2025

Wednesday, August 13, 2025
1 min read

Day 6

Work Done

Patched original codebase and re-ran evaluation.
Completed analysis of training results.
Completed documentation.

Tuesday, August 12, 2025
1 min read

Day 5

Work Done

Trained and evaluated policies from full training runs.
Plotted training curves from json log.

WIP

Analyzing training results.
Writing documentation.

TODO

Complete documentation and wrap-up.
Understanding
- Methodology section: diffusion model training, evaluation.
- Data preprocessing: how image and low-dimension tasks utilize data.

Monday, August 11, 2025
1 min read

Day 4

Work Done

Evaluated policies from full training runs.
Completed dataset analysis notebook.

WIP

Training CNN-based policy model on state-based dataset.

TODO

Training
- Compare results between CNN and transformer-based policy models.
- Plot training curves from json log.
- Analyze training results. Compare with original paper.
Documentation
- Methodology section: diffusion model training, evaluation.
- Data preprocessing: how image and low-dimension tasks utilize data.

Friday, August 8, 2025
1 min read

Day 3

Work Done

Dataset analysis: played with both versions of dataset in data_analysis.ipynb.
diffusion_policy
- Successfully installed and imported as a package after being patched with necessary __init__.py files.
- Understood success rate and metrics and how evaluation works.

WIP

Still figuring out how this repository should look like. Feel like I'd only need some basic python scripts to launch the training and evaluation jobs. Then why bother installing the original repo?

TODO

Look into colab notebooks.
Training understanding
- Policy input/output.
- How is a diffusion model trained?
- Hyperparameters.
Evaluation understanding: how does validation work?
PushT environment: compare it to lerobot/gym-pusht.
Data preprocessing
- How image and low-dimension tasks utilize data.

Thursday, August 7, 2025
2 min read

Day 2

Work Done

Training with `real-stanford/diffusion_policy`

Ran two experiment setups using the real-stanford/diffusion_policy repository.

Transformer + state-based observations.
UNet + image-based observations.

Both configurations were trained on two datasets, making a 2x2 matrix. Most default settings were adopted, except for the number of epochs and learning rate scheduler. The number fo epochs was set to 1000 for all cases in order to get a quick taste, and the learning rate scheduler is set to constant to make sure the model goes far enough.

Insert table

Analysis

Interestingly, models trained on dataset v1 both outperformed the ones trained on dataset v2. Dataset v2 is roughly double the size of v1, thus I initially thought scaling law would also work. Two potential reasons:

Quality in data: maybe v2 yields lower quality?
Larger dataset requires longer training time.

On the other hand, state-based model outperforms image-based model. I think this is expected since intuitively there definitely will be estimation errors when using vision.

WIP

Still fighting with environment setup 😢.
- Currently maintaining two environments: one for diffusion_policy, the other for my repository.
Change of plan again.
- Don't want to fork diffusion_policy \(\to\) temporarily use git submodule instead for reproduction purposes.
- Tried to install diffusion_policy with pip or conda but have no luck. The flat layout and the lack of __init__.py prohibited me from importing it as a module.
- Colab notebook is almost a self-contained training + evaluation code. Adopt that but instead structure it into a tiny_dp python submodule.

TODO

Look into colab notebooks.
Training understanding
- Policy input/output.
- How is a diffusion model trained?
- Hyperparameters.
Evaluation understanding:
- How do validation and test work?
- Definition of metrics: success rate, reward.
PushT environment: compare it to lerobot/gym-pusht.
Data preprocessing
- How image and low-dimension tasks utilize data.
- Convert to lerobot-style dataset?

Wednesday, August 6, 2025
3 min read

Day 1

Work Done

Test Evaluation Script and Environment

Ran the evaluation command from lerobot/diffusion_policy.

python -m lerobot.scripts.eval --policy.path=lerobot/diffusion_pusht --output_dir ./output --env.type=pusht --eval.n_episodes=500 --eval.batch_size=50

And the results:

	Mine	lerobot/diffusion_pusht	Paper
Average max. overlap ratio	0.962	0.955	0.957
Success rate for 500 episodes (%)	64.2	65.4	64.2

Dataset Discovery

I opened up a jupyter notebook playground and fiddled with the data a little bit. Here's the structure of the data with zarr.open_group(path).tree():

/
├── data
│   ├── action (N, 2) float32
│   ├── img (N, 96, 96, 3) float32
│   ├── keypoint (N, 9, 2) float32
│   ├── n_contacts (N, 1) float32
│   └── state (N, 5) float32
└── meta
    └── episode_ends (K,) int64

Initiallly, I compared it to the lerobot/pusht dataset released by HuggingFace. However, the entries are so different that it's difficult to match them. I printed the arrays, displayed the images, trying to get a sense of what those values mean. Here's my attempt:

episode_ends marks the ending scene/index of each episode. Use this to split the data into K rounds and label them with episode indices from 0 to K - 1.
state: I make my assumptions by simultaneously looking at the corresponding image.
- The first two numbers are the position of the tooltip.
- 3rd & 4th are the positions of the T-shaped object.
- 5th looks like the orientation of the object in radian.
img visualizes the current state (potentially given the keypoints and states).

At some point I came up with the idea that I should also check the dataset released by the authors of the original paper. BINGO!

`real-stanford/diffusion_policy`

Naturally, the next step is to discover the diffusion policy paper and code. Their README suggests running the notebook in colab, but I failed to open it due to some issues. I then turned to the example commands in the README.

I started with the low-dimension setup. Using the exact configuration from the paper, my results (0.944@750 and 0.948@950) seemed to match the authors' checkpoints. However, this is entirely based on the name of the checkpoint. Further investigation is required to determine whether this is a successful reproduction.

WIP

Reproducing both image and low-dimension experiments.
Naively training on custom v1 dataset with only swapping the dataset itself.

TODO

Focus on real-stanford/diffusion_policy.

Look into colab notebooks.
Training understanding
- Policy input/output.
- How is a diffusion model trained?
- Hyperparameters.
Evaluation understanding:
- How do validation and test work?
- Definition of metrics: success rate, reward.
PushT environment: compare it to lerobot/gym-pusht.
Data preprocessing
- How image and low-dimension tasks utilize data.
- Convert to lerobot-style dataset?
Setup local wandb?
Code cleanup and commit.

Random Notes

AttributeError: 'Space' object has no attribute 'add_collision_handler'

Looks like pymunk removed the method after version 7.0. Running uv add 'pymunk<7' solves the issue.

Environment Preparation for `diffusion_policy`

Setting up the environment wasn't the easiest. This is a two-year-old project. Huggingface libraries have been moving fast and not afraid of breaking things. Python's dependency management via conda and pip isn't the best¹. All three factors lead to hours of fixing module/attribute not found errors and nonexistence of valid version combinations. Eventually, I had a fragile but working environment. Time for running some code!

Let's hope for uv! ↩↩

Tuesday, August 5, 2025
1 min read

Project start

Work Done

Watched the youtube video of diffusion policy presentation. My takeaways:
- Human's vision reaction time is ~300ms. If the training data is collected by human, each action sequence/chunk should be at the same order of magnitudes to that¹.
- Diffusion policy works well both in joint-space and action-space. However, working in action space requires a good IK.
Created repository.
Draft plan:
- huggingface/lerobot: start from the training and evaluation scripts there. Maybe reproduce lerobot/diffusion_pusht if feasible.
- Per request, use huggingface/gym-pusht for simulation environment.
- Maybe Material for MkDocs for documentation and report.
- Or maybe just the paper-style, good-old \(\LaTeX\).
- uv for package management? Not sure if this would work since most of the environment requires conda/mamba for non-python dependencies.
- marimo or jupyter notebook for interactive sessions? Or use the jupyter notebook extension for mkdocs.

TODO

Understand difference between DDPM and DDIM².
Fiddle with lerobot/diffusion_pusht.
- Understand the workflow.
- Get a feeling of how resource-hungry are the training & evaluation scripts.
Discover the custom pusht dataset.
Perhaps read the paper?

During the Q&A session at 51:18 ↩↩
Mentioned during the final Q&A regarding speed optimizations. ↩↩