My first week of imitation learning: ACT and Diffusion Policy on cube stacking
I just started as an intern at Dream Machines, a startup teaching robots through demonstration so that non-technical people in small and medium manufacturing can automate the repetitive parts of their work. The field is new to me. I have a mechanical-engineering background, an upcoming robotics master's, and a recently-finished bachelor's thesis on reinforcement learning for a robot I built that learned to jump in simulation. So: some adjacent experience, but no real-world manipulation, no imitation learning, and limited intuition about how the modern policies actually behave on a real arm.
What lit the fire was watching Physical Intelligence's $\pi^*_{0.6}$ demos: the same model folding cardboard boxes, making espresso, doing the laundry. The kind of thing that reframes what's possible.
This post is what I did during my first week. The brief from Dominique (the founder) was simple: pick a small task, collect data, train both ACT and a Diffusion Policy, and learn from what does and doesn't work. The results are less "we discovered something publishable" and more "here's what happened when a beginner ran the obvious experiment", which is the part I think might be useful for other beginners.
Getting warmed up
I started with two papers, the bare minimum to know what I was even doing:
- Chi et al. (2023), Diffusion Policy. Uses the same denoising idea as image-generation diffusion models, but on action sequences instead of pixels.
- Black et al. (2024), Ïâ. A vision-language-action model. The pi0 blog post is the friendlier entry point.
This gave me a rough mental model. I'd known transformers from language and diffusion from images. How do they end up controlling a robot?
Transformers, very briefly: they're a way to model how elements in a sequence depend on each other, via "attention". The input can be tokens of text, patches of an image, audio, or for our purposes a sequence of joint positions and an image observation. As long as you can encode something as a vector, you can throw it in.
Diffusion, very briefly: at training time, you take a clean action sequence and gradually add noise to it; the model learns to predict and remove that noise. At inference time, you start from pure noise and iteratively denoise into a real action sequence, conditioned on the current observation. In other words: instead of "predict the next action", the policy generates a whole short action chunk by progressively refining random noise into something coherent.
I later went back and read Jay Alammar's Illustrated Transformer to refresh the details. Strongly recommended if you've forgotten how attention actually works under the hood.
For tooling, we use LeRobot, Hugging Face's open-source library for real-robot ML. It bundles datasets, pretrained policies (ACT, DP, Ïâ, etc), training and eval scripts, and a hardware-abstraction layer. To get my bearings with the data side I read the cloth-folding write-up on the LeRobot blog, which is a refreshingly honest account of what does and doesn't work.
Then, per advice in that same post, I stopped reading and just started teleoperating to get a feel for the system. I felt clumsy for about ten minutes; after that, your hands learn the leader-follower mapping and you can move pretty deliberately. It's fun.
[WRITE YOURSELF: a sentence or two on the grippers you tried and which TPU-gripper variant you liked, including model name/link if you want.]
I picked a deliberately small task: stack one cube on top of another, single arm. From the start I wanted not just "a working policy" but a small experiment. Specifically I wanted to know:
- Can ACT and DP generalise across cube placement on this task?
- Does adding wider-placement training data help generalisation?
- How do ACT and DP compare on the same setup?
To test that, I planned three training datasets, each with a different cube-placement distribution:
| name | demos | placement |
|---|---|---|
narrow |
100 | cubes within ~1 cube-side radius around a few fixed locations |
wide |
100 | cubes within ~2 cube-side radius (i.e. the same radius doubled) |
combined |
200 | union of narrow and wide |
Setup
[WRITE YOURSELF: a couple of sentences on the TRLC-DK1 dual-arm rig from The Robot Learning Company: DoF, motor family, what cameras you mounted where, anything specific to the right arm. I don't want to invent specs.]
[IMAGE PLACEHOLDER: media/setup.jpeg. Caption: "Data collection setup, single right arm with top + wrist cameras."]
One workflow detail worth calling out: we use a foot pedal wired through LeRobot to start, end, discard, and rate each demonstration (1 to 5 quality score, written into the dataset's metadata). It's a small change but it shaves real time off a recording session: you never have to take your hands off the leader arms to push a key. Same idea with make targets that wrap the long lerobot-record invocations; it makes the operator side fast enough to actually want to use.
Data collection and what I learned to look out for
Things I tried to apply, partly from the LeRobot blog and partly from Dominique:
- Practice the task yourself before recording any episodes. First demonstrations are always your worst.
- Be decisive. Imitation policies imitate everything, including your hesitation.
- Commit to one strategy. If you sometimes grasp from the side and sometimes from the top, the policy has to model the choice. For a simple task and a small dataset, that's wasted capacity.
I also paid attention to camera placement: not just whether I could solve the task while looking at the camera feed (I couldn't reliably, even after trying; wrist cameras are deeply unintuitive for teleop), but whether the policy would have the information it needed at inference time. That's a real failure mode in imitation learning: if your demonstrations rely on information that won't be in the policy's observation stream, the policy can't reproduce them. Sherry Chen's blog post on ACT for SO-101 makes this point clearly.
My personal takeaway: be deliberate about what the cameras see, but don't try to teleoperate from the camera feed alone. The speed and quality hit isn't worth it for a simple task. Knowing what the policy will see is what matters.
[IMAGE PLACEHOLDER: side by side, media/overlay_ldd_mean.jpg and media/overlay_mdd_mean.jpg. Caption: "Cube starting positions across 5 trials. Left: narrow placement. Right: wide placement."]
First ACT policy: surprised it works at all
I trained a first ACT policy on the 100 narrow demonstrations, with the default LeRobot config. The whole thing was make train, wait roughly 15 minutes on our local RTX 5090, done.
Honestly, I had low expectations. After 100 demonstrations and 15 minutes of training on a 6-DoF arm, I expected the robot to jitter approximately in the direction of the task. Instead the intent was clearly there: grasp the right cube, move to the left one, drop. Lots of misses, motion not exactly smooth, but the policy obviously knew what it was supposed to do.
That was my first surprise. Imitation learning is more sample-efficient than I'd internalised.
[VIDEO PLACEHOLDER: media/videos/act_v1_baseline.mp4. Caption: "Initial act_v1 policy on narrow. Visibly rough, but the intent is clearly there."]
At that point I realised I'd been training a policy without reading its paper, so I went and read the ACT blog and paper. ACT is, very roughly: an encoder-decoder transformer that takes recent observations (joint states + camera images) and predicts a chunk of future actions in one shot, with an optional CVAE branch that injects stochasticity to handle multi-modal demonstrations. The "predict a chunk at a time" part matters. At inference, you commit to N actions before re-planning, which is way smoother than predicting one timestep at a time.
Improving the policy: Tony's tips and the GPU lesson
Dominique pointed me at Tony Zhao's tuning tips for ACT. Two takeaways I acted on:
chunk_sizeandn_action_stepsshould match. I had defaults of 48 and 30, which means the policy generates 48 actions per inference but only executes 30 before re-planning, wasted compute. Setting both to 50 (one second at our 50 Hz control rate) was the more obvious choice. That's whatact_v2uses.- Larger batch helps. This is where I made my beginner mistake worth confessing: I checked Weights & Biases mid-training and saw GPU utilisation at ~95%, which I read as "GPU is maxed out, can't push harder". Wrong: utilisation tracks how often the GPU is doing something, not how much of its capacity is used. The memory side was at ~25%, which is the more honest "are we using the chip" number. Bigger batches mean more parallel work per kernel launch and fewer kernel launches per training step, so faster wall-clock time and (often) better gradients.
act_v3bumped batch size from 16 to 64 (now hitting ~86% VRAM) and learning rate from 3e-5 to 5e-5 to compensate, as Tony's tips suggest.
So in the end I had three ACT configurations:
| config | chunk / n_action_steps | batch | lr | steps |
|---|---|---|---|---|
act_v1 |
48 / 30 | 16 | 3e-5 | 10 000 |
act_v2 |
50 / 50 | 16 | 3e-5 | 20 000 |
act_v3 |
50 / 50 | 64 | 5e-5 | 20 000 |
For diffusion policy I used the configuration from the original paper's "Real Pour / Spread / Mug Flip" tasks (DDIM with 16 inference steps, horizon=16, n_action_steps=8) for 20 000 steps. More on diffusion in its own section below.
[IMAGE PLACEHOLDER: media/wandb_train_loss_act.svg (top), media/wandb_train_loss_dp.svg (bottom). Caption: "Training L1 loss for ACT (top); training loss for Diffusion Policy (bottom)."]
The evaluation problem (round 1)
Once the three ACT configurations had trained overnight, I needed to actually compare them. I copied the score scheme from the Ïâ paper: a task-progress score, 0.0 (failure) up to 1.0 (block stacked), in 0.2 increments per stage.
Now I had to design the evaluation. Two choices that mattered, both of which I got partly wrong on the first pass:
- Sample size. I started with 10 episodes per checkpoint, half at narrow cube positions, half at wide. I figured 10 was "enough to see if something obvious was happening". It wasn't. With per-episode noise that big, most cross-policy differences could plausibly be just sampling luck.
- Reproducibility of cube positions. I took photos of the cube starting positions and reused them across sessions so the test set was actually fixed. Tedious, but at least it ruled out "the cubes happened to be in a hard spot" as a confound.
The result of round 1 was: I had a 10-point checkpoint sweep on act_v2 trained on narrow (steps 2k to 20k, n=10 each) plus a few single-checkpoint cells across the other (dataset, config) combinations. Qualitatively, early checkpoints moved more roughly than later ones, and later checkpoints tended to stall on harder cube positions, a classic late-training overfit signature on a narrow dataset. But with only 10 episodes per cell, the numbers had so much noise that the trends were hard to defend in a paper-style write-up. I had to do it again.
Tighter evaluations (round 2)
For round 2 I bumped to 50 episodes per setting (25 narrow positions, 25 wide positions). I picked two things to compare with this tighter budget:
Experiment 1: how much does training-step depth matter? Specifically act_v3 trained on combined, at step 10 000 vs step 20 000.
Experiment 2: how much does dataset diversity matter? All three of act_v3 on narrow, act_v3 on wide, act_v3 on combined, all evaluated at the final (20 000-step) checkpoint.
Experiment 1 result
| checkpoint | overall score | narrow positions | wide positions | strict success rate |
|---|---|---|---|---|
act_v3 on combined, step 10 000 |
0.58 ± 0.34 | 0.62 ± 0.33 | 0.53 ± 0.35 | 22% |
act_v3 on combined, step 20 000 |
0.83 ± 0.27 | 0.97 ± 0.07 | 0.70 ± 0.32 | 58% |
The same weights at 20k completely change the picture. On in-distribution cube positions the policy is almost a solved task at 84% strict success, and never fails earlier than the release stage: every single narrow-position episode reaches "release block A". On the harder, wider distribution it succeeds 32% of the time. The takeaway here was straightforward: the 10k checkpoint was undertrained.
[VIDEO PLACEHOLDER: media/videos/v3_act_v3_last_run.mp4. Caption: "act_v3 on combined, step 20 000, 50-episode evaluation. The headline working result."]
Experiment 2 result (same location)
At first read this looked clean too: training on combined was the best, narrower datasets were behind. But the differences were small, a few percentage points of strict success, and the per-cell noise was still in the same range as the differences I wanted to claim. I couldn't actually say diversity helped on the basis of those numbers; the gap might just be sampling luck.
That ambiguity is what pushed me to the next experiment.
The location-shift surprise (round 3)
I moved the rig to a different physical location with a different background and re-ran the same act_v3 on combined evaluation, plus act_v3 on narrow and act_v3 on wide for the diversity comparison. 50 episodes each, same 25-narrow / 25-wide cube-placement split.
The result was much more decisive than I expected.
Same policy (act_v3 on combined), different location:
| cube positions | original location | new location | drop |
|---|---|---|---|
| narrow | 0.97 (84% succ) | 0.51 (24% succ) | -0.46 |
| wide | 0.70 (32% succ) | 0.30 (0% succ) | -0.40 |
| both | 0.83 | 0.40 | -0.43 |
The same trained policy halved its score by moving rooms. On wide-position episodes at the new location, zero out of 25 attempts produced a full stack.
Same location (new), different training dataset:
| training dataset | overall score | strict success | reached release (>= 0.8) |
|---|---|---|---|
narrow (100 demos) |
0.36 | 10% | 24% |
wide (100 demos) |
0.47 | 10% | 41% |
combined (200 demos) |
0.40 | 12% | 30% |
The differences between the three datasets are small relative to the per-cell noise, so at the new location, the dataset diversity I'd built specifically to test generalisation did not help in any clearly measurable way.
What this says to me is that for this policy, on this task, with this amount of data, the dominant axis of generalisation isn't "where the cubes start", it's the visual context of the workspace. Slight changes to camera framing, table colour, ambient light: catastrophic. Wider object-placement diversity in the data doesn't transfer across that gap.
I find that genuinely useful to know. The 97% narrow-position success number from act_v3 on combined wasn't fake, it just had a more limited scope than I'd have guessed before doing this experiment.
[IMAGE PLACEHOLDER: media/eval_results.png. Caption: "Top: act_v2 on narrow checkpoint sweep. Bottom: cross-policy comparison at the canonical 12 000-step checkpoint, with the two annotated act_v3 on combined 50-episode bars."]
What about Diffusion Policy?
I trained DPs with the paper's "Real Pour / Spread / Mug Flip" configuration on each of the three datasets. None of them performed well in evaluation, all sat around 0.30 score, and qualitatively the motion was visibly jerky, with the gripper sometimes hovering and backing off near the first cube before committing.
I spent some time on this before deciding it was a sidebar rather than a comparable result. The smoking gun was a record-loop diagnostic:
Record loop is running slower (28.5 Hz) than the target FPS (50 Hz).
breakdown: obs=0.1ms inference=34.9ms send=0.0ms post=0.1ms total=35.1ms
DP inference with 16 denoising steps takes about 35 ms on our 5090. Our control loop targets 50 Hz, which is a 20 ms budget per tick. The math: DP can't service every control tick, even on a recent GPU. Because the policy uses action chunking (n_action_steps=8), only every 8th tick actually runs the full denoising. The other 7 are sub-millisecond queue pops. So one tick in eight blows the budget by 15 ms, the loop runs uneven, and the motors see a stuttery cadence.
I tried the obvious knobs:
fps=30: wider 33 ms budget, but inference jitter pushed inference up to ~45 ms in that run, still over budget.num_inference_steps=8: halves the denoising work, lands inference at ~24 ms, still 4 ms over the 50 Hz budget.fps=30+num_inference_steps=8: first combo that fits. No more warnings. Motion still jerky.
That last result is the interesting one. Once the loop timing is OK, the remaining jerkiness is not a timing problem, it's the policy itself disagreeing with its previous chunk every 8 frames. That's a known DP failure mode and it's separate from inference latency.
[VIDEO PLACEHOLDER: side by side, media/videos/dp_50fps16steps.mp4, media/videos/dp_30fps16steps.mp4, media/videos/dp_50fps8steps.mp4. Caption: "Left to right: 50 fps / 16 inference steps (baseline); 30 fps / 16 steps; 50 fps / 8 steps."]
Honest takeaway. Diffusion Policy underperforms ACT in this experiment, but the comparison isn't fair: the policy never got to run with a stable control cadence at the training rate. A re-train at the rate the GPU can actually serve (probably 30 Hz), or with num_inference_steps reduced before training, would be the prerequisite for a fair ACT-vs-DP comparison on this hardware. I didn't do that; that's a different week.
What I'd do differently
If I were running this experiment for the first time, knowing what I know now:
- Plan the evaluation before you train. I wandered into round 1 with vague ideas about how to compare checkpoints, ended up with under-sampled noisy data, and had to redo most of it. A 30-minute plan ahead of time would have saved a day of robot time.
- Pick fewer comparisons, with more samples each. 10 episodes per cell is almost never enough on a noisy 0 to 1 task-success score. 25 to 50 per cell is the floor for any claim you want to defend. I'd cut the number of (policy x checkpoint) cells in half and double the episodes per cell.
- Decide what "out-of-distribution" actually means. I started this with a specific axis in mind (cube placement), trained data to vary it, then discovered after the fact that the much bigger axis in practice was the visual context. Worth a quick pilot to find out what actually shifts behaviour before committing to a multi-day data-collection plan.
- Don't mix in DP unless you've checked it can run real-time on the hardware. A 5-minute timing test on a single checkpoint would have told me DP wasn't going to give comparable numbers at 50 Hz, before I trained three of them.
Other experiments I'd like to run
A few things I left on the table that I think would be worth a follow-up:
- Re-train DP at a fps the GPU can actually serve (30 Hz, possibly with fewer inference steps), then redo the ACT-vs-DP comparison at matched cadence.
- Wider-context demonstrations: train ACT with deliberately varied lighting, background, and camera framing, and test whether visual diversity in the training data closes the location-shift gap I found in round 3.
- Test multi-modality directly: pick a task that has two equally-valid solutions, demonstrate both, and see whether ACT and DP commit to one mode or genuinely mix. Compare against half-data (single-mode) versions to see if the multi-mode case actually costs you anything.
- Two-arm-observer experiment: for a one-arm task, run two arms, one solving, one parked with a wrist camera oriented at the workpiece. Compare against a single-arm baseline. Inspired by Pi's RLT post.
Wrap-up
[WRITE YOURSELF: 3 to 5 sentences reflecting on the whole week. Things worth touching: imitation learning is more sample-efficient on a real arm than you'd expect; evaluation is harder than training; "obvious" sources of OOD aren't always the actual ones; small-N evaluation lies to you.]
Resources
[WRITE YOURSELF: trim to the 5 to 8 you actually want to recommend. Candidates from this draft:]
- Diffusion Policy, Chi et al. (2023)
- Ïâ, Black et al. (2024) and the pi0 blog
- ACT (ALOHA) project page and tuning tips
- LeRobot docs
- LeRobot cloth-folding write-up
- The Illustrated Transformer, Jay Alammar
- ACT-on-SO-101 blog, Sherry Chen
- Piâ.â demos and Pi RLT