You already know how Diffusion Policy conditions actions on observations using a global vector. Now let's see how tokens and attention do the same thing — but better.
Scroll to explore ↓
The two-step framework from Diffusion Policy. Click each step to see how it works.
Take the camera image, robot joint positions, and the current timestep. Run them through encoder networks. Compress everything into a single vector.
Feed the conditioning vector + noisy actions into a neural network. It predicts the noise. Subtract the noise to get cleaner actions. Repeat.
The same two-step framework — but Step 1 is fundamentally different. Toggle to compare.
Camera image + robot state + timestep are encoded into a single conditioning vector.
The conditioning vector is concatenated with noisy actions and fed to the neural network.
Instead of one conditioning vector, we represent every piece of information as a token. Click any token to learn more.
Each token carries rich information about one piece of the observation or one action step.
This is the core mechanism. The attention mask controls which tokens can look at which other tokens. Hover over any cell in the matrix below.
See which tokens can attend to which — and understand how actions get conditioned on observations.
Click on any action token below to see which observation tokens it attends to. The beam thickness shows attention weight.
See the attention connections — each action token selectively attends to the observation tokens it needs.
Both approaches follow the same two-step framework. The difference is in how observations reach the actions.
(Diffusion Policy)
(Pi Zero approach)
Images, language instructions, and robot state are each converted into sequences of tokens — not compressed into one vector.
The predicted actions are also tokens in the same sequence. They sit alongside the observation tokens.
Action tokens attend to observation tokens through the attention mechanism. This is how observations influence actions — no bottleneck.
The two-step framework stays the same.
The conditioning mechanism evolves:
single vector
→ token attention
More flexible. More powerful. No information bottleneck.