From Conditioning Vectors
to Token Attention

You already know how Diffusion Policy conditions actions on observations using a global vector. Now let's see how tokens and attention do the same thing — but better.

Overview: conditioning vectors vs token attention

Scroll to explore ↓

1

What You Already Know

The two-step framework from Diffusion Policy. Click each step to see how it works.

STEP 1

Build a Global Conditioning Vector

Take the camera image, robot joint positions, and the current timestep. Run them through encoder networks. Compress everything into a single vector.

STEP 2

Neural Network Predicts Noise

Feed the conditioning vector + noisy actions into a neural network. It predicts the noise. Subtract the noise to get cleaner actions. Repeat.

Camera 224 x 224 Robot State joint angles Timestep k = 0...K Encoder Networks c conditioning vector Noisy Actions Neural Network Predicted Noise
Key observation: Everything the robot sees and knows gets compressed into a single conditioning vector. This vector is the only way observations influence the predicted actions.
Diffusion Policy 2-step framework
2

What Changes in Pi Zero

The same two-step framework — but Step 1 is fundamentally different. Toggle to compare.

Diffusion Policy
Pi Zero

Step 1: Processing Observations

Camera image + robot state + timestep are encoded into a single conditioning vector.

Camera Image
Robot State
Timestep
↓ compress ↓
Single Conditioning Vector c

Step 2: Conditioning the Network

The conditioning vector is concatenated with noisy actions and fed to the neural network.

The problem: All observations squeezed through one vector. Information gets lost — like trying to describe a photo in one sentence.
VLM replaces conditioning vector
3

Everything Becomes Tokens

Instead of one conditioning vector, we represent every piece of information as a token. Click any token to learn more.

I1
I2
I3
I4
Image Tokens
L1
L2
L3
Language Tokens
S1
S2
State Tokens
A1
A2
A3
A4
Action Tokens
Observations
Actions

Click a token to learn more

Each token carries rich information about one piece of the observation or one action step.

The key question: How do the action tokens (A1-A4) know about the observations (I, L, S tokens)? They need to be conditioned on the observations to predict useful actions. The answer: attention.
Token sequence layout
4

Attention: How Actions "See" Observations

This is the core mechanism. The attention mask controls which tokens can look at which other tokens. Hover over any cell in the matrix below.

Observations see observations
Actions see observations
Actions see actions
Blocked (cannot attend)
Key (who is being looked at) →
← Query (who is looking)

Hover over a cell in the matrix

See which tokens can attend to which — and understand how actions get conditioned on observations.

The "aha" moment: Look at the bottom-left block (terracotta). Every action token can attend to every observation token. This IS the conditioning — instead of a single vector, each action token directly queries the observations it needs through attention.
Attention mask block structure
5

Watching Attention in Action

Click on any action token below to see which observation tokens it attends to. The beam thickness shows attention weight.

I1
I2
I3
I4
L1
L2
L3
S1
S2
Image Tokens
Language
State
A1
A2
A3
A4
Action Tokens — click one!

Click an action token

See the attention connections — each action token selectively attends to the observation tokens it needs.

Attention flow between tokens
6

Side by Side: Two Ways to Condition

Both approaches follow the same two-step framework. The difference is in how observations reach the actions.

Global Conditioning Vector

(Diffusion Policy)

Camera Image
Robot State
Timestep
↓ compress ↓
One Vector c
Neural Network
Information bottleneck — details get lost

Token Attention

(Pi Zero approach)

Camera → Image Tokens
Language → Language Tokens
State → State Tokens
↓ attention ↓
A1
A2
A3
A4
Neural Network
Fine-grained access — each action attends selectively
Side-by-side comparison
7

The Key Takeaway

1

Observations Become Tokens

Images, language instructions, and robot state are each converted into sequences of tokens — not compressed into one vector.

2

Actions Are Tokens Too

The predicted actions are also tokens in the same sequence. They sit alongside the observation tokens.

3

Attention IS Conditioning

Action tokens attend to observation tokens through the attention mechanism. This is how observations influence actions — no bottleneck.

The two-step framework stays the same.
The conditioning mechanism evolves:
single vectortoken attention

More flexible. More powerful. No information bottleneck.

Key insight: attention is conditioning