From Conditioning Vectors to Token Attention

1

What You Already Know

The two-step framework from Diffusion Policy. Click each step to see how it works.

STEP 1

Build a Global Conditioning Vector

Take the camera image, robot joint positions, and the current timestep. Run them through encoder networks. Compress everything into a single vector.

STEP 2

Neural Network Predicts Noise

Feed the conditioning vector + noisy actions into a neural network. It predicts the noise. Subtract the noise to get cleaner actions. Repeat.

Key observation: Everything the robot sees and knows gets compressed into a single conditioning vector. This vector is the only way observations influence the predicted actions.

2

What Changes in Pi Zero

The same two-step framework — but Step 1 is fundamentally different. Toggle to compare.

Diffusion Policy

Pi Zero

Step 1: Processing Observations

Camera image + robot state + timestep are encoded into a single conditioning vector.

Camera Image

Robot State

Timestep

↓ compress ↓

Single Conditioning Vector c

Step 2: Conditioning the Network

The conditioning vector is concatenated with noisy actions and fed to the neural network.

The problem: All observations squeezed through one vector. Information gets lost — like trying to describe a photo in one sentence.

3

Everything Becomes Tokens

Instead of one conditioning vector, we represent every piece of information as a token. Click any token to learn more.

I₁

I₂

I₃

I₄

Image Tokens

L₁

L₂

L₃

Language Tokens

S₁

S₂

State Tokens

A₁

A₂

A₃

A₄

Action Tokens

Observations

Actions

Click a token to learn more

Each token carries rich information about one piece of the observation or one action step.

The key question: How do the action tokens (A₁-A₄) know about the observations (I, L, S tokens)? They need to be conditioned on the observations to predict useful actions. The answer: attention.

4

Attention: How Actions "See" Observations

This is the core mechanism. The attention mask controls which tokens can look at which other tokens. Hover over any cell in the matrix below.

Observations see observations

Actions see observations

Actions see actions

Blocked (cannot attend)

Key (who is being looked at) →

← Query (who is looking)

Hover over a cell in the matrix

See which tokens can attend to which — and understand how actions get conditioned on observations.

The "aha" moment: Look at the bottom-left block (terracotta). Every action token can attend to every observation token. This IS the conditioning — instead of a single vector, each action token directly queries the observations it needs through attention.

5

Watching Attention in Action

Click on any action token below to see which observation tokens it attends to. The beam thickness shows attention weight.

I₁

I₂

I₃

I₄

L₁

L₂

L₃

S₁

S₂

Image Tokens

Language

State

A₁

A₂

A₃

A₄

Action Tokens — click one!

Click an action token

See the attention connections — each action token selectively attends to the observation tokens it needs.

6

Side by Side: Two Ways to Condition

Both approaches follow the same two-step framework. The difference is in how observations reach the actions.

Global Conditioning Vector

(Diffusion Policy)

Camera Image

Robot State

Timestep

↓ compress ↓

One Vector c

↓

Neural Network

Information bottleneck — details get lost

Token Attention

(Pi Zero approach)

Camera → Image Tokens

Language → Language Tokens

State → State Tokens

↓ attention ↓

A₁

A₂

A₃

A₄

↓

Neural Network

Fine-grained access — each action attends selectively

7

The Key Takeaway

1

Observations Become Tokens

Images, language instructions, and robot state are each converted into sequences of tokens — not compressed into one vector.

2

Actions Are Tokens Too

The predicted actions are also tokens in the same sequence. They sit alongside the observation tokens.

3

Attention IS Conditioning

Action tokens attend to observation tokens through the attention mechanism. This is how observations influence actions — no bottleneck.

The two-step framework stays the same.
The conditioning mechanism evolves:
single vector → token attention

More flexible. More powerful. No information bottleneck.

From Conditioning Vectorsto Token Attention

What You Already Know

Build a Global Conditioning Vector

Neural Network Predicts Noise

What Changes in Pi Zero

Step 1: Processing Observations

Step 2: Conditioning the Network

Everything Becomes Tokens

Click a token to learn more

Attention: How Actions "See" Observations

Hover over a cell in the matrix

Watching Attention in Action

Click an action token

Side by Side: Two Ways to Condition

Global Conditioning Vector

Token Attention

The Key Takeaway

Observations Become Tokens

Actions Are Tokens Too

Attention IS Conditioning

From Conditioning Vectors
to Token Attention