DREAMSTEER: Latent World Models Can Steer VLA Policies During Deployment

¹Fundamental AI Research (FAIR), Meta ²University of Minnesota Twin Cities

^*Work done during an internship at Meta ^†Joint last authors

Model architecture

Architecture of the Spatio-Temporal Transformer Block. The model processes RGB and Control Latents through N repeated layers as shown on the left, utilizing factorized spatio-temporal self-attention for efficiency. The Spatio-Temporal Cross-Attention mechanism on the right integrates control signals by performing independent spatial cross-attention per timestep and causal temporal cross-attention per patch.

Abstract

Pretrained generalist policies, such as vision-language-action (VLA) models, promise impressive zero-shot generalization in robot manipulation. However, their real-world performance tapers quickly under distribution shift, leading to decreased robustness and inconsistent instruction following abilities.

To address these challenges, we propose DreamSteer, a deploy-time steering framework to enhance pretrained VLAs without the need for finetuning on demonstration data collected in the target distribution. The key insight in DreamSteer is to leverage a latent world model and a general-purpose value function to steer pretrained VLA policies. During deployment, DreamSteer generates diverse action candidates, sourced from the VLA policy and a set of predefined motion primitives, and imagines the outcome of each action sequence by rolling it out within the latent world model.

By evaluating these predicted trajectories with the value model, DreamSteer identifies and executes the highest-scoring action, resulting in better instruction following and filtering out task-irrelevant behaviors. Across four real-world manipulation benchmarks of unseen objects, DreamSteer improves task success rates by 42.5 percentage points, from 23.75% to 66.25%, and increases instruction following accuracy by 17.5 percentage points, from 38.75% to 56.25%, compared to the base VLA. These results suggest that latent world models can steer VLA policies during deployment and provide an effective pathway for improving the reliability of generalist robot policies when finetuning may not be desired or feasible.

DreamSteer framework

DreamSteer: test-time policy steering. Given a language instruction and current observation, a pretrained generalist policy generates multiple candidate action sequences, which are augmented with a small set of predefined action primitives. An action-conditioned world model predicts the outcomes of each candidate through imagined rollouts, which are evaluated by a language-conditioned value model. The highest-scoring action is then selected and executed in the real environment.

Real robot experiments

Out-of-distribution (OOD) results

Pick up the phone and place it in the brown box

π₀ + DreamSteer

π₀

Pick up the mustard and place it in the brown box

π₀ + DreamSteer

π₀

Pick up the whiteboard eraser and place it in the black bowl

π₀ + DreamSteer

π₀

Pick up the blue tape and place it in the black bowl

π₀ + DreamSteer

π₀

Instruction following (IF) accuracy

Pick up the banana and place it in the black bowl

π₀ + DreamSteer

π₀

Pick up the sponge and place it in the black bowl

π₀ + DreamSteer

π₀

Pick up the apple and place it in the brown box

π₀ + DreamSteer

π₀

Pick up the pencil case and place it in the brown box

π₀ + DreamSteer

π₀

Quantitative results

Out-of-distribution objects success rate ↑

Policy	Phone	Mustard	Tape	Eraser	Average
π₀(k=1)	4/20	3/20	6/20	6/20	23.75%
π₀(k=5)+DreamSteer	7/20	6/20	11/20	10/20	42.5%
π₀(k=5+prim)+random	0/20	0/20	0/20	0/20	0%
prim+DreamSteer	0/20	0/20	0/20	0/20	0%
π₀(k=5+prim)+DreamSteer	12/20	11/20	16/20	14/20	66.25%

Instruction following accuracy ↑

Policy	Sponge	Banana	Pencil case	Apple	Average
π₀(k=1)	8/20	9/20	6/20	8/20	38.75%
π₀(k=5+prim)+DreamSteer	14/20	13/20	9/20	9/20	56.25%

DREAMSTEER: Latent World Models Can Steer VLA Policies During Deployment

Latent world model overview

Model architecture

Abstract

DreamSteer framework

Real robot experiments

Out-of-distribution (OOD) results

Pick up the phone and place it in the brown box

Pick up the mustard and place it in the brown box

Pick up the whiteboard eraser and place it in the black bowl

Pick up the blue tape and place it in the black bowl

Instruction following (IF) accuracy

Pick up the banana and place it in the black bowl

Pick up the sponge and place it in the black bowl

Pick up the apple and place it in the brown box

Pick up the pencil case and place it in the brown box

Quantitative results

Out-of-distribution objects success rate ↑

Instruction following accuracy ↑