DREAMSTEER: Latent World Models Can Steer VLA Policies During Deployment

1Fundamental AI Research (FAIR), Meta 2University of Minnesota Twin Cities
*Work done during an internship at Meta Joint last authors

Latent world model overview

DREAMSTEER teaser figure
Heterogeneous latent action-conditioned world model. Our world model learns from diverse embodiments, maps visual and proprioceptive inputs into a shared latent space, and appends context tokens to capture recent interaction history.

Model architecture

Model architecture figure
Architecture of the Spatio-Temporal Transformer Block. The model processes RGB and Control Latents through N repeated layers as shown on the left, utilizing factorized spatio-temporal self-attention for efficiency. The Spatio-Temporal Cross-Attention mechanism on the right integrates control signals by performing independent spatial cross-attention per timestep and causal temporal cross-attention per patch.
World model rollouts
Unseen episode
RealImagined
RealImagined
RealImagined
Dexterous hands manipulation
RealImagined
RealImagined
RealImagined
Unseen environment
RealImagined
RealImagined
RealImagined

Abstract

Pretrained generalist policies, such as vision-language-action (VLA) models, promise impressive zero-shot generalization in robot manipulation. However, their real-world performance tapers quickly under distribution shift, leading to decreased robustness and inconsistent instruction following abilities.

To address these challenges, we propose DreamSteer, a deploy-time steering framework to enhance pretrained VLAs without the need for finetuning on demonstration data collected in the target distribution. The key insight in DreamSteer is to leverage a latent world model and a general-purpose value function to steer pretrained VLA policies. During deployment, DreamSteer generates diverse action candidates, sourced from the VLA policy and a set of predefined motion primitives, and imagines the outcome of each action sequence by rolling it out within the latent world model.

By evaluating these predicted trajectories with the value model, DreamSteer identifies and executes the highest-scoring action, resulting in better instruction following and filtering out task-irrelevant behaviors. Across four real-world manipulation benchmarks of unseen objects, DreamSteer improves task success rates by 42.5 percentage points, from 23.75% to 66.25%, and increases instruction following accuracy by 17.5 percentage points, from 38.75% to 56.25%, compared to the base VLA. These results suggest that latent world models can steer VLA policies during deployment and provide an effective pathway for improving the reliability of generalist robot policies when finetuning may not be desired or feasible.

DreamSteer framework

DreamSteer overview figure
DreamSteer: test-time policy steering. Given a language instruction and current observation, a pretrained generalist policy generates multiple candidate action sequences, which are augmented with a small set of predefined action primitives. An action-conditioned world model predicts the outcomes of each candidate through imagined rollouts, which are evaluated by a language-conditioned value model. The highest-scoring action is then selected and executed in the real environment.

Real robot experiments

Out-of-distribution (OOD) results

Pick up the phone and place it in the brown box

π0 + DreamSteer
π0

Pick up the mustard and place it in the brown box

π0 + DreamSteer
π0

Pick up the whiteboard eraser and place it in the black bowl

π0 + DreamSteer
π0

Pick up the blue tape and place it in the black bowl

π0 + DreamSteer
π0

Instruction following (IF) accuracy

Pick up the banana and place it in the black bowl

π0 + DreamSteer
π0

Pick up the sponge and place it in the black bowl

π0 + DreamSteer
π0

Pick up the apple and place it in the brown box

π0 + DreamSteer
π0

Pick up the pencil case and place it in the brown box

π0 + DreamSteer
π0

Quantitative results

Out-of-distribution objects success rate ↑
Policy Phone Mustard Tape Eraser Average
π0(k=1) 4/20 3/20 6/20 6/20 23.75%
π0(k=5)+DreamSteer 7/20 6/20 11/20 10/20 42.5%
π0(k=5+prim)+random 0/20 0/20 0/20 0/20 0%
prim+DreamSteer 0/20 0/20 0/20 0/20 0%
π0(k=5+prim)+DreamSteer 12/20 11/20 16/20 14/20 66.25%
Instruction following accuracy ↑
Policy Sponge Banana Pencil case Apple Average
π0(k=1) 8/20 9/20 6/20 8/20 38.75%
π0(k=5+prim)+DreamSteer 14/20 13/20 9/20 9/20 56.25%