RoboPapers

Ep#68: DreamZero: World Action Models are Zero-Shot Policies

Mar 20, 2026
Seonghyeon Ye, a PhD student at KAIS and NVIDIA research intern, co-author of DreamZero and builder of a 14B World Action Model. He discusses using video-generation priors for robot control, joint video-action modeling, inverse dynamics, system tricks to run a 14B model at real-time rates, and surprising cross-embodiment and few-shot transfer from short video data.
Ask episode
AI Snips
Chapters
Transcript
Episode notes
INSIGHT

World Action Models Learn Dynamics From Video

  • World Action Models (WAMs) jointly predict video futures and actions to learn physical dynamics rather than mapping observations straight to actions.
  • Seonghyeon Ye explains WAMs use video as a dense representation so inverse dynamics is easier and generalizes to new motions from diverse data.
ADVICE

Convert Video Diffusion Into A Closed Loop Policy

  • Turn a pretrained video diffusion backbone into a policy by adding inverse dynamics and training to jointly denoise video and actions with an autoregressive objective.
  • Seonghyeon Ye describes converting a bidirectional model to autoregressive and teacher-forcing video plus action denoising for policy use.
INSIGHT

Autoregression Lets You Patch Generated Context With Reality

  • Autoregressive generation enables KV-cache updates with real observations, avoiding error accumulation from pure autoregressive video rollout.
  • Seonghyeon Ye notes policies can replace generated frames in the cache with live observations during closed-loop control to prevent blur and drift.
Get the Snipd Podcast app to discover more snips from this episode
Get the app