
RoboPapers Ep#68: DreamZero: World Action Models are Zero-Shot Policies
Mar 20, 2026
Seonghyeon Ye, a PhD student at KAIS and NVIDIA research intern, co-author of DreamZero and builder of a 14B World Action Model. He discusses using video-generation priors for robot control, joint video-action modeling, inverse dynamics, system tricks to run a 14B model at real-time rates, and surprising cross-embodiment and few-shot transfer from short video data.
AI Snips
Chapters
Transcript
Episode notes
World Action Models Learn Dynamics From Video
- World Action Models (WAMs) jointly predict video futures and actions to learn physical dynamics rather than mapping observations straight to actions.
- Seonghyeon Ye explains WAMs use video as a dense representation so inverse dynamics is easier and generalizes to new motions from diverse data.
Convert Video Diffusion Into A Closed Loop Policy
- Turn a pretrained video diffusion backbone into a policy by adding inverse dynamics and training to jointly denoise video and actions with an autoregressive objective.
- Seonghyeon Ye describes converting a bidirectional model to autoregressive and teacher-forcing video plus action denoising for policy use.
Autoregression Lets You Patch Generated Context With Reality
- Autoregressive generation enables KV-cache updates with real observations, avoiding error accumulation from pure autoregressive video rollout.
- Seonghyeon Ye notes policies can replace generated frames in the cache with live observations during closed-loop control to prevent blur and drift.
