Ep#57: Learning Dexterity from Human Videos with Gen2Act and SPIDER

Jan 6, 2026

Homanga Bharadwaj, research scientist at Meta Reality Labs and incoming Johns Hopkins assistant professor, works on teaching robots from human video. He discusses Gen2Act, which generates human videos from language to guide robot actions. He also covers SPIDER, which retargets human hand and object motion through simulation for dexterous, contact-rich tasks.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Use Human Rather Than Robot Video

Pretrained video models excel at human videos but struggle to generate plausible robot-specific embodiments.
Generating humans lets you use zero-shot models that generalize across scenes and tasks.

ANECDOTE

Fine-Tuning Collapsed Generalization

The team tried fine-tuning a video model on robot data and saw it collapse to the limited robot distribution.
Zero-shot human video generation preserved broader generalization better than fine-tuning.

ADVICE

Co-Train With Generated Video Pairs

Train robot policies by pairing each robot trajectory with a generated human video from the same initial frame and language prompt.
Use auxiliary motion prediction losses to force the policy to attend to generated video motion cues.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Teaching robots from human video is an important part of overcoming the “data gap” in robotics, but many of the details still need to be worked out. Homanga Bharadwaj tells us about two recent research papers, Gen2Act and Spider, which go over different aspects of the problem:

Gen2Act uses generative video models to create a reference of how a task should be performed given a language prompt; then, it uses a multi-purpose policy that can “translate” from human video to robot motion.

However, Gen2Act has its limitations, in particular when it comes to dexterous, contact-rich tasks. That’s where SPIDER comes in: it uses human data together with simulation to train policies across many different humanoid hands and datasets.

Also of note is that this is our first episode with our new rotating co-host, Jiafei Duan. To learn more, watch Episode #57 of RoboPapers now, with Chris Paxton and Jiafei Duan!

Abstract:

How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn’t require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.

Abstract for SPIDER:

Learning dexterous and agile policy for humanoid and dexterous hand control requires large-scale demonstrations, but collecting robot-specific data is prohibitively expensive. In contrast, abundant human motion data is readily available from motion capture, videos, and virtual reality. Due to the embodiment gap and missing dynamic information like force and torque, these demonstrations cannot be directly executed on robots. We propose Scalable Physics-Informed DExterous Retargeting (SPIDER), a physics-based retargeting framework to transform and augment kinematic-only human demonstrations to dynamically feasible robot trajectories at scale. Our key insight is that human demonstrations should provide global task structure and objective, while large-scale physics-based sampling with curriculum-style virtual contact guidance should refine trajectories to ensure dynamical feasibility and correct contact sequences. SPIDER scales across diverse 9 humanoid/dexterous hand embodiments and 6 datasets, improving success rates by 18% compared to standard sampling, while being 10× faster than reinforcement learning (RL) baselines, and enabling the generation of a 2.4M frames dynamic-feasible robot dataset for policy learning. By aligning human motion and robot feasibility at scale, SPIDER offers a general, embodiment-agnostic foundation for humanoid and dexterous hand control. As a universal retargeting method, SPIDER can work with diverse quality data, including single RGB camera video and can be applied to real robot deployment and other downstream learning methods like RL to enable efficient closed-loop policy learning.

Learn more:

Project Page for Gen2Act: https://homangab.github.io/gen2act/

ArXiV: https://arxiv.org/pdf/2409.16283

Project Page for SPIDER: https://jc-bao.github.io/spider-project/

ArXiV: https://arxiv.org/abs/2511.09484

This post on X

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com