RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

Episodes

Mentioned books

Apr 18, 2026 • 44min

Ep#73: VideoManip: Dexterous Manipulation Policies from RGB Human Videos via 3D Hand-Object Trajectory Reconstruction

Teaching robots to perform dexterous manipulation tasks currently requires teleoperation, which limits demonstration quality, speed, and scalability. Instead, why not use human videos? The problem is that a human hand isn’t a robot hand, so data must be retargeted using simulation to resolve issues like collisions and interpenetration when controlling the hand.In VideoManip, Hongyi Chen and co-authors built a system to solve this problem, taking in RGB videos of humans performing manipulation tasks and using them to create accurate simulations with which to learn robot policies.Watch episode #73 of RoboPapers, hosted by Michael Cho and Chris Paxton, now to learn more!AbstractMulti-finger robotic hand manipulation and grasping are challenging due to the high-dimensional action space and the difficulty of acquiring large-scale training data. Existing approaches largely rely on human teleoperation with wearable devices or specialized sensing equipment to capture hand-object interactions, which limits scalability. In this work, we propose VIDEOMANIP, a device-free framework that learns dexterous manipulation directly from RGB human videos. Leveraging recent advances in computer vision, VIDEOMANIP reconstructs explicit 3D robot-object trajectories from monocular videos by estimating human hand poses, object meshes, and retargets the reconstructed human motions to robotic hands for manipulation learning. To make the reconstructed robot data suitable for dexterous manipulation training, we introduce hand-object contact optimization with interaction-centric grasp modeling, as well as a demonstration synthesis strategy that generates diverse training trajectories from a single video, enabling generalizable policy learning without additional robot demonstrations. In simulation, the learned grasping model achieves a 70.25% success rate across 20 diverse objects using the Inspire Hand. In the real world, manipulation policies trained from RGB videos achieve an average 62.86% success rate across seven tasks using the LEAP Hand, outperforming retargeting-based methods by 15.87%. Project videos are available at this http URL.Learn MoreProject page: https://videomanip.github.io/ArXiV: https://arxiv.org/abs/2602.09013 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Apr 15, 2026 • 60min

Ep#72: SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

How can we build a general-purpose “foundation model” for robot motion? Zhengyi Luo joitns us to talk about SONIC, which uses motion tracking as a foundational task for humanoid robot control, and scales humanoid control training to 9k GPU hours and 100 million frames worth of data. The result: a model with a generally-useful embedding space that can be controlled by a VLA, or from human video, to perform a wide variety of humanoid whole-body-control tasks, including with zero-shot transfer to previously unseen motions.Watch episode 72 of RoboPapers, with Michael Cho and Jiafei Duan, now!AbstractDespite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.Learn MoreProject Page: https://nvlabs.github.io/GEAR-SONIC/ArXiV: https://arxiv.org/abs/2511.07820Paper PDF: https://nvlabs.github.io/GEAR-SONIC/static/pdf/sonic_paper.pdf This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

6 snips

Apr 8, 2026 • 1h 1min

Ep#71: Build Your Own Robot

Enes Erciyes, hardware and control contributor to YOR. Manan Anjaria, researcher who helped design the low-cost mobile manipulator. Mahi Shafiullah, robotics researcher focused on mobile manipulation and deployments. They discuss building an affordable, modular open-source mobile manipulator. Topics include gripper design, compliant whole-body control, cost-saving hacks, modularity, SLAM/navigation, and community-driven extensions.

Apr 1, 2026 • 1h 25min

Ep#70: A Systematic Study of Data Modalities and Strategies for Co-training Large Behavior Models for Robot Manipulation

Co-training has become a key part of the recipe for training large robotics models; it means that you mix some proportion of real robot data with other data sources, like simulation or egocentric human video data. This is especially important because robotics data tends to lack diversity which can be somewhat compensated for by the inclusion of these other modalities.And yet there has not been a sizable study on what constitute good practices for cotraining until now! We talk to Fanqi Lin and Jose Barreiros about their new work, a massive study which evaluated 89 policies over thousands of rollouts to tell us which forms of co-training were most useful for robotics.Watch episode 70 of RoboPapers, with Michael Cho and Chris Paxton, now!AbstractLarge behavior models have shown strong dexterous manipulation capabilities by extending imitation learning to large-scale training on multi-task robot data, yet their generalization remains limited by the insufficient robot data coverage. To expand this coverage without costly additional data collection, recent work relies on co-training: jointly learning from target robot data and heterogeneous data modalities. However, how different co-training data modalities and strategies affect policy performance remains poorly understood. We present a large-scale empirical study examining five co-training data modalities: standard vision-language data, dense language annotations for robot trajectories, cross-embodiment robot data, human videos, and discrete robot action tokens across single- and multi-phase training strategies. Our study leverages 4,000 hours of robot and human manipulation data and 50M vision-language samples to train vision-language-action policies. We evaluate 89 policies over 58,000 simulation rollouts and 2,835 real-world rollouts. Our results show that co-training with forms of vision-language and cross-embodiment robot data substantially improves generalization to distribution shifts, unseen tasks, and language following, while discrete action token variants yield no significant benefits. Combining effective modalities produces cumulative gains and enables rapid adaptation to unseen long-horizon dexterous tasks via fine-tuning. Training exclusively on robot data degrades the visiolinguistic understanding of the vision-language model backbone, while co-training with effective modalities restores these capabilities. Explicitly conditioning action generation on chain-of-thought traces learned from co-training data does not improve performance in our simulation benchmark. Together, these results provide practical guidance for building scalable generalist robot policies.Learn MoreProject page: https://co-training-lbm.github.ioArXiV: https://arxiv.org/abs/2602.01067 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Mar 25, 2026 • 1h 11min

Ep#65: VLM4VLA: Revisiting Vision-Language Models in Vision-Language-Action Models

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance?Jianke Zhang and Yanjiang Guo join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control.To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan.Abstract:Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of its downstream task performance. This challenges common assumptions, indicating that standard VLM competence is necessary but insufficient for effective embodied control. We further investigate the impact of specific embodied capabilities by fine-tuning VLMs on seven auxiliary embodied tasks (e.g., embodied QA, visual pointing, depth estimation). Contrary to intuition, improving a VLM's performance on specific embodied skills does not guarantee better downstream control performance. Finally, modality-level ablations identify the visual module in VLM, rather than the language component, as the primary performance bottleneck. We demonstrate that injecting control-relevant supervision into the vision encoder of the VLM yields consistent gains, even when the encoder remains frozen during downstream fine-tuning. This isolates a persistent domain gap between current VLM pretraining objectives and the requirements of embodied action-planning.Learn more:Project page: https://cladernyjorn.github.io/VLM4VLA.github.io/ArXiV: https://arxiv.org/abs/2601.03309Code: https://github.com/CladernyJorn/VLM4VLA This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Feb 26, 2026 • 56min

Ep#64: Project Instinct

Human motion is instinctual. We know how to interact with the world around us, almost without thinking about it at all. Ziwen and Shaoting joined us on RoboPapers to talk about their ambitious Project Instinct: which provides the tools, algorithms, and environments necessary to build humanoid whole-body control which can handle contact with the environment.Watch Episode #64 of RoboPapers with Michael Cho and Jiafei Duan now!Abstract:We present a unified framework from algorithm, environment, dataset curation, and deployment for Instinct-Level intelligence on humanoid robots.Project Site: https://project-instinct.github.io/Github for InstinctLab: https://github.com/project-instinct/instinctlabEmbrace CollisionsPerform contact-rich humanoid robot tasks like getting up from the ground.Abstract:Previous humanoid robot research works treat the robot as a bipedal mobile manipulation platform, where only the feet and hands contact the environment. However, we humans use all body parts to interact with the world, e.g., we sit in chairs, get up from the ground, or roll on the floor. Contacting the environment using body parts other than feet and hands brings significant challenges in both model-predictive control and reinforcement learning-based methods. An unpredictable contact sequence makes it almost impossible for model-predictive control to plan ahead in real-time. The success of the zero-shot sim-to-real reinforcement learning method for humanoids heavily depends on the acceleration of GPU-based rigid-body physical simulator and simplification of the collision detection. Lacking extreme torso movement of the humanoid research makes all other components non-trivial to design, such as termination conditions, motion commands and reward designs. To address these potential challenges, we propose a general humanoid motion framework that takes discrete motion commands and controls the robot's motor action in real-time. Using a GPU-accelerated rigid-body simulator, we train a humanoid whole-body control policy that follows the high-level motion command in the real world in real-time, even with stochastic contacts and extremely large robot base rotation and not-so-feasible motion command.Project Site: https://project-instinct.github.io/embrace-collisions/ArXiV: https://arxiv.org/abs/2502.01465Deep Whole-Body ParkourCurrent approaches to humanoid control generally fall into two paradigms: perceptive locomotion, which handles terrain well but is limited to pedal gaits, and general motion tracking, which reproduces complex skills but ignores environmental capabilities. This work unites these paradigms to achieve perceptive general motion control. We present a framework where exteroceptive sensing is integrated into whole-body motion tracking, permitting a humanoid to perform highly dynamic, non-locomotion tasks on uneven terrain. By training a single policy to perform multiple distinct motions across varied terrestrial features, we demonstrate the non-trivial benefit of integrating perception into the control loop. Our results show that this framework enables robust, highly dynamic multi-contact motions, such as vaulting and dive-rolling, on unstructured terrain, significantly expanding the robot's traversability beyond simple walking or running. this https URLProject Site: https://project-instinct.github.io/deep-whole-body-parkour/ArXiV: https://arxiv.org/abs/2601.07701Hiking in the WildAchieving robust humanoid hiking in complex, unstructured environments requires transitioning from reactive proprioception to proactive perception. However, integrating exteroception remains a significant challenge: mapping-based methods suffer from state estimation drift. For instance, LiDAR-based methods do not handle torso jitter well. Existing end-to-end approaches often struggle with scalability and training complexity. Specifically, some previous works using virtual obstacles are implemented case-by-case. In this work, we present Hiking in the Wild, a scalable, end-to-end perceptive parkour framework designed for robust humanoid hiking. To ensure safety and training stability, we introduce two key mechanisms: a foothold safety mechanism combining scalable Terrain Edge Detection with Foot Volume Points to prevent catastrophic slippage on edges, and a Flat Patch Sampling strategy that eliminates reward hacking by generating feasible navigation targets. Our approach utilizes a single-stage reinforcement learning scheme, mapping raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Extensive field experiments on a full-size humanoid demonstrate that our policy enables robust traversal of complex terrains at speeds up to 2.5 m/s. The training and deployment code is open-sourced to facilitate reproducible research and deployment on real robots with minimal hardware modifications.Project Site: https://project-instinct.github.io/hiking-in-the-wild/ArXiV: https://arxiv.org/abs/2601.07718 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app