RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

Episodes

Mentioned books

Sep 26, 2025 • 1h 22min

Ep#33: A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search

Learning policies via imitation is extremely potent, but making sure those policies will generalize to out of distribution settings is still very hard. SAILOR proposes a solution in learning to search via a learned world model, which outperforms existing imitation approaches. Gokul, Vibhakar, and Arnav tell us about their approach.Watch Episode #33 of RoboPapers, co-hosted by Michael Cho and Chris Paxton, now!Abstract:The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don’t know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert’s outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach SAILOR consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10 still leaves a performance gap. We find that SAILOR can identify nuanced failures and is robust to reward hacking. Our code is available at this https URL.Project PageGithubArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 21, 2025 • 1h 3min

Ep #32: GMT: General Motion Tracking for Humanoid Whole-Body Control

We’ve all seen videos of humanoid robots performing single tasks that are very impressive, like dancing or karate. But training humanoid robots to perform a wide range of complex motions is difficult. GMT is a general-purpose policy which can learn a wide range of robot motions.Watch Episode #32 of RoboPapers, with Zixuan Chen, co-hosted by Michael Cho and Chris Paxton, to learn more.Abstract:The ability to track general whole-body motions in the real world is a useful way to build general-purpose humanoid robots. However, achieving this can be challenging due to the temporal and kinematic diversity of the motions, the policy's capability, and the difficulty of coordination of the upper and lower bodies. To address these issues, we propose GMT, a general and scalable motion-tracking framework that trains a single unified policy to enable humanoid robots to track diverse motions in the real world. GMT is built upon two core components: an Adaptive Sampling strategy and a Motion Mixture-of-Experts (MoE) architecture. The Adaptive Sampling automatically balances easy and difficult motions during training. The MoE ensures better specialization of different regions of the motion manifold. We show through extensive experiments in both simulation and the real world the effectiveness of GMT, achieving state-of-the-art performance across a broad spectrum of motions using a unified general policy. Videos and additional information can be found at this https URL.Project PageArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 17, 2025 • 56min

Ep#31: Vision in Action: Learning Active Perception from Human Demonstrations

Most robots are fixed in one location, with cameras at the correct location to solve whatever their task is going to be. This makes setting up the camera in the correct location a key part of task setup; it also makes the task unnecessarily difficult. Ideally, robots would move their camera around intelligently in order to gather all the information they need to perform a task.In “Vision in Action,” the authors look at how to use a flexible 6-DoF “neck” to move around and gather the information necessary to perform the task, based on what a human operator is actually looking at.Learn more by watching Episode #31 of RoboPapers with Haoyu Xiong, co-hosted by Michael Cho and Chris Paxton.Abstract:We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot’s physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot’s latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.Project PageArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 12, 2025 • 1h 10min

Ep#30: R2S2: Unleashing Humanoid Reaching Potential via Real-world-Ready Skill Space

How can we train humanoid mobile manipulators to perform complex manipulation tasks in the real world? Humanoids must be able to coordinate their arms and legs to perform complex tasks, but learning such skills via reinforcement learning is challenging. In R2S2, Li Yi tells us how we can learn a robot action space based on reinforcement learning which can transfer from simulation to the real world and supports manipulation tasks like picking up a box.Watch Episode 30 of RoboPapers to find out more!Abstract:Humans possess a large reachable space in the 3D world, enabling interaction with objects at varying heights and distances. However, realizing such large-space reaching on humanoids is a complex whole-body control problem and requires the robot to master diverse skills simultaneously-including base positioning and reorientation, height and body posture adjustments, and end-effector pose control. Learning from scratch often leads to optimization difficulty and poor sim2real transferability. To address this challenge, we propose Real-world-Ready Skill Space (R2S2). Our approach begins with a carefully designed skill library consisting of real-world-ready primitive skills. We ensure optimal performance and robust sim2real transfer through individual skill tuning and sim2real evaluation. These skills are then ensembled into a unified latent space, serving as a structured prior that helps task execution in an efficient and sim2real transferable manner. A high-level planner, trained to sample skills from this space, enables the robot to accomplish real-world goal-reaching tasks. We demonstrate zero-shot sim2real transfer and validate R2S2 in multiple challenging goal-reaching scenarios.Project Page This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 12, 2025 • 1h 23min

Ep#29: Exploring Low Cost Robots using LLM Agents

One of the most exciting things about modern robot learning is that it’s so accessible. Ilia Larchenko talked to us about his journey into robotics: using low-cost robots like those from HuggingFace, building agents powered by Claude, and using these to put together robots that he can use to perform real, open-vocabulary tasks in the real world.Unlike most of our podcasts, this one isn’t really about a paper — although Ilia has certainly been doing his research, developing new techniques and ways to get these robots to perform more and more complex tasks.You can find Ilia’s work in X. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 9, 2025 • 58min

Ep#28: DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Robotics has a data problem. Generative video models propose a solution: we can fine-tune these models with robot data to generate vast amounts of diverse task data, which in turn unlocks new behaviors and environmental generation.Abstract:We introduce DreamGen, a simple yet highly effective 4-stage pipeline for training robot policies that generalize across behaviors and environments through neural trajectories - synthetic robot data generated from video world models. DreamGen leverages state-of-the-art image-to-video generative models, adapting them to the target robot embodiment to produce photorealistic synthetic videos of familiar or novel tasks in diverse environments. Since these models generate only videos, we recover pseudo-action sequences using either a latent action model or an inverse-dynamics model (IDM). Despite its simplicity, DreamGen unlocks strong behavior and environment generalization: a humanoid robot can perform 22 new behaviors in both seen and unseen environments, while requiring teleoperation data from only a single pick-and-place task in one environment. To evaluate the pipeline systematically, we introduce DreamGen Bench, a video generation benchmark that shows a strong correlation between benchmark performance and downstream policy success. Our work establishes a promising new axis for scaling robot learning well beyond manual data collection. Code available at this https URL.Project PageArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 9, 2025 • 52sec

Ep#27: DextrAH-RGB: Visuomotor Policies to Grasp Anything with Dexterous Hands

How can we learn sim-to-real manipulation policies for grasping any object? Watch this episode with Ritvik Singh of NVIDIA to find out.Abstract:One of the most important yet challenging skills for robots is dexterous multi-fingered grasping of a diverse range of objects. Much of the prior work is limited by the speed, dexterity, or reliance on depth maps. In this paper, we introduce DextrAH-RGB, a system that can perform dexterous arm-hand grasping end2end from stereo RGB input. We train a teacher policy in simulation through reinforcement learning that acts on a geometric fabric action space to ensure reactivity and safety. We then distill this teacher into an RGB-based student in simulation. To our knowledge, this is the first work that is able to demonstrate robust sim2real transfer of an end2end RGB-based policy for a complex, dynamic, contact-rich tasks such as dexterous grasping. Our policies are also able to generalize to grasping novel objects with unseen geometry, texture, or lighting conditions during training.Project Page This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 6, 2025 • 1h 9min

Ep#26: ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI

Robotics as a field faces a large data gap, and one of the most promising directions for solving it is sim-to-real robot learning. We talked with Stone Tao about ManiSkill3: a powerful, easy-to-use framework for generalizable sim-to-real learning.In particular, we learned about how with Maniskill + HuggingFace LeRobot, you can easily train your own sim-to-real manipulation policies on a low-cost robot!Abstract:Simulation has enabled unprecedented compute-scalable approaches to robot learning. However, many existing simulation frameworks typically support a narrow range of scenes/tasks and lack features critical for scaling generalizable robotics and sim2real. We introduce and open source ManiSkill3, the fastest state-visual GPU parallelized robotics simulator with contact-rich physics targeting generalizable manipulation. ManiSkill3 supports GPU parallelization of many aspects including simulation+rendering, heterogeneous simulation, pointclouds/voxels visual input, and more. Simulation with rendering on ManiSkill3 can run 10-1000x faster with 2-3x less GPU memory usage than other platforms, achieving up to 30,000+ FPS in benchmarked environments due to minimal python/pytorch overhead in the system, simulation on the GPU, and the use of the SAPIEN parallel rendering system. Tasks that used to take hours to train can now take minutes. We further provide the most comprehensive range of GPU parallelized environments/tasks spanning 12 distinct domains including but not limited to mobile manipulation for tasks such as drawing, humanoids, and dextrous manipulation in realistic scenes designed by artists or real-world digital twins. In addition, millions of demonstration frames are provided from motion planning, RL, and teleoperation. ManiSkill3 also provides a comprehensive set of baselines that span popular RL and learning-from-demonstrations algorithms.Project PageArXiVOriginal Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 5, 2025 • 59min

Ep#25: DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation

Collecting dexterous humanoid robot data is difficult to scale. That's why Mengda Xu and Han Zhang built DexUMI: a tool for demonstrating how to control a dexterous robot hand, which allows you to quickly collect task data.Abstract:We present DexUMI - a data collection and policy learning framework that uses the human hand as the natural interface to transfer dexterous manipulation skills to various robot hands. DexUMI includes hardware and software adaptations to minimize the embodiment gap between the human hand and various robot hands.The hardware adaptation bridges the kinematics gap using a wearable hand exoskeleton. It allows direct haptic feedback in manipulation data collection and adapts human motion to feasible robot hand motion.The software adaptation bridges the visual gap by replacing the human hand in video data with high-fidelity robot hand inpainting.We demonstrate DexUMI's capabilities through comprehensive real-world experiments on two different dexterous robot hand hardware platforms, achieving an average task success rate of 86%.Project PageArXiVOriginal Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 4, 2025 • 1h 7min

Ep#24: CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks

How can we control humanoid robots to get them to perform a variety of challenging, real-world tasks while moving around in their environments? Yixuan Li and Siyuan Huang of BIGAI tell us how their complex system works.Abstract:Humanoid teleoperation plays a vital role in demonstrating and collecting data for complex humanoid-scene interactions. However, current teleoperation systems face critical limitations: they decouple upper- and lower-body control to maintain stability, restricting natural coordination, and operate open-loop without real-time position feedback, leading to accumulated drift. The fundamental challenge is achieving precise, coordinated whole-body teleoperation over extended durations while maintaining accurate global positioning. Here we show that an MoE-based teleoperation system, CLONE, with closed-loop error correction enables unprecedented whole-body teleoperation fidelity, maintaining minimal positional drift over long-range trajectories using only head and hand tracking from an MR headset. Unlike previous methods that either sacrifice coordination for stability or suffer from unbounded drift, CLONE learns diverse motion skills while preventing tracking error accumulation through real-time feedback, enabling complex coordinated movements such as “picking up objects from the ground.” These results establish a new milestone for whole-body humanoid teleoperation for long-horizon humanoid-scene interaction tasks. Project SiteArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app