RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

Episodes

Mentioned books

Sep 4, 2025 • 1h 7min

Ep#24: CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks

How can we control humanoid robots to get them to perform a variety of challenging, real-world tasks while moving around in their environments? Yixuan Li and Siyuan Huang of BIGAI tell us how their complex system works.Abstract:Humanoid teleoperation plays a vital role in demonstrating and collecting data for complex humanoid-scene interactions. However, current teleoperation systems face critical limitations: they decouple upper- and lower-body control to maintain stability, restricting natural coordination, and operate open-loop without real-time position feedback, leading to accumulated drift. The fundamental challenge is achieving precise, coordinated whole-body teleoperation over extended durations while maintaining accurate global positioning. Here we show that an MoE-based teleoperation system, CLONE, with closed-loop error correction enables unprecedented whole-body teleoperation fidelity, maintaining minimal positional drift over long-range trajectories using only head and hand tracking from an MR headset. Unlike previous methods that either sacrifice coordination for stability or suffer from unbounded drift, CLONE learns diverse motion skills while preventing tracking error accumulation through real-time feedback, enabling complex coordinated movements such as “picking up objects from the ground.” These results establish a new milestone for whole-body humanoid teleoperation for long-horizon humanoid-scene interaction tasks. Project SiteArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Sep 3, 2025 • 42min

Ep#23: FALCON 🦅- Learning Force-Adaptive Humanoid Loco-Manipulation

One of the key advantages of humanoid robots is how they can handle heavier objects and exert forces in a more dynamic way, compared to wheeled robots. However, few research papers show this. We talked to Yuanhang Zhang about his really impressive work, FALCON, which shows humanoid robots pushing or pulling carts, lifting weights, and more.Abstract:Humanoid loco-manipulation holds transformative potential for daily service and industrial tasks, yet achieving precise, robust whole-body control with 3D end-effector force interaction remains a major challenge. Prior approaches are often limited to lightweight tasks or quadrupedal/wheeled platforms. To overcome these limitations, we propose FALCON, a dual-agent reinforcement-learning-based framework for robust force-adaptive humanoid loco-manipulation. FALCON decomposes whole-body control into two specialized agents: (1) a lower-body agent ensuring stable locomotion under external force disturbances, and (2) an upper-body agent precisely tracking end-effector positions with implicit adaptive force compensation. These two agents are jointly trained in simulation with a force curriculum that progressively escalates the magnitude of external force exerted on the end effector while respecting torque limits. Experiments demonstrate that, compared to the baselines, FALCON achieves 2x more accurate upper-body joint tracking, while maintaining robust locomotion under force disturbances and achieving faster training convergence. Moreover, FALCON enables policy training without embodiment-specific reward or curriculum tuning. Using the same training setup, we obtain policies that are deployed across multiple humanoids, enabling forceful loco-manipulation tasks such as transporting payloads (0-20N force), cart-pulling (0-100N), and door-opening (0-40N) in the real world.Project SiteArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Aug 27, 2025 • 1h 3min

Ep#22: DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

How can we collect large-scale manipulation data in the real world? DexWild proposes a solution: an easy-to-use wearable that lets operators perform robotic tasks in a wide variety of environments easily. With the added data diversity, this results in robot policies which can operate in a variety of different environments.Watch the video to learn more!Abstract:Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at this https URLProject PageOriginal Post on YouTubePaper on ArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Aug 24, 2025 • 60min

Ep#21: TesserAct: Learning 4D Embodied World Models

World models are an exciting area of research, wherein we predict video given robot actions and use these predictions for planning or training. But two dimensional image frames aren’t necessarily enough for robot planning; robotic tasks are inherently three dimensional, and so shouldn’t we be predicting how objects move around in 3d space?TesserAct does this, tuning a video generative model to produce 3d predictions. Watch the video to learn more!Abstract:This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.Project SiteArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Aug 22, 2025 • 1h 13min

Ep#20: VideoMimic: Visual imitation enables contextual humanoid control

Part of the advantage of humanoid robots is in their ability to interact with the environment: to climb stairs, to sit down, to navigate rough terrain, and so on. In VideoMimic, the authors produced a pipeline which let them:* Scan a scene into simulation using an iPhone* Train a whole-body control policy for manipulation and locomotion* Use this policy to control a humanoid robot as operated in diverse, real-world environments.Abstract:How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them—casually capture a human motion video and feed it to humanoids. We introduce VideoMimic, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills—all from a single policy, conditioned on the environment and global root commands. VideoMimic offers a scalable path towards teaching humanoids to operate in diverse real-world environments.Project PageArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Aug 21, 2025 • 51min

Ep#19: Learning to Drive From a World Model

Comma has been selling end-to-end driving kits for your car for years. Their Openpilot is available now, with source code on Github. Very interestingly, they train their lanekeeping behavior using generative world models. Watch to learn more.Abstract:Most self-driving systems rely on hand-coded perception outputs and engineered driving rules. Learning directly from human driving data with an end-to-end method can allow for a training architecture that is simpler and scales well with compute and data. In this work, we propose an end-to-end training architecture that uses real driving data to train a driving policy in an on-policy simulator. We show two different methods of simulation, one with reprojective simulation and one with a learned world model. We show that both methods can be used to train a policy that learns driving behavior without any hand-coded driving rules. We evaluate the performance of these policies in a closed-loop simulation and when deployed in a real-world advanced driver-assistance system.Blog postArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Aug 21, 2025 • 58min

Ep#18 HuB: Learning Extreme Humanoid Balance

This time, Tong and Boyuan tell us about how to achieve some of the most impressive humanoid robot balancing that we have ever seen.Take a look!Here’s the abstract:The human body demonstrates exceptional motor capabilities—such as standing steadily on one foot or performing a high kick with the leg raised over 1.5 meters—both requiring precise balance control. While recent research on humanoid control has leveraged reinforcement learning to track human motions for skill acquisition, applying this paradigm to balance-intensive tasks remains challenging.In this work, we identify three key obstacles: instability from reference motion errors, learning difficulties due to morphological mismatch, and the sim-to-real gap caused by sensor noise and unmodeled dynamics. To address these challenges, we propose HuB (Humanoid Balance), a unified framework that integrates reference motion refinement, balance-aware policy learning, and sim-to-real robustness training, with each component targeting a specific challenge.We validate our approach on the Unitree G1 humanoid robot across challenging quasi-static balance tasks, including extreme single-legged poses such as Swallow Balance and Bruce Lee's Kick. Our policy remains stable even under strong physical disturbances—such as a forceful soccer strike—while baseline methods consistently fail to complete these tasks.Project PageArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Aug 17, 2025 • 1h 11min

Ep#17: EgoZero: Robot Learning from Smart Glasses

Using human data is the key to unlocking the potential of general-purpose robotics: it’s much more plentiful and can potentially be collected quickly at scale for relatively little money. But how can we actually do this?Despite recent progress in general purpose robotics, robot policies still lag far behind basic human capabilities in the real world. Humans constantly interact with the physical world, yet this rich data resource remains largely untapped in robot learning. We propose EgoZero, a minimal system that learns robust manipulation policies from human demonstrations captured with Project Aria smart glasses, and zero robot data. EgoZero enables: (1) extraction of complete, robot-executable actions from in-the-wild, egocentric, human demonstrations, (2) compression of human visual observations into morphology-agnostic state representations, and (3) closed-loop policy learning that generalizes morphologically, spatially, and semantically. We deploy EgoZero policies on a gripper Franka Panda robot and demonstrate zero-shot transfer with 70% success rate over 7 manipulation tasks and only 20 minutes of data collection per task. Our results suggest that in-the-wild human data can serve as a scalable foundation for real-world robot learning — paving the way toward a future of abundant, diverse, and naturalistic training data for robots.Project SiteOriginal Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Aug 17, 2025 • 43min

Ep#16: TWIST: Teleoperated Whole-Body Imitation System

Learn how to do versatile, capable whole body teleoperation of humanoid robots — a key capability for unlocking the data we need to train general-purpose autonomous humanoids.Teleoperating humanoid robots in a whole-body manner marks a fundamental step toward developing general-purpose robotic intelligence, with human motion providing an ideal interface for controlling all degrees of freedom. Yet, most current humanoid teleoperation systems fall short of enabling coordinated whole-body behavior, typically limiting themselves to isolated locomotion or manipulation tasks. We present the Teleoperated Whole-Body Imitation System (TWIST), a system for humanoid teleoperation through whole-body motion imitation. We first generate reference motion clips by retargeting human motion capture data to the humanoid robot. We then develop a robust, adaptive, and responsive whole-body controller using a combination of reinforcement learning and behavior cloning (RL+BC). Through systematic analysis, we demonstrate how incorporating privileged future motion frames and real-world motion capture (MoCap) data improves tracking accuracy. TWIST enables real-world humanoid robots to achieve unprecedented, versatile, and coordinated whole-body motor skills--spanning whole-body manipulation, legged manipulation, locomotion, and expressive movement--using a single unified neural network controller. Our project website: this https URLProject SiteOriginal Post on XArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Aug 16, 2025 • 55min

Ep#15: Navigation World Models

World models are one of the hottest research areas. Amir’s work on navigation world models shows how we can use the world models to achieve new goals tin unfamiliar environments.Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems.Project SiteOriginal Post on XOpen-Source Code This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner