RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

Episodes

Mentioned books

Dec 15, 2025 • 1h 5min

Ep#49: Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation

Robots need to be able to apply pressure and make contact with objects as needed in order to accomplish their tasks. From compliance to working safely around humans to whole-body manipulation of heavy objects, combining force and position control can dramatically expand the capabilities of robots. This is especially true for legged robots, which have so much ability to exert forces on the world around them. But how do we train robots which can do this?Baoxiong Jia tells us more in our discussion of his team’s recent, Best Paper Award winning work on learning a unified policy for position and force control, called UniFP.To learn more, watch Episode #49 of RoboPapers, hosted by Michael Cho and Chris Paxton.Abstract:Robotic loco-manipulation tasks often involve contact-rich interactions with the environment, requiring the joint modeling of contact force and robot position. However, recent visuomotor policies often focus solely on learning position or force control, overlooking their co-learning. In this work, we propose the first unified policy for legged robots that jointly models force and position control learned without reliance on force sensors. By simulating diverse combinations of position and force commands alongside external disturbance forces, we use reinforcement learning to learn a policy that estimates forces from historical robot states and compensates for them through position and velocity adjustments. This policy enables a wide range of manipulation behaviors under varying force and position inputs, including position tracking, force application, force tracking, and compliant interactions. Furthermore, we demonstrate that the learned policy enhances trajectory-based imitation learning pipelines by incorporating essential contact information through its force estimation module, achieving approximately 39.5% higher success rates across four challenging contact-rich manipulation tasks compared to position-control policies. Extensive experiments on both a quadrupedal manipulator and a humanoid robot validate the versatility and robustness of the proposed policy across diverse scenarios.Project Page: https://unified-force.github.io/ArXiV: https://arxiv.org/abs/2505.20829Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 4, 2025 • 56min

Ep#48: VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation

Robots must often be able to move around and interact with objects in previously-unseen environments to be useful. And the interaction part is important; to do this, they must be able to perceive and interact with the world using onboard sensing. Enter VisualMimic. Shaofeng Yin and Yanjie Ze show us how to use visual sim-to-real to train diverse loco-manipulation tasks, which can even handle diverse outdoor environments.Learn more in Episode #48 of RoboPapers today, hosted by Michael Cho and Chris Paxton.Abstract:Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: this https URL .Project Page: https://visualmimic.github.io/ArXiV: https://arxiv.org/abs/2509.20322 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 2, 2025 • 49min

Ep#47: ResMimic: From General Motion Tracking to Humanoid Whole-Body Loco-Manipulation via Residual Learning

For robots to be useful, they can’t just dance — they must be able to physically interact with the world around them. Unfortunately, the sorts of motion tracking policies you see performing dancing or martial arts are not really capable of the kind of precise, forceful interaction needed to perform useful interactions with the world. Siheng and Yanjie join us to talk about ResMimic, their new paper which takes a general-purpose human motion-tracking policy and improves it with a residual policy to reliably interact with objectsTo learn more, watch Episode #47 of RoboPapers, hosted by Michael Cho and Chris Paxton, today!Abstract:Humanoid whole-body loco-manipulation promises transformative capabilities for daily service and warehouse tasks. While recent advances in general motion tracking (GMT) have enabled humanoids to reproduce diverse human motions, these policies lack the precision and object awareness required for loco-manipulation. To this end, we introduce ResMimic, a two-stage residual learning framework for precise and expressive humanoid control from human motion data. First, a GMT policy, trained on large-scale human-only motion, serves as a task-agnostic base for generating human-like whole-body movements. An efficient but precise residual policy is then learned to refine the GMT outputs to improve locomotion and incorporate object interaction. To further facilitate efficient training, we design (i) a point-cloud-based object tracking reward for smoother optimization, (ii) a contact reward that encourages accurate humanoid body-object interactions, and (iii) a curriculum-based virtual object controller to stabilize early training. We evaluate ResMimic in both simulation and on a real Unitree G1 humanoid. Results show substantial gains in task success, training efficiency, and robustness over strong baselines.Project Page: https://resmimic.github.io/ArXiV: https://www.arxiv.org/abs/2510.05070Original post on X: https://x.com/SihengZhao/status/1975985531298476316 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 1, 2025 • 57min

Ep#46: ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training

Improving robot's’ ability to learn from human demonstrations is key to getting better performance from them in a wide variety of tasks. Algorithmic improvements like consistency flow training and a new architecture which can leverage multimodal inputs, allows ManiFlow to substantially improve on prior work while also showing strong generalization to unseen environments and distractors. Ge Yan tells us more about how this works and how we can make imitation learning better.Find out more on RoboPapers #46, with Michael Cho and Chris Paxton!Abstract:We introduces ManiFlow, a visuomotor imitation learning policy for general robot manipulation that generates precise, high-dimensional actions conditioned on diverse visual, language and proprioceptive inputs. We leverage flow matching with consistency training to enable high-quality dexterous action generation in just 1-2 inference steps. To handle diverse input modalities efficiently, we propose DiT-X, a diffusion transformer architecture with adaptive cross-attention and AdaLN-Zero conditioning that enables fine-grained feature interactions between action tokens and multi-modal observations. ManiFlow demonstrates consistent improvements across diverse simulation benchmarks and nearly doubles success rates on real-world tasks across single-arm, bimanual, and humanoid robot setups with increasing dexterity. The extensive evaluation further demonstrates the strong robustness and generalizability of ManiFlow to novel objects and background changes, and highlights its strong scaling capability with larger-scale datasets.Project Page: https://maniflow-policy.github.io/ArXiV Paper: https://www.arxiv.org/pdf/2509.01819Thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Nov 25, 2025 • 1h 1min

Ep#45: HERMES: Human-to-Robot Embodied Learning From Multi-Source Motion Data for Mobile Dexterous Manipulation

Just collecting manipulation data isn’t enough for robots - they need to be able to move around in the world, which has a whole different set of challenges from pure manipulation. And bringing navigation and manipulation together in a single framework is even more challenging.Enter HERMES, from Zhecheng Yuan and Tianming Wei. This is a four-stage process in which human videos are used to set up an RL sim-to-real training pipeline in order to overcome differences between robot and human kinematics, and used together with a navigation foundation model to move around in a variety of environments.To learn more, join us as Zhecheng Yuan and Tianming Wei tell us about how they built their system to perform mobile dexterous manipulation from human videos in a variety of environments.Watch Episode #45 of RoboPapers today, hosted by Michael Cho and Chris Paxton!Abstract:Leveraging human motion data to impart robots with versatile manipulation skills has emerged as a promising paradigm in robotic manipulation. Nevertheless, translating multi-source human hand motions into feasible robot behaviors remains challenging, particularly for robots equipped with multi-fingered dexterous hands characterized by complex, high-dimensional action spaces. Moreover, existing approaches often struggle to produce policies capable of adapting to diverse environmental conditions. In this paper, we introduce HERMES, a human-to-robot learning framework for mobile bimanual dexterous manipulation. First, HERMES formulates a unified reinforcement learning approach capable of seamlessly transforming heterogeneous human hand motions from multiple sources into physically plausible robotic behaviors. Subsequently, to mitigate the sim2real gap, we devise an end-to-end, depth image-based sim2real transfer method for improved generalization to real-world scenarios. Furthermore, to enable autonomous operation in varied and unstructured environments, we augment the navigation foundation model with a closed-loop Perspective-n-Point (PnP) localization mechanism, ensuring precise alignment of visual goals and effectively bridging autonomous navigation and dexterous manipulation. Extensive experimental results demonstrate that HERMES consistently exhibits generalizable behaviors across diverse, in-the-wild scenarios, successfully performing numerous complex mobile bimanual dexterous manipulation tasksProject Page: https://gemcollector.github.io/HERMES/ArXiV: https://arxiv.org/abs/2508.20085 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Nov 24, 2025 • 1h 5min

Ep#44: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models

Reasoning over long horizons would allow robots to generalize better to unseen environments and settings zero-shot. One mechanism for this kind of reasoning would be world models, but traditional video world models still tend to struggle with long horizons, and are very data intensive to train. But what if instead of predicting images about the future, we predicted just the symbolic information necessary for reasoning?Nishanth Kumar tells us about Pixels to Predicates, a method for symbol grounding which allows a VLM to plan sequences of robot skills to achieve unseen goals in previously unseen settings.To find out more, watch episode #44 of RoboPapers with Michael Cho and Chris Paxton now!Abstract:Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.Project PageArXiVThread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app