

RoboPapers
Chris Paxton and Michael Cho
Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com
Episodes
Mentioned books

Dec 15, 2025 • 1h 5min
Ep#53: Semantic World Models
World models — action-conditioned predictive models of the environment — are an exciting are of research for robots that can be useful both for training and for test-time compute. But video-based world models waste a lot of predictive power on reconstructing pixels, which makes model and data requirements much higher and limits how far out into the future their predictions remain viable.Instead, what if we learned a purely semantic world model, one which predicts which properties will be true about the world after a sequence of actions, without reconstructing the whole images? Jacob Berg tells us more.Watch Episode #53 of RoboPapers now, with Michael Cho and Chris Paxton!Abstract:Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as “semantic” world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at this https URL.Project Page: https://weirdlabuw.github.io/swm/ArXiV: https://arxiv.org/abs/2510.19818You may also find this episode interesting, which covers ideas in symbolic learning: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

6 snips
Dec 12, 2025 • 46min
Ep#52: Probe, Learn, Distill: Self-improving Vision-Language-Action Models
Wenli Xiao, a PhD student and robotics researcher, introduces her innovative Probe, Learn, Distill (PLD) method for enhancing vision-language-action models. She details how freezing a VLA's backbone and training lightweight residual actors can improve reliability in complex tasks. Wenli also discusses the use of hybrid rollouts for optimizing data collection and the significance of training on fewer tasks to generalize better on unseen challenges. Her insights on continual learning and practical workflows could reshape the future of robotics!

Dec 10, 2025 • 54min
Ep#51: Humanoid Everyday
Robotics, as we know, has a data problem. Many workarounds have been proposed, but one of the most important things is just to collect a large amount of real-robot data — something very difficult, especially for mobile humanoids. Enter Humanoid Everyday, which provides a large, diverse dataset of humanoid mobile manipulation examples.With 260 tasks across 7 different categories, this is the largest humanoid robot dataset we’ve ever seen — and, most importantly, the authors have provided clear evidence that it works for robot learning.Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, and Yue Wang all join us to tell us more about their thought process, their dataset, and the future of humanoid robot evaluation.Watch Episode #51 of RoboPapers, with Michael Cho and Chris Paxton, now!Abstract:From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios.Project Page: https://humanoideveryday.github.io/ArXiV: https://arxiv.org/abs/2510.08807 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 8, 2025 • 1h 4min
Ep#50: EMMA: Scaling Mobile Manipulation via Egocentric Human Data
Collecting robot teleoperation data for mobile manipulation is incredibly time consuming, even moreso than collecting teleoperation data for a stationary mobile manipulator. Fortunately, Lawrence and Pranav have a solution: EMMA, or Egocentric Mobile MAnipulation.In short, they find that they can skip mobile teleoperation entirely, just using static arms for manipulation tasks and co-training with egocentric human video. This is enough to show generalization to more complex scenes and tasks.To learn more, watch Episode #50 of RoboPapers now, hosted by Michael Cho and Chris Paxton!Abstract:Scaling mobile manipulation imitation learning is bottlenecked by expensive mobile robot teleoperation. We present Egocentric Mobile MAnipulation (EMMA), an end-to-end framework training mobile manipulation policies from human mobile manipulation data with static robot data, sidestepping mobile teleoperation. To accomplish this, we co-train human full-body motion data with static robot data. In our experiments across three real-world tasks, EMMA demonstrates comparable performance to baselines trained on teleoperated mobile robot data (Mobile ALOHA), achieving higher or equivalent task performance in full task success. We find that EMMA is able to generalize to new spatial configurations and scenes, and we observe positive performance scaling as we increase the hours of human data, opening new avenues for scalable robotic learning in real-world environments. Details of this project can be found at this https URL.Project Page: https://ego-moma.github.io/ArXiV: https://arxiv.org/abs/2509.04443Original Thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 5, 2025 • 50min
Ep#49: Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation
Robots need to be able to apply pressure and make contact with objects as needed in order to accomplish their tasks. From compliance to working safely around humans to whole-body manipulation of heavy objects, combining force and position control can dramatically expand the capabilities of robots. This is especially true for legged robots, which have so much ability to exert forces on the world around them. But how do we train robots which can do this?Baoxiong Jia tells us more in our discussion of his team’s recent, Best Paper Award winning work on learning a unified policy for position and force control, called UniFP.To learn more, watch Episode #49 of RoboPapers, hosted by Michael Cho and Chris Paxton.Abstract:Robotic loco-manipulation tasks often involve contact-rich interactions with the environment, requiring the joint modeling of contact force and robot position. However, recent visuomotor policies often focus solely on learning position or force control, overlooking their co-learning. In this work, we propose the first unified policy for legged robots that jointly models force and position control learned without reliance on force sensors. By simulating diverse combinations of position and force commands alongside external disturbance forces, we use reinforcement learning to learn a policy that estimates forces from historical robot states and compensates for them through position and velocity adjustments. This policy enables a wide range of manipulation behaviors under varying force and position inputs, including position tracking, force application, force tracking, and compliant interactions. Furthermore, we demonstrate that the learned policy enhances trajectory-based imitation learning pipelines by incorporating essential contact information through its force estimation module, achieving approximately 39.5% higher success rates across four challenging contact-rich manipulation tasks compared to position-control policies. Extensive experiments on both a quadrupedal manipulator and a humanoid robot validate the versatility and robustness of the proposed policy across diverse scenarios.Project Page: https://unified-force.github.io/ArXiV: https://arxiv.org/abs/2505.20829Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 4, 2025 • 56min
Ep#48: VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation
Robots must often be able to move around and interact with objects in previously-unseen environments to be useful. And the interaction part is important; to do this, they must be able to perceive and interact with the world using onboard sensing. Enter VisualMimic. Shaofeng Yin and Yanjie Ze show us how to use visual sim-to-real to train diverse loco-manipulation tasks, which can even handle diverse outdoor environments.Learn more in Episode #48 of RoboPapers today, hosted by Michael Cho and Chris Paxton.Abstract:Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: this https URL .Project Page: https://visualmimic.github.io/ArXiV: https://arxiv.org/abs/2509.20322 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 2, 2025 • 49min
Ep#47: ResMimic: From General Motion Tracking to Humanoid Whole-Body Loco-Manipulation via Residual Learning
For robots to be useful, they can’t just dance — they must be able to physically interact with the world around them. Unfortunately, the sorts of motion tracking policies you see performing dancing or martial arts are not really capable of the kind of precise, forceful interaction needed to perform useful interactions with the world. Siheng and Yanjie join us to talk about ResMimic, their new paper which takes a general-purpose human motion-tracking policy and improves it with a residual policy to reliably interact with objectsTo learn more, watch Episode #47 of RoboPapers, hosted by Michael Cho and Chris Paxton, today!Abstract:Humanoid whole-body loco-manipulation promises transformative capabilities for daily service and warehouse tasks. While recent advances in general motion tracking (GMT) have enabled humanoids to reproduce diverse human motions, these policies lack the precision and object awareness required for loco-manipulation. To this end, we introduce ResMimic, a two-stage residual learning framework for precise and expressive humanoid control from human motion data. First, a GMT policy, trained on large-scale human-only motion, serves as a task-agnostic base for generating human-like whole-body movements. An efficient but precise residual policy is then learned to refine the GMT outputs to improve locomotion and incorporate object interaction. To further facilitate efficient training, we design (i) a point-cloud-based object tracking reward for smoother optimization, (ii) a contact reward that encourages accurate humanoid body-object interactions, and (iii) a curriculum-based virtual object controller to stabilize early training. We evaluate ResMimic in both simulation and on a real Unitree G1 humanoid. Results show substantial gains in task success, training efficiency, and robustness over strong baselines.Project Page: https://resmimic.github.io/ArXiV: https://www.arxiv.org/abs/2510.05070Original post on X: https://x.com/SihengZhao/status/1975985531298476316 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 1, 2025 • 57min
Ep#46: ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training
Improving robot's’ ability to learn from human demonstrations is key to getting better performance from them in a wide variety of tasks. Algorithmic improvements like consistency flow training and a new architecture which can leverage multimodal inputs, allows ManiFlow to substantially improve on prior work while also showing strong generalization to unseen environments and distractors. Ge Yan tells us more about how this works and how we can make imitation learning better.Find out more on RoboPapers #46, with Michael Cho and Chris Paxton!Abstract:We introduces ManiFlow, a visuomotor imitation learning policy for general robot manipulation that generates precise, high-dimensional actions conditioned on diverse visual, language and proprioceptive inputs. We leverage flow matching with consistency training to enable high-quality dexterous action generation in just 1-2 inference steps. To handle diverse input modalities efficiently, we propose DiT-X, a diffusion transformer architecture with adaptive cross-attention and AdaLN-Zero conditioning that enables fine-grained feature interactions between action tokens and multi-modal observations. ManiFlow demonstrates consistent improvements across diverse simulation benchmarks and nearly doubles success rates on real-world tasks across single-arm, bimanual, and humanoid robot setups with increasing dexterity. The extensive evaluation further demonstrates the strong robustness and generalizability of ManiFlow to novel objects and background changes, and highlights its strong scaling capability with larger-scale datasets.Project Page: https://maniflow-policy.github.io/ArXiV Paper: https://www.arxiv.org/pdf/2509.01819Thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Nov 25, 2025 • 1h 1min
Ep#45: HERMES: Human-to-Robot Embodied Learning From Multi-Source Motion Data for Mobile Dexterous Manipulation
Just collecting manipulation data isn’t enough for robots - they need to be able to move around in the world, which has a whole different set of challenges from pure manipulation. And bringing navigation and manipulation together in a single framework is even more challenging.Enter HERMES, from Zhecheng Yuan and Tianming Wei. This is a four-stage process in which human videos are used to set up an RL sim-to-real training pipeline in order to overcome differences between robot and human kinematics, and used together with a navigation foundation model to move around in a variety of environments.To learn more, join us as Zhecheng Yuan and Tianming Wei tell us about how they built their system to perform mobile dexterous manipulation from human videos in a variety of environments.Watch Episode #45 of RoboPapers today, hosted by Michael Cho and Chris Paxton!Abstract:Leveraging human motion data to impart robots with versatile manipulation skills has emerged as a promising paradigm in robotic manipulation. Nevertheless, translating multi-source human hand motions into feasible robot behaviors remains challenging, particularly for robots equipped with multi-fingered dexterous hands characterized by complex, high-dimensional action spaces. Moreover, existing approaches often struggle to produce policies capable of adapting to diverse environmental conditions. In this paper, we introduce HERMES, a human-to-robot learning framework for mobile bimanual dexterous manipulation. First, HERMES formulates a unified reinforcement learning approach capable of seamlessly transforming heterogeneous human hand motions from multiple sources into physically plausible robotic behaviors. Subsequently, to mitigate the sim2real gap, we devise an end-to-end, depth image-based sim2real transfer method for improved generalization to real-world scenarios. Furthermore, to enable autonomous operation in varied and unstructured environments, we augment the navigation foundation model with a closed-loop Perspective-n-Point (PnP) localization mechanism, ensuring precise alignment of visual goals and effectively bridging autonomous navigation and dexterous manipulation. Extensive experimental results demonstrate that HERMES consistently exhibits generalizable behaviors across diverse, in-the-wild scenarios, successfully performing numerous complex mobile bimanual dexterous manipulation tasksProject Page: https://gemcollector.github.io/HERMES/ArXiV: https://arxiv.org/abs/2508.20085 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Nov 24, 2025 • 1h 5min
Ep#44: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
Reasoning over long horizons would allow robots to generalize better to unseen environments and settings zero-shot. One mechanism for this kind of reasoning would be world models, but traditional video world models still tend to struggle with long horizons, and are very data intensive to train. But what if instead of predicting images about the future, we predicted just the symbolic information necessary for reasoning?Nishanth Kumar tells us about Pixels to Predicates, a method for symbol grounding which allows a VLM to plan sequences of robot skills to achieve unseen goals in previously unseen settings.To find out more, watch episode #44 of RoboPapers with Michael Cho and Chris Paxton now!Abstract:Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.Project PageArXiVThread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com


