RoboPapers

Chris Paxton and Michael Cho
undefined
Nov 20, 2025 • 58min

Ep#043: Attention-based map encoding for learning generalized legged locomotion

Walking robots can do all kinds of exciting things like dancing, running, and martial arts — but for them to be useful, they must be able to use their legs to handle terrain, to move over obstacles not just around them. So, how can we train walking policies for legged robots that are useful?Unlike with manipulation, these policies are trained with end-to-end, sim-to-real reinforcement learning, using attention. Turns out maybe “attention is all you need” also applies to locomotion. Chong Zhang joins us to explain more.Watch Episode #43 of RoboPapers, hosted by Michael Cho and Chris Paxton, now, to find out more.Abstract:Dynamic locomotion of legged robots is a critical yet challenging topic in expanding the operational range of mobile robots. It requires precise planning when possible footholds are sparse, robustness against uncertainties and disturbances, and generalizability across diverse terrains. Although traditional model-based controllers excel at planning on complex terrains, they struggle with real-world uncertainties. Learning-based controllers offer robustness to such uncertainties but often lack precision on terrains with sparse steppable areas. Hybrid methods achieve enhanced robustness on sparse terrains by combining both methods but are computationally demanding and constrained by the inherent limitations of model-based planners. To achieve generalized legged locomotion on diverse terrains while preserving the robustness of learning-based controllers, this paper proposes an attention-based map encoding conditioned on robot proprioception, which is trained as part of the controller using reinforcement learning. We show that the network learns to focus on steppable areas for future footholds when the robot dynamically navigates diverse and challenging terrains. We synthesized behaviors that exhibited robustness against uncertainties while enabling precise and agile traversal of sparse terrains. In addition, our method offers a way to interpret the topographical perception of a neural network. We have trained two controllers for a 12-degrees-of-freedom quadrupedal robot and a 23-degrees-of-freedom humanoid robot and tested the resulting controllers in the real world under various challenging indoor and outdoor scenarios, including ones unseen during training.Paper in Science RoboticsArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
4 snips
Nov 13, 2025 • 54min

Ep#42: General Intuition

Discover how AI can learn from video games to create predictive world models. The team shares insights on using diffusion models for better visual detail in training agents. They explore the challenges of multi-player dynamics and the importance of high-quality action labels. The discussion includes innovations for stability and speed in model training, as well as the advantages of transferring knowledge across different games. Learn about their mission to develop general agents for complex reasoning in three-dimensional spaces.
undefined
Nov 5, 2025 • 41min

Ep#41: HITTER: A Humanoid Table Tennis Robot via Hierarchical Planning and Learning

How can we make a humanoid robot play table tennis? The robot must hit a moving ball and return it over and over again, requiring precise whole-body control over again. Zhi Su tells us about how he developed a hierarchical approach for planning an whole body control that lets people play this game with a humanoid robot.Watch Episode #41 of RoboPapers with Michael Cho and Chris Paxton now!Abstract:Humanoid robots have recently achieved impressive progress in locomotion and whole-body control, yet they remain constrained in tasks that demand rapid interaction with dynamic environments through manipulation. Table tennis exemplifies such a challenge: with ball speeds exceeding 5 m/s, players must perceive, predict, and act within sub-second reaction times, requiring both agility and precision. To address this, we present a hierarchical framework for humanoid table tennis that integrates a model-based planner for ball trajectory prediction and racket target planning with a reinforcement learning-based whole-body controller. The planner determines striking position, velocity and timing, while the controller generates coordinated arm and leg motions that mimic human strikes and maintain stability and agility across consecutive rallies. Moreover, to encourage natural movements, human motion references are incorporated during training. We validate our system on a general-purpose humanoid robot, achieving up to 106 consecutive shots with a human opponent and sustained exchanges against another humanoid. These results demonstrate real-world humanoid table tennis with sub-second reactive control, marking a step toward agile and interactive humanoid behaviors.Project PageArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Nov 3, 2025 • 1h 1min

Ep#40: Daxo Robotics

How can we build robotic hands with truly superhuman dexterity? Daxo Robotics is developing a unique tendon-driven soft robot hand, which aims to be tougher and more capable than a traditional humanoid hand. Each finger consists of many different tendons, which act in concert to move or manipulate.This is a special episode of RoboPapers where, instead of talking about a scientific paper, we talk to Tom Zhang, founder of Daxo Robotics, in order to learn both about his background and about how this one-of-a-kind robot hand design works.Watch Episode #40 of RoboPapers with Michael Cho and Chris Paxton now!Daxo Robotics websiteFollow Tom on XWatch this episode on YouTube This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Oct 28, 2025 • 1h 21min

Ep#39: MolmoAct: An Action Reasoning Model that reasons in 3D space

Reasoning models have massively expanded what LLMs are capable of, but this hasn’t necessarily applied to robotics. Perhaps this is in part because robots need to reason over space, not just words and symbols; so the robotics version of a reasoning model would need to think in 3D. That’s the idea behind MolmoAct, an “Action Reasoning Model” which generates spatial plans in order to predict precise low-level robot actions. Jason Lee, Haoquan Fang, and Jiafei Duan told us more about their work.Watch Episode #39 of RoboPapers, with Michael Cho and Chris Paxton, now!Abstract:Reasoning is central to purposeful action, yet most robotic foundation models map perception and instructions directly to control, which limits adaptability, generalization, and semantic grounding. We introduce Action Reasoning Models (ARMs), a class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, encodes observations and instructions into depth-aware perception tokens, generates mid-level spatial plans as editable trajectory traces, and predicts precise low-level actions, enabling explainable and steerable behavior. MolmoAct-7B-D achieves strong performance across simulation and real-world settings: 70.5% zero-shot accuracy on SimplerEnv Visual Matching tasks, surpassing closed-source Pi-0 and GR00T N1.5; 86.6% average success on LIBERO, including an additional 6.3% gain over ThinkAct on long-horizon tasks; and in real-world fine-tuning, an additional 10% (single-arm) and an additional 22.7% (bimanual) task progression over Pi-0-FAST. It also outperforms baselines by an additional 23.3% on out-of-distribution generalization and achieves top human-preference scores for open-ended instruction following and trajectory steering. Furthermore, we release, for the first time, the MolmoAct Dataset -- a mid-training robot dataset comprising over 10,000 high quality robot trajectories across diverse scenarios and tasks. Training with this dataset yields an average 5.5% improvement in general performance over the base model. We release all model weights, training code, our collected dataset, and our action reasoning dataset, establishing MolmoAct as both a state-of-the-art robotics foundation model and an open blueprint for building ARMs that transform perception into purposeful action through structured reasoning. Blogpost: this https URLPaperProject websiteOriginal post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Oct 24, 2025 • 52min

Ep#38: Q Learning is Not Yet Scalable

Offline reinforcement learning is crucial for robotics, but does it scale? We talk to Seohong, who discusses how for long-horizon manipulation problems the answer may be no — at least not yet. But there are tricks that you can use to make it work effectively.Watch episode #38 of RoboPapers with Michael Cho and Chris Paxton now!Abstract:In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: this https URLAnd from the blog post:Over the past few years, we’ve seen that next-token prediction scales, denoising diffusion scales, contrastive learning scales, and so on, all the way to the point where we can train models with billions of parameters with a scalable objective that can eat up as much data as we can throw at it. Then, what about reinforcement learning (RL)? Does RL also scale like all the other objectives?ArXiVBlog Post This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Oct 21, 2025 • 45min

Ep#37: AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

Robots has a data problem, in that robotics data is rare. While human video is quite common, it’s not usually directly usable for robots for a variety of reasons, most significantly that it’s missing explicit, accurate robot actions. Instead, Jeremy proposes that we predict keypoint trajectories — basically, how any given point in an object will move as a robot performs a task. This lets us use action-free human video to train robot skills.Learn more by watching Episode #37 of RoboPapers with Michael Cho and Chris Paxton here.Abstract:Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at this https URL.ArXiVProject Page This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Oct 17, 2025 • 49min

Ep#36: Whole-Body Conditioned Egocentric Video Prediction

Learning a true world model for a human body means taking high-dimensional actions representing the full body pose — the location of hands and feet, for example — and using it to predict the effects of each action. This would allow for an unprecedented level of simulation over the effects of each action on the world, but this level of information is usually not available.But, with a new dataset from Meta, Yutong Bai and co-authors were able to train just such a world model, using detailed 3d information of whole human bodies in different apartments, and predicting the results of granular actions.Watch Episode#36 of RoboPapers, co-hosted by Michael Cho and Chris Paxton, now to find out more.Abstract:We train models to Predict Egocentric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model’s embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.Project SiteArXiVThread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Oct 8, 2025 • 1h 2min

Ep#35: Reinforcement Learning with Action Chunking

Today, most robot learning from demonstration predicts action chunks, small robot action trajectory. Doing this is crucial for better performance, and has all kinds of advantages. But how can we apply these advantages to reinforcement learning?We talked to Colin Li and Paul Zhou to find out more.Abstract:We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a ‘chunked’ action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased n-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.ArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Oct 3, 2025 • 48min

Ep#34: RoboArena

Evaluating robot policies is hard. Every lab has a different robot; reproducible evaluations are really challenging. This makes it hard for us to know which methods for learning robot policies are likely to perform the best in real-world scenarios. Taking a page from LLM evaluations like Chatbot Arena, RoboArena aims to address this problem through crowdsourcing evaluations with a network of different evaluators.Watch Episode #34 of RoboPapers, hosted by Chris Paxton and Michael Cho, now to learn more from authors Pranav Atreya and Karl Pertsch.Abstract:Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ‘’robot challenges’‘, and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.Project SiteArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app