

RoboPapers
Chris Paxton and Michael Cho
Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com
Episodes
Mentioned books

Feb 19, 2026 • 1h 9min
Ep#63: NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos
The holy grail of robotics is to be able to perform previously-unseen, out-of-distribution manipulation tasks “zero shot” in a new environment. NovaFlow proposes an approach which (1) generates a video, (2) computes predicted flow — how points move through the scene — and (3) uses this flow as an objective to generate a motion. Using this procedure, NovaFlow generates motions in unseen scenes, for unseen tasks, and can transfer across embodiments. To learn more, we are joined by Hongyu Li and Jiahui Fu from RAI.Watch Episode #63 of RoboPapers now to learn more!Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training.Learn more:Project site: https://novaflow.lhy.xyz/ArXiV: https://arxiv.org/abs/2510.08568 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Feb 11, 2026 • 60min
Ep#62: PolaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies
Evaluating robot policies is hard. Ideally, instead of testing every new policy on a real robot, you could test in simulation; but simulations rarely correlate well with real-world performance. In order to make good, useful simulations, you need to spend a great deal of time and effort.That’s where PolaRiS comes in: it’s a toolkit that lets you take a short video of a real scene and turn it into a high-fidelity simulation. It provides what you need to build a good evaluation environment, and it “ships” with off-the-shelf environments that already show strong sim-to-real correlation, meaning that they can be used to inform policy performance.Arhan Jain and Karl Pertsch join us to talk about what they have built, why, and how you can use it.Watch Episode #62 of RoboPapers, with Chris Paxton and Jiafei Duan, now!Abstract:A significant challenge for robot learning research is our ability to accurately measure and compare the performance of robot policies. Benchmarking in robotics is historically challenging due to the stochasticity, reproducibility, and time-consuming nature of real-world rollouts. This challenge is exacerbated for recent generalist policies, which has to be evaluated across a wide variety of scenes and tasks. Evaluation in simulation offers a scalable complement to real world evaluations, but the visual and physical domain gap between existing simulation benchmarks and the real world has made them an unreliable signal for policy improvement. Furthermore, building realistic and diverse simulated environments has traditionally required significant human effort and expertise. To bridge the gap, we introduce Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS), a scalable real-to-sim framework for high-fidelity simulated robot evaluation. PolaRiS utilizes neural reconstruction methods to turn short video scans of real-world scenes into interactive simulation environments. Additionally, we develop a simple simulation data co-training recipe that bridges remaining real-to-sim gaps and enables zero-shot evaluation in unseen simulation environments. Through extensive paired evaluations between simulation and the real world, we demonstrate that PolaRiS evaluations provide a much stronger correlation to real world generalist policy performance than existing simulated benchmarks. Its simplicity also enables rapid creation of diverse simulated environments. As such, this work takes a step towards distributed and democratized evaluation for the next generation of robotic foundation models.Learn More:Project Page: https://polaris-evals.github.io/ArXiV: https://arxiv.org/abs/2512.16881This post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Feb 4, 2026 • 55min
Ep#61: 1x World Model
Daniel Ho, Director of Evaluations at 1X who builds world-model-based control for humanoid robots. He describes using internet and egocentric videos as imagined worlds to generate zero-shot robot behaviors. The conversation covers how prompts and action labels guide imagined rollouts, training recipes across web/ego/robot data, evaluation with learned simulators, and challenges like contact-rich tasks and latency.

Jan 28, 2026 • 1h 14min
Ep#60: Sim-to-Real Manipulation with VIRAL and Doorman
For robots to be useful, they must be able to interact with a wide variety of environments; and yet, scaling interaction data is difficult, expensive, and time consuming. Instead, much research revolves around sim-to-real manipulation — but mostly this has not been mobile manipulation.Recently, though, this has begun to change. Two recent papers from Tairen He and Haoru Xue show us how to unlock the potential of this technique, building policies which, without any real data at all, can move objects around in the world and open doors in the real world with a humanoid robot.Watch Episode #60 of RoboPapers now to learn more, hosted by Chris Paxton and Jiafei Duan. In this episode, we cover two papers:. First is VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation; and second is DoorMan: Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer.Paper #1: VIRALAbstract:A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization over lighting, materials, camera parameters, image quality, and sensor delays--with real-to-sim alignment of the dexterous hands and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.Project page: https://viral-humanoid.github.io/ArXiV: https://arxiv.org/abs/2511.15200Original thread on X: Paper #2: DoormanAbstract:Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception.Project page: https://doorman-humanoid.github.io/ArXiV: https://arxiv.org/abs/2512.01061 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Jan 21, 2026 • 49min
Ep#59: SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies
Teleoperating a robot is hard. This means that when performing a robot task via teleoperation — say, to collect examples for training a robot policy — it’s almost unavoidably slower than you would like, below either the capabilities of the human expert on their own or the robot performing the task. Wouldn’t it be great if there was a way to fix this?Unfortunately, it’s harder than it looks. You can’t just execute faster, as this alters the distribution of environment states the policy will encounter. Nadun Ranakawa Arachchige and Zhenyang Chen propose Speed-Adaptive Imitation Learning (SAIL), which adds error-adaptive guidance, adapts execution speed according to task structure, predicts controller-invariant action targets to ensure robustness across execution speeds, and explicitly models delays from, for example, sensor latency.Watch episode #59 of RoboPapers, with Chris Paxton and Michael Cho to learn more!Abstract:Offline Imitation Learning (IL) methods such as Behavior Cloning are effective at acquiring complex robotic manipulation skills. However, existing IL-trained policies are confined to executing the task at the same speed as shown in demonstration data. This limits the task throughput of a robotic system, a critical requirement for applications such as industrial automation. In this paper, we introduce and formalize the novel problem of enabling faster-than-demonstration execution of visuomotor policies and identify fundamental challenges in robot dynamics and state-action distribution shifts. We instantiate the key insights as SAIL (Speed Adaptation for Imitation Learning), a full-stack system integrating four tightly-connected components: (1) a consistency-preserving action inference algorithm for smooth motion at high speed, (2) high-fidelity tracking of controller-invariant motion targets, (3) adaptive speed modulation that dynamically adjusts execution speed based on motion complexity, and (4) action scheduling to handle real-world system latencies. Experiments on 12 tasks across simulation and two real, distinct robot platforms show that SAIL achieves up to a 4x speedup over demonstration speed in simulation and up to 3.2x speedup in the real world. Additional detail is available at this https URLProject site: https://nadunranawaka1.github.io/sail-policy/ArXiV: https://arxiv.org/abs/2506.11948 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Jan 14, 2026 • 1h 12min
Ep#58: RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning
In order for robots to be deployed in the real world, performing tasks of real value, they must be reliable. Unfortunately, even more, most robotic demos work maybe 70-80% of the time at best. The way to get better reliability is to do real-world reinforcement learning: having the robot teach itself how to perform the task up to a high level of success.The key to doing this is to start with a core of expert human data, use that to train a policy then iteratively improve it, until finally finishing with on-policy reinforcement learning. Kun Lei talks through a unified framework for imitation and reinforcement learning based on PPO, which enables this improvement process.In this episode, Kun Lei explains the theory behind his reinforcement learning method and how it allowed his robot to run in a shopping mall juicing oranges for seven hours at a time, among experiments on a wide variety of tasks and embodiments.Watch episode 58 of RoboPapers now, hosted by Michael Cho and Chris Paxton!Abstract: Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass the performance of skilled human operators. We present RL-100, a real-world reinforcement learning framework built on diffusion-based visuomotor policies. RL-100 unifies imitation and reinforcement learning under a single PPO-style objective applied within the denoising process, yielding conservative and stable policy improvements across both offline and online stages. To meet deployment latency constraints, we employ a lightweight consistency distillation procedure that compresses multi-step diffusion into a one-step controller for high-frequency control. The framework is task-, embodiment-, and representation-agnostic, and supports both single-action outputs and action-chunking control. We evaluate RL-100 on seven diverse real-robot manipulation tasks, ranging from dynamic pushing and agile bowling to pouring, cloth folding, unscrewing, and multi-stage juicing. RL-100 attains 100% success across evaluated trials, achieving 900 out of 900 successful episodes, including up to 250 out of 250 consecutive trials on one task, and matches or surpasses expert teleoperators in time-to-completion. Without retraining, a single policy attains approximately 90% zero-shot success under environmental and dynamics shifts, adapts in a few-shot regime to significant task variations (86.7%), and remains robust to aggressive human perturbations (about 95%). In a public shopping-mall deployment, the juicing robot served random customers continuously for roughly seven hours without failure. Together, these results suggest a practical path toward deployment-ready robot learning: start from human priors, align training objectives with human-grounded metrics, and reliably extend performance beyond human demonstrations.Learn more:Project Page: https://lei-kun.github.io/RL-100/ArXiV: https://arxiv.org/abs/2510.14830Original thread on X: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Jan 6, 2026 • 51min
Ep#57: Learning Dexterity from Human Videos with Gen2Act and SPIDER
Homanga Bharadwaj, research scientist at Meta Reality Labs and incoming Johns Hopkins assistant professor, works on teaching robots from human video. He discusses Gen2Act, which generates human videos from language to guide robot actions. He also covers SPIDER, which retargets human hand and object motion through simulation for dexterous, contact-rich tasks.

Dec 22, 2025 • 46min
Ep#56: GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation
It’s long been a dream of roboticists to be able to teach a robot in simulation so as to skip the long and expensive process of collecting large amounts of real-world training data. However, building simulations for robot tasks is extremely hard. Ideally, we could go from real data to a useful simulation.This is exactly what Guangqi Jiang and his co-authors do. they use 3d Gaussian splatting to reconstructed scenes which let them create interactive environments that, when combined with a physcs engine, allow for training robot policies that show zero-shot sim-to-real transfer (i.e., using no real-world demonstrations).To learn more, watch Episode 56 of Robopapers with Michael Cho and Chris Paxton now!Abstract:This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: this https URL.Learn more:Project Page: https://3dgsworld.github.io/ArXiV: https://arxiv.org/abs/2510.20813Authors’ Original Thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 19, 2025 • 54min
Ep#55: Trace Anything: Representing Any Video in 4D via Trajectory Fields
Modeling how worlds evolve over time is an important aspect of interacting with them. Video world models have become an exciting area of research in robotics over the past year in part for this reason. What if there was a better way to represent changes over time, though?Trace Anything represents each frame in a video as a trajectory field, i.e. a trajectory through 3d space. This provides a very unique foundation for all kinds of downstream tasks like goal-conditioned manipulation and motion forecasting.We talked to Xinhang Liu to learn more.Watch Episode 55 of RoboPapers with Michael Cho and Chris Paxton now!Abstract:Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion.Project Page: https://trace-anything.github.io/ArXiV: https://arxiv.org/abs/2510.13802This Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

6 snips
Dec 17, 2025 • 51min
Ep#54: MemER: Scaling Up Memory for Robot Control via Experience Retrieval
Ajay Sridhar, a Robotics PhD student, and Jenny Pan, a visiting researcher, dive into MemER's groundbreaking work on robot memory. They discuss how robots can enhance decision-making by selecting crucial keyframes, improving long-horizon task execution like object search. Jenny explains the innovative training methods for keyframe selection, while Ajay shares insights into robust retry behaviors and the challenges of memory management in dynamic environments. Their vision includes transferable memories across robots, enhancing collaboration in robotic tasks.


