RoboPapers

Chris Paxton and Michael Cho

Chris Paxton & Michael Cho geek out over robotic papers with paper authors. robopapers.substack.com

Episodes

Mentioned books

Feb 19, 2026 • 1h 9min

Ep#63: NovaFlow: Zero-Shot Manipulation via Actionable Flow from Generated Videos

The holy grail of robotics is to be able to perform previously-unseen, out-of-distribution manipulation tasks “zero shot” in a new environment. NovaFlow proposes an approach which (1) generates a video, (2) computes predicted flow — how points move through the scene — and (3) uses this flow as an objective to generate a motion. Using this procedure, NovaFlow generates motions in unseen scenes, for unseen tasks, and can transfer across embodiments. To learn more, we are joined by Hongyu Li and Jiahui Fu from RAI.Watch Episode #63 of RoboPapers now to learn more!Enabling robots to execute novel manipulation tasks zero-shot is a central goal in robotics. Most existing methods assume in-distribution tasks or rely on fine-tuning with embodiment-matched data, limiting transfer across platforms. We present NovaFlow, an autonomous manipulation framework that converts a task description into an actionable plan for a target robot without any demonstrations. Given a task description, NovaFlow synthesizes a video using a video generation model and distills it into 3D actionable object flow using off-the-shelf perception modules. From the object flow, it computes relative poses for rigid objects and realizes them as robot actions via grasp proposals and trajectory optimization. For deformable objects, this flow serves as a tracking objective for model-based planning with a particle-based dynamics model. By decoupling task understanding from low-level control, NovaFlow naturally transfers across embodiments. We validate on rigid, articulated, and deformable object manipulation tasks using a table-top Franka arm and a Spot quadrupedal mobile robot, and achieve effective zero-shot execution without demonstrations or embodiment-specific training.Learn more:Project site: https://novaflow.lhy.xyz/ArXiV: https://arxiv.org/abs/2510.08568 This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Feb 11, 2026 • 60min

Ep#58: RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning

In order for robots to be deployed in the real world, performing tasks of real value, they must be reliable. Unfortunately, even more, most robotic demos work maybe 70-80% of the time at best. The way to get better reliability is to do real-world reinforcement learning: having the robot teach itself how to perform the task up to a high level of success.The key to doing this is to start with a core of expert human data, use that to train a policy then iteratively improve it, until finally finishing with on-policy reinforcement learning. Kun Lei talks through a unified framework for imitation and reinforcement learning based on PPO, which enables this improvement process.In this episode, Kun Lei explains the theory behind his reinforcement learning method and how it allowed his robot to run in a shopping mall juicing oranges for seven hours at a time, among experiments on a wide variety of tasks and embodiments.Watch episode 58 of RoboPapers now, hosted by Michael Cho and Chris Paxton!Abstract: Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass the performance of skilled human operators. We present RL-100, a real-world reinforcement learning framework built on diffusion-based visuomotor policies. RL-100 unifies imitation and reinforcement learning under a single PPO-style objective applied within the denoising process, yielding conservative and stable policy improvements across both offline and online stages. To meet deployment latency constraints, we employ a lightweight consistency distillation procedure that compresses multi-step diffusion into a one-step controller for high-frequency control. The framework is task-, embodiment-, and representation-agnostic, and supports both single-action outputs and action-chunking control. We evaluate RL-100 on seven diverse real-robot manipulation tasks, ranging from dynamic pushing and agile bowling to pouring, cloth folding, unscrewing, and multi-stage juicing. RL-100 attains 100% success across evaluated trials, achieving 900 out of 900 successful episodes, including up to 250 out of 250 consecutive trials on one task, and matches or surpasses expert teleoperators in time-to-completion. Without retraining, a single policy attains approximately 90% zero-shot success under environmental and dynamics shifts, adapts in a few-shot regime to significant task variations (86.7%), and remains robust to aggressive human perturbations (about 95%). In a public shopping-mall deployment, the juicing robot served random customers continuously for roughly seven hours without failure. Together, these results suggest a practical path toward deployment-ready robot learning: start from human priors, align training objectives with human-grounded metrics, and reliably extend performance beyond human demonstrations.Learn more:Project Page: https://lei-kun.github.io/RL-100/ArXiV: https://arxiv.org/abs/2510.14830Original thread on X: This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Jan 6, 2026 • 51min

Ep#57: Learning Dexterity from Human Videos with Gen2Act and SPIDER

Homanga Bharadwaj, research scientist at Meta Reality Labs and incoming Johns Hopkins assistant professor, works on teaching robots from human video. He discusses Gen2Act, which generates human videos from language to guide robot actions. He also covers SPIDER, which retargets human hand and object motion through simulation for dexterous, contact-rich tasks.

Dec 22, 2025 • 46min

Ep#56: GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

It’s long been a dream of roboticists to be able to teach a robot in simulation so as to skip the long and expensive process of collecting large amounts of real-world training data. However, building simulations for robot tasks is extremely hard. Ideally, we could go from real data to a useful simulation.This is exactly what Guangqi Jiang and his co-authors do. they use 3d Gaussian splatting to reconstructed scenes which let them create interactive environments that, when combined with a physcs engine, allow for training robot policies that show zero-shot sim-to-real transfer (i.e., using no real-world demonstrations).To learn more, watch Episode 56 of Robopapers with Michael Cho and Chris Paxton now!Abstract:This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: this https URL.Learn more:Project Page: https://3dgsworld.github.io/ArXiV: https://arxiv.org/abs/2510.20813Authors’ Original Thread on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

Dec 19, 2025 • 54min

Ep#55: Trace Anything: Representing Any Video in 4D via Trajectory Fields

Modeling how worlds evolve over time is an important aspect of interacting with them. Video world models have become an exciting area of research in robotics over the past year in part for this reason. What if there was a better way to represent changes over time, though?Trace Anything represents each frame in a video as a trajectory field, i.e. a trajectory through 3d space. This provides a very unique foundation for all kinds of downstream tasks like goal-conditioned manipulation and motion forecasting.We talked to Xinhang Liu to learn more.Watch Episode 55 of RoboPapers with Michael Cho and Chris Paxton now!Abstract:Effective spatio-temporal representation is fundamental to modeling, understanding, and predicting dynamics in videos. The atomic unit of a video, the pixel, traces a continuous 3D trajectory over time, serving as the primitive element of dynamics. Based on this principle, we propose representing any video as a Trajectory Field: a dense mapping that assigns a continuous 3D trajectory function of time to each pixel in every frame. With this representation, we introduce Trace Anything, a neural network that predicts the entire trajectory field in a single feed-forward pass. Specifically, for each pixel in each frame, our model predicts a set of control points that parameterizes a trajectory (i.e., a B-spline), yielding its 3D position at arbitrary query time instants. We trained the Trace Anything model on large-scale 4D data, including data from our new platform, and our experiments demonstrate that: (i) Trace Anything achieves state-of-the-art performance on our new benchmark for trajectory field estimation and performs competitively on established point-tracking benchmarks; (ii) it offers significant efficiency gains thanks to its one-pass paradigm, without requiring iterative optimization or auxiliary estimators; and (iii) it exhibits emergent abilities, including goal-conditioned manipulation, motion forecasting, and spatio-temporal fusion.Project Page: https://trace-anything.github.io/ArXiV: https://arxiv.org/abs/2510.13802This Post on X This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

6 snips

Dec 17, 2025 • 51min

Ep#54: MemER: Scaling Up Memory for Robot Control via Experience Retrieval

Ajay Sridhar, a Robotics PhD student, and Jenny Pan, a visiting researcher, dive into MemER's groundbreaking work on robot memory. They discuss how robots can enhance decision-making by selecting crucial keyframes, improving long-horizon task execution like object search. Jenny explains the innovative training methods for keyframe selection, while Ajay shares insights into robust retry behaviors and the challenges of memory management in dynamic environments. Their vision includes transferable memories across robots, enhancing collaboration in robotic tasks.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app