RoboPapers

Chris Paxton and Michael Cho
undefined
Aug 15, 2025 • 1h 9min

Ep#13: Instant Policy

How can we do in-context learning for robots? Watch this episode to find out. This work was an ICLR 2025 oral paper, and winner of Best Paper Award at the ICLR 2025 Robot Learning Workshop.Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation and model ICIL as a graph generation problem using a learned diffusion process, enabling structured reasoning over demonstrations, observations, and actions. Second, we show that such a model can be trained using pseudo-demonstrations – arbitrary trajectories generated in simulation – as a virtually infinite pool of training data. Our experiments, in both simulation and reality, show that Instant Policy enables rapid learning of various everyday robot tasks. We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks.Project SiteOriginal Post on XArXiV PDF This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Aug 14, 2025 • 1h 9min

Ep#12: VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

How can world models be used for training autonomous driving? Learn by watching this episode with Florent Bartoccioni!We explores the potential of large-scale generative video models to enhance autonomous driving capabilities, introducing an open-source autoregressive video model (VaViM) and a companion video-action model (VaVAM). VaViM is a simple autoregressive model that predicts frames using spatio-temporal token sequences, while VaVAM leverages the learned representations to generate driving trajectories through imitation learning. Together, they offer a complete perception-to-action pipeline.Project SiteOriginal Post on XArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Aug 14, 2025 • 1h 8min

Ep#14: In-Air Vehicle Maneuver for High-Speed Off-Road Navigation and VERTIFORMER

Two papers in one episode! Learn about how we can use small amounts of data to train transformers capable of doing truly impressive stuff.Original Post on XDom, cars don’t fly!—Or do they? In-Air Vehicle Maneuver for High-Speed Off-Road NavigationWhen pushing the speed limit for aggressive off-road navigation on uneven terrain, it is inevitable that vehicles may become airborne from time to time. During time-sensitive tasks, being able to fly over challenging terrain can also save time, instead of cautiously circumventing or slowly negotiating through. However, most off-road autonomy systems operate under the assumption that the vehicles are always on the ground and therefore limit operational speed. In this paper, we present a novel approach for in-air vehicle maneuver during high-speed off-road navigation. Based on a hybrid forward kinodynamic model using both physics principles and machine learning, our fixed-horizon, sampling-based motion planner ensures accurate vehicle landing poses and their derivatives within a short airborne time window using vehicle throttle and steering commands. We test our approach in extensive in-air experiments both indoors and outdoors, compare it against an error-driven control method, and demonstrate that precise and timely in-air vehicle maneuver is possible through existing ground vehicle controls.Paper PDFVERTIFORMER: A Data-Efficient Multi-Task Transformer on Vertically Challenging TerrainWe propose VERTIFORMER, a novel data-efficient multi-task Transformer trained with only one hour of multi-modal data to address the challenges of applying Transformers for robot mobility on extremely rugged, vertically challenging, off-road terrain. With a Transformer encoder and decoder to predict the next robot pose, action, and terrain patch, VERTIFORMER employs a unified state space and missing modality infilling to respectively enhance dynamics understanding and enable a variety of off-road mobility tasks simultaneously, e.g., forward and inverse kinodynamics modeling. By leveraging this unified representation alongside modality infilling, it also achieves real-time task switching during inference for improved fault tolerance and better generalization to unseen environments. Furthermore, VERTIFORMER’s non-autoregressive design also mitigates computational bottlenecks and error propagation associated with autoregressive models. Our experiments offer insights into effectively utilizing Transformers for off-road robot mobility with limited data and demonstrate VERTIFORMER can facilitate multiple off-road mobility tasks onboard a physical mobile robot.Paper PDFOpen-Source Code This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Aug 14, 2025 • 1h 2min

Ep#11: Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation

Large real-world robot datasets hold great potential to train generalist robot models, but scaling real-world human data collection is time-consuming and resource-intensive. Simulation has great potential in supplementing large-scale data, especially with recent advances in generative AI and automated data generation tools that enable scalable creation of robot behavior datasets. However, training a policy solely in simulation and transferring it to the real world often demands substantial human effort to bridge the reality gap. A compelling alternative is to co-train the policy on a mixture of simulation and real-world datasets. Preliminary studies have recently shown this strategy to substantially improve the performance of a policy over one trained on a limited amount of real-world data. Nonetheless, the community lacks a systematic understanding of sim-and-real co-training and what it takes to reap the benefits of simulation data for real-robot learning. This work presents a simple yet effective recipe for utilizing simulation data to solve vision-based robotic manipulation tasks. We derive this recipe from comprehensive experiments that validate the co-training strategy on various simulation and real-world datasets. Using two domains--a robot arm and a humanoid--across diverse tasks, we demonstrate that simulation data can enhance real-world task performance by an average of 38%, even with notable differences between the simulation and real-world data. Videos and additional results can be found at this https URLOriginal Post on XProject SiteArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Aug 12, 2025 • 1h 1min

Ep#10 Human Policy ~ Humanoid Policy

It’s hard to collect data for humanoid robots at sufficient scale for generalization. The authors of “Humanoid Policy ~ Human Policy” have the answer: collect human data at scale, and retarget it to humanoid robots.This acts as a multiplier, letting you get away with using far less robot data to accomplish challenging robot tasks. Watch or listen to learn more.Abstract:Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive teleoperated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency. Code and data: this https URLProject WebsiteArXiVYouTube Link This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Aug 12, 2025 • 52min

Ep#9: AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World

Evaluating robot policies is challenging, because everyone has a slightly different testing setup and environment. Paul joined us to talk about his work AutoEval, which is a continuously-operating benchmark which allows you to test robot policies remotely, in any environment.Abstract:Scalable and reproducible policy evaluation has been a long-standing challenge in robot learning. Evaluations are critical to assess progress and build better policies, but evaluation in the real world, especially at a scale that would provide statistically reliable results, is costly in terms of human time and hard to obtain. Evaluation of increasingly generalist robot policies requires an increasingly diverse repertoire of evaluation environments, making the evaluation bottleneck even more pronounced. To make real-world evaluation of robotic policies more practical, we propose AutoEval, a system to autonomously evaluate generalist robot policies around the clock with minimal human intervention. Users interact with AutoEval by submitting evaluation jobs to the AutoEval queue, much like how software jobs are submitted with a cluster scheduling system, and AutoEval will schedule the policies for evaluation within a framework supplying automatic success detection and automatic scene resets. We show that AutoEval can nearly fully eliminate human involvement in the evaluation process, permitting around the clock evaluations, and the evaluation results correspond closely to ground truth evaluations conducted by hand. To facilitate the evaluation of generalist policies in the robotics community, we provide public access to multiple AutoEval scenes in the popular BridgeData robot setup with WidowX robot arms. In the future, we hope that AutoEval scenes can be set up across institutions to form a diverse and distributed evaluation network.Project PageArXiV Page This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Aug 8, 2025 • 59min

Ep#8: VGGT: Visual Geometry Grounded Transformer

3D spatial information provides a really strong signal for robotics policies, something we’ve discussed in previous episodes. But computing this 3D structure is hard, and often relies on imperfect, low-quality depth sensors. It would be great if we could reconstruct this information from cameras alone, with little prior information.Well, that’s exactly what VGGT does!We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives without their post-processing utilizing visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis.Project Page This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Aug 8, 2025 • 59min

Ep#7: AnyDexGrasp: General Dexterous Grasping for Different Hands with Human-level Learning Efficiency

Hao-Shu talks to us about how we can learn a contact-centric grasp representation which works across many different robots.We introduce an efficient approach for learning dexterous grasping with minimal data, advancing robotic manipulation capabilities across different robotic hands. Unlike traditional methods that require millions of grasp labels for each robotic hand, our method achieves high performance with human-level learning efficiency: only hundreds of grasp attempts on 40 training objects. The approach separates the grasping process into two stages: first, a universal model maps scene geometry to intermediate contact-centric grasp representations, independent of specific robotic hands. Next, a unique grasp decision model is trained for each robotic hand through real-world trial and error, translating these representations into final grasp poses. Our results show a grasp success rate of 75-95% across three different robotic hands in real-world cluttered environments with over 150 novel objects, improving to 80-98% with increased training objects. This adaptable method demonstrates promising applications for humanoid robots, prosthetics, and other domains requiring robust, versatile robotic manipulation.Paper on ArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Aug 8, 2025 • 53min

Ep#6: FP3: A 3D Foundation Policy for Robotic Manipulation

3D features are uniquely better at generalizing than 2D image features, given less data. Learn about how we can use them to train generalizable robot manipulation policies with Yang Gao.Following its success in natural language processing and computer vision, foundation models that are pre-trained on large-scale multi-task datasets have also shown great potential in robotics. However, most existing robot foundation models rely solely on 2D image observations, ignoring 3D geometric information, which is essential for robots to perceive and reason about the 3D world. In this paper, we introduce FP3, a first large-scale 3D foundation policy model for robotic manipulation. FP3 builds on a scalable diffusion transformer architecture and is pre-trained on 60k trajectories with point cloud observations. With the model design and diverse pre-training data, FP3 can be efficiently fine-tuned for downstream tasks while exhibiting strong generalization capabilities. Experiments on real robots demonstrate that with only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects, significantly surpassing existing robot foundation models.Paper on ArXiV This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com
undefined
Aug 8, 2025 • 1h

Ep#5: R+X: Retrieval and Execution from Everyday Human Videos

Human data is much more plentiful than robot data, and humans already know how to perform so many tasks. Teaching robots from human videos, then, has a ton of potential.We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at this https URL.Find the paper on ArXiV. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit robopapers.substack.com

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app