Recognition: no theorem link
Octo: An Open-Source Generalist Robot Policy
Pith reviewed 2026-05-11 00:21 UTC · model grok-4.3
The pith
A large policy pretrained on hundreds of thousands of robot trajectories can be finetuned in hours to work with new sensors and actions on different robots.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a transformer-based policy trained on a large collection of robot manipulation trajectories can accept instructions through language or goal images and can be finetuned to new observation spaces and action spaces across nine different robotic platforms, all within a few hours on standard consumer graphics processing units, providing a versatile initialization for generalist policies.
What carries the argument
A transformer architecture that processes sequences of observations and instructions to predict actions, serving as the reusable initialization for adaptation to new robot configurations.
Load-bearing premise
That pretraining on varied robot data creates features general enough for effective transfer to new hardware through limited finetuning without needing lots of extra data or careful tuning.
What would settle it
An experiment on a new robot setup where finetuning the pretrained policy yields performance no better than training an identical model from random initialization on the same limited data would disprove the central result.
read the original abstract
Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Octo, a large transformer-based policy pretrained on 800k trajectories from the Open X-Embodiment dataset. It claims that this model serves as a versatile initialization that can be effectively finetuned (via language or goal-image instructions) to new robotic platforms with different observation and action spaces, with experiments across 9 platforms and ablations on architecture and data choices to support future generalist robot model development.
Significance. If the central claim holds, Octo would provide a valuable open-source starting point for robot learning that reduces per-task data collection needs. The release of the model, code, and detailed ablations on design decisions (architecture, training data) are concrete strengths that can accelerate community progress on generalist policies.
major comments (2)
- [Experiments (across 9 platforms)] The experiments across 9 robotic platforms demonstrate successful finetuning, but the manuscript does not report results for an identical transformer architecture trained from random initialization on the same target data volume and compute budget. This comparison is needed to isolate whether the 800k-trajectory pretraining supplies transferable features or whether any sufficiently large model would succeed under the reported finetuning protocol.
- [Experiments and ablations] The weakest assumption—that modest finetuning on new sensors/actions yields reliable performance because of pretraining rather than hyperparameter choices or model scale—is not directly tested. Adding from-scratch baselines (or at minimum reporting the finetuning hyperparameter search effort and data volume per platform) would make the versatility claim load-bearing rather than suggestive.
minor comments (1)
- [Abstract] The abstract and introduction use 'a few hours on standard consumer GPUs' without specifying exact GPU type, batch size, or number of steps; adding these details would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential impact of Octo as an open-source generalist robot policy. We address the major comments below, focusing on strengthening the experimental validation.
read point-by-point responses
-
Referee: [Experiments (across 9 platforms)] The experiments across 9 robotic platforms demonstrate successful finetuning, but the manuscript does not report results for an identical transformer architecture trained from random initialization on the same target data volume and compute budget. This comparison is needed to isolate whether the 800k-trajectory pretraining supplies transferable features or whether any sufficiently large model would succeed under the reported finetuning protocol.
Authors: We concur that a from-scratch baseline with the same architecture, data volume, and compute would help isolate the contribution of pretraining. However, the scale of the model and the number of platforms make running these additional experiments infeasible with our current computational budget. Our ablations on architecture variants and subsets of the pretraining data provide indirect evidence for the benefits of large-scale pretraining. In the revised manuscript, we will include a dedicated limitations paragraph discussing this and report the exact finetuning data amounts and hyperparameter optimization details for each of the 9 platforms. revision: partial
-
Referee: [Experiments and ablations] The weakest assumption—that modest finetuning on new sensors/actions yields reliable performance because of pretraining rather than hyperparameter choices or model scale—is not directly tested. Adding from-scratch baselines (or at minimum reporting the finetuning hyperparameter search effort and data volume per platform) would make the versatility claim load-bearing rather than suggestive.
Authors: We agree that to make the claims more robust, additional controls are valuable. We will update the manuscript to explicitly report the data volume used for finetuning on each platform and describe the extent of hyperparameter tuning performed. While we cannot add full from-scratch runs, we believe the combination of cross-platform results and ablations supports the utility of the pretrained model as a starting point. We will revise the text to emphasize these points more clearly. revision: partial
- Full from-scratch baselines across all nine platforms under matched compute constraints
Circularity Check
No significant circularity; empirical results are independent of model inputs
full rationale
The paper's central claim is that a transformer policy pretrained on 800k Open X-Embodiment trajectories serves as a versatile initialization for finetuning to new observation and action spaces. This is supported by new experimental runs across 9 robotic platforms with held-out data, not by any derivation that reduces to the pretraining data or fitted parameters by construction. No equations, uniqueness theorems, or self-citations are used to derive performance claims; the work contains no mathematical derivation chain and reports direct empirical outcomes from finetuning protocols. The results are therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- transformer architecture hyperparameters
- pretraining and finetuning optimization settings
axioms (1)
- domain assumption A transformer can learn general robot control features from heterogeneous trajectory data collected across many embodiments.
Forward citations
Cited by 60 Pith papers
-
Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
-
Test-time Sparsity for Extreme Fast Action Diffusion
Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models
MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.
-
VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models
VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.
-
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and r...
-
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation
A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.
-
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
-
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
-
STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations
STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
Action Images: End-to-End Policy Learning via Multiview Video Generation
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
-
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
-
Diffusion Policy with Bayesian Expert Selection for Active Multi-Target Tracking
A Bayesian expert selection framework with variational Bayesian last layers and lower confidence bounds improves diffusion policies for active multi-target tracking.
-
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
-
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
-
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
-
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
Retrieve-then-steer stores successful observation-action segments in memory, retrieves relevant chunks, filters them, and uses an elite prior with confidence-adaptive guidance to steer a flow-matching action sampler f...
-
Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs
A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.
-
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
-
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
-
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
An Efficient Metric for Data Quality Measurement in Imitation Learning
Power spectral density of trajectories ranks demonstration quality for imitation learning, enabling rollout-free curation that improves fine-tuned policy success.
-
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
-
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
-
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
-
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
-
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
Exploring High-Order Self-Similarity for Video Understanding
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
-
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
-
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
-
HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
HELM raises long-horizon VLA success from 58.4% to 81.5% on LIBERO-LONG by combining episodic memory retrieval, learned failure prediction, and replanning, outperforming context extension or adaptation alone.
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.
-
A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics
A two-level hierarchical vector quantization tokenizer that clusters actions spatially and temporally achieves new state-of-the-art results in in-context imitation learning for robotics.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
-
MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning
A single transformer model trained offline on expert trajectories from three distinct MARL environments achieves competitive performance against specialized baselines without per-task tuning.
-
SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation
SnapFlow compresses multi-step denoising in flow-matching VLAs into one step via progressive self-distillation using two-step Euler shortcuts from marginal velocities, matching 10-step teacher success rates with 9.6x ...
-
Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming
DAERT generates diverse adversarial instructions via a uniform policy in RL to drop VLA task success rates from 93.33% to 5.85% on benchmarks with models like π0 and OpenVLA.
Reference graph
Works this paper leans on
-
[1]
Introducing scale’s automotive foundation model, 2023
Scale AI. Introducing scale’s automotive foundation model, 2023. URL https://scale.com/blog/afm1
work page 2023
-
[2]
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NeurIPS, 2017
work page 2017
-
[3]
Human-to-robot imitation in the wild
Shikhar Bahl, Abhinav Gupta, and Deepak Pathak. Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450, 2022
-
[4]
Affordances from human videos as a versatile representation for robotics
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In CVPR, 2023
work page 2023
-
[5]
Hydra: Hybrid robot actions for imitation learning
Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. arxiv, 2023
work page 2023
-
[6]
Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918 , 2023
- [7]
-
[8]
End to End Learning for Self-Driving Cars
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 , 2016
work page internal anchor Pith review arXiv 2016
-
[9]
RoboCat : A self-improving foundation agent for robotic manipulation
Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauza, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving foundation agent for robotic manipu- lation. arXiv preprint arXiv:2306.11706 , 2023
-
[10]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Do as i can, not as i say: Grounding language in robotic affordances
Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning , pages 287–318. PMLR, 2023
work page 2023
-
[12]
Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu Wang. Scaling data- driven robotics with reward sketching and batch rein- forcement learning. arXiv preprint arXiv:19...
-
[13]
nuscenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020
work page 2020
-
[14]
Berkeley UR5 demonstration dataset
Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https: //sites.google.com/view/berkeley-ur5/home
-
[15]
Vision-language models provide promptable rep- resentations for reinforcement learning
William Chen, Oier Mees, Aviral Kumar, and Sergey Levine. Vision-language models provide promptable rep- resentations for reinforcement learning. arXiv preprint arXiv:2402.02651, 2024
-
[16]
Genaug: Retargeting behaviors to unseen situ- ations via generative augmentation, 2023
Zoey Chen, Sho Kiami, Abhishek Gupta, and Vikash Kumar. Genaug: Retargeting behaviors to unseen situations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023
-
[17]
Dif- fusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023
work page 2023
-
[18]
From play to policy: Conditional behavior generation from uncurated robot data
Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data. In The Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[19]
Robonet: Large-scale multi-robot learning
Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning , pages 885–897. PMLR, 2020
work page 2020
-
[20]
Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. CLVR jaco play dataset, 2023. URL https://github.com/ clvrai/clvr_jaco_play_dataset
work page 2023
-
[21]
Causal confusion in imitation learning
Pim de Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning. NeurIPS, 2019
work page 2019
-
[22]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[23]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets, 2023
Maximilian Du, Suraj Nair, Dorsa Sadigh, and Chelsea Finn. Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets. ArXiv, abs/2304.08742,
-
[25]
URL https://api.semanticscholar.org/CorpusID: 258186973
-
[26]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. arXiv preprint arXiv:2109.13396, 2021
work page internal anchor Pith review arXiv 2021
-
[27]
Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot
Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023
work page 2023
-
[28]
Deep visual foresight for planning robot motion
Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2786–2793. IEEE, 2017
work page 2017
-
[29]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
Are we ready for autonomous driving? the kitti vision benchmark suite
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition , pages 3354–3361. IEEE, 2012
work page 2012
-
[31]
Robot learning in homes: Improving generalization and reducing dataset bias
Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018
work page 2018
-
[32]
Scaling up and distilling down: Language-guided robot skill acquisition
Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning , pages 3766–3777. PMLR, 2023
work page 2023
-
[33]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016
work page 2016
-
[34]
Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. In Robotics: Science and Systems , 2023
work page 2023
-
[35]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020
work page 2020
-
[36]
Gaia-1: A generative world model for autonomous driving, 2023
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving, 2023
work page 2023
-
[37]
Visual language maps for robot navigation
Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023
work page 2023
-
[38]
Audio visual language maps for robot naviga- tion
Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Audio visual language maps for robot naviga- tion. In Proceedings of the International Symposium on Experimental Robotics (ISER) , Chiang Mai, Thailand, 2023
work page 2023
-
[39]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 , 2023
work page internal anchor Pith review arXiv 2023
-
[40]
Bc-z: Zero-shot task generalization with robotic imitation learning
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , pages 991–1002. PMLR, 2022
work page 2022
-
[41]
VIMA: Robot manipulation with multimodal prompts
Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA: Robot manipulation with multimodal prompts. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on M...
work page 2023
-
[42]
Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018
Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018
-
[43]
Scaling up multi- task robotic reinforcement learning
Dmitry Kalashnikov, Jake Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Scaling up multi- task robotic reinforcement learning. In 5th Annual Conference on Robot Learning , 2021
work page 2021
-
[44]
Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters , 7(4):11807–11814, 2022
work page 2022
-
[45]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Segment Anything, April 2023
work page 2023
-
[46]
Robohive: A unified framework for robot learning
Vikash Kumar, Rutav Shah, Gaoyue Zhou, Vincent Moens, Vittorio Caggiano, Abhishek Gupta, and Ar- avind Rajeswaran. Robohive: A unified framework for robot learning. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/ forum?id=0H5fRQcpQ7
work page 2023
-
[47]
Language models as zero-shot trajectory generators
Teyun Kwon, Norman Di Palo, and Edward Johns. Language models as zero-shot trajectory generators. arXiv preprint arXiv:2310.11604 , 2023
-
[48]
End-to-end training of deep visuomotor policies
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research , 17(1):1334– 1373, 2016
work page 2016
-
[49]
Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436, 2018
work page 2018
-
[50]
Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier
Yixin Lin, Austin S. Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier. Polymetis. https: //facebookresearch.github.io/fairo/polymetis/, 2021
work page 2021
-
[51]
Robot learning on the job: Human- in-the-loop autonomy and learning during deployment
Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. In Robotics: Science and Systems (RSS) , 2023
work page 2023
-
[52]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018
work page 2018
-
[53]
Multi-stage cable routing through hierarchical imitation learning, 2024
Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi-stage cable routing through hierarchical imitation learning. arXiv preprint arXiv:2307.08927 , 2023
-
[54]
Fmb: a functional manipulation benchmark for generalizable robotic learning
Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. arXiv preprint arXiv:2401.08553, 2024
-
[55]
Language conditioned imitation learning over unstructured data
Corey Lynch and Pierre Sermanet. Language conditioned imitation learning over unstructured data. In RSS, 2021
work page 2021
-
[56]
Interactive language: Talking to robots in real time
Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023
work page 2023
-
[57]
Where are we in the search for an artificial visual cortex for embodied intelligence? 2023
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, and Franziska Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? 2023
work page 2023
-
[58]
Where are we in the search for an artificial vi- sual cortex for embodied intelligence?
Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, et al. Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023
-
[59]
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, Silvio Savarese, and Li Fei-Fei. RoboTurk: A crowdsourcing platform for robotic skill learning through imitation. CoRR, abs/1811.02790, 2018. URL http://arxiv.org/abs/1811. 02790
-
[60]
Roboturk: A crowdsourcing platform for robotic skill learning through imitation
Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879–
-
[61]
Mimicgen: A data generation system for scalable robot learning using human demonstrations
Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ireti- ayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In 7th Annual Conference on Robot Learning , 2023
work page 2023
-
[62]
What matters in language conditioned robotic imitation learning over unstructured data
Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022
work page 2022
-
[63]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks. IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022
work page 2022
-
[64]
Grounding language with visual affordances over un- structured data
Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over un- structured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023
work page 2023
-
[65]
Structured world models from human videos
Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. CoRL, 2023
work page 2023
-
[66]
Learning and retrieval from prior data for skill-based imitation learning
Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill-based imitation learning. In Conference on Robot Learning (CoRL), 2022
work page 2022
-
[67]
Im- proved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Im- proved denoising diffusion probabilistic models. In International Conference on Machine Learning , pages 8162–8171. PMLR, 2021
work page 2021
-
[68]
Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Her- zog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Schölkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang Huang...
work page internal anchor Pith review arXiv 2023
- [69]
-
[70]
The surprising effectiveness of representation learning for visual imita- tion, 2021
Jyothish Pari, Nur Muhammad Shafiullah, Sridhar Pan- dian Arunachalam, and Lerrel Pinto. The surprising effectiveness of representation learning for visual imita- tion, 2021
work page 2021
-
[71]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence , volume 32, 2018
work page 2018
-
[72]
Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours
Lerrel Pinto and Abhinav Gupta. Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA) , pages 3406–3413. IEEE, 2016
work page 2016
-
[73]
Shared Control Templates for Assistive Robotics
Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and Joern V ogel. Shared Control Templates for Assistive Robotics. In 2020 IEEE International Conference on Robotics and Automation (ICRA) , page 7, Paris, France, 2020
work page 2020
-
[74]
Robot learning with sensorimotor pre-training
Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot learning with sensorimotor pre-training. Conference on Robot Learning, 2023
work page 2023
-
[75]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research , 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html
work page 2020
-
[76]
Scott Reed, Konrad Zolna, Emilio Parisotto, Ser- gio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. Transactions on Machine Learning Research , 2022
work page 2022
-
[77]
High-Resolution Image Synthesis with Latent Diffusion Models, April 2022
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models, April 2022
work page 2022
-
[78]
Latent plans for task agnostic offline reinforcement learning
Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task agnostic offline reinforcement learning. In Proceedings of the 6th Conference on Robot Learning (CoRL) , 2022
work page 2022
-
[79]
Multi-resolution sensing for real-time control with vision- language models
Saumya Saxena, Mohit Sharma, and Oliver Kroemer. Multi-resolution sensing for real-time control with vision- language models. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id= WuBv9-IGDUA
work page 2023
-
[80]
Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.