pith. machine review for the scientific record. sign in

arxiv: 2403.12945 · v2 · submitted 2024-03-19 · 💻 cs.RO

Recognition: 1 theorem link

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abigail O'Neill, Abraham Lee, Albert Zhan, Alexander Khazatsky, Andrew E. Wang, Annie Xie, Archit Sharma, Arefeh Yavary, Arhan Jain, Ashwin Balakrishna, Blake Wulfe, Charlotte Le, Chelsea Finn, Cheng Chi, Christopher Agia, Cody Simpson, Daniel Morton, Daphne Chen, David Antonio Herrera, Derick Seale, Dinesh Jayaraman, Donovon Jackson, Dorsa Sadigh, Emi Tran, Ethan Paul Foster, Glen Berseth, Homer Rich Walke, Huy Ha, Ilija Radosavovic, Jaimyn Drake, Jean Mercat, Jeannette Bohg, Jensen Gao, Jiaheng Hu, Jiajun Wu, Jimmy Wu, Jingpei Lu, Jingyun Yang, Jitendra Malik, Joey Hejna, Jonathan Heewon Yang, Joseph J Lim, Kaiyuan Wang, Karl Pertsch, Ken Goldberg, Kevin Black, Kevin Lin, Kirsty Ellis, Kyle Beltran Hatch, Kyle Hsu, Lawrence Yunliang Chen, Marion Lepert, Marius Memmel, Masha Itkina, Mateo Guaman Castro, Michael C. Yip, Minho Heo, Mohan Kumar Srirama, Muhammad Zubair Irshad, Osbert Bastani, Pannag R Sanketi, Patrick Tree Miller, Patrick Yin, Peter David Fagan, Qiuyu Chen, Quan Vuong, Roberto Mart\'in-Mart\'in, Rohan Baijal, Rosario Scalise, Roy Lin, Sergey Levine, Shan Lin, Shivin Dass, Shuran Song, Siddharth Karamcheti, Soroush Nasiriany, Stephen Tian, Subramanian Ramamoorthy, Sudeep Dasari, Suneel Belkhale, Sungjae Park, Suraj Nair, Suvir Mirchandani, Ted Xiao, Thomas Kollar, Tony Nguyen, Tony Z. Zhao, Trinity Chung, Victor Son, Vitor Guizilini, Yecheng Jason Ma, Yilin Wu, Youngwoon Lee, Yuke Zhu, Yunchu Zhang, Yunshuang Li, Zehan Ma

Pith reviewed 2026-05-11 05:47 UTC · model grok-4.3

classification 💻 cs.RO
keywords robot manipulationimitation learningdatasetgeneralizationdemonstration learningin-the-wild roboticspolicy trainingreal-world data
0
0 comments X

The pith

Training on the DROID dataset of 350 hours of real-world robot manipulation data produces policies with higher performance and stronger generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a very large robot manipulation dataset collected across many different real-world scenes and tasks can be used to train policies that succeed more often and transfer better to new situations than policies trained on smaller, more controlled datasets. Current robot learning is held back by the difficulty and cost of gathering diverse data outside the lab, so most policies remain narrow. By releasing 76,000 trajectories gathered by non-experts over 564 scenes and 84 tasks, the work tests whether scale and variety in ordinary environments can overcome those limits. The reported result is that models trained with this data outperform those trained without it on both seen and unseen tasks.

Core claim

Policies trained with the DROID dataset achieve higher success rates and improved generalization compared with policies trained only on smaller laboratory datasets. The dataset supplies 76k demonstration trajectories totaling 350 hours, gathered across 564 distinct scenes and 84 tasks by 50 non-expert collectors in homes and offices in three continents over twelve months. The authors open-source the trajectories, the policy-training code, and instructions for reproducing the robot hardware.

What carries the argument

The DROID dataset itself: a collection of 76k real-world demonstration trajectories spanning 564 scenes and 84 tasks, gathered by non-experts in varied everyday environments.

If this is right

  • Policies trained on DROID data reach higher success rates on the same tasks used during collection.
  • The same policies transfer more successfully to rooms and object arrangements never seen in training.
  • Robot learning pipelines can now incorporate data gathered by non-experts without specialized facilities.
  • Open release of the full trajectories and training code allows other groups to reproduce and extend the results directly.
  • Future dataset efforts can prioritize geographic and scene diversity over perfect control of every variable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the generalization gains hold, robot training may shift from controlled lab collection toward distributed, low-cost gathering in homes and workplaces.
  • The result implies that scene diversity can substitute for some amount of expert supervision and hardware precision.
  • A natural next test would be whether the same dataset also improves performance when used for imitation learning from video only, without robot proprioception.
  • The collection protocol could be applied to other robot platforms to test whether the benefits are hardware-specific.

Load-bearing premise

The data collected by non-experts in ordinary, uncontrolled settings is high enough in quality and coverage to produce measurable gains in policy performance and generalization beyond what smaller lab datasets already provide.

What would settle it

Train identical policy architectures on DROID versus on a comparable number of trajectories from existing lab datasets, then measure success rates on a held-out set of tasks performed in new rooms; if the DROID-trained policies show no advantage or perform worse, the central claim is falsified.

read the original abstract

The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces DROID, a large-scale in-the-wild robot manipulation dataset with 76k demonstration trajectories (350 hours) collected across 564 scenes and 84 tasks by 50 non-expert collectors in North America, Asia, and Europe. It claims that policies trained on DROID achieve higher performance and improved generalization ability relative to smaller lab-collected datasets, and releases the full dataset, policy learning code, and hardware reproduction guide.

Significance. If the empirical results hold, DROID would be a valuable resource for robot learning by substantially increasing the scale and diversity of available real-world manipulation data, addressing a core limitation of current policies trained in limited environments. The open-sourcing of the dataset, code, and detailed hardware guide is a clear strength that supports reproducibility and community use.

major comments (1)
  1. [Section 5] Section 5 (Experiments): The manuscript reports that training with DROID yields higher performance and better generalization, but does not present volume-controlled ablations (e.g., comparing full DROID against an equal-sized subset collected in a single lab environment). Without such controls it is difficult to isolate whether gains arise from the in-the-wild diversity, geographic spread, and non-expert collection rather than from increased total interaction time, which directly affects the central attribution in the abstract and title.
minor comments (1)
  1. [Abstract] Abstract: The claim of improved performance and generalization is stated without any quantitative metrics, baselines, or effect sizes; including one or two key numbers would make the summary more informative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript. The primary concern raised is the lack of volume-controlled ablations in Section 5 to better isolate the contributions of in-the-wild diversity versus raw data volume. We address this point directly below and commit to revisions that strengthen the empirical claims.

read point-by-point responses
  1. Referee: [Section 5] Section 5 (Experiments): The manuscript reports that training with DROID yields higher performance and better generalization, but does not present volume-controlled ablations (e.g., comparing full DROID against an equal-sized subset collected in a single lab environment). Without such controls it is difficult to isolate whether gains arise from the in-the-wild diversity, geographic spread, and non-expert collection rather than from increased total interaction time, which directly affects the central attribution in the abstract and title.

    Authors: We agree that volume-controlled ablations would provide stronger evidence for attributing performance gains specifically to the in-the-wild aspects of DROID rather than scale alone. Our current experiments in Section 5 compare policies trained on DROID against those trained on smaller existing lab-collected datasets (e.g., Bridge, RT-1), which implicitly vary both volume and diversity. However, we did not include explicit controls matching DROID's full volume against an equivalently sized single-lab subset, as our data collection prioritized geographic and scene diversity across 564 environments rather than concentrating volume in one setting. In the revised manuscript, we will add volume-controlled experiments by (1) subsampling DROID to match the sizes of the baseline datasets used in our comparisons and (2) constructing a volume-matched subset drawn from the largest individual scenes or geographic clusters in DROID where sufficient data exists. We will also include a discussion of the practical challenges in collecting large-scale single-lab data at DROID's volume. These additions will clarify the relative contributions of scale and diversity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release with direct experimental validation

full rationale

The paper's core contribution is the collection and public release of the DROID dataset (76k trajectories across 564 scenes) followed by empirical policy training results showing performance gains. No derivation chain, equations, fitted parameters, or predictions exist that reduce to inputs by construction. Claims rest on direct experimentation and comparisons to baselines rather than self-definitional loops, self-citation load-bearing premises, or renamed known results. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that in-the-wild data collection yields higher-quality training signals for generalization than lab data; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Diverse real-world robot interaction data improves policy generalization over limited lab data
    Invoked when claiming higher performance and generalization from training on DROID.

pith-pipeline@v0.9.0 · 5912 in / 1221 out tokens · 54192 ms · 2026-05-11T05:47:41.116365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Membership Inference Attacks on Vision-Language-Action Models

    cs.CR 2026-05 unverdicted novelty 8.0

    Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.

  2. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  3. Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation

    cs.LG 2026-05 unverdicted novelty 7.0

    MA-BC partitions divergent expert data while pooling non-conflicting pairs in MOMDPs, converging faster to Pareto-optimal policies than independent learners and matching a new minimax lower bound.

  4. Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations

    cs.RO 2026-05 unverdicted novelty 7.0

    CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.

  5. Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

    cs.RO 2026-05 conditional novelty 7.0

    A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

  6. CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...

  7. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  8. OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

    cs.RO 2026-04 unverdicted novelty 7.0

    A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.

  9. Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

    cs.RO 2026-04 unverdicted novelty 7.0

    A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...

  10. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 accept novelty 7.0

    3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.

  11. Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.

  12. CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 7.0

    CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.

  13. Action Images: End-to-End Policy Learning via Multiview Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  14. Tune to Learn: How Controller Gains Shape Robot Policy Learning

    cs.RO 2026-04 conditional novelty 7.0

    Controller gains affect learnability differently for behavior cloning, RL from scratch, and sim-to-real transfer, so optimal gains depend on the learning paradigm rather than desired task behavior.

  15. Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

  16. What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models

    cs.RO 2026-05 conditional novelty 6.0

    PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and ligh...

  17. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  18. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  19. StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

    cs.RO 2026-05 unverdicted novelty 6.0

    StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...

  20. ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...

  21. ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

    cs.CV 2026-05 unverdicted novelty 6.0

    ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.

  22. HumanNet: Scaling Human-centric Video Learning to One Million Hours

    cs.CV 2026-05 unverdicted novelty 6.0

    HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.

  23. DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

    cs.RO 2026-05 unverdicted novelty 6.0

    DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...

  24. TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

  25. BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.

  26. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  27. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  28. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  29. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  30. Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.

  31. Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

  32. Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.

  33. Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...

  34. ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  35. Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction

    cs.RO 2026-04 unverdicted novelty 6.0

    COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...

  36. V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    V-CAGE automates the creation of scalable, high-quality robotic manipulation datasets through context-aware scene construction, closed-loop visual verification, and perceptually-driven compression.

  37. SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

    cs.RO 2026-04 unverdicted novelty 6.0

    SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...

  38. EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World

    cs.RO 2026-04 unverdicted novelty 6.0

    EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...

  39. A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model

    cs.RO 2026-04 unverdicted novelty 6.0

    A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.

  40. Hierarchical Planning with Latent World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.

  41. ARM: Advantage Reward Modeling for Long-Horizon Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.

  42. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  43. InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    cs.RO 2025-10 unverdicted novelty 6.0

    InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

  44. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  45. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  46. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    cs.RO 2025-04 unverdicted novelty 6.0

    Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...

  47. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    cs.RO 2025-02 accept novelty 6.0

    OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.

  48. $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    cs.LG 2024-10 unverdicted novelty 6.0

    π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

  49. RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    cs.RO 2024-10 conditional novelty 6.0

    RDT-1B is a diffusion foundation model that unifies action spaces across robots and demonstrates superior bimanual manipulation with zero-shot generalization, language following, and few-shot learning on real robots.

  50. Evaluating Real-World Robot Manipulation Policies in Simulation

    cs.RO 2024-05 conditional novelty 6.0

    SIMPLER simulated environments yield policy performance that correlates strongly with real-world robot manipulation results and captures similar sensitivity to distribution shifts.

  51. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  52. ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.

  53. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  54. ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

    cs.RO 2026-04 unverdicted novelty 5.0

    ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.

  55. AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    AEGIS uses a pre-computed Gaussian anchor and layer-wise Gram-Schmidt orthogonal projections to isolate destructive gradients during VLA fine-tuning, preserving VQA performance without co-training or replay.

  56. Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

    cs.LG 2026-04 unverdicted novelty 5.0

    VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.

  57. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  58. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    cs.RO 2025-01 unverdicted novelty 5.0

    SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...

  59. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  60. JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

    cs.RO 2026-04 unverdicted novelty 4.0

    JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 60 Pith papers · 11 internal anchors

  1. [1]

    Roboa- gent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023

  2. [2]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 , 2022

  3. [3]

    Scaling data-driven robotics with reward sketching and batch reinforcement learning

    Serkan Cabi, Sergio Gomez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu Wang. Scaling data-driven robotics with reward sketching and batch reinforcement learning. RSS, 2019

  4. [4]

    arXiv preprint arXiv:1903.11027 (2019)

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. preprint arXiv:1903.11027, 2019

  5. [5]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

  6. [6]

    Rovi-aug: Robot and viewpoint augmentation for cross-embodiment robot learning

    Lawrence Yunliang Chen, Chenfeng Xu, Karthik Dhar- marajan, Muhammad Zubair Irshad, Richard Cheng, Kurt Keutzer, Masayoshi Tomizuka, Quan Vuong, and Ken Goldberg. Rovi-aug: Robot and viewpoint augmentation for cross-embodiment robot learning. In Conference on Robot Learning (CoRL) , Munich, Germany, 2024

  7. [7]

    Dif- fusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS) , 2023

  8. [8]

    Robonet: Large-scale multi-robot learning

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. CoRL, 2019

  9. [9]

    Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023

  10. [10]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023

  11. [11]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248–255. Ieee, 2009

  12. [12]

    Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,

    Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568, 2018

  13. [13]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396 , 2021

  14. [14]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023

  15. [15]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM , 24(6):381–395, 1981

  16. [16]

    Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117,

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

  17. [17]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  18. [18]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012

  19. [19]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18995– 19012, 2022

  20. [20]

    Robot learning in homes: Improving generalization and reducing dataset bias

    Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashc- hand Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems , 31, 2018

  21. [21]

    Scaling up and distilling down: Language-guided robot skill acquisition

    Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning , pages 3766–3777. PMLR, 2023

  22. [22]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020

  23. [23]

    spaCy 2: Natural language understanding with Bloom embeddings, con- volutional neural networks and incremental parsing

    Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, con- volutional neural networks and incremental parsing. To appear, 2017

  24. [24]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , pages 991–1002. PMLR, 2022

  25. [25]

    VIMA: Robot ma- nipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA: Robot ma- nipulation with multimodal prompts. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on...

  26. [26]

    Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018

  27. [27]

    Mt-opt: Continuous multi- task robotic reinforcement learning at scale

    Dmitry Kalashnkov, Jake Varley, Yevgen Chebotar, Ben Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continuous multi- task robotic reinforcement learning at scale. arXiv, 2021

  28. [28]

    Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation

    Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett War- nell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022

  29. [29]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023. URL https: //arxiv.org/abs/2304.02643

  30. [30]

    Berg, Wan-Yen Lo, et al

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Segment Anything, April 2023

  31. [31]

    Ep n p: An accurate o (n) solution to the p n p problem

    Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem. International journal of computer vision , 81: 155–166, 2009

  32. [32]

    Learning hand-eye coordination for robotic grasping with large-scale data collection

    Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with large-scale data collection. In International Symposium on Experimental Robotics . Springer, 2016

  33. [33]

    Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier

    Yixin Lin, Austin S. Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier. Polymetis. https: //facebookresearch.github.io/fairo/polymetis/, 2021

  34. [34]

    Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera

    Jingpei Lu, Zekai Liang, Tristin Xie, Florian Ritcher, Shan Lin, Sainan Liu, and Michael C Yip. Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera. arXiv preprint arXiv:2409.10441 , 2024

  35. [35]

    Interactive language: Talking to robots in real time

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters , 2023

  36. [36]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879–

  37. [37]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In arXiv preprint arXiv:2108.03298 , 2021

  38. [38]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023

  39. [39]

    Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Her- zog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bern- hard Schölkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang Hua...

  40. [40]

    Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours

    Lerrel Pinto and Abhinav Gupta. Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA) , pages 3406–3413. IEEE, 2016

  41. [41]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

  42. [42]

    Robot learning with sensorimotor pre-training

    Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, 2023

  43. [43]

    Common crawl – building an open web-scale crawl using hadoop, 2010

    Ahad Rana. Common crawl – building an open web-scale crawl using hadoop, 2010. URL https://www.slideshare. net/hadoopusergroup/common-crawlpresentation

  44. [44]

    Accelerating 3d deep learning with pytorch3d

    Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020

  45. [45]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108,

  46. [46]

    URL https://api.semanticscholar.org/CorpusID: 203626972

  47. [47]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems , 35:25278–25294, 2022

  48. [48]

    On bringing robots home, 2023

    Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home, 2023

  49. [49]

    Gnm: A general navigation model to drive any robot

    Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 7226–7233. IEEE, 2023

  50. [50]

    Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

    Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachow- icz, Kevin Black, Noriaki Hirose, and Sergey Levine. ViNT: A foundation model for visual navigation. In 7th Annual Conference on Robot Learning , 2023. URL https://arxiv.org/abs/2306.14846

  51. [51]

    Multiple interactions made easy (mime): Large scale demonstrations data for imitation

    Pratyusha Sharma, Lekha Mohan, Lerrel Pinto, and Abhi- nav Gupta. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning , pages 906–915. PMLR, 2018

  52. [52]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL) , 2022

  53. [53]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning , pages 785–799. PMLR, 2023

  54. [54]

    Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations

    Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations. IEEE Robotics and Automation Letters , 5(3):4978–4985, 2020

  55. [55]

    Nomad: Goal masked diffusion policies for navigation and exploration.arXiv preprint arXiv:2310.07896, 2023

    Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navi- gation and exploration. arXiv preprint arXiv:2310.07896 , 2023

  56. [56]

    Scalability in perception for autonomous driving: Waymo open dataset, 2019

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

  57. [57]

    View-invariant policy learning via zero-shot novel view synthesis

    Stephen Tian, Blake Wulfe, Kyle Sargent, Katherine Liu, Sergey Zakharov, Vitor Guizilini, and Jiajun Wu. View-invariant policy learning via zero-shot novel view synthesis. arXiv, 2024

  58. [58]

    Tartandrive: A large-scale dataset for learning off-road dynamics models

    Samuel Triest, Matthew Sivaprakasam, Sean J Wang, Wenshan Wang, Aaron M Johnson, and Sebastian Scherer. Tartandrive: A large-scale dataset for learning off-road dynamics models. In 2022 International Conference on Robotics and Automation (ICRA), pages 2546–2552. IEEE, 2022

  59. [59]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems , pages 5998–6008, 2017

  60. [60]

    Bridgedata v2: A dataset for robot learning at scale, 2023

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale, 2023

  61. [61]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  62. [62]

    Dust3r: Geometric 3d vision made easy, 2024

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy, 2024. URL https://arxiv.org/abs/2312. 14132

  63. [63]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2025

  64. [64]

    Visual imitation made easy, 2020

    Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, Pieter Abbeel, and Lerrel Pinto. Visual imitation made easy, 2020

  65. [65]

    Bdd100k: A diverse driving dataset for heterogeneous multitask learning

    Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020

  66. [66]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023

  67. [67]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 , 2024

  68. [68]

    in-the-wild

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In 7th Annual Conference on Robot Learning , 2023. APPENDIX A CONTRIBUTIONS Project Leads: Alexander Khazatsky, Karl Pertsch Research Leads (contri...

  69. [69]

    Industrial office: industrial office tables and chairs, conference rooms, conference TVs

  70. [70]

    Industrial kitchen: industrial refrigerator, sink, coffee maker

  71. [71]

    Industrial dining room: industrial setting with dining tables

  72. [72]

    Home office: desk or desk chairs in a home setting

  73. [73]

    Home kitchen: refrigerator, kitchen sink, kitchen tabletop in a home setting

  74. [74]

    Home dining room: dining table, dining chairs, in a home setting

  75. [75]

    Bedroom: room with a bed

  76. [76]

    Bathroom: Showers, baths, toilets, bathroom sinks

  77. [77]

    Living room: places with couches, armchairs, coffee tables, tvs in a home setting

  78. [78]

    Hallway / closet: areas between rooms, situations where the robot is interacting with a door or objects in a closet

  79. [79]

    Other: any other location that does not fit into those categories 12: Unknown: a scene that’s too hard to classify because the image is dark or too close up Listing C.1: The prompt provided to GPT4V in order to classify scene types. A. Diffusion Policy Architecture and Hyperparameters We build our diffusion policy [ 7] training pipeline on the Robomimic c...