arxiv: 2403.12945 · v2 · submitted 2024-03-19 · 💻 cs.RO

Recognition: 1 theorem link

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abigail O'Neill, Abraham Lee, Albert Zhan, Alexander Khazatsky, Andrew E. Wang, Annie Xie, Archit Sharma, Arefeh Yavary, Arhan Jain, Ashwin Balakrishna, Blake Wulfe, Charlotte Le, Chelsea Finn, Cheng Chi, Christopher Agia, Cody Simpson, Daniel Morton, Daphne Chen, David Antonio Herrera, Derick Seale, Dinesh Jayaraman, Donovon Jackson, Dorsa Sadigh, Emi Tran, Ethan Paul Foster, Glen Berseth, Homer Rich Walke, Huy Ha, Ilija Radosavovic, Jaimyn Drake, Jean Mercat, Jeannette Bohg, Jensen Gao, Jiaheng Hu, Jiajun Wu, Jimmy Wu, Jingpei Lu, Jingyun Yang, Jitendra Malik, Joey Hejna, Jonathan Heewon Yang, Joseph J Lim, Kaiyuan Wang, Karl Pertsch, Ken Goldberg, Kevin Black, Kevin Lin, Kirsty Ellis, Kyle Beltran Hatch, Kyle Hsu, Lawrence Yunliang Chen, Marion Lepert, Marius Memmel, Masha Itkina, Mateo Guaman Castro, Michael C. Yip, Minho Heo, Mohan Kumar Srirama, Muhammad Zubair Irshad, Osbert Bastani, Pannag R Sanketi, Patrick Tree Miller, Patrick Yin, Peter David Fagan, Qiuyu Chen, Quan Vuong, Roberto Mart\'in-Mart\'in, Rohan Baijal, Rosario Scalise, Roy Lin, Sergey Levine, Shan Lin, Shivin Dass, Shuran Song, Siddharth Karamcheti, Soroush Nasiriany, Stephen Tian, Subramanian Ramamoorthy, Sudeep Dasari, Suneel Belkhale, Sungjae Park, Suraj Nair, Suvir Mirchandani, Ted Xiao, Thomas Kollar, Tony Nguyen, Tony Z. Zhao, Trinity Chung, Victor Son, Vitor Guizilini, Yecheng Jason Ma, Yilin Wu, Youngwoon Lee, Yuke Zhu, Yunchu Zhang, Yunshuang Li, Zehan Ma

Pith reviewed 2026-05-11 05:47 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot manipulationimitation learningdatasetgeneralizationdemonstration learningin-the-wild roboticspolicy trainingreal-world data

0 comments

The pith

Training on the DROID dataset of 350 hours of real-world robot manipulation data produces policies with higher performance and stronger generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a very large robot manipulation dataset collected across many different real-world scenes and tasks can be used to train policies that succeed more often and transfer better to new situations than policies trained on smaller, more controlled datasets. Current robot learning is held back by the difficulty and cost of gathering diverse data outside the lab, so most policies remain narrow. By releasing 76,000 trajectories gathered by non-experts over 564 scenes and 84 tasks, the work tests whether scale and variety in ordinary environments can overcome those limits. The reported result is that models trained with this data outperform those trained without it on both seen and unseen tasks.

Core claim

Policies trained with the DROID dataset achieve higher success rates and improved generalization compared with policies trained only on smaller laboratory datasets. The dataset supplies 76k demonstration trajectories totaling 350 hours, gathered across 564 distinct scenes and 84 tasks by 50 non-expert collectors in homes and offices in three continents over twelve months. The authors open-source the trajectories, the policy-training code, and instructions for reproducing the robot hardware.

What carries the argument

The DROID dataset itself: a collection of 76k real-world demonstration trajectories spanning 564 scenes and 84 tasks, gathered by non-experts in varied everyday environments.

If this is right

Policies trained on DROID data reach higher success rates on the same tasks used during collection.
The same policies transfer more successfully to rooms and object arrangements never seen in training.
Robot learning pipelines can now incorporate data gathered by non-experts without specialized facilities.
Open release of the full trajectories and training code allows other groups to reproduce and extend the results directly.
Future dataset efforts can prioritize geographic and scene diversity over perfect control of every variable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the generalization gains hold, robot training may shift from controlled lab collection toward distributed, low-cost gathering in homes and workplaces.
The result implies that scene diversity can substitute for some amount of expert supervision and hardware precision.
A natural next test would be whether the same dataset also improves performance when used for imitation learning from video only, without robot proprioception.
The collection protocol could be applied to other robot platforms to test whether the benefits are hardware-specific.

Load-bearing premise

The data collected by non-experts in ordinary, uncontrolled settings is high enough in quality and coverage to produce measurable gains in policy performance and generalization beyond what smaller lab datasets already provide.

What would settle it

Train identical policy architectures on DROID versus on a comparable number of trajectories from existing lab datasets, then measure success rates on a held-out set of tasks performed in new rooms; if the DROID-trained policies show no advantage or perform worse, the central claim is falsified.

read the original abstract

The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DROID is a solid large-scale in-the-wild robot dataset that addresses data scarcity, but its generalization claims rest on unshown details and lack volume-matched controls.

read the letter

DROID is a large robot manipulation dataset collected in real-world environments across multiple continents, which stands out for its scale and diversity. The authors gathered 76k trajectories in 564 scenes by 50 non-expert collectors over a year, and they back it up with open-sourced code and a hardware guide. This is new in the sense that it moves beyond the typical lab-constrained datasets. Prior work had smaller scales and less geographic spread, so this incremental step in data collection is practical and addresses a real bottleneck in training general policies. They do a good job showing it's possible to collect this much data safely and logistically in varied conditions. The claim that training on it leads to better policies and generalization is plausible given the numbers. The soft spot is the lack of detailed evidence in the abstract for those gains. No baselines or numbers are provided, making it hard to assess the strength of the result. The stress-test concern holds here: without ablations that match data volume but vary the diversity or collection method, we can't be sure the improvements come from the in-the-wild properties rather than just having more demonstrations. Non-expert variability could also play a role in ways not fully explored. If the full paper has those controls, it would strengthen the case; otherwise, the attribution to diversity is weaker than presented. Readers working on imitation learning or scaling robot policies will get value from the dataset itself. It's worth a serious referee because the contribution is the data release and the demonstration of feasibility, even if the experiments could be tighter. I recommend putting it through peer review with feedback on adding volume-controlled comparisons.

Referee Report

1 major / 1 minor

Summary. The paper introduces DROID, a large-scale in-the-wild robot manipulation dataset with 76k demonstration trajectories (350 hours) collected across 564 scenes and 84 tasks by 50 non-expert collectors in North America, Asia, and Europe. It claims that policies trained on DROID achieve higher performance and improved generalization ability relative to smaller lab-collected datasets, and releases the full dataset, policy learning code, and hardware reproduction guide.

Significance. If the empirical results hold, DROID would be a valuable resource for robot learning by substantially increasing the scale and diversity of available real-world manipulation data, addressing a core limitation of current policies trained in limited environments. The open-sourcing of the dataset, code, and detailed hardware guide is a clear strength that supports reproducibility and community use.

major comments (1)

[Section 5] Section 5 (Experiments): The manuscript reports that training with DROID yields higher performance and better generalization, but does not present volume-controlled ablations (e.g., comparing full DROID against an equal-sized subset collected in a single lab environment). Without such controls it is difficult to isolate whether gains arise from the in-the-wild diversity, geographic spread, and non-expert collection rather than from increased total interaction time, which directly affects the central attribution in the abstract and title.

minor comments (1)

[Abstract] Abstract: The claim of improved performance and generalization is stated without any quantitative metrics, baselines, or effect sizes; including one or two key numbers would make the summary more informative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript. The primary concern raised is the lack of volume-controlled ablations in Section 5 to better isolate the contributions of in-the-wild diversity versus raw data volume. We address this point directly below and commit to revisions that strengthen the empirical claims.

read point-by-point responses

Referee: [Section 5] Section 5 (Experiments): The manuscript reports that training with DROID yields higher performance and better generalization, but does not present volume-controlled ablations (e.g., comparing full DROID against an equal-sized subset collected in a single lab environment). Without such controls it is difficult to isolate whether gains arise from the in-the-wild diversity, geographic spread, and non-expert collection rather than from increased total interaction time, which directly affects the central attribution in the abstract and title.

Authors: We agree that volume-controlled ablations would provide stronger evidence for attributing performance gains specifically to the in-the-wild aspects of DROID rather than scale alone. Our current experiments in Section 5 compare policies trained on DROID against those trained on smaller existing lab-collected datasets (e.g., Bridge, RT-1), which implicitly vary both volume and diversity. However, we did not include explicit controls matching DROID's full volume against an equivalently sized single-lab subset, as our data collection prioritized geographic and scene diversity across 564 environments rather than concentrating volume in one setting. In the revised manuscript, we will add volume-controlled experiments by (1) subsampling DROID to match the sizes of the baseline datasets used in our comparisons and (2) constructing a volume-matched subset drawn from the largest individual scenes or geographic clusters in DROID where sufficient data exists. We will also include a discussion of the practical challenges in collecting large-scale single-lab data at DROID's volume. These additions will clarify the relative contributions of scale and diversity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release with direct experimental validation

full rationale

The paper's core contribution is the collection and public release of the DROID dataset (76k trajectories across 564 scenes) followed by empirical policy training results showing performance gains. No derivation chain, equations, fitted parameters, or predictions exist that reduce to inputs by construction. Claims rest on direct experimentation and comparisons to baselines rather than self-definitional loops, self-citation load-bearing premises, or renamed known results. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that in-the-wild data collection yields higher-quality training signals for generalization than lab data; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Diverse real-world robot interaction data improves policy generalization over limited lab data
Invoked when claiming higher performance and generalization from training on DROID.

pith-pipeline@v0.9.0 · 5912 in / 1221 out tokens · 54192 ms · 2026-05-11T05:47:41.116365+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Membership Inference Attacks on Vision-Language-Action Models
cs.CR 2026-05 unverdicted novelty 8.0

Vision-language-action models are highly vulnerable to membership inference attacks, including practical black-box versions that exploit generated actions and motion trajectories.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation
cs.LG 2026-05 unverdicted novelty 7.0

MA-BC partitions divergent expert data while pooling non-conflicting pairs in MOMDPs, converging faster to Pareto-optimal policies than independent learners and matching a new minimax lower bound.
Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations
cs.RO 2026-05 unverdicted novelty 7.0

CoDi decomposes the multi-agent diffusion score into pre-trained single-agent policies plus a gradient-free cost guidance term to generate coordinated behavior from single-agent data alone.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
cs.RO 2026-05 conditional novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational ov...
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction
cs.RO 2026-04 unverdicted novelty 7.0

A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 7.0

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing ...
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 accept novelty 7.0

3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 7.0

CodeGraphVLP uses a semantic-graph state and executable code planner to enable reliable long-horizon non-Markovian robot manipulation, improving task success and lowering latency over standard VLA baselines.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
Tune to Learn: How Controller Gains Shape Robot Policy Learning
cs.RO 2026-04 conditional novelty 7.0

Controller gains affect learnability differently for behavior cloning, RL from scratch, and sim-to-real transfer, so optimal gains depend on the learning paradigm rather than desired task behavior.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
What to Ignore, What to React: Visually Robust RL Fine-Tuning of VLA Models
cs.RO 2026-05 conditional novelty 6.0

PAIR-VLA adds invariance and sensitivity objectives over paired visual variants during PPO fine-tuning of VLA models, yielding 9-16% average gains on ManiSkill3 under distractors, textures, poses, viewpoints, and ligh...
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 6.0

ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
cs.CV 2026-05 unverdicted novelty 6.0

ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
HumanNet: Scaling Human-centric Video Learning to One Million Hours
cs.CV 2026-05 unverdicted novelty 6.0

HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.
DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions
cs.RO 2026-05 unverdicted novelty 6.0

DexSynRefine synthesizes HOI motions with an extended manifold method, refines them via task-space residual RL, and adapts for sim-to-real transfer, outperforming kinematic retargeting by 50-70 percentage points on fi...
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
BifrostUMI: Bridging Robot-Free Demonstrations and Humanoid Whole-Body Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

BifrostUMI enables robot-free human demonstration capture via VR and wrist cameras to train visuomotor policies that predict keypoint trajectories for transfer to humanoid whole-body control through retargeting.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Empirical study on robosuite tasks reveals a dominant-skill effect in compositions and shows that an atomic probe approximates full revalidation for skill updates at much lower cost.
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
cs.RO 2026-04 unverdicted novelty 6.0

COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...
V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

V-CAGE automates the creation of scalable, high-quality robotic manipulation datasets through context-aware scene construction, closed-loop visual verification, and perceptually-driven compression.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
cs.RO 2026-04 unverdicted novelty 6.0

SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
cs.RO 2026-04 unverdicted novelty 6.0

EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
cs.RO 2026-04 unverdicted novelty 6.0

A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
Hierarchical Planning with Latent World Models
cs.LG 2026-04 unverdicted novelty 6.0

Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
cs.RO 2024-10 conditional novelty 6.0

RDT-1B is a diffusion foundation model that unifies action spaces across robots and demonstrates superior bimanual manipulation with zero-shot generalization, language following, and few-shot learning on real robots.
Evaluating Real-World Robot Manipulation Policies in Simulation
cs.RO 2024-05 conditional novelty 6.0

SIMPLER simulated environments yield policy performance that correlates strongly with real-world robot manipulation results and captures similar sensitivity to distribution shifts.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

ProcVLM learns procedure-grounded dense progress rewards for robotic manipulation via a reasoning-before-estimation VLM trained on a 60M-frame synthesized corpus from 30 embodied datasets.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

AEGIS uses a pre-computed Gaussian anchor and layer-wise Gram-Schmidt orthogonal projections to isolate destructive gradients during VLA fine-tuning, preserving VQA performance without co-training or replay.
Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
cs.LG 2026-04 unverdicted novelty 5.0

VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
cs.RO 2025-01 unverdicted novelty 5.0

SpatialVLA adds 3D-aware position encoding and adaptive discretized action grids to visual-language-action models, enabling strong zero-shot performance and fine-tuning on new robot setups after pre-training on 1.1 mi...
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy
cs.RO 2026-04 unverdicted novelty 4.0

JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 60 Pith papers · 11 internal anchors

[1]

Roboa- gent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023

work page arXiv 2023
[2]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Scaling data-driven robotics with reward sketching and batch reinforcement learning

Serkan Cabi, Sergio Gomez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu Wang. Scaling data-driven robotics with reward sketching and batch reinforcement learning. RSS, 2019

work page 2019
[4]

arXiv preprint arXiv:1903.11027 (2019)

Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. preprint arXiv:1903.11027, 2019

work page arXiv 1903
[5]

ShapeNet: An Information-Rich 3D Model Repository

Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

work page internal anchor Pith review arXiv 2015
[6]

Rovi-aug: Robot and viewpoint augmentation for cross-embodiment robot learning

Lawrence Yunliang Chen, Chenfeng Xu, Karthik Dhar- marajan, Muhammad Zubair Irshad, Richard Cheng, Kurt Keutzer, Masayoshi Tomizuka, Quan Vuong, and Ken Goldberg. Rovi-aug: Robot and viewpoint augmentation for cross-embodiment robot learning. In Conference on Robot Learning (CoRL) , Munich, Germany, 2024

work page 2024
[7]

Dif- fusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS) , 2023

work page 2023
[8]

Robonet: Large-scale multi-robot learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. CoRL, 2019

work page 2019
[9]

Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023

work page arXiv 2023
[10]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023

work page 2023
[11]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248–255. Ieee, 2009

work page 2009
[12]

Visual foresight: Model-based deep reinforcement learning for vision-based robotic control,

Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568, 2018

work page arXiv 2018
[13]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396 , 2021

work page internal anchor Pith review arXiv 2021
[14]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023

work page 2023
[15]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM , 24(6):381–395, 1981

work page 1981
[16]

Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117,

Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

work page arXiv 2024
[17]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[18]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012

work page 2012
[19]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18995– 19012, 2022

work page 2022
[20]

Robot learning in homes: Improving generalization and reducing dataset bias

Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashc- hand Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems , 31, 2018

work page 2018
[21]

Scaling up and distilling down: Language-guided robot skill acquisition

Huy Ha, Pete Florence, and Shuran Song. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning , pages 3766–3777. PMLR, 2023

work page 2023
[22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems , 33:6840–6851, 2020

work page 2020
[23]

spaCy 2: Natural language understanding with Bloom embeddings, con- volutional neural networks and incremental parsing

Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, con- volutional neural networks and incremental parsing. To appear, 2017

work page 2017
[24]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning , pages 991–1002. PMLR, 2022

work page 2022
[25]

VIMA: Robot ma- nipulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. VIMA: Robot ma- nipulation with multimodal prompts. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on...

work page 2023
[26]

Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation, 2018

Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018

work page arXiv 2018
[27]

Mt-opt: Continuous multi- task robotic reinforcement learning at scale

Dmitry Kalashnkov, Jake Varley, Yevgen Chebotar, Ben Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continuous multi- task robotic reinforcement learning at scale. arXiv, 2021

work page 2021
[28]

Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation

Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett War- nell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022

work page 2022
[29]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023. URL https: //arxiv.org/abs/2304.02643

work page internal anchor Pith review arXiv 2023
[30]

Berg, Wan-Yen Lo, et al

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Segment Anything, April 2023

work page 2023
[31]

Ep n p: An accurate o (n) solution to the p n p problem

Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Ep n p: An accurate o (n) solution to the p n p problem. International journal of computer vision , 81: 155–166, 2009

work page 2009
[32]

Learning hand-eye coordination for robotic grasping with large-scale data collection

Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination for robotic grasping with large-scale data collection. In International Symposium on Experimental Robotics . Springer, 2016

work page 2016
[33]

Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier

Yixin Lin, Austin S. Wang, Giovanni Sutanto, Ak- shara Rai, and Franziska Meier. Polymetis. https: //facebookresearch.github.io/fairo/polymetis/, 2021

work page 2021
[34]

Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera

Jingpei Lu, Zekai Liang, Tristin Xie, Florian Ritcher, Shan Lin, Sainan Liu, and Michael C Yip. Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera. arXiv preprint arXiv:2409.10441 , 2024

work page arXiv 2024
[35]

Interactive language: Talking to robots in real time

Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters , 2023

work page 2023
[36]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879–

work page
[37]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In arXiv preprint arXiv:2108.03298 , 2021

work page internal anchor Pith review arXiv 2021
[38]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023

work page 2023
[39]

Open X-Embodiment Collaboration, Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Her- zog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bern- hard Schölkopf, Brian Ichter, Cewu Lu, Charles Xu, Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang Hua...

work page internal anchor Pith review arXiv 2023
[40]

Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours

Lerrel Pinto and Abhinav Gupta. Supersizing self- supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA) , pages 3406–3413. IEEE, 2016

work page 2016
[41]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

Robot learning with sensorimotor pre-training

Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, 2023

work page 2023
[43]

Common crawl – building an open web-scale crawl using hadoop, 2010

Ahad Rana. Common crawl – building an open web-scale crawl using hadoop, 2010. URL https://www.slideshare. net/hadoopusergroup/common-crawlpresentation

work page 2010
[44]

Accelerating 3d deep learning with pytorch3d

Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020

work page arXiv 2007
[45]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[46]

URL https://api.semanticscholar.org/CorpusID: 203626972

work page
[47]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems , 35:25278–25294, 2022

work page 2022
[48]

On bringing robots home, 2023

Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home, 2023

work page 2023
[49]

Gnm: A general navigation model to drive any robot

Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 7226–7233. IEEE, 2023

work page 2023
[50]

Vint: A foundation model for visual navigation.arXiv preprint arXiv:2306.14846, 2023

Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachow- icz, Kevin Black, Noriaki Hirose, and Sergey Levine. ViNT: A foundation model for visual navigation. In 7th Annual Conference on Robot Learning , 2023. URL https://arxiv.org/abs/2306.14846

work page arXiv 2023
[51]

Multiple interactions made easy (mime): Large scale demonstrations data for imitation

Pratyusha Sharma, Lekha Mohan, Lerrel Pinto, and Abhi- nav Gupta. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning , pages 906–915. PMLR, 2018

work page 2018
[52]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL) , 2022

work page 2022
[53]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning , pages 785–799. PMLR, 2023

work page 2023
[54]

Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations

Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed- loop grasping from low-cost demonstrations. IEEE Robotics and Automation Letters , 5(3):4978–4985, 2020

work page 2020
[55]

Nomad: Goal masked diffusion policies for navigation and exploration.arXiv preprint arXiv:2310.07896, 2023

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navi- gation and exploration. arXiv preprint arXiv:2310.07896 , 2023

work page arXiv 2023
[56]

Scalability in perception for autonomous driving: Waymo open dataset, 2019

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

work page 2019
[57]

View-invariant policy learning via zero-shot novel view synthesis

Stephen Tian, Blake Wulfe, Kyle Sargent, Katherine Liu, Sergey Zakharov, Vitor Guizilini, and Jiajun Wu. View-invariant policy learning via zero-shot novel view synthesis. arXiv, 2024

work page 2024
[58]

Tartandrive: A large-scale dataset for learning off-road dynamics models

Samuel Triest, Matthew Sivaprakasam, Sean J Wang, Wenshan Wang, Aaron M Johnson, and Sebastian Scherer. Tartandrive: A large-scale dataset for learning off-road dynamics models. In 2022 International Conference on Robotics and Automation (ICRA), pages 2546–2552. IEEE, 2022

work page 2022
[59]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems , pages 5998–6008, 2017

work page 2017
[60]

Bridgedata v2: A dataset for robot learning at scale, 2023

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale, 2023

work page 2023
[61]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[62]

Dust3r: Geometric 3d vision made easy, 2024

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy, 2024. URL https://arxiv.org/abs/2312. 14132

work page 2024
[63]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2025

work page 2025
[64]

Visual imitation made easy, 2020

Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, Pieter Abbeel, and Lerrel Pinto. Visual imitation made easy, 2020

work page 2020
[65]

Bdd100k: A diverse driving dataset for heterogeneous multitask learning

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020

work page 2020
[66]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

3D-VLA: A 3D Vision-Language-Action Generative World Model

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 , 2024

work page internal anchor Pith review arXiv 2024
[68]

in-the-wild

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In 7th Annual Conference on Robot Learning , 2023. APPENDIX A CONTRIBUTIONS Project Leads: Alexander Khazatsky, Karl Pertsch Research Leads (contri...

work page 2023
[69]

Industrial office: industrial office tables and chairs, conference rooms, conference TVs

work page
[70]

Industrial kitchen: industrial refrigerator, sink, coffee maker

work page
[71]

Industrial dining room: industrial setting with dining tables

work page
[72]

Home office: desk or desk chairs in a home setting

work page
[73]

Home kitchen: refrigerator, kitchen sink, kitchen tabletop in a home setting

work page
[74]

Home dining room: dining table, dining chairs, in a home setting

work page
[75]

Bedroom: room with a bed

work page
[76]

Bathroom: Showers, baths, toilets, bathroom sinks

work page
[77]

Living room: places with couches, armchairs, coffee tables, tvs in a home setting

work page
[78]

Hallway / closet: areas between rooms, situations where the robot is interacting with a door or objects in a closet

work page
[79]

Other: any other location that does not fit into those categories 12: Unknown: a scene that’s too hard to classify because the image is dark or too close up Listing C.1: The prompt provided to GPT4V in order to classify scene types. A. Diffusion Policy Architecture and Hyperparameters We build our diffusion policy [ 7] training pipeline on the Robomimic c...

work page