arxiv: 2412.13877 · v3 · submitted 2024-12-18 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

Kun Wu , Chengkai Hou , Jiaming Liu , Zhengping Che , Xiaozhu Ju , Zhuqin Yang , Meng Li , Yinuo Zhao

show 29 more authors

Zhiyuan Xu Guang Yang Shichao Fan Xinhua Wang Fei Liao Zhen Zhao Guangyu Li Zhao Jin Lecheng Wang Jilei Mao Ning Liu Pei Ren Qiang Zhang Yaoxu Lyu Mengzhen Liu Jingyang He Yulin Luo Zeyu Gao Chenxuan Li Chenyang Gu Yankai Fu Di Wu Xingyu Wang Sixiang Chen Zhenyu Wang Pengju An Siyuan Qian Shanghang Zhang Jian Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:09 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords RoboMINDmulti-embodimentteleoperation datasetimitation learningrobot manipulationVision-Language-Action modelsfailure demonstrationsdigital twin

0 comments

The pith

RoboMIND supplies 107k teleoperated trajectories across four robot embodiments to train generalizable manipulation policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RoboMIND as a dataset of 107k demonstration trajectories covering 479 tasks and 96 object classes, collected via human teleoperation on a single unified platform. It spans four robotic embodiments and records multi-view observations, robot states, language instructions, plus 5k labeled failure cases. Experiments apply imitation learning methods to single tasks and Vision-Language-Action models to multi-task settings, reporting high success rates and improved generalization when policies are trained on this data. The work claims this constitutes the largest multi-embodiment teleoperation collection built under consistent protocols.

Core claim

RoboMIND is a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes, collected through human teleoperation on a unified platform that covers four distinct robotic embodiments: the Franka Emika Panda, the UR5e, the AgileX dual-arm robot, and a humanoid robot with dual dexterous hands. The dataset includes multi-view observations, proprioceptive robot state information, linguistic task descriptions, and 5k real-world failure demonstrations each paired with detailed causes, together with a matching digital twin in the Isaac Sim simulator.

What carries the argument

The unified data collection platform and standardized protocol that records consistent teleoperation demonstrations across multiple robotic embodiments.

Load-bearing premise

Human teleoperation demonstrations collected on one unified platform supply enough quality and coverage to train policies that generalize across robot embodiments and unseen real-world conditions.

What would settle it

A controlled test in which a VLA model trained on RoboMIND is evaluated on a previously unseen robot embodiment or on tasks outside the 479 covered ones and yields success rates no higher than models trained on smaller single-embodiment datasets.

read the original abstract

In this paper, we introduce RoboMIND (Multi-embodiment Intelligence Normative Data for Robot Manipulation), a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view observations, proprioceptive robot state information, and linguistic task descriptions. To ensure data consistency and reliability for imitation learning, RoboMIND is built on a unified data collection platform and a standardized protocol, covering four distinct robotic embodiments: the Franka Emika Panda, the UR5e, the AgileX dual-arm robot, and a humanoid robot with dual dexterous hands. Our dataset also includes 5k real-world failure demonstrations, each accompanied by detailed causes, enabling failure reflection and correction during policy learning. Additionally, we created a digital twin environment in the Isaac Sim simulator, replicating the real-world tasks and assets, which facilitates the low-cost collection of additional training data and enables efficient evaluation. To demonstrate the quality and diversity of our dataset, we conducted extensive experiments using various imitation learning methods for single-task settings and state-of-the-art Vision-Language-Action (VLA) models for multi-task scenarios. By leveraging RoboMIND, the VLA models achieved high manipulation success rates and demonstrated strong generalization capabilities. To the best of our knowledge, RoboMIND is the largest multi-embodiment teleoperation dataset collected on a unified platform, providing large-scale and high-quality robotic training data. Our project is at https://x-humanoid-robomind.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboMIND releases a big multi-embodiment robot dataset with failures and sim support, but the paper's claims about generalization lack the numbers and protocols to back them up.

read the letter

RoboMIND releases a big multi-embodiment robot dataset with failures and sim support, but the paper's claims about generalization lack the numbers and protocols to back them up. The authors collected 107k teleoperated trajectories across Franka, UR5e, AgileX dual-arm, and a humanoid with dexterous hands on one unified platform, covering 479 tasks with 96 objects. They added multi-view observations, proprioception, language labels, 5k failure cases with cause annotations, and an Isaac Sim digital twin for extra data and testing. That scale plus the controlled collection setup is the concrete advance over prior robot datasets. The experiments with single-task imitation learning and multi-task VLA models are the right way to show the data is usable, and the reported high success rates fit what this volume should enable. The soft spot is the missing detail on results. The abstract states strong generalization without success percentages, baselines, per-embodiment scores, or any description of whether one robot type was held out during training. Given the kinematic and morphological differences across the four platforms, it is not possible to tell from the given information whether the multi-embodiment coverage actually produces transfer or whether performance is driven by single-embodiment data. This paper is for groups working on generalist manipulation policies who need large, diverse demonstration sets. The dataset and collection method are real contributions that others can use directly. It deserves peer review so the authors can supply the quantitative breakdowns and evaluation protocols that would make the generalization story testable.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces RoboMIND, a dataset of 107k teleoperated demonstration trajectories across 479 tasks, 96 object classes, and four robotic embodiments (Franka Emika Panda, UR5e, AgileX dual-arm, humanoid with dexterous hands). Collected on a unified platform with multi-view observations, proprioceptive states, and language descriptions, it also includes 5k failure cases with annotated causes and a matching Isaac Sim digital twin. Experiments with imitation learning and VLA models are said to produce high manipulation success rates and strong generalization.

Significance. If the experimental validation holds, RoboMIND would be a valuable large-scale resource for multi-embodiment imitation and VLA training. The unified collection protocol, scale, inclusion of real failure demonstrations for reflection/correction, and simulator twin are concrete strengths that could support reproducible policy development and low-cost data augmentation.

major comments (1)

[§4] §4 (Experiments): The VLA multi-task results claim 'high manipulation success rates and strong generalization capabilities' without reporting numerical success rates, baselines, per-embodiment breakdowns, or the cross-embodiment protocol (e.g., whether any platform was held out, how kinematics/dynamics differences were handled, or the contribution of the 5k failure cases). This directly undermines assessment of the central claim that the dataset enables cross-embodiment transfer.

minor comments (2)

[Abstract] Abstract: Add at least one key quantitative result (e.g., average success rate or comparison to prior datasets) to make the empirical claims concrete rather than qualitative.
[§3] §3 (Dataset): Clarify the distribution of the 107k trajectories across the four embodiments to allow readers to judge balance and potential single-embodiment dominance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We provide a point-by-point response to the major comment below and outline the revisions we will make to address the concerns.

read point-by-point responses

Referee: [§4] §4 (Experiments): The VLA multi-task results claim 'high manipulation success rates and strong generalization capabilities' without reporting numerical success rates, baselines, per-embodiment breakdowns, or the cross-embodiment protocol (e.g., whether any platform was held out, how kinematics/dynamics differences were handled, or the contribution of the 5k failure cases). This directly undermines assessment of the central claim that the dataset enables cross-embodiment transfer.

Authors: We acknowledge that the manuscript does not provide specific numerical values for the VLA success rates or detailed breakdowns in the current version. To address this, we will revise Section 4 to include quantitative results, including overall and per-embodiment success rates for the VLA models, comparisons against relevant baselines, and a clear description of the experimental protocol. Specifically, we will clarify that the multi-embodiment training was performed jointly on data from all four robots using a unified observation and action space to mitigate kinematic differences, without holding out any embodiment. We will also add results showing the impact of incorporating the 5k failure demonstrations on policy performance. These changes will allow readers to better evaluate the cross-embodiment transfer capabilities enabled by RoboMIND. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset introduction with no derivation chain

full rationale

The paper introduces an empirical teleoperation dataset (107k trajectories across four embodiments) and reports standard imitation-learning and VLA experiments. No equations, fitted parameters, or self-referential reductions appear; the size/quality claim rests on the described collection protocol rather than any self-definition, fitted-input prediction, or load-bearing self-citation. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and benchmark paper. No mathematical derivations, fitted parameters, background axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5728 in / 1217 out tokens · 23332 ms · 2026-05-15T22:09:44.467479+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By leveraging RoboMIND, the VLA models achieved high manipulation success rates and demonstrated strong generalization capabilities.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning
cs.RO 2026-05 unverdicted novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps
cs.RO 2026-04 unverdicted novelty 7.0

HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.
BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination
cs.RO 2026-04 conditional novelty 7.0

BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.
Towards Generalizable Robotic Manipulation in Dynamic Environments
cs.CV 2026-03 unverdicted novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
cs.AI 2026-04 unverdicted novelty 6.0

PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling
cs.RO 2026-03 unverdicted novelty 6.0

ROBOGATE applies adaptive boundary-focused sampling in simulation to discover robot policy failure boundaries, revealing a 97.65 percentage point performance gap for a VLA model between LIBERO and industrial scenarios.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
cs.RO 2025-10 unverdicted novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
cs.RO 2025-06 unverdicted novelty 6.0

RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
cs.CV 2025-03 unverdicted novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
cs.RO 2025-03 unverdicted novelty 6.0

AgiBot World supplies over 1 million trajectories enabling GO-1 to deliver 30% average gains over Open X-Embodiment and over 60% success on complex dexterous tasks while open-sourcing everything.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
Motus: A Unified Latent Action World Model
cs.CV 2025-12 unverdicted novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
Towards Robotic Dexterous Hand Intelligence: A Survey
cs.RO 2026-05 unverdicted novelty 4.0

A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
World Simulation with Video Foundation Models for Physical AI
cs.CV 2025-10 unverdicted novelty 4.0

Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 19 Pith papers · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Do as i can, not as i say: Grounding language in robotic affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Proceedings of The 6th Conference on Robot Learning , volume 205 of Proceedings of Machine Learning Research , pages 287–318...

work page 2023
[3]

Learning dexterous in-hand manipula- tion

OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pa- chocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipula- tion. The International Journal of Robotics Research , 39(1):3–20, 2020

work page 2020
[4]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Open- flamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Affordances from human videos as a versatile representation for robotics

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

work page 2023
[6]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Rt-h: Action hierarchies using language

Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Ser- manet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. In https://arxiv.org/abs/2403.01823, 2024

work page arXiv 2024
[8]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 4788–4795. IEEE, 2024

work page 2024
[9]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for Real-World Control at Scale. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023

work page 2023
[10]

Scaling data-driven robotics with reward sketching and batch reinforcement learning

Serkan Cabi, Sergio G ´omez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Ve- cerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. In Pro- ceedings of Robotics: Science and Systems , July 2020

work page 2020
[11]

Ler- obot: State-of-the-art machine learning for real-world robotics in pytorch

Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, and Thomas Wolf. Ler- obot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/ lerobot, 2024

work page 2024
[12]

X- humanoid tien kung, 2024

The Beijing Humanoid Robot Innovation Center. X- humanoid tien kung, 2024. URL https://x-humanoid. com/

work page 2024
[13]

Matterport3d: Learning from rgb-d data in indoor environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667–676, 2017

work page 2017
[14]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Mirage: Cross-embodiment zero-shot policy trans- fer with cross-painting

Lawrence Yunliang Chen, Kush Hari, Karthik Dhar- marajan, Chenfeng Xu, Quan Vuong, and Ken Gold- berg. Mirage: Cross-embodiment zero-shot policy trans- fer with cross-painting. In Proceedings of Robotics: Science and Systems , 2024

work page 2024
[16]

Open-television: Teleoperation with immersive active visual feedback

Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. In 8th Annual Con- ference on Robot Learning , 2024

work page 2024
[17]

Dif- fusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. RSS, 2023

work page 2023
[18]

Deep reinforce- ment learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep reinforce- ment learning from human preferences. Advances in neural information processing systems , 30, 2017

work page 2017
[19]

Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

work page 2016
[20]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV) , pages 720–736, 2018

work page 2018
[21]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-

work page
[22]

International Journal of Computer Vision , pages 1–23, 2022

work page 2022
[23]

Robonet: Large-scale multi-robot learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning, pages 885–897. PMLR, 2020

work page 2020
[24]

A learning-based hierarchical control scheme for an exoskeleton robot in human–robot coop- erative manipulation

Mingdi Deng, Zhijun Li, Yu Kang, CL Philip Chen, and Xiaoli Chu. A learning-based hierarchical control scheme for an exoskeleton robot in human–robot coop- erative manipulation. IEEE transactions on cybernetics, 50(1):112–125, 2018

work page 2018
[25]

Scaling cross-embodied learning: One policy for manipulation, navigation, lo- comotion and aviation

Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, lo- comotion and aviation. In 8th Annual Conference on Robot Learning, 2024

work page 2024
[26]

Palm-e: an embodied multimodal language model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023
[27]

Bridge Data: Boosting Generalization of Robotic Skills with Cross- Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge Data: Boosting Generalization of Robotic Skills with Cross- Domain Datasets. In Proceedings of Robotics: Science and Systems, New York City, NY , USA, June 2022. doi: 10.15607/RSS.2022.XVIII.063

work page doi:10.15607/rss.2022.xviii.063 2022
[28]

FlowBot3D: Learning 3D Articulation Flow to Manipulate Articu- lated Objects

Ben Eisner, Harry Zhang, and David Held. FlowBot3D: Learning 3D Articulation Flow to Manipulate Articu- lated Objects. In Proceedings of Robotics: Science and Systems, June 2022. doi: 10.15607/RSS.2022.XVIII. 018

work page doi:10.15607/rss.2022.xviii 2022
[29]

Anygrasp: Robust and efficient grasp perception in spatial and temporal domains

Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics , 2023

work page 2023
[30]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 653–660. IEEE, 2024

work page 2024
[31]

Humanplus: Humanoid shadowing and imitation from humans

Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. In 8th Annual Conference on Robot Learning , 2024

work page 2024
[32]

Zhao, and Chelsea Finn

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In 8th Annual Conference on Robot Learning , 2024

work page 2024
[33]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[34]

Act3d: 3d feature field transformers for multi-task robotic manipulation

Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation. In 7th Annual Conference on Robot Learning , 2023

work page 2023
[35]

Franka robotics, 2024

Franka Robotics GmbH. Franka robotics, 2024. URL https://franka.de/

work page 2024
[36]

Rvt: Robotic view transformer for 3d object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. In Conference on Robot Learning, pages 694–710. PMLR, 2023

work page 2023
[37]

The” something some- thing” video database for learning and evaluating visual common sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, He- una Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something some- thing” video database for learning and evaluating visual common sense. In Proceedings of the IEEE interna- tional conference on computer vision, pages 5...

work page 2017
[38]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18995–19012, 2022

work page 2022
[39]

Robot learning in homes: Improving generalization and reducing dataset bias

Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018

work page 2018
[40]

BAKU: An efficient transformer for multi-task policy learning

Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. BAKU: An efficient transformer for multi-task policy learning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

work page 2024
[41]

Dexpilot: Vision-based tele- operation of dexterous robotic hand-arm system

Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based tele- operation of dexterous robotic hand-arm system. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020

work page 2020
[42]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020

work page 2020
[43]

Sharegpt, 2023

https://sharegpt.com/. Sharegpt, 2023. URL https: //sharegpt.com/

work page 2023
[44]

V oxposer: Composable 3d value maps for robotic manipulation with language models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. In 7th Annual Conference on Robot Learning , 2023

work page 2023
[45]

Depth camera d435i

Intel. Depth camera d435i. https://www.intelrealsense. com/depth-camera-d435i/, 2019

work page 2019
[46]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kap- pler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022

work page 2022
[47]

Robotic grasping using deep reinforcement learning

Shirin Joshi, Sulabh Kumra, and Ferat Sahin. Robotic grasping using deep reinforcement learning. In 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE) , pages 1461–1466. IEEE, 2020

work page 2020
[48]

Mt-opt: Continu- ous multi-task robotic reinforcement learning at scale

Dmitry Kalashnikov, Jacob Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continu- ous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212 , 2021

work page arXiv 2021
[49]

DEFT: Dexterous fine- tuning for hand policies

Aditya Kannan, Kenneth Shaw, Shikhar Bahl, Pragna Mannam, and Deepak Pathak. DEFT: Dexterous fine- tuning for hand policies. In 7th Annual Conference on Robot Learning, 2023

work page 2023
[50]

3d diffuser actor: Policy diffusion with 3d scene representations

Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. In 8th Annual Conference on Robot Learning, 2024

work page 2024
[51]

DROID: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. In RSS 2024 Workshop: Data Generation for Robotics , 2024

work page 2024
[52]

Open- VLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- VLA: An open-source vision-language-action model. In 8th Annual Conference on Robot Learning , 2024

work page 2024
[53]

Design and use paradigms for gazebo, an open-source multi-robot simulator

Nathan Koenig and Andrew Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. In IEEE/RSJ International Conference on Intelligent Robots and Systems , volume 3, pages 2149– 2154, 2004

work page 2004
[54]

AI2- THOR: An Interactive 3D Environment for Visual AI

Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli Van- derBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2- THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017

work page 2017
[55]

Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection

Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436, 2018

work page 2018
[56]

Datacomp-LM: In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-LM: In search of the next generation of training sets for language models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[57]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Manipllm: Embodied multimodal large language model for object-centric robotic manipulation

Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 18061– 18070, 2024

work page 2024
[59]

Vision- language foundation models as effective robot imitators

Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision- language foundation models as effective robot imitators. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[60]

Evaluating Real-World Robot Manipulation Policies in Simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lu- nawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Code as policies: Language model programs for em- bodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for em- bodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 9493–9500. IEEE, 2023

work page 2023
[62]

Video-LLaV A: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment before projection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 5971– 5984, Miami, Florida, USA, November 2024

work page 2024
[63]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems , 36, 2024

work page 2024
[64]

Robot learning on the job: Human- in-the-loop autonomy and learning during deployment

Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. The International Journal of Robotics Research , page 02783649241273901, 2022

work page 2022
[65]

Robomamba: Efficient vision-language-action model for robotic rea- soning and manipulation

Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic rea- soning and manipulation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

work page 2024
[66]

RDT-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1b: a diffusion foundation model for bimanual manipulation. In The Thirteenth International Conference on Learning Representations , 2025

work page 2025
[67]

REFLECT: Summarizing robot experiences for failure explanation and correction

Zeyi Liu, Arpit Bahety, and Shuran Song. REFLECT: Summarizing robot experiences for failure explanation and correction. In 7th Annual Conference on Robot Learning, 2023

work page 2023
[68]

Isaac gym: High performance GPU based physics simulation for robot learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance GPU based physics simulation for robot learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021

work page 2021
[69]

Learning dexterous grasping with object-centric visual affor- dances

Priyanka Mandikal and Kristen Grauman. Learning dexterous grasping with object-centric visual affor- dances. In 2021 IEEE international conference on robotics and automation (ICRA) , pages 6169–6176. IEEE, 2021

work page 2021
[70]

Dexvip: Learning dexterous grasping with human hand pose priors from video

Priyanka Mandikal and Kristen Grauman. Dexvip: Learning dexterous grasping with human hand pose priors from video. In Conference on Robot Learning , pages 651–661. PMLR, 2022

work page 2022
[71]

Roboturk: A crowdsourcing platform for robotic skill learning through imitation

Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879–893. PMLR, 2018

work page 2018
[72]

Generation and comprehension of unambiguous object descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 11–20, 2016

work page 2016
[73]

Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks. IEEE Robotics and Automation Let- ters, 7(3):7327–7334, 2022

work page 2022
[74]

Where2act: From pixels to actions for articulated 3d objects

Kaichun Mo, Leonidas J Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6813–6823, 2021

work page 2021
[75]

Movella. Xsens. https://www.movella.com/products/ xsens, 2025. Accessed: 2025-01-15

work page 2025
[76]

R3m: A universal visual representation for robot manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In 6th Annual Conference on Robot Learning , 2022

work page 2022
[77]

Nvidia isaac sim: Robotics simulation and synthetic data, 2023

NVIDIA. Nvidia isaac sim: Robotics simulation and synthetic data, 2023. URL https://developer.nvidia.com/ isaac-sim

work page 2023
[78]

Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment col- laboration0

Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhi- ram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Man- dlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment col- laboration0. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 6892–6903, 2024

work page 2024
[79]

Astra series - structured light camera

ORBBEC. Astra series - structured light camera. https://www.orbbec.com/products/ structured-light-camera/astra-series/, 2022

work page 2022
[80]

Gemini 335 - 3d vision for a 3d world

ORBBEC. Gemini 335 - 3d vision for a 3d world. https://www.orbbec.com/products/ stereo-vision-camera/gemini-335/, 2024

work page 2024

Showing first 80 references.