pith. machine review for the scientific record. sign in

arxiv: 2412.13877 · v3 · submitted 2024-12-18 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:09 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords RoboMINDmulti-embodimentteleoperation datasetimitation learningrobot manipulationVision-Language-Action modelsfailure demonstrationsdigital twin
0
0 comments X

The pith

RoboMIND supplies 107k teleoperated trajectories across four robot embodiments to train generalizable manipulation policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RoboMIND as a dataset of 107k demonstration trajectories covering 479 tasks and 96 object classes, collected via human teleoperation on a single unified platform. It spans four robotic embodiments and records multi-view observations, robot states, language instructions, plus 5k labeled failure cases. Experiments apply imitation learning methods to single tasks and Vision-Language-Action models to multi-task settings, reporting high success rates and improved generalization when policies are trained on this data. The work claims this constitutes the largest multi-embodiment teleoperation collection built under consistent protocols.

Core claim

RoboMIND is a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes, collected through human teleoperation on a unified platform that covers four distinct robotic embodiments: the Franka Emika Panda, the UR5e, the AgileX dual-arm robot, and a humanoid robot with dual dexterous hands. The dataset includes multi-view observations, proprioceptive robot state information, linguistic task descriptions, and 5k real-world failure demonstrations each paired with detailed causes, together with a matching digital twin in the Isaac Sim simulator.

What carries the argument

The unified data collection platform and standardized protocol that records consistent teleoperation demonstrations across multiple robotic embodiments.

Load-bearing premise

Human teleoperation demonstrations collected on one unified platform supply enough quality and coverage to train policies that generalize across robot embodiments and unseen real-world conditions.

What would settle it

A controlled test in which a VLA model trained on RoboMIND is evaluated on a previously unseen robot embodiment or on tasks outside the 479 covered ones and yields success rates no higher than models trained on smaller single-embodiment datasets.

read the original abstract

In this paper, we introduce RoboMIND (Multi-embodiment Intelligence Normative Data for Robot Manipulation), a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view observations, proprioceptive robot state information, and linguistic task descriptions. To ensure data consistency and reliability for imitation learning, RoboMIND is built on a unified data collection platform and a standardized protocol, covering four distinct robotic embodiments: the Franka Emika Panda, the UR5e, the AgileX dual-arm robot, and a humanoid robot with dual dexterous hands. Our dataset also includes 5k real-world failure demonstrations, each accompanied by detailed causes, enabling failure reflection and correction during policy learning. Additionally, we created a digital twin environment in the Isaac Sim simulator, replicating the real-world tasks and assets, which facilitates the low-cost collection of additional training data and enables efficient evaluation. To demonstrate the quality and diversity of our dataset, we conducted extensive experiments using various imitation learning methods for single-task settings and state-of-the-art Vision-Language-Action (VLA) models for multi-task scenarios. By leveraging RoboMIND, the VLA models achieved high manipulation success rates and demonstrated strong generalization capabilities. To the best of our knowledge, RoboMIND is the largest multi-embodiment teleoperation dataset collected on a unified platform, providing large-scale and high-quality robotic training data. Our project is at https://x-humanoid-robomind.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces RoboMIND, a dataset of 107k teleoperated demonstration trajectories across 479 tasks, 96 object classes, and four robotic embodiments (Franka Emika Panda, UR5e, AgileX dual-arm, humanoid with dexterous hands). Collected on a unified platform with multi-view observations, proprioceptive states, and language descriptions, it also includes 5k failure cases with annotated causes and a matching Isaac Sim digital twin. Experiments with imitation learning and VLA models are said to produce high manipulation success rates and strong generalization.

Significance. If the experimental validation holds, RoboMIND would be a valuable large-scale resource for multi-embodiment imitation and VLA training. The unified collection protocol, scale, inclusion of real failure demonstrations for reflection/correction, and simulator twin are concrete strengths that could support reproducible policy development and low-cost data augmentation.

major comments (1)
  1. [§4] §4 (Experiments): The VLA multi-task results claim 'high manipulation success rates and strong generalization capabilities' without reporting numerical success rates, baselines, per-embodiment breakdowns, or the cross-embodiment protocol (e.g., whether any platform was held out, how kinematics/dynamics differences were handled, or the contribution of the 5k failure cases). This directly undermines assessment of the central claim that the dataset enables cross-embodiment transfer.
minor comments (2)
  1. [Abstract] Abstract: Add at least one key quantitative result (e.g., average success rate or comparison to prior datasets) to make the empirical claims concrete rather than qualitative.
  2. [§3] §3 (Dataset): Clarify the distribution of the 107k trajectories across the four embodiments to allow readers to judge balance and potential single-embodiment dominance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We provide a point-by-point response to the major comment below and outline the revisions we will make to address the concerns.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The VLA multi-task results claim 'high manipulation success rates and strong generalization capabilities' without reporting numerical success rates, baselines, per-embodiment breakdowns, or the cross-embodiment protocol (e.g., whether any platform was held out, how kinematics/dynamics differences were handled, or the contribution of the 5k failure cases). This directly undermines assessment of the central claim that the dataset enables cross-embodiment transfer.

    Authors: We acknowledge that the manuscript does not provide specific numerical values for the VLA success rates or detailed breakdowns in the current version. To address this, we will revise Section 4 to include quantitative results, including overall and per-embodiment success rates for the VLA models, comparisons against relevant baselines, and a clear description of the experimental protocol. Specifically, we will clarify that the multi-embodiment training was performed jointly on data from all four robots using a unified observation and action space to mitigate kinematic differences, without holding out any embodiment. We will also add results showing the impact of incorporating the 5k failure demonstrations on policy performance. These changes will allow readers to better evaluate the cross-embodiment transfer capabilities enabled by RoboMIND. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset introduction with no derivation chain

full rationale

The paper introduces an empirical teleoperation dataset (107k trajectories across four embodiments) and reports standard imitation-learning and VLA experiments. No equations, fitted parameters, or self-referential reductions appear; the size/quality claim rests on the described collection protocol rather than any self-definition, fitted-input prediction, or load-bearing self-citation. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and benchmark paper. No mathematical derivations, fitted parameters, background axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5728 in / 1217 out tokens · 23332 ms · 2026-05-15T22:09:44.467479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RotVLA: Rotational Latent Action for Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 7.0

    RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

  2. RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

    cs.RO 2026-05 unverdicted novelty 7.0

    RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

  3. HRDexDB: A Large-Scale Dataset of Dexterous Human and Robotic Hand Grasps

    cs.RO 2026-04 unverdicted novelty 7.0

    HRDexDB is a multi-modal dataset of 1.4K human and robotic dexterous grasps across 100 objects, providing aligned 3D kinematics, high-resolution tactile data, and video streams.

  4. BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination

    cs.RO 2026-04 conditional novelty 7.0

    BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.

  5. Towards Generalizable Robotic Manipulation in Dynamic Environments

    cs.CV 2026-03 unverdicted novelty 7.0

    DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

  6. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  7. AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    AT-VLA introduces adaptive tactile injection and a dual-stream tactile reaction mechanism to integrate real-time tactile feedback into pretrained VLA models for contact-rich robotic manipulation.

  8. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  9. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  10. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

  11. DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

  12. ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling

    cs.RO 2026-03 unverdicted novelty 6.0

    ROBOGATE applies adaptive boundary-focused sampling in simulation to discover robot policy failure boundaries, revealing a 97.65 percentage point performance gap for a VLA model between LIBERO and industrial scenarios.

  13. InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

    cs.RO 2025-10 unverdicted novelty 6.0

    InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

  14. RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    cs.RO 2025-06 unverdicted novelty 6.0

    RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.

  15. HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    cs.CV 2025-03 unverdicted novelty 6.0

    HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.

  16. AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    cs.RO 2025-03 unverdicted novelty 6.0

    AgiBot World supplies over 1 million trajectories enabling GO-1 to deliver 30% average gains over Open X-Embodiment and over 60% success on complex dexterous tasks while open-sourcing everything.

  17. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  18. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

  19. Towards Robotic Dexterous Hand Intelligence: A Survey

    cs.RO 2026-05 unverdicted novelty 4.0

    A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.

  20. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · cited by 19 Pith papers · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Do as i can, not as i say: Grounding language in robotic affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Proceedings of The 6th Conference on Robot Learning , volume 205 of Proceedings of Machine Learning Research , pages 287–318...

  3. [3]

    Learning dexterous in-hand manipula- tion

    OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pa- chocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipula- tion. The International Journal of Robotics Research , 39(1):3–20, 2020

  4. [4]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Open- flamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023

  5. [5]

    Affordances from human videos as a versatile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforce- ment learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  7. [7]

    Rt-h: Action hierarchies using language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Ser- manet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language. In https://arxiv.org/abs/2403.01823, 2024

  8. [8]

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 4788–4795. IEEE, 2024

  9. [9]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for Real-World Control at Scale. In Proceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023

  10. [10]

    Scaling data-driven robotics with reward sketching and batch reinforcement learning

    Serkan Cabi, Sergio G ´omez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Ve- cerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. In Pro- ceedings of Robotics: Science and Systems , July 2020

  11. [11]

    Ler- obot: State-of-the-art machine learning for real-world robotics in pytorch

    Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, and Thomas Wolf. Ler- obot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/ lerobot, 2024

  12. [12]

    X- humanoid tien kung, 2024

    The Beijing Humanoid Robot Innovation Center. X- humanoid tien kung, 2024. URL https://x-humanoid. com/

  13. [13]

    Matterport3d: Learning from rgb-d data in indoor environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In 2017 International Conference on 3D Vision (3DV) , pages 667–676, 2017

  14. [14]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158 , 2024

  15. [15]

    Mirage: Cross-embodiment zero-shot policy trans- fer with cross-painting

    Lawrence Yunliang Chen, Kush Hari, Karthik Dhar- marajan, Chenfeng Xu, Quan Vuong, and Ken Gold- berg. Mirage: Cross-embodiment zero-shot policy trans- fer with cross-painting. In Proceedings of Robotics: Science and Systems , 2024

  16. [16]

    Open-television: Teleoperation with immersive active visual feedback

    Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-television: Teleoperation with immersive active visual feedback. In 8th Annual Con- ference on Robot Learning , 2024

  17. [17]

    Dif- fusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. RSS, 2023

  18. [18]

    Deep reinforce- ment learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep reinforce- ment learning from human preferences. Advances in neural information processing systems , 30, 2017

  19. [19]

    Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

    Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning, 2016

  20. [20]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV) , pages 720–736, 2018

  21. [21]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-

  22. [22]

    International Journal of Computer Vision , pages 1–23, 2022

  23. [23]

    Robonet: Large-scale multi-robot learning

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning, pages 885–897. PMLR, 2020

  24. [24]

    A learning-based hierarchical control scheme for an exoskeleton robot in human–robot coop- erative manipulation

    Mingdi Deng, Zhijun Li, Yu Kang, CL Philip Chen, and Xiaoli Chu. A learning-based hierarchical control scheme for an exoskeleton robot in human–robot coop- erative manipulation. IEEE transactions on cybernetics, 50(1):112–125, 2018

  25. [25]

    Scaling cross-embodied learning: One policy for manipulation, navigation, lo- comotion and aviation

    Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, lo- comotion and aviation. In 8th Annual Conference on Robot Learning, 2024

  26. [26]

    Palm-e: an embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: an embodied multimodal language model. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  27. [27]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross- Domain Datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge Data: Boosting Generalization of Robotic Skills with Cross- Domain Datasets. In Proceedings of Robotics: Science and Systems, New York City, NY , USA, June 2022. doi: 10.15607/RSS.2022.XVIII.063

  28. [28]

    FlowBot3D: Learning 3D Articulation Flow to Manipulate Articu- lated Objects

    Ben Eisner, Harry Zhang, and David Held. FlowBot3D: Learning 3D Articulation Flow to Manipulate Articu- lated Objects. In Proceedings of Robotics: Science and Systems, June 2022. doi: 10.15607/RSS.2022.XVIII. 018

  29. [29]

    Anygrasp: Robust and efficient grasp perception in spatial and temporal domains

    Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics , 2023

  30. [30]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 653–660. IEEE, 2024

  31. [31]

    Humanplus: Humanoid shadowing and imitation from humans

    Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Humanplus: Humanoid shadowing and imitation from humans. In 8th Annual Conference on Robot Learning , 2024

  32. [32]

    Zhao, and Chelsea Finn

    Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In 8th Annual Conference on Robot Learning , 2024

  33. [33]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

  34. [34]

    Act3d: 3d feature field transformers for multi-task robotic manipulation

    Theophile Gervet, Zhou Xian, Nikolaos Gkanatsios, and Katerina Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation. In 7th Annual Conference on Robot Learning , 2023

  35. [35]

    Franka robotics, 2024

    Franka Robotics GmbH. Franka robotics, 2024. URL https://franka.de/

  36. [36]

    Rvt: Robotic view transformer for 3d object manipulation

    Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. In Conference on Robot Learning, pages 694–710. PMLR, 2023

  37. [37]

    The” something some- thing” video database for learning and evaluating visual common sense

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, He- una Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something some- thing” video database for learning and evaluating visual common sense. In Proceedings of the IEEE interna- tional conference on computer vision, pages 5...

  38. [38]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jack- son Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 18995–19012, 2022

  39. [39]

    Robot learning in homes: Improving generalization and reducing dataset bias

    Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashchand Gandhi, and Lerrel Pinto. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018

  40. [40]

    BAKU: An efficient transformer for multi-task policy learning

    Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. BAKU: An efficient transformer for multi-task policy learning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

  41. [41]

    Dexpilot: Vision-based tele- operation of dexterous robotic hand-arm system

    Ankur Handa, Karl Van Wyk, Wei Yang, Jacky Liang, Yu-Wei Chao, Qian Wan, Stan Birchfield, Nathan Ratliff, and Dieter Fox. Dexpilot: Vision-based tele- operation of dexterous robotic hand-arm system. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9164–9170. IEEE, 2020

  42. [42]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020

  43. [43]

    Sharegpt, 2023

    https://sharegpt.com/. Sharegpt, 2023. URL https: //sharegpt.com/

  44. [44]

    V oxposer: Composable 3d value maps for robotic manipulation with language models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. In 7th Annual Conference on Robot Learning , 2023

  45. [45]

    Depth camera d435i

    Intel. Depth camera d435i. https://www.intelrealsense. com/depth-camera-d435i/, 2019

  46. [46]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kap- pler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022

  47. [47]

    Robotic grasping using deep reinforcement learning

    Shirin Joshi, Sulabh Kumra, and Ferat Sahin. Robotic grasping using deep reinforcement learning. In 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE) , pages 1461–1466. IEEE, 2020

  48. [48]

    Mt-opt: Continu- ous multi-task robotic reinforcement learning at scale

    Dmitry Kalashnikov, Jacob Varley, Yevgen Chebotar, Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Sergey Levine, and Karol Hausman. Mt-opt: Continu- ous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212 , 2021

  49. [49]

    DEFT: Dexterous fine- tuning for hand policies

    Aditya Kannan, Kenneth Shaw, Shikhar Bahl, Pragna Mannam, and Deepak Pathak. DEFT: Dexterous fine- tuning for hand policies. In 7th Annual Conference on Robot Learning, 2023

  50. [50]

    3d diffuser actor: Policy diffusion with 3d scene representations

    Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. In 8th Annual Conference on Robot Learning, 2024

  51. [51]

    DROID: A large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. In RSS 2024 Workshop: Data Generation for Robotics , 2024

  52. [52]

    Open- VLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- VLA: An open-source vision-language-action model. In 8th Annual Conference on Robot Learning , 2024

  53. [53]

    Design and use paradigms for gazebo, an open-source multi-robot simulator

    Nathan Koenig and Andrew Howard. Design and use paradigms for gazebo, an open-source multi-robot simulator. In IEEE/RSJ International Conference on Intelligent Robots and Systems , volume 3, pages 2149– 2154, 2004

  54. [54]

    AI2- THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli Van- derBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2- THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017

  55. [55]

    Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection

    Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning hand-eye coor- dination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436, 2018

  56. [56]

    Datacomp-LM: In search of the next generation of training sets for language models

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-LM: In search of the next generation of training sets for language models. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  57. [57]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

  58. [58]

    Manipllm: Embodied multimodal large language model for object-centric robotic manipulation

    Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 18061– 18070, 2024

  59. [59]

    Vision- language foundation models as effective robot imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, and Tao Kong. Vision- language foundation models as effective robot imitators. In The Twelfth International Conference on Learning Representations, 2024

  60. [60]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lu- nawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 , 2024

  61. [61]

    Code as policies: Language model programs for em- bodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for em- bodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 9493–9500. IEEE, 2023

  62. [62]

    Video-LLaV A: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment before projection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 5971– 5984, Miami, Florida, USA, November 2024

  63. [63]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems , 36, 2024

  64. [64]

    Robot learning on the job: Human- in-the-loop autonomy and learning during deployment

    Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. The International Journal of Robotics Research , page 02783649241273901, 2022

  65. [65]

    Robomamba: Efficient vision-language-action model for robotic rea- soning and manipulation

    Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Efficient vision-language-action model for robotic rea- soning and manipulation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

  66. [66]

    RDT-1b: a diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1b: a diffusion foundation model for bimanual manipulation. In The Thirteenth International Conference on Learning Representations , 2025

  67. [67]

    REFLECT: Summarizing robot experiences for failure explanation and correction

    Zeyi Liu, Arpit Bahety, and Shuran Song. REFLECT: Summarizing robot experiences for failure explanation and correction. In 7th Annual Conference on Robot Learning, 2023

  68. [68]

    Isaac gym: High performance GPU based physics simulation for robot learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance GPU based physics simulation for robot learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021

  69. [69]

    Learning dexterous grasping with object-centric visual affor- dances

    Priyanka Mandikal and Kristen Grauman. Learning dexterous grasping with object-centric visual affor- dances. In 2021 IEEE international conference on robotics and automation (ICRA) , pages 6169–6176. IEEE, 2021

  70. [70]

    Dexvip: Learning dexterous grasping with human hand pose priors from video

    Priyanka Mandikal and Kristen Grauman. Dexvip: Learning dexterous grasping with human hand pose priors from video. In Conference on Robot Learning , pages 651–661. PMLR, 2022

  71. [71]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning , pages 879–893. PMLR, 2018

  72. [72]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 11–20, 2016

  73. [73]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks. IEEE Robotics and Automation Let- ters, 7(3):7327–7334, 2022

  74. [74]

    Where2act: From pixels to actions for articulated 3d objects

    Kaichun Mo, Leonidas J Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6813–6823, 2021

  75. [75]

    Movella. Xsens. https://www.movella.com/products/ xsens, 2025. Accessed: 2025-01-15

  76. [76]

    R3m: A universal visual representation for robot manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. In 6th Annual Conference on Robot Learning , 2022

  77. [77]

    Nvidia isaac sim: Robotics simulation and synthetic data, 2023

    NVIDIA. Nvidia isaac sim: Robotics simulation and synthetic data, 2023. URL https://developer.nvidia.com/ isaac-sim

  78. [78]

    Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment col- laboration0

    Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhi- ram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Man- dlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment col- laboration0. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 6892–6903, 2024

  79. [79]

    Astra series - structured light camera

    ORBBEC. Astra series - structured light camera. https://www.orbbec.com/products/ structured-light-camera/astra-series/, 2022

  80. [80]

    Gemini 335 - 3d vision for a 3d world

    ORBBEC. Gemini 335 - 3d vision for a 3d world. https://www.orbbec.com/products/ stereo-vision-camera/gemini-335/, 2024

Showing first 80 references.