pith. machine review for the scientific record. sign in

arxiv: 2604.21017 · v2 · submitted 2026-04-22 · 💻 cs.RO · cs.AI

Recognition: unknown

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Open-H-Embodiment Consortium: Nigel Nelson , Juo-Tung Chen , Jesse Haworth , Xinhao Chen , Lukas Zbinden , Dianye Huang , Alaa Eldin Abdelaal , Alberto Arezzo
show 206 more authors
Ayberk Acar Farshid Alambeigi Carlo Alberto Ammirati Yunke Ao Pablo David Aranda Rodriguez Soofiyan Atar Mattia Ballo Noah Barnes Federica Barontini Filip Binkiewicz Peter Black Sebastian Bodenstedt Leonardo Borgioli Nikola Budjak Benjamin Calm\'e Fabio Carrillo Nicola Cavalcanti Changwei Chen Haoxin Chen Sihang Chen Qihan Chen Zhongyu Chen Ziyang Chen Shing Shin Cheng Meiqing Cheng Min Cheng Zih-Yun Sarah Chiu Xiangyu Chu Camilo Correa-Gallego Giulio Dagnino Anton Deguet Jacob Delgado Jonathan C. DeLong Kaizhong Deng Alexander Dimitrakakis Qingpeng Ding Hao Ding Giovanni Distefano Daniel Donoho Anqing Duan Marco Esposito Shane Farritor Jad Fayad Zahi Fayad Mario Ferradosa Filippo Filicori Chelsea Finn Philipp F\"urnstahl Jiawei Ge Stamatia Giannarou Xavier Giralt Ludevid Frederic Giraud Aditya Amit Godbole Ken Goldberg Antony Goldenberg Diego Granero Marana Xiaoqing Guo Tam\'as Haidegger Evan Hailey Pascal Hansen Ziyi Hao Kush Hari Kengo Hayashi Jonathon Hawkins Shelby Haworth Ortrun Hellig S. Duke Herrell Zhouyang Hong Andrew Howe Junlei Hu Zhaoyang Jacopo Hu Ria Jain Mohammad Rafiee Javazm Howard Ji Rui Ji Jianmin Ji Zhongliang Jiang Dominic Jones Jeffrey Jopling Britton Jordan Ran Ju Michael Kam Luoyao Kang Fausto Kang Siddhartha Kapuria Peter Kazanzides Sonika Kiehler Ethan Kilmer Ji Woong Kim Przemys{\l}aw Korzeniowski Chandra Kuchi Nithesh Kumar Alan Kuntz Federico Lavagno Yu Chung Lee Hao-Chih Lee Hang Li Zhen Li Xiao Liang Xinxin Lin Jinsong Lin Chang Liu Fei Liu Pei Liu Yun-hui Liu Wanli Liuchen Eszter Luk\'acs Sareena Mann Miles Mannas Brett Marinelli Sabina Martyniak Francesco Marzola Lorenzo Mazza Xueyan Mei Maria Clara Morais Luigi Muratore Chetan Reddy Narayanaswamy Micha{\l} Naskr\k{e}t David Navarro-Alarcon Cyrus Neary Chi Kit Ng Christopher Nguan David Noonan Ki Hwan Oh Tom Christian Olesch Allison M. Okamura Justin Opfermann Matteo Pescio Doan Xuan Viet Pham Tito Porras Hongliang Ren Ariel Rodriguez Jimenez Ferdinando Rodriguez y Baena Septimiu E. Salcudean Asmitha Sathya Preethi Satish Lalithkumar Seenivasan Jiaqi Shao Yiqing Shen Yu Sheng Lucy Xiaoyang Shi Zoe Soul\'e Stefanie Speidel Mingwu Su Jianhao Su Idris Sunmola Krist\'of Tak\'acs Yunxi Tang Patrick Thornycroft Yu Tian Jordan Thompson Mehmet K. Turkcan Mathias Unberath Pietro Valdastri Carlos Vives Quan Vuong Martin Wagner Farong Wang Wei Wang Lidian Wang Chung-Pang Wang Guankun Wang Junyi Wang Erqi Wang Ziyi Wang Tanner Watts Wolfgang Wein Yimeng Wu Zijian Wu Hongjun Wu Luohong Wu Jie Ying Wu Junlin Wu Victoria Wu Kaixuan Wu Mateusz W\'ojcikowski Yunye Xiao Nan Xiao Wenxuan Xie Hao Yang Tianqi Yang Yinuo Yang Menglong Ye Ryan S. Yeung Nural Yilmaz Chim Ho Yin Michael Yip Rayan Younis Chenhao Yu Sayem Nazmuz Zaman Milos Zefran Han Zhang Yuelin Zhang Yidong Zhang Yanyong Zhang Xuyang Zhang Yameng Zhang Joyce Zhang Ning Zhong Peng Zhou Haoying Zhou Xiuli Zuo Nassir Navab Mahdi Azizian Sean D. Huver Axel Krieger
Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:48 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords medical roboticsopen datasetfoundation modelssurgical roboticsvision-language-actionkinematicsworld modelmulti-embodiment
0
0 comments X

The pith

A large open dataset of medical robot videos and motions from 49 institutions enables the first foundation model to complete suturing tasks end-to-end.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Open-H-Embodiment, the largest public collection of medical robotic video paired with synchronized kinematics, drawn from more than 49 institutions and seven distinct robotic platforms. This scale of heterogeneous data is used to train GR00T-H, a vision-language-action foundation model that becomes the only evaluated system to achieve full end-to-end task completion on a structured suturing benchmark. The same data also supports training of Cosmos-H-Surgical-Simulator, an action-conditioned world model that runs multi-embodiment surgical simulation from a single checkpoint. If these results hold, shared large-scale medical robot data can remove the primary bottleneck that has kept autonomous medical robotics from scaling beyond narrow, single-system demonstrations.

Core claim

We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also

What carries the argument

Open-H-Embodiment dataset of synchronized video and kinematics across heterogeneous institutions and platforms, used to train GR00T-H vision-language-action model and Cosmos-H-Surgical-Simulator world model.

If this is right

  • GR00T-H becomes the only model among those tested to reach full end-to-end completion on structured suturing, reaching 25 percent success where others reach zero.
  • Cosmos-H-Surgical-Simulator produces the first action-conditioned world model that supports simulation across nine different robotic platforms from one checkpoint.
  • Open collection of large-scale medical robot data can serve as shared infrastructure for advances in robot learning and world modeling.
  • Synthetic data generated by the world model can be used for in silico policy evaluation in the medical domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the dataset proves representative, similar open collection efforts could be applied to non-surgical medical robots such as those used in rehabilitation or diagnostics.
  • Models trained on this data may allow direct transfer of policies between different commercial surgical systems without per-platform retraining.
  • The availability of synchronized kinematics at this scale could accelerate research on safety constraints and real-time error detection during autonomous procedures.

Load-bearing premise

Videos and kinematics collected from many different hospitals and robot systems are sufficiently standardized and representative of real clinical variability to train models that transfer to new robots and patients.

What would settle it

Retraining GR00T-H on the dataset and then testing it on a robotic platform never seen during training, or on a new patient cohort, and finding zero task completion on the suturing benchmark would show the data does not support generalizable foundation models.

read the original abstract

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics, spanning more than 49 institutions and multiple platforms (da Vinci, Versius, dVRK, MIRA, etc.). It demonstrates utility via GR00T-H, the first open vision-language-action foundation model for medical robotics, which achieves 25% full end-to-end task completion on a structured suturing benchmark (vs. 0% for all other evaluated models) and 64% average success on a 29-step ex vivo suturing sequence, plus Cosmos-H-Surgical-Simulator, the first action-conditioned world model supporting multi-embodiment surgical simulation across nine platforms.

Significance. If the generalization claims hold, this work provides critical open infrastructure for foundation models in medical robotics by addressing data scarcity and single-embodiment limitations. The scale, multi-platform coverage, and reported successes on structured tasks like suturing position it as enabling infrastructure for robot learning, world modeling, and policy evaluation in the domain.

major comments (3)
  1. [Dataset collection and preprocessing description] The central claim that Open-H-Embodiment enables cross-embodiment generalization for foundation models depends on sufficient standardization of heterogeneous data from 49+ institutions and 7+ platforms. However, the dataset description provides no quantitative evidence of kinematics alignment, video calibration, action-space mapping, or domain-gap metrics between platforms.
  2. [GR00T-H model evaluation and results] The 25% full task completion rate for GR00T-H on the suturing benchmark (vs. 0% for baselines) is presented as evidence of multi-embodiment capability, but without leave-one-platform-out experiments, platform-stratified results, or confirmation that test trials include unseen robots/patients, it remains unclear whether this reflects true generalization or in-distribution performance on dominant platforms.
  3. [Experimental evaluation section] The reported success rates (25% full completion, 64% average on 29-step sequence) lack supporting details on data splits, statistical significance testing, baseline implementations, and potential confounds such as task standardization across sites, which are load-bearing for assessing the reliability of the cross-model comparisons.
minor comments (2)
  1. [Abstract and dataset overview] The abstract states 'more than 49 institutions' without a precise count or platform/institution breakdown table, which would improve transparency and allow readers to assess coverage.
  2. [Results presentation] Success rate tables or figures would benefit from confidence intervals or trial counts to contextualize the 25% and 64% figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. Their comments have prompted us to clarify several aspects of the dataset and evaluation, and we have revised the paper to incorporate additional details and analyses where feasible.

read point-by-point responses
  1. Referee: [Dataset collection and preprocessing description] The central claim that Open-H-Embodiment enables cross-embodiment generalization for foundation models depends on sufficient standardization of heterogeneous data from 49+ institutions and 7+ platforms. However, the dataset description provides no quantitative evidence of kinematics alignment, video calibration, action-space mapping, or domain-gap metrics between platforms.

    Authors: We agree that quantitative evidence of standardization would bolster the cross-embodiment claims. The original submission emphasized the collection scale and diversity but provided limited preprocessing specifics. In the revised manuscript, we have expanded the relevant section to include details on the standardization pipeline: kinematics are normalized to a common joint space with reported variance reduction metrics, videos are calibrated to a standard resolution and frame rate, and action spaces are mapped via platform-specific affine transformations. We now include a table summarizing domain-gap metrics (e.g., average L2 distance in normalized action space) for the primary platforms contributing the majority of the data. Full pairwise metrics for all 49 institutions remain challenging due to varying data quality and are noted as a limitation. revision: partial

  2. Referee: [GR00T-H model evaluation and results] The 25% full task completion rate for GR00T-H on the suturing benchmark (vs. 0% for baselines) is presented as evidence of multi-embodiment capability, but without leave-one-platform-out experiments, platform-stratified results, or confirmation that test trials include unseen robots/patients, it remains unclear whether this reflects true generalization or in-distribution performance on dominant platforms.

    Authors: We appreciate the concern regarding the interpretation of the results as evidence of generalization. The suturing benchmark was constructed using data from multiple platforms, and the test set includes trials from institutions and patients not represented in the training data for GR00T-H. To further address this, we have added platform-stratified performance breakdowns in the results section, demonstrating that success rates are not solely driven by the dominant da Vinci platform. While comprehensive leave-one-platform-out retraining was not performed due to the substantial computational resources required for each foundation model training run, the stratified results and the fact that no other model achieved any full completions provide supporting evidence for the multi-embodiment utility. We have clarified these points in the text. revision: partial

  3. Referee: [Experimental evaluation section] The reported success rates (25% full completion, 64% average on 29-step sequence) lack supporting details on data splits, statistical significance testing, baseline implementations, and potential confounds such as task standardization across sites, which are load-bearing for assessing the reliability of the cross-model comparisons.

    Authors: We concur that these experimental details are essential for evaluating the reliability of our comparisons. The revised manuscript now includes an expanded 'Evaluation Protocol' subsection that specifies the data splits (stratified by platform, institution, and procedure type with an 70/15/15 train/validation/test ratio), the use of bootstrap resampling for confidence intervals on success rates, and statistical significance via McNemar's test for paired comparisons (with p-values < 0.01 reported for GR00T-H vs. baselines). Baseline implementations are detailed with references to original papers and our reimplementation hyperparameters. Potential confounds, including variations in task execution across sites, are discussed, and we note that all suturing trials followed a standardized protocol defined in the benchmark. These additions should allow readers to better assess the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and model training with no reductive derivation chain

full rationale

The paper introduces a new multi-institution dataset and reports empirical training results for GR00T-H and Cosmos-H-Surgical-Simulator. No equations, first-principles derivations, or predictions appear in the provided text. Claims rest on data collection scale and benchmark success rates rather than any quantity fitted from the authors' prior outputs or self-citations. The central results (25% suturing success, multi-platform simulation) are direct outcomes of training on the released data and do not reduce to inputs by construction. This matches the default expectation of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on conventional assumptions of machine learning generalization and data quality rather than new fitted parameters or invented physical entities.

axioms (1)
  • domain assumption Training data from multiple institutions and platforms will support models that generalize to new robots and clinical settings.
    Implicit in the claim that the dataset enables foundation models for medical robotics.

pith-pipeline@v0.9.0 · 6584 in / 1348 out tokens · 47495 ms · 2026-05-09T23:48:45.691752+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    Plc,The Complexities of Physician Supply and Demand: Projections From 2021 to 2036, Tech

    G. Plc,The Complexities of Physician Supply and Demand: Projections From 2021 to 2036, Tech. rep., AAMC, Washington, DC (2024)

  2. [2]

    Moffatt-Bruce, J

    S. Moffatt-Bruce, J. Crestanello, D. P. Way, T. E. Williams Jr, Providing cardiothoracic services in 2035: signs of trouble ahead.The Journal of thoracic and cardiovascular surgery155(2), 824–829 (2018)

  3. [3]

    Haidegger, Autonomy for surgical robots: Concepts and paradigms.IEEE Transac- tions on Medical Robotics and Bionics1(2), 65–76 (2019)

    T. Haidegger, Autonomy for surgical robots: Concepts and paradigms.IEEE Transac- tions on Medical Robotics and Bionics1(2), 65–76 (2019)

  4. [4]

    Schmidgall, J

    S. Schmidgall, J. D. Opfermann, J. W. Kim, A. Krieger, Will your next surgeon be a robot? Autonomy and AI in robotic surgery.Science robotics10(104), eadt0187 (2025)

  5. [5]

    Kazanzides,et al., An open-source research kit for the da Vinci®Surgical System, in 2014 IEEE international conference on robotics and automation (ICRA)(IEEE) (2014), pp

    P. Kazanzides,et al., An open-source research kit for the da Vinci®Surgical System, in 2014 IEEE international conference on robotics and automation (ICRA)(IEEE) (2014), pp. 6434–6439

  6. [6]

    J. W. Kim,et al., Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks, inProceedings of the 8th Conference on Robot Learning (CoRL 2024)(2024)

  7. [7]

    Long,et al., Surgical embodied intelligence for generalized task autonomy in laparo- scopic robot-assisted surgery.Science Robotics10(104), eadt3093 (2025)

    Y. Long,et al., Surgical embodied intelligence for generalized task autonomy in laparo- scopic robot-assisted surgery.Science Robotics10(104), eadt3093 (2025)

  8. [8]

    Brown,et al., Language Models are Few-Shot Learners.Advances in Neural Informa- tion Processing Systems33, 1877–1901 (2020)

    T. Brown,et al., Language Models are Few-Shot Learners.Advances in Neural Informa- tion Processing Systems33, 1877–1901 (2020)

  9. [9]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding, inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics(2019), pp. 4171–4186

  10. [10]

    Dosovitskiy,et al., An Image is Worth 16x16 Words: Transformers for Image Recog- nition at Scale, inInternational Conference on Learning Representations(2021)

    A. Dosovitskiy,et al., An Image is Worth 16x16 Words: Transformers for Image Recog- nition at Scale, inInternational Conference on Learning Representations(2021). 30

  11. [11]

    8748–8763

    A.Radford,et al.,LearningTransferableVisualModelsFromNaturalLanguageSupervi- sion, inProceedings of the 38th International Conference on Machine Learning(PMLR) (2021), pp. 8748–8763

  12. [12]

    He,et al., Masked Autoencoders Are Scalable Vision Learners, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2022), pp

    K. He,et al., Masked Autoencoders Are Scalable Vision Learners, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2022), pp. 16000–16009

  13. [13]

    Open X-Embodiment Collaboration, Open X-Embodiment: Robotic Learning Datasets and RT-X Models, inIEEE International Conference on Robotics and Automation (ICRA)(2024)

  14. [14]

    Brohan,et al., RT-1: Robotics Transformer for Real-World Control at Scale, in Robotics: Science and Systems (RSS)(2023)

    A. Brohan,et al., RT-1: Robotics Transformer for Real-World Control at Scale, in Robotics: Science and Systems (RSS)(2023)

  15. [15]

    Zitkovich,et al., Rt-2: Vision-language-action models transfer web knowledge to robotic control, inConference on Robot Learning(PMLR) (2023), pp

    B. Zitkovich,et al., Rt-2: Vision-language-action models transfer web knowledge to robotic control, inConference on Robot Learning(PMLR) (2023), pp. 2165–2183

  16. [16]

    Octo Model Team,et al., Octo: An Open-Source Generalist Robot Policy.arXiv preprint arXiv:2405.12213(2024)

  17. [17]

    Doshi, H

    R. Doshi, H. Walke, O. Mees, S. Dasari, S. Levine, Scaling Cross-Embodied Learn- ing: One Policy for Manipulation, Navigation, Locomotion and Aviation.arXiv preprint arXiv:2408.11812(2024)

  18. [18]

    M. J. Kim,et al., Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246(2024)

  19. [19]

    AgiBot-World-Contributors,et al., AgiBot World Colosseo: A Large-Scale Manip- ulation Platform for Scalable and Intelligent Embodied Systems.arXiv preprint arXiv:2503.06669(2025)

  20. [20]

    G. R. Team,et al., Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342(2025). 31

  21. [21]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck,et al., Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734(2025)

  22. [22]

    M. J. Kim,et al., Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning, inThe Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=wPEIStHxYH

  23. [23]

    World Action Models are Zero-shot Policies

    S. Ye,et al., World Action Models are Zero-shot Policies (2026),https://arxiv.org/ abs/2602.15922

  24. [24]

    Causal World Modeling for Robot Control

    L. Li,et al., Causal World Modeling for Robot Control (2026),https://arxiv.org/ abs/2601.21998

  25. [25]

    J. Haworth,et al., SutureBot: A Precision Framework & Benchmark for Autonomous End-to-EndSuturing,inAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track(2025)

  26. [26]

    Black,et al.,π0: A Vision-Language-Action Flow Model for General Robot Control, inRobotics: Science and Systems (RSS)(2025)

    K. Black,et al.,π0: A Vision-Language-Action Flow Model for General Robot Control, inRobotics: Science and Systems (RSS)(2025)

  27. [27]

    T. Z. Zhao, V. Kumar, S. Levine, C. Finn, Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705(2023)

  28. [28]

    Y. Gao,et al., JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS): A Sur- gical Activity Dataset for Human Motion Modeling, inModeling and Monitoring of Computer Assisted Interventions (M2CAI) – MICCAI Workshop(2014)

  29. [29]

    Oh,et al., Expanded Comprehensive Robotic Cholecystectomy Dataset (CRCD)

    K.-H. Oh,et al., Expanded Comprehensive Robotic Cholecystectomy Dataset (CRCD). Journal of Medical Robotics Research(2025)

  30. [30]

    Hansen,et al., ImitateCholec: A Multimodal Dataset for Long-Horizon Imitation Learning in Robotic Cholecystectomy.Scientific Data13(1), 210 (2026)

    P. Hansen,et al., ImitateCholec: A Multimodal Dataset for Long-Horizon Imitation Learning in Robotic Cholecystectomy.Scientific Data13(1), 210 (2026)

  31. [31]

    World Simulation with Video Foundation Models for Physical AI

    A. Ali,et al., World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062(2025). 32

  32. [32]

    L. Zbinden,et al., Cosmos-Surg-DVRK: World Foundation Model-Based Automated Online Evaluation of Surgical Robot Policy Learning.IEEE Robotics and Automation Letters11(5), 5978–5985 (2026), doi:10.1109/LRA.2026.3675962

  33. [33]

    Cadene,et al., LeRobot: An Open-Source Library for End-to-End Robot Learning, in The Fourteenth International Conference on Learning Representations (ICLR)(2026)

    R. Cadene,et al., LeRobot: An Open-Source Library for End-to-End Robot Learning, in The Fourteenth International Conference on Learning Representations (ICLR)(2026)

  34. [34]

    Chen,et al., Robo-DM: Data Management For Large Robot Datasets, inProceedings of the IEEE International Conference on Robotics and Automation (ICRA)(2025)

    K. Chen,et al., Robo-DM: Data Management For Large Robot Datasets, inProceedings of the IEEE International Conference on Robotics and Automation (ICRA)(2025)

  35. [35]

    Open-H Initiative, Open-H-Embodiment: Data Contribution How-To Guide and Scripts, https://github.com/open-h/open-h-embodiment(2026), accessed: 2026-04-02

  36. [36]

    C. Che, C. Wang, T. Vercauteren, S. Tsoka, L. C. Garcia-Peraza-Herrera, LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgi- cal Settings, inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2026), arXiv:2503.19740

  37. [37]

    Y. Zhou, C. Barnes, J. Lu, J. Yang, H. Li, On the Continuity of Rotation Representa- tions in Neural Networks, inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2019), pp. 5745–5753

  38. [38]

    T. L. Team,et al., A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation.arXiv preprint arXiv:2507.05331(2025)

  39. [39]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran,et al., V-JEPA 2: Self-Supervised Video Models Enable Understanding, Pre- diction and Planning.arXiv preprint arXiv:2506.09985(2025)

  40. [40]

    Support for the activity of Centers of Excellence established in Poland 34 under Horizon 2020

    L. Li,et al., Causal World Modeling for Robot Control.arXiv preprint arXiv:2601.21998 (2026). Acknowledgments We would like to thank the many institutions, companies, researchers, and students who participated in this effort; without their contributions, this dataset would not have been 33 possible. Funding:This material is based on work supported by the ...