pith. machine review for the scientific record. sign in

arxiv: 2604.21017 · v2 · submitted 2026-04-22 · 💻 cs.RO · cs.AI

Recognition: unknown

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Aditya Amit Godbole, Alaa Eldin Abdelaal, Alan Kuntz, Alberto Arezzo, Alexander Dimitrakakis, Allison M. Okamura, Andrew Howe, Anqing Duan, Anton Deguet, Antony Goldenberg, Ariel Rodriguez Jimenez, Asmitha Sathya, Axel Krieger, Ayberk Acar, Benjamin Calm\'e, Brett Marinelli, Britton Jordan, Camilo Correa-Gallego, Carlo Alberto Ammirati, Carlos Vives, Chandra Kuchi, Chang Liu, Changwei Chen, Chelsea Finn, Chenhao Yu, Chetan Reddy Narayanaswamy, Chi Kit Ng, Chim Ho Yin, Christopher Nguan, Chung-Pang Wang, Cyrus Neary, Daniel Donoho, David Navarro-Alarcon, David Noonan, Dianye Huang, Diego Granero Marana, Doan Xuan Viet Pham, Dominic Jones, Erqi Wang, Eszter Luk\'acs, Ethan Kilmer, Evan Hailey, Fabio Carrillo, Farong Wang, Farshid Alambeigi, Fausto Kang, Federica Barontini, Federico Lavagno, Fei Liu, Ferdinando Rodriguez y Baena, Filip Binkiewicz, Filippo Filicori, Francesco Marzola, Frederic Giraud, Giovanni Distefano, Giulio Dagnino, Guankun Wang, Hang Li, Han Zhang, Hao-Chih Lee, Hao Ding, Haoxin Chen, Hao Yang, Haoying Zhou, Hongjun Wu, Hongliang Ren, Howard Ji, Idris Sunmola, Jacob Delgado, Jad Fayad, Jeffrey Jopling, Jesse Haworth, Jianhao Su, Jianmin Ji, Jiaqi Shao, Jiawei Ge, Jie Ying Wu, Jinsong Lin, Ji Woong Kim, Jonathan C. DeLong, Jonathon Hawkins, Jordan Thompson, Joyce Zhang, Junlei Hu, Junlin Wu, Junyi Wang, Juo-Tung Chen, Justin Opfermann, Kaixuan Wu, Kaizhong Deng, Kengo Hayashi, Ken Goldberg, Ki Hwan Oh, Krist\'of Tak\'acs, Kush Hari, Lalithkumar Seenivasan, Leonardo Borgioli, Lidian Wang, Lorenzo Mazza, Lucy Xiaoyang Shi, Luigi Muratore, Lukas Zbinden, Luohong Wu, Luoyao Kang, Mahdi Azizian, Marco Esposito, Maria Clara Morais, Mario Ferradosa, Martin Wagner, Mateusz W\'ojcikowski, Mathias Unberath, Matteo Pescio, Mattia Ballo, Mehmet K. Turkcan, Meiqing Cheng, Menglong Ye, Michael Kam, Michael Yip, Micha{\l} Naskr\k{e}t, Miles Mannas, Milos Zefran, Min Cheng, Mingwu Su, Mohammad Rafiee Javazm, Nan Xiao, Nassir Navab, Nicola Cavalcanti, Nikola Budjak, Ning Zhong, Nithesh Kumar, Noah Barnes, Nural Yilmaz, Open-H-Embodiment Consortium: Nigel Nelson, Ortrun Hellig, Pablo David Aranda Rodriguez, Pascal Hansen, Patrick Thornycroft, Pei Liu, Peng Zhou, Peter Black, Peter Kazanzides, Philipp F\"urnstahl, Pietro Valdastri, Preethi Satish, Przemys{\l}aw Korzeniowski, Qihan Chen, Qingpeng Ding, Quan Vuong, Ran Ju, Rayan Younis, Ria Jain, Rui Ji, Ryan S. Yeung, Sabina Martyniak, Sareena Mann, Sayem Nazmuz Zaman, S. Duke Herrell, Sean D. Huver, Sebastian Bodenstedt, Septimiu E. Salcudean, Shane Farritor, Shelby Haworth, Shing Shin Cheng, Siddhartha Kapuria, Sihang Chen, Sonika Kiehler, Soofiyan Atar, Stamatia Giannarou, Stefanie Speidel, Tam\'as Haidegger, Tanner Watts, Tianqi Yang, Tito Porras, Tom Christian Olesch, Victoria Wu, Wanli Liuchen, Wei Wang, Wenxuan Xie, Wolfgang Wein, Xavier Giralt Ludevid, Xiangyu Chu, Xiao Liang, Xiaoqing Guo, Xinhao Chen, Xinxin Lin, Xiuli Zuo, Xueyan Mei, Xuyang Zhang, Yameng Zhang, Yanyong Zhang, Yidong Zhang, Yimeng Wu, Yinuo Yang, Yiqing Shen, Yu Chung Lee, Yuelin Zhang, Yun-hui Liu, Yunke Ao, Yunxi Tang, Yunye Xiao, Yu Sheng, Yu Tian, Zahi Fayad, Zhaoyang Jacopo Hu, Zhen Li, Zhongliang Jiang, Zhongyu Chen, Zhouyang Hong, Zih-Yun Sarah Chiu, Zijian Wu, Ziyang Chen, Ziyi Hao, Ziyi Wang, Zoe Soul\'e

Pith reviewed 2026-05-09 23:48 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords medical roboticsopen datasetfoundation modelssurgical roboticsvision-language-actionkinematicsworld modelmulti-embodiment
0
0 comments X

The pith

A large open dataset of medical robot videos and motions from 49 institutions enables the first foundation model to complete suturing tasks end-to-end.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Open-H-Embodiment, the largest public collection of medical robotic video paired with synchronized kinematics, drawn from more than 49 institutions and seven distinct robotic platforms. This scale of heterogeneous data is used to train GR00T-H, a vision-language-action foundation model that becomes the only evaluated system to achieve full end-to-end task completion on a structured suturing benchmark. The same data also supports training of Cosmos-H-Surgical-Simulator, an action-conditioned world model that runs multi-embodiment surgical simulation from a single checkpoint. If these results hold, shared large-scale medical robot data can remove the primary bottleneck that has kept autonomous medical robotics from scaling beyond narrow, single-system demonstrations.

Core claim

We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also

What carries the argument

Open-H-Embodiment dataset of synchronized video and kinematics across heterogeneous institutions and platforms, used to train GR00T-H vision-language-action model and Cosmos-H-Surgical-Simulator world model.

If this is right

  • GR00T-H becomes the only model among those tested to reach full end-to-end completion on structured suturing, reaching 25 percent success where others reach zero.
  • Cosmos-H-Surgical-Simulator produces the first action-conditioned world model that supports simulation across nine different robotic platforms from one checkpoint.
  • Open collection of large-scale medical robot data can serve as shared infrastructure for advances in robot learning and world modeling.
  • Synthetic data generated by the world model can be used for in silico policy evaluation in the medical domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the dataset proves representative, similar open collection efforts could be applied to non-surgical medical robots such as those used in rehabilitation or diagnostics.
  • Models trained on this data may allow direct transfer of policies between different commercial surgical systems without per-platform retraining.
  • The availability of synchronized kinematics at this scale could accelerate research on safety constraints and real-time error detection during autonomous procedures.

Load-bearing premise

Videos and kinematics collected from many different hospitals and robot systems are sufficiently standardized and representative of real clinical variability to train models that transfer to new robots and patients.

What would settle it

Retraining GR00T-H on the dataset and then testing it on a robotic platform never seen during training, or on a new patient cohort, and finding zero task completion on the suturing benchmark would show the data does not support generalizable foundation models.

read the original abstract

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics, spanning more than 49 institutions and multiple platforms (da Vinci, Versius, dVRK, MIRA, etc.). It demonstrates utility via GR00T-H, the first open vision-language-action foundation model for medical robotics, which achieves 25% full end-to-end task completion on a structured suturing benchmark (vs. 0% for all other evaluated models) and 64% average success on a 29-step ex vivo suturing sequence, plus Cosmos-H-Surgical-Simulator, the first action-conditioned world model supporting multi-embodiment surgical simulation across nine platforms.

Significance. If the generalization claims hold, this work provides critical open infrastructure for foundation models in medical robotics by addressing data scarcity and single-embodiment limitations. The scale, multi-platform coverage, and reported successes on structured tasks like suturing position it as enabling infrastructure for robot learning, world modeling, and policy evaluation in the domain.

major comments (3)
  1. [Dataset collection and preprocessing description] The central claim that Open-H-Embodiment enables cross-embodiment generalization for foundation models depends on sufficient standardization of heterogeneous data from 49+ institutions and 7+ platforms. However, the dataset description provides no quantitative evidence of kinematics alignment, video calibration, action-space mapping, or domain-gap metrics between platforms.
  2. [GR00T-H model evaluation and results] The 25% full task completion rate for GR00T-H on the suturing benchmark (vs. 0% for baselines) is presented as evidence of multi-embodiment capability, but without leave-one-platform-out experiments, platform-stratified results, or confirmation that test trials include unseen robots/patients, it remains unclear whether this reflects true generalization or in-distribution performance on dominant platforms.
  3. [Experimental evaluation section] The reported success rates (25% full completion, 64% average on 29-step sequence) lack supporting details on data splits, statistical significance testing, baseline implementations, and potential confounds such as task standardization across sites, which are load-bearing for assessing the reliability of the cross-model comparisons.
minor comments (2)
  1. [Abstract and dataset overview] The abstract states 'more than 49 institutions' without a precise count or platform/institution breakdown table, which would improve transparency and allow readers to assess coverage.
  2. [Results presentation] Success rate tables or figures would benefit from confidence intervals or trial counts to contextualize the 25% and 64% figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review of our manuscript. Their comments have prompted us to clarify several aspects of the dataset and evaluation, and we have revised the paper to incorporate additional details and analyses where feasible.

read point-by-point responses
  1. Referee: [Dataset collection and preprocessing description] The central claim that Open-H-Embodiment enables cross-embodiment generalization for foundation models depends on sufficient standardization of heterogeneous data from 49+ institutions and 7+ platforms. However, the dataset description provides no quantitative evidence of kinematics alignment, video calibration, action-space mapping, or domain-gap metrics between platforms.

    Authors: We agree that quantitative evidence of standardization would bolster the cross-embodiment claims. The original submission emphasized the collection scale and diversity but provided limited preprocessing specifics. In the revised manuscript, we have expanded the relevant section to include details on the standardization pipeline: kinematics are normalized to a common joint space with reported variance reduction metrics, videos are calibrated to a standard resolution and frame rate, and action spaces are mapped via platform-specific affine transformations. We now include a table summarizing domain-gap metrics (e.g., average L2 distance in normalized action space) for the primary platforms contributing the majority of the data. Full pairwise metrics for all 49 institutions remain challenging due to varying data quality and are noted as a limitation. revision: partial

  2. Referee: [GR00T-H model evaluation and results] The 25% full task completion rate for GR00T-H on the suturing benchmark (vs. 0% for baselines) is presented as evidence of multi-embodiment capability, but without leave-one-platform-out experiments, platform-stratified results, or confirmation that test trials include unseen robots/patients, it remains unclear whether this reflects true generalization or in-distribution performance on dominant platforms.

    Authors: We appreciate the concern regarding the interpretation of the results as evidence of generalization. The suturing benchmark was constructed using data from multiple platforms, and the test set includes trials from institutions and patients not represented in the training data for GR00T-H. To further address this, we have added platform-stratified performance breakdowns in the results section, demonstrating that success rates are not solely driven by the dominant da Vinci platform. While comprehensive leave-one-platform-out retraining was not performed due to the substantial computational resources required for each foundation model training run, the stratified results and the fact that no other model achieved any full completions provide supporting evidence for the multi-embodiment utility. We have clarified these points in the text. revision: partial

  3. Referee: [Experimental evaluation section] The reported success rates (25% full completion, 64% average on 29-step sequence) lack supporting details on data splits, statistical significance testing, baseline implementations, and potential confounds such as task standardization across sites, which are load-bearing for assessing the reliability of the cross-model comparisons.

    Authors: We concur that these experimental details are essential for evaluating the reliability of our comparisons. The revised manuscript now includes an expanded 'Evaluation Protocol' subsection that specifies the data splits (stratified by platform, institution, and procedure type with an 70/15/15 train/validation/test ratio), the use of bootstrap resampling for confidence intervals on success rates, and statistical significance via McNemar's test for paired comparisons (with p-values < 0.01 reported for GR00T-H vs. baselines). Baseline implementations are detailed with references to original papers and our reimplementation hyperparameters. Potential confounds, including variations in task execution across sites, are discussed, and we note that all suturing trials followed a standardized protocol defined in the benchmark. These additions should allow readers to better assess the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release and model training with no reductive derivation chain

full rationale

The paper introduces a new multi-institution dataset and reports empirical training results for GR00T-H and Cosmos-H-Surgical-Simulator. No equations, first-principles derivations, or predictions appear in the provided text. Claims rest on data collection scale and benchmark success rates rather than any quantity fitted from the authors' prior outputs or self-citations. The central results (25% suturing success, multi-platform simulation) are direct outcomes of training on the released data and do not reduce to inputs by construction. This matches the default expectation of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on conventional assumptions of machine learning generalization and data quality rather than new fitted parameters or invented physical entities.

axioms (1)
  • domain assumption Training data from multiple institutions and platforms will support models that generalize to new robots and clinical settings.
    Implicit in the claim that the dataset enables foundation models for medical robotics.

pith-pipeline@v0.9.0 · 6584 in / 1348 out tokens · 47495 ms · 2026-05-09T23:48:45.691752+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 15 canonical work pages · 9 internal anchors

  1. [1]

    Plc,The Complexities of Physician Supply and Demand: Projections From 2021 to 2036, Tech

    G. Plc,The Complexities of Physician Supply and Demand: Projections From 2021 to 2036, Tech. rep., AAMC, Washington, DC (2024)

  2. [2]

    Moffatt-Bruce, J

    S. Moffatt-Bruce, J. Crestanello, D. P. Way, T. E. Williams Jr, Providing cardiothoracic services in 2035: signs of trouble ahead.The Journal of thoracic and cardiovascular surgery155(2), 824–829 (2018)

  3. [3]

    Haidegger, Autonomy for surgical robots: Concepts and paradigms.IEEE Transac- tions on Medical Robotics and Bionics1(2), 65–76 (2019)

    T. Haidegger, Autonomy for surgical robots: Concepts and paradigms.IEEE Transac- tions on Medical Robotics and Bionics1(2), 65–76 (2019)

  4. [4]

    Schmidgall, J

    S. Schmidgall, J. D. Opfermann, J. W. Kim, A. Krieger, Will your next surgeon be a robot? Autonomy and AI in robotic surgery.Science robotics10(104), eadt0187 (2025)

  5. [5]

    Kazanzides,et al., An open-source research kit for the da Vinci®Surgical System, in 2014 IEEE international conference on robotics and automation (ICRA)(IEEE) (2014), pp

    P. Kazanzides,et al., An open-source research kit for the da Vinci®Surgical System, in 2014 IEEE international conference on robotics and automation (ICRA)(IEEE) (2014), pp. 6434–6439

  6. [6]

    J. W. Kim,et al., Surgical Robot Transformer (SRT): Imitation Learning for Surgical Tasks, inProceedings of the 8th Conference on Robot Learning (CoRL 2024)(2024)

  7. [7]

    Long,et al., Surgical embodied intelligence for generalized task autonomy in laparo- scopic robot-assisted surgery.Science Robotics10(104), eadt3093 (2025)

    Y. Long,et al., Surgical embodied intelligence for generalized task autonomy in laparo- scopic robot-assisted surgery.Science Robotics10(104), eadt3093 (2025)

  8. [8]

    Brown,et al., Language Models are Few-Shot Learners.Advances in Neural Informa- tion Processing Systems33, 1877–1901 (2020)

    T. Brown,et al., Language Models are Few-Shot Learners.Advances in Neural Informa- tion Processing Systems33, 1877–1901 (2020)

  9. [9]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding, inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics(2019), pp. 4171–4186

  10. [10]

    Dosovitskiy,et al., An Image is Worth 16x16 Words: Transformers for Image Recog- nition at Scale, inInternational Conference on Learning Representations(2021)

    A. Dosovitskiy,et al., An Image is Worth 16x16 Words: Transformers for Image Recog- nition at Scale, inInternational Conference on Learning Representations(2021). 30

  11. [11]

    8748–8763

    A.Radford,et al.,LearningTransferableVisualModelsFromNaturalLanguageSupervi- sion, inProceedings of the 38th International Conference on Machine Learning(PMLR) (2021), pp. 8748–8763

  12. [12]

    He,et al., Masked Autoencoders Are Scalable Vision Learners, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2022), pp

    K. He,et al., Masked Autoencoders Are Scalable Vision Learners, inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2022), pp. 16000–16009

  13. [13]

    Open X-Embodiment Collaboration, Open X-Embodiment: Robotic Learning Datasets and RT-X Models, inIEEE International Conference on Robotics and Automation (ICRA)(2024)

  14. [14]

    Brohan,et al., RT-1: Robotics Transformer for Real-World Control at Scale, in Robotics: Science and Systems (RSS)(2023)

    A. Brohan,et al., RT-1: Robotics Transformer for Real-World Control at Scale, in Robotics: Science and Systems (RSS)(2023)

  15. [15]

    Zitkovich,et al., Rt-2: Vision-language-action models transfer web knowledge to robotic control, inConference on Robot Learning(PMLR) (2023), pp

    B. Zitkovich,et al., Rt-2: Vision-language-action models transfer web knowledge to robotic control, inConference on Robot Learning(PMLR) (2023), pp. 2165–2183

  16. [16]

    Octo Model Team,et al., Octo: An Open-Source Generalist Robot Policy.arXiv preprint arXiv:2405.12213(2024)

  17. [17]

    Doshi, H

    R. Doshi, H. Walke, O. Mees, S. Dasari, S. Levine, Scaling Cross-Embodied Learn- ing: One Policy for Manipulation, Navigation, Locomotion and Aviation.arXiv preprint arXiv:2408.11812(2024)

  18. [18]

    M. J. Kim,et al., Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246(2024)

  19. [19]

    AgiBot-World-Contributors,et al., AgiBot World Colosseo: A Large-Scale Manip- ulation Platform for Scalable and Intelligent Embodied Systems.arXiv preprint arXiv:2503.06669(2025)

  20. [20]

    G. R. Team,et al., Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342(2025). 31

  21. [21]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck,et al., Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734(2025)

  22. [22]

    M. J. Kim,et al., Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning, inThe Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=wPEIStHxYH

  23. [23]

    World Action Models are Zero-shot Policies

    S. Ye,et al., World Action Models are Zero-shot Policies (2026),https://arxiv.org/ abs/2602.15922

  24. [24]

    Causal World Modeling for Robot Control

    L. Li,et al., Causal World Modeling for Robot Control (2026),https://arxiv.org/ abs/2601.21998

  25. [25]

    J. Haworth,et al., SutureBot: A Precision Framework & Benchmark for Autonomous End-to-EndSuturing,inAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track(2025)

  26. [26]

    Black,et al.,π0: A Vision-Language-Action Flow Model for General Robot Control, inRobotics: Science and Systems (RSS)(2025)

    K. Black,et al.,π0: A Vision-Language-Action Flow Model for General Robot Control, inRobotics: Science and Systems (RSS)(2025)

  27. [27]

    T. Z. Zhao, V. Kumar, S. Levine, C. Finn, Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705(2023)

  28. [28]

    Y. Gao,et al., JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS): A Sur- gical Activity Dataset for Human Motion Modeling, inModeling and Monitoring of Computer Assisted Interventions (M2CAI) – MICCAI Workshop(2014)

  29. [29]

    Oh,et al., Expanded Comprehensive Robotic Cholecystectomy Dataset (CRCD)

    K.-H. Oh,et al., Expanded Comprehensive Robotic Cholecystectomy Dataset (CRCD). Journal of Medical Robotics Research(2025)

  30. [30]

    Hansen,et al., ImitateCholec: A Multimodal Dataset for Long-Horizon Imitation Learning in Robotic Cholecystectomy.Scientific Data13(1), 210 (2026)

    P. Hansen,et al., ImitateCholec: A Multimodal Dataset for Long-Horizon Imitation Learning in Robotic Cholecystectomy.Scientific Data13(1), 210 (2026)

  31. [31]

    World Simulation with Video Foundation Models for Physical AI

    A. Ali,et al., World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062(2025). 32

  32. [32]

    L. Zbinden,et al., Cosmos-Surg-DVRK: World Foundation Model-Based Automated Online Evaluation of Surgical Robot Policy Learning.IEEE Robotics and Automation Letters11(5), 5978–5985 (2026), doi:10.1109/LRA.2026.3675962

  33. [33]

    Cadene,et al., LeRobot: An Open-Source Library for End-to-End Robot Learning, in The Fourteenth International Conference on Learning Representations (ICLR)(2026)

    R. Cadene,et al., LeRobot: An Open-Source Library for End-to-End Robot Learning, in The Fourteenth International Conference on Learning Representations (ICLR)(2026)

  34. [34]

    Chen,et al., Robo-DM: Data Management For Large Robot Datasets, inProceedings of the IEEE International Conference on Robotics and Automation (ICRA)(2025)

    K. Chen,et al., Robo-DM: Data Management For Large Robot Datasets, inProceedings of the IEEE International Conference on Robotics and Automation (ICRA)(2025)

  35. [35]

    Open-H Initiative, Open-H-Embodiment: Data Contribution How-To Guide and Scripts, https://github.com/open-h/open-h-embodiment(2026), accessed: 2026-04-02

  36. [36]

    C. Che, C. Wang, T. Vercauteren, S. Tsoka, L. C. Garcia-Peraza-Herrera, LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgi- cal Settings, inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2026), arXiv:2503.19740

  37. [37]

    Y. Zhou, C. Barnes, J. Lu, J. Yang, H. Li, On the Continuity of Rotation Representa- tions in Neural Networks, inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2019), pp. 5745–5753

  38. [38]

    T. L. Team,et al., A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation.arXiv preprint arXiv:2507.05331(2025)

  39. [39]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran,et al., V-JEPA 2: Self-Supervised Video Models Enable Understanding, Pre- diction and Planning.arXiv preprint arXiv:2506.09985(2025)

  40. [40]

    Support for the activity of Centers of Excellence established in Poland 34 under Horizon 2020

    L. Li,et al., Causal World Modeling for Robot Control.arXiv preprint arXiv:2601.21998 (2026). Acknowledgments We would like to thank the many institutions, companies, researchers, and students who participated in this effort; without their contributions, this dataset would not have been 33 possible. Funding:This material is based on work supported by the ...