pith. machine review for the scientific record. sign in

arxiv: 2605.06747 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:45 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords HumanNethuman-centric videoegocentric videoembodied intelligencevision-language-actiondataset curationrobot data substitutephysical interaction learning
0
0 comments X

The pith

A one-million-hour human video dataset lets vision-language models learn physical interactions better than training on real robot data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HumanNet, a curated corpus of one million hours of first-person and third-person human activity videos drawn from internet sources, complete with captions, motion descriptions, and hand-body signals. It argues that this scale of human-centric data, structured around interaction and viewpoint diversity, can serve as a practical substitute for scarce robot recordings when training embodied models. The central evidence comes from an ablation where continued training of the Qwen vision-language model on 1000 hours of egocentric HumanNet footage outperformed the same training on 100 hours of real-robot data from Magic Cobot, all on identical validation sets. A sympathetic reader would care because embodied intelligence has long been limited by the cost and volume of physical interaction data; if human video works, the field gains a path to much larger training sets without building more robots. The work therefore treats data curation itself as the key engineering contribution rather than any new model architecture.

Core claim

HumanNet is a one-million-hour human-centric video corpus spanning first- and third-person views, fine-grained activities, human-object interactions, tool use, and long-horizon behaviors, accompanied by interaction-centric annotations including captions, motion descriptions, and hand-body signals. The authors treat human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment as first-class design principles that convert unstructured internet video into a substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. In a controlled vision-language-action ablation, continued training from the Qwen VLM using 1000

What carries the argument

HumanNet's systematic data curation paradigm that applies human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment to internet video.

If this is right

  • Embodied models can be scaled using abundant human video rather than limited robot recordings.
  • Human-to-robot transfer becomes feasible at larger data volumes without proportional hardware costs.
  • Representation learning, activity understanding, and motion generation all benefit from the same interaction-centric annotations.
  • Unstructured internet video can be systematically turned into training data once the four curation principles are applied.
  • Vision-language-action training benefits from viewpoint diversity across first- and third-person footage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data-collection budgets in robotics could shift from building robot fleets to mining and annotating existing human video archives.
  • The same curation principles might be tested on other modalities such as audio or force signals to further reduce reliance on physical hardware.
  • Downstream robotic manipulation benchmarks could be used to measure whether the observed VLM gains translate to actual policy improvement.
  • If the performance edge holds at even larger scales, training runs that currently require robot time could instead run on cloud video corpora.

Load-bearing premise

The 1000-hour egocentric human subset and the 100-hour robot subset are comparable in task distribution, interaction complexity, and annotation quality so that performance differences can be attributed to the data source rather than other factors.

What would settle it

Re-running the exact continued-training experiment after matching the human and robot subsets for task distribution, interaction complexity, and annotation quality; if the human-video advantage disappears, the claim that egocentric human data is a scalable substitute would be falsified.

read the original abstract

Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains constrained by the lack of large, diverse, and richly annotated human activity data. We present HumanNet, a one-million-hour human-centric video corpus that captures how humans interact with the physical world at scale. HumanNet spans both first-person and third-person perspectives and covers fine-grained activities, human-object interactions, tool use, and long-horizon behaviors across diverse real-world environments. Beyond raw video, the dataset provides interaction-centric annotations, including captions, motion descriptions, and hand and body-related signals, enabling motion-aware and interaction-aware learning. Beyond scale, HumanNet introduces a systematic data curation paradigm for embodied learning, where human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment are treated as first-class design principles. This design transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer. We conduct a first-step validation on the value of this design through controlled vision-language-action ablation: under a fixed set of validation data, continued training from the Qwen VLM model with 1000 hours of egocentric video drawn from HumanNet surpasses the continued training with 100 hours of real-robot data from Magic Cobot, indicating that egocentric human video could be a scalable and cost-effective substitute for robot data. By building this project, we aim to explore the opportunity to scale embodied foundation models using human-centric videos, rather than relying solely on robot-specific data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces HumanNet, a 1-million-hour human-centric video dataset spanning first- and third-person views with annotations for captions, motion descriptions, and hand/body signals. It outlines a systematic curation paradigm treating human-centric filtering, temporal structuring, and viewpoint diversity as core principles for embodied representation learning. The central empirical validation shows that continued training of the Qwen VLM on 1000 hours of egocentric HumanNet video outperforms the identical procedure on 100 hours of real-robot data from Magic Cobot under a fixed validation set, suggesting human video as a scalable substitute for robot data.

Significance. If the ablation comparison holds after addressing volume and distribution controls, the result would indicate that large-scale human-centric video can serve as a cost-effective proxy for scarce robot interaction data, lowering barriers to training embodied vision-language-action models. The dataset scale and curation framework represent a substantial infrastructure contribution to the field.

major comments (1)
  1. [Abstract] Abstract: The claim that continued training from Qwen VLM with 1000 hours of HumanNet egocentric video surpasses the same procedure with 100 hours of Magic Cobot robot data is load-bearing for the substitution conclusion, yet the tenfold volume mismatch is unaddressed. No equal-volume HumanNet baseline (e.g., 100 hours) or explicit matching of task distributions, interaction complexity, or annotation quality between subsets is reported, so performance differences cannot be causally attributed to data source rather than quantity.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by specifying the exact metrics, validation set composition (e.g., robot-specific vs. generic actions), and any statistical significance tests supporting the performance comparison.
  2. Quantitative statistics on corpus diversity (number of environments, activity categories, or viewpoint balance) would better substantiate the claims of broad coverage and systematic curation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The point raised about the volume mismatch in the ablation study is well-taken and directly impacts the strength of our substitution claim. We address it below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that continued training from Qwen VLM with 1000 hours of HumanNet egocentric video surpasses the same procedure with 100 hours of Magic Cobot robot data is load-bearing for the substitution conclusion, yet the tenfold volume mismatch is unaddressed. No equal-volume HumanNet baseline (e.g., 100 hours) or explicit matching of task distributions, interaction complexity, or annotation quality between subsets is reported, so performance differences cannot be causally attributed to data source rather than quantity.

    Authors: We agree that the tenfold volume difference (1000 hours HumanNet vs. 100 hours Magic Cobot) prevents strong causal attribution to data source alone and that the current comparison is insufficient for the substitution conclusion. In the revised manuscript we will add a controlled equal-volume baseline using 100 hours of egocentric HumanNet video under the same training protocol and validation set. We will also add a section comparing task distributions, interaction complexity, and annotation characteristics between the HumanNet subset and Magic Cobot data to clarify what is and is not matched. These additions will be placed in the experiments section and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison stands on direct experimental outcome

full rationale

The paper presents HumanNet as a dataset and validates its utility via a controlled ablation experiment comparing continued training from Qwen VLM on 1000 hours of egocentric HumanNet video versus 100 hours of Magic Cobot robot data, under fixed validation. No derivations, equations, or first-principles results are claimed. No parameters are fitted and then renamed as predictions. No self-citations are invoked to establish uniqueness theorems or load-bearing premises. The reported performance delta is a direct empirical measurement rather than a quantity that reduces to its own inputs by construction. Potential confounding from unequal data volumes affects causal interpretation but does not create circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that human video contains transferable physical interaction knowledge that can substitute for robot data. No free parameters or invented entities are introduced; the work is empirical dataset construction rather than theoretical derivation.

axioms (1)
  • domain assumption Human-centric video captures transferable interaction patterns usable for robot learning
    Invoked in the validation experiment and the claim that human video can substitute for robot data.

pith-pipeline@v0.9.0 · 5579 in / 1251 out tokens · 68357 ms · 2026-05-11T00:45:14.497748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 27 canonical work pages · 8 internal anchors

  1. [1]

    Qwen2.5-VL technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report, 2025

  2. [2]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  3. [3]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015

  4. [4]

    Dexycb: A benchmark for capturing hand grasping of objects

    Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, et al. Dexycb: A benchmark for capturing hand grasping of objects. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9044–9053, 2021

  5. [5]

    Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2024

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling, 2024

  6. [6]

    Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha...

  7. [7]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

  8. [8]

    DeepSeek-V3 technical report, 2024

    DeepSeek-AI et al. DeepSeek-V3 technical report, 2024

  9. [9]

    Rethinking video generation model for the embodied world,

    Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

  10. [10]

    Rh20t: A robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot, 2023. URLhttps://arxiv.org/ abs/2307.00595

  11. [11]

    Gemma 3 technical report, 2025

    Gemma Team. Gemma 3 technical report, 2025

  12. [12]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzyńska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The "something something" video database for learning and evaluating visual common sense, 2017. URLhttps://arxiv.org/...

  13. [13]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Ku- mar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

  14. [14]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moha...

  15. [15]

    Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik

    Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya- narasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions, 2018. URLhttps://arxiv.org/abs/1705. 08421. 12

  16. [16]

    EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video,

    Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/2505.11709

  17. [17]

    Openego: A large-scale multimodal egocentric dataset for dexterous manipulation,

    Ahad Jawaid and Yu Xiang. Openego: A large-scale multimodal egocentric dataset for dexterous manipulation,

  18. [18]

    URLhttps://arxiv.org/abs/2509.05513

  19. [19]

    Egomimic: Scaling imitation learning via egocentric video, 2024

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URLhttps://arxiv.org/abs/2410.24221

  20. [20]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. URLhttps://arxiv.org/abs/1705.06950

  21. [21]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  22. [22]

    URLhttps://arxiv.org/abs/2403.12945

  23. [23]

    Hoi4d: A 4d egocentric dataset for category-level human-object interaction, 2024

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. Hoi4d: A 4d egocentric dataset for category-level human-object interaction, 2024. URL https: //arxiv.org/abs/2203.01577

  24. [24]

    Being-h0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos, 2025. URL https://arxiv.org/abs/2507.15597

  25. [25]

    Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

    Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, and Zongqing Lu. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026. URLhttps://arxiv.org/abs/2601.12993

  26. [26]

    Being-H0.7: A Latent World-Action Model from Egocentric Videos

    Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu, Chaoyi Xu, Ziheng Xi, Yuhui Fu, and Zongqing Lu. Being-h0.7: A latent world-action model from egocentric videos, 2026. URLhttps://arxiv.org/ abs/2605.00078

  27. [27]

    Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023. URLhttps://arxiv.org/abs/2308.09126

  28. [28]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Microsoft, :, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxu...

  29. [29]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 13 Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. URL https://arxiv.org/abs/1906.03327

  30. [30]

    R3m: A universal visual representation for robot manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601

  31. [31]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, :, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, Y...

  32. [32]

    Ryan Punamiya, Simar Kareer, Zeyi Liu, Josh Citron, Ri-Zhao Qiu, Xiongyi Cai, Alexey Gavryushin, Jiaqi Chen, Davide Liconti, Lawrence Y. Zhu, Patcharapong Aphiwetsa, Baoyu Li, Aniketh Cheluva, Pranav Kuppili, Yangcen Liu, Dhruv Patel, Aidan Gao, Hye-Young Chung, Ryan Co, Renee Zbizika, Jeff Liu, Xiaomeng Xu, Haoyu Xiong, Geng Chen, Sebastiano Oliani, Chen...

  33. [33]

    Assembly101: A large-scale multi-view video dataset for understanding procedural activities

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022

  34. [34]

    Finegym: A hierarchical video dataset for fine-grained action understanding, 2020

    Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding, 2020. URLhttps://arxiv.org/abs/2004.06704

  35. [35]

    Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta

    Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding, 2016. URLhttps://arxiv.org/abs/1604. 01753

  36. [36]

    A pragmatic vla foundation model,

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  37. [37]

    Human2robot: Learning robot actions from paired human-robot videos, 2025

    Sicheng Xie, Haidong Cao, Zejia Weng, Zhen Xing, Haoran Chen, Shiwei Shen, Jiaqi Leng, Zuxuan Wu, and Yu-Gang Jiang. Human2robot: Learning robot actions from paired human-robot videos, 2025. URL https://arxiv.org/abs/2502.16587

  38. [38]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, et al. Qwen3 technical report, 2025

  39. [39]

    Hacs: Human action clips and segments dataset for recognition and temporal localization, 2019

    Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. Hacs: Human action clips and segments dataset for recognition and temporal localization, 2019. URLhttps://arxiv.org/abs/1712.09374

  40. [40]

    EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, and Linxi Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710

  41. [41]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 14