pith. machine review for the scientific record. sign in

arxiv: 2604.23570 · v1 · submitted 2026-04-26 · 💻 cs.RO

Recognition: unknown

EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:14 UTC · model grok-4.3

classification 💻 cs.RO
keywords egocentric datasetrobot manipulation learningreal-world data collectionhuman task routinesmulti-modal annotationshead-mounted videoecological validity
0
0 comments X

The pith

EgoLive is the largest open-source egocentric dataset of real-world human tasks collected for advancing robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces EgoLive as a large-scale egocentric video dataset gathered from people performing routine tasks in everyday settings like home service and retail work. It claims three main advantages: greater size than previous datasets, higher data quality from a custom head-mounted device with multi-modal annotations, and better realism from unconstrained real-world collection. If correct, this would allow robot learning to use more scalable and transferable data than lab teleoperation methods provide. The dataset aims to support the development of generalizable robotic models that can deploy in practical environments.

Core claim

The authors establish EgoLive as the largest annotated egocentric dataset focused on real-world task-oriented human routines. It delivers leading data quality via a customized head-mounted capture device and comprehensive high-precision multi-modal annotations. Data collection occurs exclusively in unconstrained real-world scenarios encompassing home service, retail, and other practical work, providing superior diversity and ecological validity.

What carries the argument

The head-mounted capture device and multi-modal annotations on unconstrained real-world human working data.

If this is right

  • Training on this dataset will produce robot policies with better performance in diverse real environments.
  • Scalable collection of natural human demonstrations becomes feasible for manipulation learning.
  • Multi-modal data supports richer training signals for robotic systems.
  • Vertical field data improves applicability to service and retail robot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future benchmarks could compare robot performance directly on tasks represented in the dataset.
  • Privacy-preserving methods may need development if this approach scales to more recordings.
  • Hybrid datasets combining this with simulated data could further enhance learning efficiency.

Load-bearing premise

The premise that unconstrained real-world egocentric human videos will yield data scalable and transferable enough to improve robot manipulation beyond teleoperation or lab datasets.

What would settle it

A side-by-side test where robots trained on EgoLive data show no advantage over those trained on existing datasets when performing the same manipulation tasks in real settings.

read the original abstract

The advancement of robot learning is currently hindered by the scarcity of large-scale, high-quality datasets. While established data collection methods such as teleoperation and universal manipulation interfaces dominate current datasets, they suffer from inherent limitations in scalability and real-world deployability. Human egocentric video collection, by contrast, has emerged as a promising approach to enable scalable, natural and in-the-wild data collection. As such, we present EgoLive, a large-scale, high-quality egocentric dataset designed explicitly for robot manipulation learning. EgoLive establishes three distinctive technical advantages over existing egocentric datasets: first, it represents the largest open-source annotated egocentric dataset focused on real-world task-oriented human routines to date; second, it delivers leading data quality via a customized head-mounted capture device and comprehensive high-precision multi-modal annotations; third, all data is collected exclusively in unconstrained real-world scenarios and encompasses vertical field human working data, including home service, retail, and other practical work scenarios, providing superior diversity and ecological validity. With the introduction of EgoLive, we aim to provide the research community with a scalable, high-quality dataset that accelerates breakthroughs in generalizable robotic models and facilitates the real-world deployment of robot systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces EgoLive, a large-scale egocentric video dataset collected from unconstrained real-world human tasks (e.g., home service, retail) using a customized head-mounted capture device. It provides multi-modal annotations and positions the release as overcoming limitations of teleoperation and lab-based datasets by offering superior scale, data quality, and ecological validity to accelerate generalizable robot manipulation learning.

Significance. If the dataset's scale, annotation quality, and real-world diversity claims hold, EgoLive could provide a valuable open resource for robot learning research, enabling training on natural human routines that may improve policy generalization and real-world deployability beyond current teleoperated datasets.

major comments (3)
  1. [Abstract] Abstract: The three claimed technical advantages (largest open-source annotated egocentric dataset for task-oriented routines, leading data quality, superior ecological validity) are asserted via device description and scenario coverage but lack quantitative comparisons (e.g., size, annotation density, task diversity metrics) to prior datasets such as Ego4D or EPIC-KITCHENS.
  2. [Abstract] Abstract: The central claim that EgoLive will accelerate breakthroughs in generalizable robotic models and facilitate real-world robot deployment is not supported by any experiments, baseline policy training results, or transferability evaluations showing improved success rates or generalization on manipulation tasks.
  3. [Dataset Collection and Annotation (inferred from abstract claims)] The manuscript provides no details on annotation protocols, quality control procedures, inter-annotator agreement, or validation metrics for the high-precision multi-modal annotations, preventing verification of the 'leading data quality' advantage.
minor comments (2)
  1. Add a dedicated comparison table in the related work or dataset sections listing EgoLive statistics against existing egocentric datasets to make the scale and diversity claims concrete and verifiable.
  2. Clarify the exact sensor specifications, synchronization methods, and annotation taxonomy in the methods section to allow reproducibility and assessment of the customized capture device.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each of the major comments point by point below, indicating the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The three claimed technical advantages (largest open-source annotated egocentric dataset for task-oriented routines, leading data quality, superior ecological validity) are asserted via device description and scenario coverage but lack quantitative comparisons (e.g., size, annotation density, task diversity metrics) to prior datasets such as Ego4D or EPIC-KITCHENS.

    Authors: We agree that quantitative comparisons are needed to substantiate the claims. In the revised manuscript, we will add a comparison table (likely in the Dataset section) that reports concrete metrics including total video hours, annotated frames or segments, number of task categories, scenario diversity (e.g., indoor/outdoor, home/retail), and annotation density for EgoLive versus Ego4D, EPIC-KITCHENS, and similar datasets. This will provide objective evidence for the scale and validity advantages. revision: yes

  2. Referee: [Abstract] Abstract: The central claim that EgoLive will accelerate breakthroughs in generalizable robotic models and facilitates real-world robot deployment is not supported by any experiments, baseline policy training results, or transferability evaluations showing improved success rates or generalization on manipulation tasks.

    Authors: The paper is a dataset release and does not contain robot learning experiments, which would require substantial additional work outside the current scope. We will revise the abstract language to avoid overclaiming by changing the phrasing to indicate that EgoLive 'is designed to enable' or 'has the potential to accelerate' breakthroughs in generalizable models and real-world deployment, rather than asserting that it will directly do so. This keeps the intended motivation while removing unsupported assertions. revision: partial

  3. Referee: [Dataset Collection and Annotation (inferred from abstract claims)] The manuscript provides no details on annotation protocols, quality control procedures, inter-annotator agreement, or validation metrics for the high-precision multi-modal annotations, preventing verification of the 'leading data quality' advantage.

    Authors: We agree that these procedural details are required to verify the quality claims. We will expand the relevant section of the revised manuscript with a full description of the annotation pipeline, including tools and interfaces used, step-by-step protocols for each modality, quality-control steps (multiple independent annotators plus expert review), inter-annotator agreement measures (e.g., IoU for bounding boxes, kappa for action labels), and any automated consistency checks performed. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with descriptive claims only

full rationale

The manuscript introduces EgoLive as a large-scale egocentric dataset collected via head-mounted devices in real-world scenarios. Its central claims concern scale, annotation quality, and ecological validity, all asserted directly from the collection protocol and device description rather than from any derivation, equation, or fitted model. No predictions, first-principles results, or parameter estimations appear; therefore none can reduce to inputs by construction. Self-citations, if present, are not load-bearing for any derived quantity. The absence of policy-training experiments is an evidence gap for the transferability claim but does not constitute circularity in the paper's stated contributions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper. No free parameters, axioms, or invented entities are introduced because there are no derivations or new theoretical constructs.

pith-pipeline@v0.9.0 · 5615 in / 1103 out tokens · 70033 ms · 2026-05-08T06:14:30.836534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 18 canonical work pages · 3 internal anchors

  1. [1]

    Create unlimited productivity via intelligent machines

    AgiBot. Create unlimited productivity via intelligent machines. https://www.agibot.com/, 2026. Accessed: 2026-04-06

  2. [2]

    Bot-sort: R obust associations multi-pedestrian tracking

    Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, 2022

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    HOT3D: Hand and object tracking in 3D from egocentric multi-view videos

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7071, 2025

  5. [5]

    H-RDT: Human manipulation enhanced bimanual robotic manipulation

    Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-RDT: Human manipulation enhanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026

  6. [6]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners.Advancesin Neural Information Processing Systems (NeurIPS), 2020

  7. [7]

    Egocentric-10K

    AI Build. Egocentric-10K. URL https://huggingface.co/datasets/builddotai/Egocentric-10K, 2025

  8. [8]

    Gómez Rodríguez, José M

    Carlos Campos, Richard Elvira, Juan J. Gómez Rodríguez, José M. M. Montiel, and Juan D. Tardós. ORB-SLAM3: An accurate open-source library for visual, visual–inertial and multi-map slam.IEEE Transactionson Robotics, 37(6):1874–1890, 2021. doi: 10.1109/TRO.2021.3075644

  9. [9]

    On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, et al. On scaling up a multilingual vision and language model. InProceedings of the IEEE/CVF Conference on Computer...

  10. [10]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

  11. [11]

    Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1):33–55, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1):33–55, 2022

  12. [12]

    Egocentric human-object interaction detection: A new benchmark and method

    Kunyuan Deng, Yi Wang, and Lap-Pui Chau. Egocentric human-object interaction detection: A new benchmark and method. Expert Systems with Applications, 300:130216, 2026

  13. [13]

    Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:22046–22078, 2023

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:22046–22078, 2023

  14. [14]

    GEN-0: Embodied foundation models that scale with physical interaction

    Generalist AI Team. GEN-0: Embodied foundation models that scale with physical interaction. https:// generalistai.com/blog/nov-04-2025-GEN-0, 2025. Accessed: 2026-04-16

  15. [15]

    GEN-1: Scaling embodied foundation models to mastery.https://generalistai.com/blog/ apr-02-2026-GEN-1, 2025

    Generalist AI Team. GEN-1: Scaling embodied foundation models to mastery.https://generalistai.com/blog/ apr-02-2026-GEN-1, 2025. Accessed: 2026-04-16

  16. [16]

    Ego4D: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  17. [17]

    ReMix: Optimizing data mixtures for large scale imitation learning

    Joey Hejna, Chethan Anand Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. ReMix: Optimizing data mixtures for large scale imitation learning. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors, Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 145–164. PMLR, 06–09 Nov 2025...

  18. [18]

    Yoon, Mouli sivapurapu, and Jian Zhang

    Ryan Hoque, Peide Huang, David J. Yoon, Mouli sivapurapu, and Jian Zhang. EgoDex: Learning dexter- ous manipulation from large-scale egocentric video. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=FFxkFMU89E

  19. [19]

    EgoMimic: Scaling imitation learning via egocentric video

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. EgoMimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

  20. [20]

    Phantom: Training robots without robots using only human videos

    Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos. In9th Annual Conference on Robot Learning, 2025. URLhttps://openreview.net/forum?id=BTUioBmCWo

  21. [21]

    Masquerade: Learning from in-the-wild human videos using data-editing

    Marion Lepert, Jiaying Fang, and Jeannette Bohg. Masquerade: Learning from in-the-wild human videos using data-editing. In IEEE International Conference on Robotics and Automation (ICRA), 2026

  22. [22]

    MimicDreamer: Aligning human and robot demonstrations for scalable vla training

    Haoyun Li, Ivan Zhang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Zhiqin Yang, Zhentao Zhang, Boyuan Wang, Chaojun Ni, Wenkang Qin, et al. MimicDreamer: Aligning human and robot demonstrations for scalable vla training. arXiv preprint arXiv:2509.22199, 2025

  23. [23]

    VITRA: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos

    Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, et al. VITRA: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos. InIEEE International Conference on Robotics and Automation (ICRA), 2026

  24. [24]

    HOI4D: A 4D egocentric dataset for category-level human-object interaction

    Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. HOI4D: A 4D egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022

  25. [25]

    Being-h0: vision-language-action pretraining from large-scale human videos,

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-H0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

  26. [26]

    Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

    Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-H0.5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

  27. [27]

    Cosmos-embed1-448p: Vision-language embedding model for multimodal representation learning

    NVIDIA. Cosmos-embed1-448p: Vision-language embedding model for multimodal representation learning. TechnicalReport, 2024. URLhttps://www.nvidia.com

  28. [28]

    GR00T N1: An open foundation model for generalist humanoid robots

    NVIDIA, Johan Bjorck, Nikita Cherniadev Fernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You L...

  29. [29]

    Reconstructing hands in 3D with transformers

    Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024

  30. [30]

    Bringing general-purpose AI to the physical world.https://www.pi.website, 2026

    Physical Intelligence. Bringing general-purpose AI to the physical world.https://www.pi.website, 2026. Accessed: 2026-04-16

  31. [31]

    EgoBridge: Domain adaptation for generalizable imitation from egocentric human data

    Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Y Zhu, Simar Kareer, Judy Hoffman, and Danfei Xu. EgoBridge: Domain adaptation for generalizable imitation from egocentric human data. In Human to Robot: Workshopon Sensorizing, Modeling, and Learning from Humans, 2025

  32. [32]

    Humanoid policy˜ human policy,

    Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid policy∼Human policy.arXiv preprint arXiv:2503.13441, 2025

  33. [33]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  34. [34]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas 15 Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408....

  35. [35]

    Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), November 2017. URL http: //doi.acm.org/10.1145/3130800.3130883

  36. [36]

    Xperience-10M

    AI Ropedia. Xperience-10M. URL https://huggingface.co/datasets/ropedia-ai/xperience-10m, 2026

  37. [37]

    LAION-5B: An open large-scale dataset for training next generation image-text models.Advancesin Neural Information Processing Systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Sasanka Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models.A...

  38. [38]

    Understanding human hands in contact at internet scale

    Dandan Shan, Jiaqi Geng, Michelle Shu, and David Fouhey. Understanding human hands in contact at internet scale. 2020

  39. [39]

    Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,

    Modi Shi, Shijia Peng, Jin Chen, Haoran Jiang, Yinghui Li, Di Huang, Ping Luo, Hongyang Li, and Li Chen. EgoHumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration, 2026. URL https://arxiv.org/abs/2602.10106

  40. [40]

    Smith, Chen Yu, and Alfredo F

    Linda B. Smith, Chen Yu, and Alfredo F. Pereira. Not your mother’s view: the dynamics of toddler visual experience. Developmental Science, 14(1):9–17, 2011. doi: https://doi.org/10.1111/j.1467-7687.2009.00947.x. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-7687.2009.00947.x

  41. [41]

    Tesla AI and Robotics.https://www.tesla.com/AI, 2026

    Tesla, Inc. Tesla AI and Robotics.https://www.tesla.com/AI, 2026. Accessed: 2026-04-16

  42. [42]

    Zhao, et al

    Quan Vuong, Sergey Levine, Homer Rich Walke, Karl Pertsch, Anikait Singh, Ria Doshi, Charles Xu, Jianlan Luo, Liam Tan, Dhruv Shah, Chelsea Finn, Max Du, Moo Jin Kim, Alexander Khazatsky, Jonathan Heewon Yang, Tony Z. Zhao, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. InTowardsGeneralist Robots: Learning Paradigms for Scalable Skil...

  43. [43]

    Foundation- Stereo: Zero-shot stereo matching

    Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundation- Stereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5249–5260, 2025

  44. [44]

    ZeroWBC: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

    Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, and Xuelong Li. ZeroWBC: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

  45. [45]

    Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

    Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. EgoVLA: Learning vision-language-action models from egocentric human videos. arXiv preprint arXiv:2507.12440, 2025

  46. [46]

    Chen Yu and Linda B. Smith. Embodied attention and word learning by toddlers.Cognition, 125(2):244–262,

  47. [47]

    doi: https://doi.org/10.1016/j.cognition.2012.06.016

    ISSN 0010-0277. doi: https://doi.org/10.1016/j.cognition.2012.06.016. URLhttps://www.sciencedirect. com/science/article/pii/S0010027712001369

  48. [48]

    arXiv preprint arXiv:2511.00153 , year=

    Justin Yu, Yide Shentu, Di Wu, Pieter Abbeel, Ken Goldberg, and Philipp Wu. EgoMI: Learning active vision and whole-body manipulation from egocentric human demonstrations.arXiv preprint arXiv:2511.00153, 2025

  49. [49]

    SCIZOR: A self-supervised approach to data curation for large-scale imitation learning

    Yu Zhang, Yuqi Xie, Huihan Liu, Rutav Shah, Michael Wan, Linxi Fan, and Yuke Zhu. SCIZOR: A self-supervised approach to data curation for large-scale imitation learning. InIEEE International Conference on Robotics and Automation (ICRA), 2026. To appear

  50. [50]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi: 10.15607/RSS.2023.XIX.016

  51. [51]

    FastUMI: A scalable and hardware-independent universal manipulation interface with dataset

    Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, and Xuelong Li. FastUMI: A scalable and hardware-independent universal manipulation interface with dataset. In 9th Annual Conference on Robot Learning,...

  52. [52]

    EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. EgoScale: Scaling dexterous manipulation with diverse egocentric human data. arXiv preprint arXiv:2602.16710, 2026. 17