arxiv: 2604.23570 · v1 · submitted 2026-04-26 · 💻 cs.RO

Recognition: unknown

EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks

Yihang Li , Xuelong Wei , Jingzhou Luo , Yingjing Xiao , Yibo Bai , Guangyuan Zhou , Teng Zou , Chenguang Gui

show 21 more authors

Jiajun Wen He Zhang Kangliang Chen Xing Pan Shuaiyan Liu Daming Wang Tao An Jiayi Li Shibo Jin Wanwan Zhang Tianyu Wang Boren Wei Zhixuan Huang Fangsheng Liu Ruodai Li Hui Zhang Anson Li Yicheng Gong Peng Cao Jiaming Liang Liang Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:14 UTC · model grok-4.3

classification 💻 cs.RO

keywords egocentric datasetrobot manipulation learningreal-world data collectionhuman task routinesmulti-modal annotationshead-mounted videoecological validity

0 comments

The pith

EgoLive is the largest open-source egocentric dataset of real-world human tasks collected for advancing robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces EgoLive as a large-scale egocentric video dataset gathered from people performing routine tasks in everyday settings like home service and retail work. It claims three main advantages: greater size than previous datasets, higher data quality from a custom head-mounted device with multi-modal annotations, and better realism from unconstrained real-world collection. If correct, this would allow robot learning to use more scalable and transferable data than lab teleoperation methods provide. The dataset aims to support the development of generalizable robotic models that can deploy in practical environments.

Core claim

The authors establish EgoLive as the largest annotated egocentric dataset focused on real-world task-oriented human routines. It delivers leading data quality via a customized head-mounted capture device and comprehensive high-precision multi-modal annotations. Data collection occurs exclusively in unconstrained real-world scenarios encompassing home service, retail, and other practical work, providing superior diversity and ecological validity.

What carries the argument

The head-mounted capture device and multi-modal annotations on unconstrained real-world human working data.

If this is right

Training on this dataset will produce robot policies with better performance in diverse real environments.
Scalable collection of natural human demonstrations becomes feasible for manipulation learning.
Multi-modal data supports richer training signals for robotic systems.
Vertical field data improves applicability to service and retail robot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future benchmarks could compare robot performance directly on tasks represented in the dataset.
Privacy-preserving methods may need development if this approach scales to more recordings.
Hybrid datasets combining this with simulated data could further enhance learning efficiency.

Load-bearing premise

The premise that unconstrained real-world egocentric human videos will yield data scalable and transferable enough to improve robot manipulation beyond teleoperation or lab datasets.

What would settle it

A side-by-side test where robots trained on EgoLive data show no advantage over those trained on existing datasets when performing the same manipulation tasks in real settings.

read the original abstract

The advancement of robot learning is currently hindered by the scarcity of large-scale, high-quality datasets. While established data collection methods such as teleoperation and universal manipulation interfaces dominate current datasets, they suffer from inherent limitations in scalability and real-world deployability. Human egocentric video collection, by contrast, has emerged as a promising approach to enable scalable, natural and in-the-wild data collection. As such, we present EgoLive, a large-scale, high-quality egocentric dataset designed explicitly for robot manipulation learning. EgoLive establishes three distinctive technical advantages over existing egocentric datasets: first, it represents the largest open-source annotated egocentric dataset focused on real-world task-oriented human routines to date; second, it delivers leading data quality via a customized head-mounted capture device and comprehensive high-precision multi-modal annotations; third, all data is collected exclusively in unconstrained real-world scenarios and encompasses vertical field human working data, including home service, retail, and other practical work scenarios, providing superior diversity and ecological validity. With the introduction of EgoLive, we aim to provide the research community with a scalable, high-quality dataset that accelerates breakthroughs in generalizable robotic models and facilitates the real-world deployment of robot systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoLive releases a sizable real-world egocentric dataset for robot manipulation but provides no experiments showing it yields better policies than teleoperation or lab data.

read the letter

The main point is a new dataset called EgoLive: egocentric video from people doing actual tasks in homes, retail, and similar settings, captured with a custom head-mounted rig and annotated across multiple modalities. The authors position it as the largest open annotated set of its kind focused on unconstrained daily work, with better ecological validity than teleop collections or lab recordings. That focus on natural human routines is the clearest addition here, and if the data volume and annotation precision hold up, it could supply useful imitation targets for manipulation models that currently rely on more artificial demonstrations. The collection protocol itself follows established egocentric practices but narrows the domain to practical service and retail scenarios, which is a reasonable specialization. The soft spot is the absence of any downstream validation. The abstract and positioning claim the data will accelerate generalizable robotic models, yet the manuscript contains no policy training runs, no success-rate comparisons against Ego4D-style sets or teleoperation baselines, and no tests of real-world deployment gains. The three listed advantages are described through device specs and scenario lists rather than measured against prior datasets on any robot-learning metric. Without those results, the transferability argument stays untested. This work is aimed at groups building or benchmarking imitation and manipulation systems who need fresh real-world video. A reader hunting for raw data volume in everyday environments might find it worth downloading once released. It should go to peer review so referees can check annotation quality, scale claims, and whether the authors are willing to add even minimal validation experiments before publication.

Referee Report

3 major / 2 minor

Summary. The paper introduces EgoLive, a large-scale egocentric video dataset collected from unconstrained real-world human tasks (e.g., home service, retail) using a customized head-mounted capture device. It provides multi-modal annotations and positions the release as overcoming limitations of teleoperation and lab-based datasets by offering superior scale, data quality, and ecological validity to accelerate generalizable robot manipulation learning.

Significance. If the dataset's scale, annotation quality, and real-world diversity claims hold, EgoLive could provide a valuable open resource for robot learning research, enabling training on natural human routines that may improve policy generalization and real-world deployability beyond current teleoperated datasets.

major comments (3)

[Abstract] Abstract: The three claimed technical advantages (largest open-source annotated egocentric dataset for task-oriented routines, leading data quality, superior ecological validity) are asserted via device description and scenario coverage but lack quantitative comparisons (e.g., size, annotation density, task diversity metrics) to prior datasets such as Ego4D or EPIC-KITCHENS.
[Abstract] Abstract: The central claim that EgoLive will accelerate breakthroughs in generalizable robotic models and facilitate real-world robot deployment is not supported by any experiments, baseline policy training results, or transferability evaluations showing improved success rates or generalization on manipulation tasks.
[Dataset Collection and Annotation (inferred from abstract claims)] The manuscript provides no details on annotation protocols, quality control procedures, inter-annotator agreement, or validation metrics for the high-precision multi-modal annotations, preventing verification of the 'leading data quality' advantage.

minor comments (2)

Add a dedicated comparison table in the related work or dataset sections listing EgoLive statistics against existing egocentric datasets to make the scale and diversity claims concrete and verifiable.
Clarify the exact sensor specifications, synchronization methods, and annotation taxonomy in the methods section to allow reproducibility and assessment of the customized capture device.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each of the major comments point by point below, indicating the revisions we plan to make.

read point-by-point responses

Referee: [Abstract] Abstract: The three claimed technical advantages (largest open-source annotated egocentric dataset for task-oriented routines, leading data quality, superior ecological validity) are asserted via device description and scenario coverage but lack quantitative comparisons (e.g., size, annotation density, task diversity metrics) to prior datasets such as Ego4D or EPIC-KITCHENS.

Authors: We agree that quantitative comparisons are needed to substantiate the claims. In the revised manuscript, we will add a comparison table (likely in the Dataset section) that reports concrete metrics including total video hours, annotated frames or segments, number of task categories, scenario diversity (e.g., indoor/outdoor, home/retail), and annotation density for EgoLive versus Ego4D, EPIC-KITCHENS, and similar datasets. This will provide objective evidence for the scale and validity advantages. revision: yes
Referee: [Abstract] Abstract: The central claim that EgoLive will accelerate breakthroughs in generalizable robotic models and facilitates real-world robot deployment is not supported by any experiments, baseline policy training results, or transferability evaluations showing improved success rates or generalization on manipulation tasks.

Authors: The paper is a dataset release and does not contain robot learning experiments, which would require substantial additional work outside the current scope. We will revise the abstract language to avoid overclaiming by changing the phrasing to indicate that EgoLive 'is designed to enable' or 'has the potential to accelerate' breakthroughs in generalizable models and real-world deployment, rather than asserting that it will directly do so. This keeps the intended motivation while removing unsupported assertions. revision: partial
Referee: [Dataset Collection and Annotation (inferred from abstract claims)] The manuscript provides no details on annotation protocols, quality control procedures, inter-annotator agreement, or validation metrics for the high-precision multi-modal annotations, preventing verification of the 'leading data quality' advantage.

Authors: We agree that these procedural details are required to verify the quality claims. We will expand the relevant section of the revised manuscript with a full description of the annotation pipeline, including tools and interfaces used, step-by-step protocols for each modality, quality-control steps (multiple independent annotators plus expert review), inter-annotator agreement measures (e.g., IoU for bounding boxes, kappa for action labels), and any automated consistency checks performed. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with descriptive claims only

full rationale

The manuscript introduces EgoLive as a large-scale egocentric dataset collected via head-mounted devices in real-world scenarios. Its central claims concern scale, annotation quality, and ecological validity, all asserted directly from the collection protocol and device description rather than from any derivation, equation, or fitted model. No predictions, first-principles results, or parameter estimations appear; therefore none can reduce to inputs by construction. Self-citations, if present, are not load-bearing for any derived quantity. The absence of policy-training experiments is an evidence gap for the transferability claim but does not constitute circularity in the paper's stated contributions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper. No free parameters, axioms, or invented entities are introduced because there are no derivations or new theoretical constructs.

pith-pipeline@v0.9.0 · 5615 in / 1103 out tokens · 70033 ms · 2026-05-08T06:14:30.836534+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Create unlimited productivity via intelligent machines

AgiBot. Create unlimited productivity via intelligent machines. https://www.agibot.com/, 2026. Accessed: 2026-04-06

2026
[2]

Bot-sort: R obust associations multi-pedestrian tracking

Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651, 2022

work page arXiv 2022
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review arXiv 2025
[4]

HOT3D: Hand and object tracking in 3D from egocentric multi-view videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. HOT3D: Hand and object tracking in 3D from egocentric multi-view videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7071, 2025

2025
[5]

H-RDT: Human manipulation enhanced bimanual robotic manipulation

Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, and Jun Zhu. H-RDT: Human manipulation enhanced bimanual robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18135–18143, 2026

2026
[6]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners.Advancesin Neural Information Processing Systems (NeurIPS), 2020

2020
[7]

Egocentric-10K

AI Build. Egocentric-10K. URL https://huggingface.co/datasets/builddotai/Egocentric-10K, 2025

2025
[8]

Gómez Rodríguez, José M

Carlos Campos, Richard Elvira, Juan J. Gómez Rodríguez, José M. M. Montiel, and Juan D. Tardós. ORB-SLAM3: An accurate open-source library for visual, visual–inertial and multi-map slam.IEEE Transactionson Robotics, 37(6):1874–1890, 2021. doi: 10.1109/TRO.2021.3075644

work page doi:10.1109/tro.2021.3075644 2021
[9]

On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, et al. On scaling up a multilingual vision and language model. InProceedings of the IEEE/CVF Conference on Computer...

2024
[10]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots.arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review arXiv 2024
[11]

Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1):33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1):33–55, 2022

2022
[12]

Egocentric human-object interaction detection: A new benchmark and method

Kunyuan Deng, Yi Wang, and Lap-Pui Chau. Egocentric human-object interaction detection: A new benchmark and method. Expert Systems with Applications, 300:130216, 2026

2026
[13]

Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:22046–22078, 2023

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:22046–22078, 2023

2023
[14]

GEN-0: Embodied foundation models that scale with physical interaction

Generalist AI Team. GEN-0: Embodied foundation models that scale with physical interaction. https:// generalistai.com/blog/nov-04-2025-GEN-0, 2025. Accessed: 2026-04-16

2025
[15]

GEN-1: Scaling embodied foundation models to mastery.https://generalistai.com/blog/ apr-02-2026-GEN-1, 2025

Generalist AI Team. GEN-1: Scaling embodied foundation models to mastery.https://generalistai.com/blog/ apr-02-2026-GEN-1, 2025. Accessed: 2026-04-16

2026
[16]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

2022
[17]

ReMix: Optimizing data mixtures for large scale imitation learning

Joey Hejna, Chethan Anand Bhateja, Yichen Jiang, Karl Pertsch, and Dorsa Sadigh. ReMix: Optimizing data mixtures for large scale imitation learning. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors, Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 145–164. PMLR, 06–09 Nov 2025...

2025
[18]

Yoon, Mouli sivapurapu, and Jian Zhang

Ryan Hoque, Peide Huang, David J. Yoon, Mouli sivapurapu, and Jian Zhang. EgoDex: Learning dexter- ous manipulation from large-scale egocentric video. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=FFxkFMU89E

2026
[19]

EgoMimic: Scaling imitation learning via egocentric video

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. EgoMimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13226–13233. IEEE, 2025

2025
[20]

Phantom: Training robots without robots using only human videos

Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos. In9th Annual Conference on Robot Learning, 2025. URLhttps://openreview.net/forum?id=BTUioBmCWo

2025
[21]

Masquerade: Learning from in-the-wild human videos using data-editing

Marion Lepert, Jiaying Fang, and Jeannette Bohg. Masquerade: Learning from in-the-wild human videos using data-editing. In IEEE International Conference on Robotics and Automation (ICRA), 2026

2026
[22]

MimicDreamer: Aligning human and robot demonstrations for scalable vla training

Haoyun Li, Ivan Zhang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Zhiqin Yang, Zhentao Zhang, Boyuan Wang, Chaojun Ni, Wenkang Qin, et al. MimicDreamer: Aligning human and robot demonstrations for scalable vla training. arXiv preprint arXiv:2509.22199, 2025

work page arXiv 2025
[23]

VITRA: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos

Qixiu Li, Yu Deng, Yaobo Liang, Lin Luo, Lei Zhou, Chengtang Yao, Lingqi Zeng, Zhiyuan Feng, Huizhi Liang, Sicheng Xu, et al. VITRA: Scalable vision-language-action model pretraining for robotic manipulation with real-life human activity videos. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026
[24]

HOI4D: A 4D egocentric dataset for category-level human-object interaction

Yunze Liu, Yun Liu, Che Jiang, Kangbo Lyu, Weikang Wan, Hao Shen, Boqiang Liang, Zhoujie Fu, He Wang, and Li Yi. HOI4D: A 4D egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013–21022, 2022

2022
[25]

Being-h0: vision-language-action pretraining from large-scale human videos,

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-H0: vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

work page arXiv 2025
[26]

Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-H0.5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993, 2026

work page arXiv 2026
[27]

Cosmos-embed1-448p: Vision-language embedding model for multimodal representation learning

NVIDIA. Cosmos-embed1-448p: Vision-language embedding model for multimodal representation learning. TechnicalReport, 2024. URLhttps://www.nvidia.com

2024
[28]

GR00T N1: An open foundation model for generalist humanoid robots

NVIDIA, Johan Bjorck, Nikita Cherniadev Fernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You L...

2025
[29]

Reconstructing hands in 3D with transformers

Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey, and Jitendra Malik. Reconstructing hands in 3D with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9826–9836, 2024

2024
[30]

Bringing general-purpose AI to the physical world.https://www.pi.website, 2026

Physical Intelligence. Bringing general-purpose AI to the physical world.https://www.pi.website, 2026. Accessed: 2026-04-16

2026
[31]

EgoBridge: Domain adaptation for generalizable imitation from egocentric human data

Ryan Punamiya, Dhruv Patel, Patcharapong Aphiwetsa, Pranav Kuppili, Lawrence Y Zhu, Simar Kareer, Judy Hoffman, and Danfei Xu. EgoBridge: Domain adaptation for generalizable imitation from egocentric human data. In Human to Robot: Workshopon Sensorizing, Modeling, and Learning from Humans, 2025

2025
[32]

Humanoid policy˜ human policy,

Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid policy∼Human policy.arXiv preprint arXiv:2503.13441, 2025

work page arXiv 2025
[33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

2021
[34]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas 15 Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408....

work page internal anchor Pith review arXiv 2024
[35]

Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), November 2017. URL http: //doi.acm.org/10.1145/3130800.3130883

work page doi:10.1145/3130800.3130883 2017
[36]

Xperience-10M

AI Ropedia. Xperience-10M. URL https://huggingface.co/datasets/ropedia-ai/xperience-10m, 2026

2026
[37]

LAION-5B: An open large-scale dataset for training next generation image-text models.Advancesin Neural Information Processing Systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Sasanka Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: An open large-scale dataset for training next generation image-text models.A...

2022
[38]

Understanding human hands in contact at internet scale

Dandan Shan, Jiaqi Geng, Michelle Shu, and David Fouhey. Understanding human hands in contact at internet scale. 2020

2020
[39]

Egohumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration,

Modi Shi, Shijia Peng, Jin Chen, Haoran Jiang, Yinghui Li, Di Huang, Ping Luo, Hongyang Li, and Li Chen. EgoHumanoid: Unlocking in-the-wild loco-manipulation with robot-free egocentric demonstration, 2026. URL https://arxiv.org/abs/2602.10106

work page arXiv 2026
[40]

Smith, Chen Yu, and Alfredo F

Linda B. Smith, Chen Yu, and Alfredo F. Pereira. Not your mother’s view: the dynamics of toddler visual experience. Developmental Science, 14(1):9–17, 2011. doi: https://doi.org/10.1111/j.1467-7687.2009.00947.x. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-7687.2009.00947.x

work page doi:10.1111/j.1467-7687.2009.00947.x 2011
[41]

Tesla AI and Robotics.https://www.tesla.com/AI, 2026

Tesla, Inc. Tesla AI and Robotics.https://www.tesla.com/AI, 2026. Accessed: 2026-04-16

2026
[42]

Zhao, et al

Quan Vuong, Sergey Levine, Homer Rich Walke, Karl Pertsch, Anikait Singh, Ria Doshi, Charles Xu, Jianlan Luo, Liam Tan, Dhruv Shah, Chelsea Finn, Max Du, Moo Jin Kim, Alexander Khazatsky, Jonathan Heewon Yang, Tony Z. Zhao, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. InTowardsGeneralist Robots: Learning Paradigms for Scalable Skil...

2023
[43]

Foundation- Stereo: Zero-shot stereo matching

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundation- Stereo: Zero-shot stereo matching. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5249–5260, 2025

2025
[44]

ZeroWBC: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

Haoran Yang, Jiacheng Bao, Yucheng Xin, Haoming Song, Yuyang Tian, Bin Zhao, Dong Wang, and Xuelong Li. ZeroWBC: Learning natural visuomotor humanoid control directly from human egocentric video.arXiv preprint arXiv:2603.09170, 2026

work page arXiv 2026
[45]

Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, et al. EgoVLA: Learning vision-language-action models from egocentric human videos. arXiv preprint arXiv:2507.12440, 2025

work page arXiv 2025
[46]

Chen Yu and Linda B. Smith. Embodied attention and word learning by toddlers.Cognition, 125(2):244–262,
[47]

doi: https://doi.org/10.1016/j.cognition.2012.06.016

ISSN 0010-0277. doi: https://doi.org/10.1016/j.cognition.2012.06.016. URLhttps://www.sciencedirect. com/science/article/pii/S0010027712001369

work page doi:10.1016/j.cognition.2012.06.016 2012
[48]

arXiv preprint arXiv:2511.00153 , year=

Justin Yu, Yide Shentu, Di Wu, Pieter Abbeel, Ken Goldberg, and Philipp Wu. EgoMI: Learning active vision and whole-body manipulation from egocentric human demonstrations.arXiv preprint arXiv:2511.00153, 2025

work page arXiv 2025
[49]

SCIZOR: A self-supervised approach to data curation for large-scale imitation learning

Yu Zhang, Yuqi Xie, Huihan Liu, Rutav Shah, Michael Wan, Linxi Fan, and Yuke Zhu. SCIZOR: A self-supervised approach to data curation for large-scale imitation learning. InIEEE International Conference on Robotics and Automation (ICRA), 2026. To appear

2026
[50]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi: 10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023
[51]

FastUMI: A scalable and hardware-independent universal manipulation interface with dataset

Zhaxizhuoma, Kehui Liu, Chuyue Guan, Zhongjie Jia, Ziniu Wu, Xin Liu, Tianyu Wang, Shuai Liang, Pengan Chen, Pingrui Zhang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, and Xuelong Li. FastUMI: A scalable and hardware-independent universal manipulation interface with dataset. In 9th Annual Conference on Robot Learning,...

2025
[52]

EgoScale: Scaling Dexterous Manipulation with Diverse Ego- centric Human Data,

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, et al. EgoScale: Scaling dexterous manipulation with diverse egocentric human data. arXiv preprint arXiv:2602.16710, 2026. 17

work page arXiv 2026