HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Bingyi Kang; Bo Liang; Daquan Zhou; Duomin Wang; Enze Xie; Eric Huang; Jiankai Tu; Jianxin Bi; Jiaxin Li; Juncheng Ma

arxiv: 2606.20521 · v1 · pith:6WXUQCOOnew · submitted 2026-06-18 · 💻 cs.CV

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Juncheng Ma , Jianxin Bi , Yufan Deng , Xuanran Zhai , Kewei Zhang , Ye Huang , Bo Liang , Shukai Gong

show 14 more authors

Jiankai Tu Xiaotian Tang Jiaxin Li Kaiqi Chen Duomin Wang Yuqi Wang Bingyi Kang Eric Huang Zhiyang Dou Zhen Dong Enze Xie Wojciech Matusik Tat-Seng Chua Daquan Zhou

This is my paper

Pith reviewed 2026-06-26 17:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords egocentric human videoembodied pretrainingfoundation modelsrobot learningaction predictiontask successdata scalingteleoperated trajectories

0 comments

The pith

Egocentric human video outperforms teleoperated robot trajectories for embodied model pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that egocentric human video, after a designed filtering and labeling pipeline, produces better pretrained models than an equal volume of teleoperated real-robot trajectories. The comparison holds post-training and validation protocols fixed, so differences trace to the pretraining source. A reader would care because robot data is costly and scarce while human video is abundant and diverse, suggesting a route to scale embodied foundation models without proportional increases in robot collection. The result points to a two-stage approach: pretrain on human video for broad world representations, then adapt with limited robot data for action alignment.

Core claim

With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment.

What carries the argument

The filtering and labeling pipeline that converts raw egocentric human video into pretraining data aligned for embodied action prediction.

If this is right

Egocentric pretraining learns more diverse world representations than robot-trajectory pretraining.
A small amount of labeled real-robot data suffices afterward to align the action space.
The paradigm reduces dependence on high-cost, low-diversity robot data collection.
The study supplies guidance for assessing data quality before committing to robot data gathering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The out-of-distribution gains imply that human behavioral variety transfers better to novel robot environments than robot trajectories do.
If the pipeline proves reusable, the same human-video source could support pretraining across multiple robot embodiments without new collection.
Scaling the human-video volume further while keeping robot adaptation data fixed could widen the observed gap.

Load-bearing premise

The filtering and labeling pipeline applied to egocentric human video is neutral with respect to the downstream evaluation tasks and does not confer an unfair advantage relative to the raw teleoperated robot trajectories.

What would settle it

Running the identical downstream evaluation on models pretrained with the same human videos but without the described filtering and labeling pipeline, and finding no performance advantage over the robot-data baseline, would falsify the central claim.

read the original abstract

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Filtered human egocentric video appears to outperform raw robot data for embodied pretraining, but the processing steps need close examination.

read the letter

The main takeaway is that this paper reports filtered egocentric human video beating raw teleoperated robot trajectories for pretraining embodied models, with a 24% lower validation loss and substantially higher task success rates under matched post-training conditions.

The new element is the controlled head-to-head comparison of the two data sources using fixed protocols. Earlier papers have suggested human video as a cheaper alternative, but this one runs the direct test and quantifies the gap.

The work does a solid job spelling out the scalability limits of robot data and sketching a practical two-stage approach: pretrain on abundant human video for broad representations, then adapt with limited robot data for action alignment.

The soft spot is the filtering and labeling pipeline applied only to the human data. The abstract calls it carefully designed while leaving robot data untouched, yet supplies no criteria, ablations, or checks that the steps stay neutral to the downstream robot tasks. If those choices embed task-relevant selection or pseudo-labeling tuned to the evaluation, the gains may not trace to the egocentric distribution itself. The absence of error bars or significance tests in the reported numbers adds to the uncertainty.

This paper targets researchers working on data scaling for embodied foundation models. Readers focused on robot learning and generalist systems will find the empirical comparison worth their time. It deserves peer review because the question is relevant and the setup is straightforward, even though the pipeline details require more scrutiny before the central claim can be taken as settled.

I would send it to referees.

Referee Report

3 major / 2 minor

Summary. The paper claims that, under fixed post-training and validation protocols, embodied foundation models pretrained on the same volume of egocentric human video (after a filtering and labeling pipeline) achieve a 24% lower validation loss on real-robot action prediction and 52.5%/90% higher success rates on in-distribution and out-of-distribution real-robot tasks than models pretrained on raw teleoperated robot trajectories. It concludes that human video is not merely a substitute but can be superior for learning diverse world representations before action-space adaptation with limited robot data.

Significance. If the reported gains are shown to arise from the data distribution rather than pipeline-induced selection effects, the result would be significant for embodied AI scaling: it would support a cheaper, higher-diversity pretraining paradigm that reduces reliance on costly robot teleoperation while still enabling strong downstream robot performance. The fixed-protocol design and quantitative comparisons are strengths that make the finding falsifiable and reproducible in principle.

major comments (3)

[Abstract] Abstract: The central quantitative claims (24% lower loss, 52.5% and 90% higher success rates) are presented without error bars, number of runs, or statistical significance tests. Because these numbers are the primary evidence for the superiority claim, the absence of variance estimates leaves the magnitude and reliability of the gains difficult to assess.
[Abstract] Abstract and methods: The human-video pipeline is described only as 'carefully designed filtering and labeling' with no enumeration of criteria (action mapping, quality thresholds, scene selection), no ablation of each step, and no statement that an identical pipeline was applied to the robot trajectories. This detail is load-bearing for the claim that gains derive from the egocentric distribution itself rather than post-hoc selection bias relative to raw robot data.
[Results] Results (assumed §4 or equivalent): The comparison is between processed human data and raw robot data; without a control that applies the same filtering/labeling steps symmetrically to robot trajectories or an ablation isolating the pipeline's contribution, the reported outperformance cannot be unambiguously attributed to data source rather than processing asymmetry.

minor comments (2)

Notation for action spaces and loss functions should be defined explicitly on first use to aid readers comparing the two data regimes.
Figure captions for success-rate plots should state the number of evaluation episodes and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful review and constructive suggestions. The comments correctly identify areas where additional transparency and statistical rigor would strengthen the manuscript. We respond to each point below and will incorporate revisions to address the concerns about reporting and methodological clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The central quantitative claims (24% lower loss, 52.5% and 90% higher success rates) are presented without error bars, number of runs, or statistical significance tests. Because these numbers are the primary evidence for the superiority claim, the absence of variance estimates leaves the magnitude and reliability of the gains difficult to assess.

Authors: We agree that variance estimates are important for assessing reliability. In the revised manuscript we will report means and standard deviations across multiple random seeds for both validation loss and success rates, and include statistical significance tests (e.g., paired t-tests) comparing the two pretraining conditions. revision: yes
Referee: [Abstract] Abstract and methods: The human-video pipeline is described only as 'carefully designed filtering and labeling' with no enumeration of criteria (action mapping, quality thresholds, scene selection), no ablation of each step, and no statement that an identical pipeline was applied to the robot trajectories. This detail is load-bearing for the claim that gains derive from the egocentric distribution itself rather than post-hoc selection bias relative to raw robot data.

Authors: The methods section already enumerates the pipeline criteria, but the abstract is concise. We will expand the abstract to list the main filtering steps and add an explicit clarification that the pipeline is applied exclusively to human video because it requires action inference from visual observations; robot trajectories already contain direct action labels, so the same steps are neither applicable nor necessary. revision: yes
Referee: [Results] Results (assumed §4 or equivalent): The comparison is between processed human data and raw robot data; without a control that applies the same filtering/labeling steps symmetrically to robot trajectories or an ablation isolating the pipeline's contribution, the reported outperformance cannot be unambiguously attributed to data source rather than processing asymmetry.

Authors: We will add a dedicated paragraph in the discussion section explaining why symmetric application of the full pipeline is not meaningful (robot data already supplies precise actions). To further isolate effects we will include, where data permits, a quality-filtering ablation on the robot trajectories and report whether performance changes materially. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical comparison is self-contained

full rationale

The paper reports an empirical head-to-head comparison of pretraining data sources under fixed post-training and validation protocols. No equations, fitted parameters, or derivations are present that would reduce the reported loss reductions or success-rate gains to the input data by construction. The filtering/labeling pipeline is described as part of the human-video processing step, but the central claim is an observed performance difference rather than a mathematical identity or self-referential prediction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. This is the expected non-finding for a data-comparison study whose results remain externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical machine-learning comparison study. No new mathematical axioms, free parameters, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5901 in / 1056 out tokens · 33611 ms · 2026-06-26T17:50:56.541335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 23 linked inside Pith

[1]

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

Pith/arXiv arXiv 2025
[2]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

AgiBot World Contributors. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025
[3]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[4]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[5]

Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022
[6]

Egocentric-100k: 100,000 hours of real-world egocentric video from factory workers

Build AI. Egocentric-100k: 100,000 hours of real-world egocentric video from factory workers. https:// huggingface.co/datasets/builddotai/Egocentric-100K, 2026

2026
[7]

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

Pith/arXiv arXiv 2024
[8]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

2022
[9]

Humannet: Scaling human-centric video learning to one million hours.arXiv preprint arXiv:2605.06747, 2026

Yufan Deng and Daquan Zhou. Humannet: Scaling human-centric video learning to one million hours.arXiv preprint arXiv:2605.06747, 2026

Pith/arXiv arXiv 2026
[10]

Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

arXiv 2026
[11]

Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026

Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, et al. Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026

Pith/arXiv arXiv 2026
[12]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[13]

Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives, 2024

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives, 2024. URLhttps://arxiv.org/abs/2311.18259

arXiv 2024
[14]

Yoon, Mouli Sivapurapu, and Jian Zhang

Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/2505.11709

Pith/arXiv arXiv 2026
[15]

Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence

Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence. arXiv preprint arXiv:2512.24653, 2025

arXiv 2025
[16]

Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, et al. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

arXiv 2025
[17]

Egomimic: Scaling imitation learning via egocentric video, 2024

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URLhttps://arxiv.org/abs/2410.24221

arXiv 2024
[18]

Droid: A large-scale in-the-wild robot manipulation dataset, 2025

Alexander Khazatsky et al. Droid: A large-scale in-the-wild robot manipulation dataset, 2025. URLhttps: //arxiv.org/abs/2403.12945

Pith/arXiv arXiv 2025
[19]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[20]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 10

Pith/arXiv arXiv 2026
[21]

Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

arXiv 2025
[22]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[23]

Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024
[24]

Being-h0: Vision-language-action pretraining from large-scale human videos, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos, 2025. URL https://arxiv.org/abs/2507.15597

arXiv 2025
[25]

Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, and Zongqing Lu. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026. URLhttps://arxiv.org/abs/2601.12993

arXiv 2026
[26]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. URL https://arxiv.org/abs/1906.03327

arXiv 2019
[27]

R3m: A universal visual representation for robot manipulation, 2022

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601

Pith/arXiv arXiv 2022
[28]

GR00T N1: An open foundation model for generalist humanoid robots, 2025

NVIDIA et al. GR00T N1: An open foundation model for generalist humanoid robots, 2025. URLhttps: //arxiv.org/abs/2503.14734

Pith/arXiv arXiv 2025
[29]

Open X-Embodiment: Robotic learning datasets and RT-X models,

Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models,
[30]

URLhttps://arxiv.org/abs/2310.08864

Pith/arXiv arXiv
[31]

Egoverse: An egocentric human dataset for robot learning from around the world, 2026

Ryan Punamiya et al. Egoverse: An egocentric human dataset for robot learning from around the world, 2026. URLhttps://arxiv.org/abs/2604.07607

Pith/arXiv arXiv 2026
[32]

Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026

Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset

2026
[33]

Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Pith/arXiv arXiv 2025
[34]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025
[35]

Humanego: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint, 2025

Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, and Yiannis Aloimonos. Humanego: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint, 2025

2025
[36]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026
[37]

World action models are zero-shot policies, 2026

Seonghyeon Ye et al. World action models are zero-shot policies, 2026. URLhttps://arxiv.org/abs/2602.15922

Pith/arXiv arXiv 2026
[38]

Fast-wam: Do world action models need test-time future imagination?, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?, 2026. URLhttps://arxiv.org/abs/2603.16666

Pith/arXiv arXiv 2026
[39]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023

2023
[40]

Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, and Linxi Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710

arXiv 2026
[41]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 11

2023

[1] [1]

Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

Pith/arXiv arXiv 2025

[2] [2]

Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

AgiBot World Contributors. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

Pith/arXiv arXiv 2025

[3] [3]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[4] [4]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[5] [5]

Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022

[6] [6]

Egocentric-100k: 100,000 hours of real-world egocentric video from factory workers

Build AI. Egocentric-100k: 100,000 hours of real-world egocentric video from factory workers. https:// huggingface.co/datasets/builddotai/Egocentric-100K, 2026

2026

[7] [7]

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

Pith/arXiv arXiv 2024

[8] [8]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

2022

[9] [9]

Humannet: Scaling human-centric video learning to one million hours.arXiv preprint arXiv:2605.06747, 2026

Yufan Deng and Daquan Zhou. Humannet: Scaling human-centric video learning to one million hours.arXiv preprint arXiv:2605.06747, 2026

Pith/arXiv arXiv 2026

[10] [10]

Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

arXiv 2026

[11] [11]

Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026

Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, et al. Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026

Pith/arXiv arXiv 2026

[12] [12]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022

[13] [13]

Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives, 2024

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives, 2024. URLhttps://arxiv.org/abs/2311.18259

arXiv 2024

[14] [14]

Yoon, Mouli Sivapurapu, and Jian Zhang

Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/2505.11709

Pith/arXiv arXiv 2026

[15] [15]

Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence

Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence. arXiv preprint arXiv:2512.24653, 2025

arXiv 2025

[16] [16]

Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, et al. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

arXiv 2025

[17] [17]

Egomimic: Scaling imitation learning via egocentric video, 2024

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URLhttps://arxiv.org/abs/2410.24221

arXiv 2024

[18] [18]

Droid: A large-scale in-the-wild robot manipulation dataset, 2025

Alexander Khazatsky et al. Droid: A large-scale in-the-wild robot manipulation dataset, 2025. URLhttps: //arxiv.org/abs/2403.12945

Pith/arXiv arXiv 2025

[19] [19]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[20] [20]

Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 10

Pith/arXiv arXiv 2026

[21] [21]

Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

arXiv 2025

[22] [22]

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[23] [23]

Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024

[24] [24]

Being-h0: Vision-language-action pretraining from large-scale human videos, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos, 2025. URL https://arxiv.org/abs/2507.15597

arXiv 2025

[25] [25]

Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, and Zongqing Lu. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026. URLhttps://arxiv.org/abs/2601.12993

arXiv 2026

[26] [26]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. URL https://arxiv.org/abs/1906.03327

arXiv 2019

[27] [27]

R3m: A universal visual representation for robot manipulation, 2022

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601

Pith/arXiv arXiv 2022

[28] [28]

GR00T N1: An open foundation model for generalist humanoid robots, 2025

NVIDIA et al. GR00T N1: An open foundation model for generalist humanoid robots, 2025. URLhttps: //arxiv.org/abs/2503.14734

Pith/arXiv arXiv 2025

[29] [29]

Open X-Embodiment: Robotic learning datasets and RT-X models,

Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models,

[30] [30]

URLhttps://arxiv.org/abs/2310.08864

Pith/arXiv arXiv

[31] [31]

Egoverse: An egocentric human dataset for robot learning from around the world, 2026

Ryan Punamiya et al. Egoverse: An egocentric human dataset for robot learning from around the world, 2026. URLhttps://arxiv.org/abs/2604.07607

Pith/arXiv arXiv 2026

[32] [32]

Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026

Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset

2026

[33] [33]

Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

Pith/arXiv arXiv 2025

[34] [34]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

Pith/arXiv arXiv 2025

[35] [35]

Humanego: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint, 2025

Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, and Yiannis Aloimonos. Humanego: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint, 2025

2025

[36] [36]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026

[37] [37]

World action models are zero-shot policies, 2026

Seonghyeon Ye et al. World action models are zero-shot policies, 2026. URLhttps://arxiv.org/abs/2602.15922

Pith/arXiv arXiv 2026

[38] [38]

Fast-wam: Do world action models need test-time future imagination?, 2026

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?, 2026. URLhttps://arxiv.org/abs/2603.16666

Pith/arXiv arXiv 2026

[39] [39]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023

2023

[40] [40]

Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026

Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, and Linxi Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710

arXiv 2026

[41] [41]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 11

2023