ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Chunxiao Liu; Ganlong Zhao; Guoquan Ye; Hao Li; Haotian Hou; Hongsheng Li; Jianbo Liu; Siyuan Huang; Tongyan Fang; Xiaogang Wang

arxiv: 2606.17200 · v1 · pith:SGQTYOWKnew · submitted 2026-06-15 · 💻 cs.RO

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Hao Li , Ganlong Zhao , Yufei Liu , Haotian Hou , Guoquan Ye , Tongyan Fang , Chunxiao Liu , Siyuan Huang

show 3 more authors

Jianbo Liu Xiaogang Wang Hongsheng Li

This is my paper

Pith reviewed 2026-06-27 03:21 UTC · model grok-4.3

classification 💻 cs.RO

keywords VLA pretrainingegocentric human videospseudo-action trajectoriesreliability-aware weightingunified action representationrobotic manipulationhuman-robot data unificationbimanual manipulation

0 comments

The pith

A unified VLA framework turns egocentric human videos into pseudo robot actions and uses reliability-aware weighting to improve pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that abundant egocentric human videos can supply useful training signals for vision-language-action models once converted into robot-compatible pseudo trajectories. It does this by building a scalable video-to-action pipeline and aligning the data through camera-space actions, morphology conditioning, and chunked timing. A reliability-aware objective then down-weights noisy human labels while adding an auxiliary loss on the human data. If successful, this approach lets models scale pretraining with cheap human footage rather than only costly robot demonstrations, producing better results on manipulation benchmarks after both joint pretraining and fine-tuning.

Core claim

ACE-EGO-0 establishes that joint pretraining on 4.53K hours of robot and simulation data plus 1.48K hours of pseudo-action-labeled egocentric human data, achieved through a unified camera-space action representation with morphology conditioning and time-aligned chunking together with a reliability-aware training objective and human auxiliary loss, consistently improves both unified joint pretraining and supervised fine-tuning and reaches state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0 while transferring to real-world bimanual manipulation.

What carries the argument

The reliability-aware training objective with human auxiliary loss that concentrates supervision on reliable pseudo-action signals from human videos after conversion via the egocentric video-to-action pipeline.

If this is right

Joint pretraining that adds the weighted human data improves performance over robot-only baselines.
The same weighted human signals also improve results after supervised fine-tuning.
The resulting model reaches state-of-the-art scores on RoboCasa GR1 TableTop.
The resulting model reaches state-of-the-art scores on RoboTwin 2.0.
The pretrained model transfers effectively to real-world bimanual manipulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method points toward training pipelines that can draw on far larger volumes of everyday human video instead of depending primarily on robot-collected trajectories.
Similar conversion and weighting steps could be tested on other sources of human movement data such as third-person videos or motion-capture archives.
If the reliability weighting proves robust, it may allow incremental addition of new noisy data sources without retraining the entire model from scratch.

Load-bearing premise

The pseudo-action trajectories extracted from egocentric human videos remain useful and comparable to real robot demonstrations once placed in the unified camera-space representation with morphology conditioning.

What would settle it

An ablation that trains the same VLA architecture on the 4.53K hours of robot data alone and measures no improvement or a drop on RoboCasa GR1 TableTop and RoboTwin 2.0 benchmarks compared with the version that adds the 1.48K hours of weighted human pseudo-actions.

read the original abstract

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper packages a video-to-action pipeline, unified camera-space actions with morphology conditioning, and a reliability-weighted loss to mix human egocentric videos with robot data for VLA pretraining, but the abstract supplies no numbers or checks on pseudo-label quality.

read the letter

The paper's main contribution is ACE-EGO-0, a VLA pretraining setup that converts large amounts of egocentric human video into pseudo-action trajectories and trains jointly with robot data. They use a unified camera-space action representation with morphology conditioning and time-aligned chunking to make the data comparable, plus a reliability-aware objective that includes a human auxiliary loss to focus on good signals.

This is new as a packaged framework for this kind of mixed training. The paper does well at breaking down the specific problems with human-robot data mixing and giving practical solutions for each one. The claim that adding the human data improves both pretraining and fine-tuning, leading to SOTA on RoboCasa GR1 TableTop and RoboTwin 2.0, is the kind of result that could matter for scaling.

The soft spots are in the evidence presented. The abstract states the improvements but includes no quantitative results, no baselines, no ablation studies, and no analysis of how accurate or useful the pseudo-actions are. The weakest point is whether the video-to-action pipeline produces pseudo-trajectories that remain effective after conversion and conditioning. Without checks on that, the reported gains might not come from the human supervision at all. The stress-test note is accurate based on the abstract.

This work is aimed at people in robotics and AI who are trying to scale VLA models beyond what robot data alone can provide. Readers interested in data unification techniques or reliability weighting in training could find the ideas worth discussing.

It deserves a serious referee because the problem is central to the field and the proposed methods are specific enough to review properly. I would recommend sending it out for peer review to get the full experimental details and see if the claims hold.

Referee Report

3 major / 2 minor

Summary. The paper introduces ACE-EGO-0, a unified VLA pretraining framework for jointly leveraging robot demonstrations and egocentric human videos. It describes a scalable egocentric video-to-action pipeline that converts human videos into robot-format pseudo-action trajectories, a unified action representation based on camera-space actions, morphology conditioning, and time-aligned chunking, and a reliability-aware training objective with a human auxiliary loss to focus on reliable signals. The model is instantiated on 4.53K hours of robot/simulation data plus 1.48K hours of pseudo-labeled human data, with claims that human supervision under this weighting consistently improves joint pretraining and fine-tuning, achieving SOTA on RoboCasa GR1 TableTop and RoboTwin 2.0 while showing real-world bimanual transfer.

Significance. If the empirical results hold and the pseudo-action signals are validated as useful, the work could meaningfully advance scalable VLA pretraining by demonstrating how to incorporate abundant human video data alongside scarce robot trajectories. The reliability-aware objective and unified representation address practical challenges in heterogeneous embodiment and noisy supervision, with potential to reduce dependence on expensive robot data collection.

major comments (3)

[Abstract] Abstract: The claims of 'consistent improvements' and 'state-of-the-art performance' on RoboCasa GR1 TableTop and RoboTwin 2.0 are presented without any quantitative metrics, baseline comparisons, ablation results, or error bars, which is load-bearing for evaluating whether gains arise from the human data rather than architecture or data volume alone.
[Video-to-action pipeline] Egocentric video-to-action pipeline: No quantitative checks on pseudo-action fidelity are described (e.g., robot execution success rates of the converted trajectories or correlation with ground-truth robot actions), which is required to substantiate that the human supervision survives conversion to unified camera-space representation and morphology conditioning and contributes under the reliability-aware objective.
[Reliability-aware training objective] Reliability-aware training objective: The formulation of the human auxiliary loss and weighting mechanism lacks ablations isolating its contribution versus the joint architecture or extra data volume; without these, it is unclear whether the reported gains on the two benchmarks are attributable to selective amplification of reliable human signals.

minor comments (2)

[Unified action representation] The exact definitions of camera-space actions and the morphology conditioning mechanism would benefit from explicit equations or pseudocode for reproducibility.
[Data] Data sources and collection details for the 4.53K hours of robot/simulation data and 1.48K hours of human data should be expanded to support replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of results, validation of the pipeline, and ablations for the training objective.

read point-by-point responses

Referee: [Abstract] Abstract: The claims of 'consistent improvements' and 'state-of-the-art performance' on RoboCasa GR1 TableTop and RoboTwin 2.0 are presented without any quantitative metrics, baseline comparisons, ablation results, or error bars, which is load-bearing for evaluating whether gains arise from the human data rather than architecture or data volume alone.

Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised manuscript, we will add key metrics (e.g., success rates with error bars), baseline comparisons, and references to ablations to better substantiate the claims of consistent improvements from human data and SOTA performance. revision: yes
Referee: [Video-to-action pipeline] Egocentric video-to-action pipeline: No quantitative checks on pseudo-action fidelity are described (e.g., robot execution success rates of the converted trajectories or correlation with ground-truth robot actions), which is required to substantiate that the human supervision survives conversion to unified camera-space representation and morphology conditioning and contributes under the reliability-aware objective.

Authors: The manuscript validates the pipeline through downstream task improvements when including the pseudo-labeled human data. However, we acknowledge the value of direct fidelity checks. We will add quantitative evaluations, including execution success rates on held-out converted trajectories and correlation analyses with available ground-truth where possible, in a new subsection of the revised version. revision: yes
Referee: [Reliability-aware training objective] Reliability-aware training objective: The formulation of the human auxiliary loss and weighting mechanism lacks ablations isolating its contribution versus the joint architecture or extra data volume; without these, it is unclear whether the reported gains on the two benchmarks are attributable to selective amplification of reliable human signals.

Authors: We will expand the experiments section with targeted ablations that isolate the reliability-aware objective and human auxiliary loss. These will include comparisons to joint training without the weighting mechanism and to data-volume-matched baselines, to clarify the contribution of selective amplification of reliable signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework relies on external data and benchmarks.

full rationale

The paper presents an empirical VLA pretraining approach that converts egocentric videos to pseudo-actions, applies a unified camera-space representation with morphology conditioning, and uses a reliability-aware loss to weight human supervision. No equations, derivations, or 'predictions' are described that reduce by construction to fitted inputs or self-citations. The claimed gains on RoboCasa and RoboTwin benchmarks are evaluated against external test sets rather than internal redefinitions, so the chain is self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5845 in / 1105 out tokens · 41568 ms · 2026-06-27T03:21:32.612848+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 16 canonical work pages · 3 internal anchors

[1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023

2023
[2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[3]

and Kragic, D

K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A Vision-Language-Action Flow Model for General Robot Control. InProceedings ...

work page doi:10.15607/rss.2025.xxi.010 2025
[4]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

2025
[5]

ONeill, A

A. ONeill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[6]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024
[7]

L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In Advances in Neural Information Processing Systems, volume 37, 2024. doi:10.52202/079017-3952

work page doi:10.52202/079017-3952 2024
[8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14734 2025
[9]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[10]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation, pages 13226–13233. IEEE, 2025

2025
[11]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. EgoVLA: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440,

Pith/arXiv arXiv
[12]

doi:10.48550/arXiv.2507.12440

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.12440
[13]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 of Proceedings of Machine Learning Research, pages 2828–2844. PMLR, 2025

2025
[14]

Long, Y .-C

G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9826–9836. IEEE, June 2024. doi:10.1109/cvpr52733.2024.00938

work page doi:10.1109/cvpr52733.2024.00938 2024
[15]

R. A. Potamias, J. Zhang, J. Deng, and S. Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025. 15

2025
[16]

ACM Trans

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Transac- tions on Graphics, 36(6):1–17, November 2017. doi:10.1145/3130800.3130883

work page doi:10.1145/3130800.3130883 2017
[17]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

2025
[18]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025
[19]

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024
[20]

Zheng, J

J. Zheng, J. Li, D. Liu, Y . Zheng, Z. Wang, Z. Ou, Y . Liu, J. Liu, Y .-Q. Zhang, and X. Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22508–22519, 2025

2025
[21]

S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. InInternational Conference on Learning Repre- sentations, 2025

2025
[22]

S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

arXiv 2026
[23]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. SpatialVLA: Exploring spatial representations for visual-language-action models. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.011

work page doi:10.15607/rss.2025.xxi.011 2025
[24]

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3D-VLA: A 3D vision-language-action generative world model. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Re...

2024
[25]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daumé III, A. Kolobov, F. Huang, and J. Yang. TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, 2025

2025
[26]

Assembly101: A large-scale multi-view video dataset for understanding procedural activities

K. Grauman et al. Ego4d: Around the world in 3,000 hours of egocentric video. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18973–18990. IEEE, June 2022. doi:10.1109/cvpr52688.2022.01842

work page doi:10.1109/cvpr52688.2022.01842 2022
[27]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1):33–55, 2022. doi:10.1007/s11263-021-01531-2

work page doi:10.1007/s11263-021-01531-2 2022
[28]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, Z. Chavis, J. Chen, F. Cheng, F.-J. Chu, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages...

2024
[29]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. EgoDex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. doi:10.48550/arXiv.2505.11709

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.11709 2025
[30]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Castañeda, F. Hu, Y . L. Tan, L. Fu, T. Darrell, F. Huang, Y . Zhu, D. Xu, and L. Fan. EgoScale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710,

arXiv
[31]

doi:10.48550/arXiv.2602.16710

work page doi:10.48550/arxiv.2602.16710
[32]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual representation for robot manipulation. In K. Liu, D. Kulic, and J. Ichnowski, editors,Proceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 892–909. PMLR, 14–18 Dec 2023. URLhttps://proceedings.mlr.press/v205/ nair23a.html

2023
[33]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=YJ7o2wetJ2

2023
[34]

Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. LIV: Language-image representations and rewards for robotic control. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th 16 International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 23301...

2023
[35]

T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022. URLhttps://arxiv.org/abs/2203.06173

arXiv 2022
[36]

Majumdar, K

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://p...

2023
[37]

Karamcheti, S

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language-driven representation learning for robotics. InRobotics: Science and Systems, 2023. URLhttps://arxiv.org/abs/2302.12766

arXiv 2023
[38]

K. Q. Lin, J. Wang, M. Soldan, M. Wray, R. Yan, E. Z. Xu, D. Gao, R.-C. Tu, W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, and M. Z. Shou. Egocentric video-language pretraining. InAdvances in Neural Infor- mation Processing Systems, volume 35, 2022. URLhttps://proceedings.neurips.cc/paper_files/paper/2022/hash/ 31fb284a0aaaad837d2930a610c...

2022
[39]

Y . Zhao, I. Misra, P. Krähenbühl, and R. Girdhar. Learning video representations from large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, June 2023

2023
[40]

J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu. OKAMI: Teaching humanoid robots manipulation skills through single video imitation. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 299–317. PMLR, 2025

2025
[41]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 4545–4565. PMLR, 2025

2025
[42]

L. Y . Zhu, P. Kuppili, R. Punamiya, P. Aphiwetsa, D. Patel, S. Kareer, S. Ha, and D. Xu. EMMA: scaling mobile manipulation via egocentric human data.IEEE Robotics Autom. Lett., 11(3):3087–3094, 2026. doi:10.1109/LRA.2026.3653320. URL https://doi.org/10.1109/LRA.2026.3653320

work page doi:10.1109/lra.2026.3653320 2026
[43]

G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2R: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025. doi:10.48550/arXiv.2505.11920

work page doi:10.48550/arxiv.2505.11920 2025
[44]

V . Liu, A. Adeniji, D. Zhan, S. Haldar, R. Bhirangi, P. Abbeel, and L. Pinto. Egozero: Robot learning from smart glasses. arXiv preprint arXiv:2505.20290, 2025. doi:10.48550/arXiv.2505.20290

work page doi:10.48550/arxiv.2505.20290 2025
[45]

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos. In2025 IEEE International Conference on Robotics and Automation, pages 16939–16947. IEEE,
[46]

doi:10.1109/ICRA55743.2025.11128283

work page doi:10.1109/icra55743.2025.11128283 2025
[47]

Y . Chen, Y . Ge, H. Zhou, M. Ding, Y . Ge, and X. Liu. Dial: Decoupling intent and action via latent world modeling for end-to-end vla.arXiv preprint arXiv:2603.29844, 2026

Pith/arXiv arXiv 2026
[48]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019

2019
[49]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, November 2021. doi:10.1109/tpami.2020.2991965

work page doi:10.1109/tpami.2020.2991965 2021
[50]

Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. He, and H. Dong. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[51]

Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026

Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset

2026
[52]

Carion, L

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025
[53]

Z. Yu, S. Zafeiriou, and T. Birdal. Dyn-hamr: Recovering 4d interacting hand motion from a dynamic camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27716–27726, 2025

2025
[54]

Huang, Q

J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 17

Pith/arXiv arXiv 2025
[55]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[56]

Zheng, J

R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, A. Narayan, Y . L. Tan, G. Wang, Q. Wang, J. Xiang, Y . Xu, S. Ye, J. Kautz, F. Huang, Y . Zhu, and L. Fan. Flare: Robot learning with implicit world modeling. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volu...

2025
[57]

Y . Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y . Chen, D. Huo, F. Xiong, X. Wei, Z. Ma, and M. Xu. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Pith/arXiv arXiv 2026
[58]

Zhang, Z

T. Zhang, Z. Yuan, D. Chi, P. Liu, D. Li, K. Hu, L. Zhang, J. Nie, Z. Wei, Z. Chen, Y . Tang, J. Li, Z. Xiang, M. Li, T. Luo, H. Wan, A. Li, L. Zhai, Z. Zhan, X. Bai, J. Cai, P. Cao, K. Chen, S. Chen, Y . Dai, S. Di, Y . Gong, C. Gui, Y . Guo, P. Hao, Q. He, H. Huang, K. Huang, Z. Huang, S. Jin, Y . Jin, A. Li, D. Li, J. Li, R. Li, Y . Li, Y . Li, J. Lian...

2026
[59]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu. Motus: A unified latent action world model, 2025. URLhttps://arxiv.org/abs/2512.13030

Pith/arXiv arXiv 2025
[60]

W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026
[61]

Hy-Embodied-0.5-VLA: From vision-language-action models to a real-world robot learning stack.arXiv preprint arXiv:2606.14409, 2026

Tencent Robotics and Tencent Hy Team. Hy-Embodied-0.5-VLA: From vision-language-action models to a real-world robot learning stack.arXiv preprint arXiv:2606.14409, 2026. 18 A Additional Method Details Human Video RoboTwin Galaxea World Agibot World Figure 7Camera-space action visualization across real robot demonstrations, simulation rollouts, and human e...

arXiv 2026

[1] [1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023

2023

[2] [2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[3] [3]

and Kragic, D

K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A Vision-Language-Action Flow Model for General Robot Control. InProceedings ...

work page doi:10.15607/rss.2025.xxi.010 2025

[4] [4]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

2025

[5] [5]

ONeill, A

A. ONeill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[6] [6]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024

[7] [7]

L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In Advances in Neural Information Processing Systems, volume 37, 2024. doi:10.52202/079017-3952

work page doi:10.52202/079017-3952 2024

[8] [8]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14734 2025

[9] [9]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[10] [10]

Kareer, D

S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation, pages 13226–13233. IEEE, 2025

2025

[11] [11]

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. EgoVLA: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440,

Pith/arXiv arXiv

[12] [12]

doi:10.48550/arXiv.2507.12440

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.12440

[13] [13]

Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 of Proceedings of Machine Learning Research, pages 2828–2844. PMLR, 2025

2025

[14] [14]

Long, Y .-C

G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9826–9836. IEEE, June 2024. doi:10.1109/cvpr52733.2024.00938

work page doi:10.1109/cvpr52733.2024.00938 2024

[15] [15]

R. A. Potamias, J. Zhang, J. Deng, and S. Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025. 15

2025

[16] [16]

ACM Trans

J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Transac- tions on Graphics, 36(6):1–17, November 2017. doi:10.1145/3130800.3130883

work page doi:10.1145/3130800.3130883 2017

[17] [17]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

2025

[18] [18]

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025

[19] [19]

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024

[20] [20]

Zheng, J

J. Zheng, J. Li, D. Liu, Y . Zheng, Z. Wang, Z. Ou, Y . Liu, J. Liu, Y .-Q. Zhang, and X. Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22508–22519, 2025

2025

[21] [21]

S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. InInternational Conference on Learning Repre- sentations, 2025

2025

[22] [22]

S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026

arXiv 2026

[23] [23]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. SpatialVLA: Exploring spatial representations for visual-language-action models. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.011

work page doi:10.15607/rss.2025.xxi.011 2025

[24] [24]

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3D-VLA: A 3D vision-language-action generative world model. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Re...

2024

[25] [25]

Zheng, Y

R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daumé III, A. Kolobov, F. Huang, and J. Yang. TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, 2025

2025

[26] [26]

Assembly101: A large-scale multi-view video dataset for understanding procedural activities

K. Grauman et al. Ego4d: Around the world in 3,000 hours of egocentric video. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18973–18990. IEEE, June 2022. doi:10.1109/cvpr52688.2022.01842

work page doi:10.1109/cvpr52688.2022.01842 2022

[27] [27]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1):33–55, 2022. doi:10.1007/s11263-021-01531-2

work page doi:10.1007/s11263-021-01531-2 2022

[28] [28]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, Z. Chavis, J. Chen, F. Cheng, F.-J. Chu, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages...

2024

[29] [29]

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. EgoDex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. doi:10.48550/arXiv.2505.11709

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.11709 2025

[30] [30]

Zheng, D

R. Zheng, D. Niu, Y . Xie, J. Wang, M. Xu, Y . Jiang, F. Castañeda, F. Hu, Y . L. Tan, L. Fu, T. Darrell, F. Huang, Y . Zhu, D. Xu, and L. Fan. EgoScale: Scaling dexterous manipulation with diverse egocentric human data.arXiv preprint arXiv:2602.16710,

arXiv

[31] [31]

doi:10.48550/arXiv.2602.16710

work page doi:10.48550/arxiv.2602.16710

[32] [32]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual representation for robot manipulation. In K. Liu, D. Kulic, and J. Ichnowski, editors,Proceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 892–909. PMLR, 14–18 Dec 2023. URLhttps://proceedings.mlr.press/v205/ nair23a.html

2023

[33] [33]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=YJ7o2wetJ2

2023

[34] [34]

Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. LIV: Language-image representations and rewards for robotic control. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th 16 International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 23301...

2023

[35] [35]

T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022. URLhttps://arxiv.org/abs/2203.06173

arXiv 2022

[36] [36]

Majumdar, K

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://p...

2023

[37] [37]

Karamcheti, S

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language-driven representation learning for robotics. InRobotics: Science and Systems, 2023. URLhttps://arxiv.org/abs/2302.12766

arXiv 2023

[38] [38]

K. Q. Lin, J. Wang, M. Soldan, M. Wray, R. Yan, E. Z. Xu, D. Gao, R.-C. Tu, W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, and M. Z. Shou. Egocentric video-language pretraining. InAdvances in Neural Infor- mation Processing Systems, volume 35, 2022. URLhttps://proceedings.neurips.cc/paper_files/paper/2022/hash/ 31fb284a0aaaad837d2930a610c...

2022

[39] [39]

Y . Zhao, I. Misra, P. Krähenbühl, and R. Girdhar. Learning video representations from large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, June 2023

2023

[40] [40]

J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu. OKAMI: Teaching humanoid robots manipulation skills through single video imitation. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 299–317. PMLR, 2025

2025

[41] [41]

Lepert, J

M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 4545–4565. PMLR, 2025

2025

[42] [42]

L. Y . Zhu, P. Kuppili, R. Punamiya, P. Aphiwetsa, D. Patel, S. Kareer, S. Ha, and D. Xu. EMMA: scaling mobile manipulation via egocentric human data.IEEE Robotics Autom. Lett., 11(3):3087–3094, 2026. doi:10.1109/LRA.2026.3653320. URL https://doi.org/10.1109/LRA.2026.3653320

work page doi:10.1109/lra.2026.3653320 2026

[43] [43]

G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2R: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025. doi:10.48550/arXiv.2505.11920

work page doi:10.48550/arxiv.2505.11920 2025

[44] [44]

V . Liu, A. Adeniji, D. Zhan, S. Haldar, R. Bhirangi, P. Abbeel, and L. Pinto. Egozero: Robot learning from smart glasses. arXiv preprint arXiv:2505.20290, 2025. doi:10.48550/arXiv.2505.20290

work page doi:10.48550/arxiv.2505.20290 2025

[45] [45]

J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos. In2025 IEEE International Conference on Robotics and Automation, pages 16939–16947. IEEE,

[46] [46]

doi:10.1109/ICRA55743.2025.11128283

work page doi:10.1109/icra55743.2025.11128283 2025

[47] [47]

Y . Chen, Y . Ge, H. Zhou, M. Ding, Y . Ge, and X. Liu. Dial: Decoupling intent and action via latent world modeling for end-to-end vla.arXiv preprint arXiv:2603.29844, 2026

Pith/arXiv arXiv 2026

[48] [48]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019

2019

[49] [49]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, November 2021. doi:10.1109/tpami.2020.2991965

work page doi:10.1109/tpami.2020.2991965 2021

[50] [50]

Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. He, and H. Dong. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022

[51] [51]

Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026

Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset

2026

[52] [52]

Carion, L

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025

[53] [53]

Z. Yu, S. Zafeiriou, and T. Birdal. Dyn-hamr: Recovering 4d interacting hand motion from a dynamic camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27716–27726, 2025

2025

[54] [54]

Huang, Q

J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 17

Pith/arXiv arXiv 2025

[55] [55]

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[56] [56]

Zheng, J

R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, A. Narayan, Y . L. Tan, G. Wang, Q. Wang, J. Xiang, Y . Xu, S. Ye, J. Kautz, F. Huang, Y . Zhu, and L. Fan. Flare: Robot learning with implicit world modeling. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volu...

2025

[57] [57]

Y . Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y . Chen, D. Huo, F. Xiong, X. Wei, Z. Ma, and M. Xu. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026

Pith/arXiv arXiv 2026

[58] [58]

Zhang, Z

T. Zhang, Z. Yuan, D. Chi, P. Liu, D. Li, K. Hu, L. Zhang, J. Nie, Z. Wei, Z. Chen, Y . Tang, J. Li, Z. Xiang, M. Li, T. Luo, H. Wan, A. Li, L. Zhai, Z. Zhan, X. Bai, J. Cai, P. Cao, K. Chen, S. Chen, Y . Dai, S. Di, Y . Gong, C. Gui, Y . Guo, P. Hao, Q. He, H. Huang, K. Huang, Z. Huang, S. Jin, Y . Jin, A. Li, D. Li, J. Li, R. Li, Y . Li, Y . Li, J. Lian...

2026

[59] [59]

H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu. Motus: A unified latent action world model, 2025. URLhttps://arxiv.org/abs/2512.13030

Pith/arXiv arXiv 2025

[60] [60]

W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026

[61] [61]

Hy-Embodied-0.5-VLA: From vision-language-action models to a real-world robot learning stack.arXiv preprint arXiv:2606.14409, 2026

Tencent Robotics and Tencent Hy Team. Hy-Embodied-0.5-VLA: From vision-language-action models to a real-world robot learning stack.arXiv preprint arXiv:2606.14409, 2026. 18 A Additional Method Details Human Video RoboTwin Galaxea World Agibot World Figure 7Camera-space action visualization across real robot demonstrations, simulation rollouts, and human e...

arXiv 2026