ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining
Pith reviewed 2026-06-27 03:21 UTC · model grok-4.3
The pith
A unified VLA framework turns egocentric human videos into pseudo robot actions and uses reliability-aware weighting to improve pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ACE-EGO-0 establishes that joint pretraining on 4.53K hours of robot and simulation data plus 1.48K hours of pseudo-action-labeled egocentric human data, achieved through a unified camera-space action representation with morphology conditioning and time-aligned chunking together with a reliability-aware training objective and human auxiliary loss, consistently improves both unified joint pretraining and supervised fine-tuning and reaches state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0 while transferring to real-world bimanual manipulation.
What carries the argument
The reliability-aware training objective with human auxiliary loss that concentrates supervision on reliable pseudo-action signals from human videos after conversion via the egocentric video-to-action pipeline.
If this is right
- Joint pretraining that adds the weighted human data improves performance over robot-only baselines.
- The same weighted human signals also improve results after supervised fine-tuning.
- The resulting model reaches state-of-the-art scores on RoboCasa GR1 TableTop.
- The resulting model reaches state-of-the-art scores on RoboTwin 2.0.
- The pretrained model transfers effectively to real-world bimanual manipulation tasks.
Where Pith is reading between the lines
- The method points toward training pipelines that can draw on far larger volumes of everyday human video instead of depending primarily on robot-collected trajectories.
- Similar conversion and weighting steps could be tested on other sources of human movement data such as third-person videos or motion-capture archives.
- If the reliability weighting proves robust, it may allow incremental addition of new noisy data sources without retraining the entire model from scratch.
Load-bearing premise
The pseudo-action trajectories extracted from egocentric human videos remain useful and comparable to real robot demonstrations once placed in the unified camera-space representation with morphology conditioning.
What would settle it
An ablation that trains the same VLA architecture on the 4.53K hours of robot data alone and measures no improvement or a drop on RoboCasa GR1 TableTop and RoboTwin 2.0 benchmarks compared with the version that adds the 1.48K hours of weighted human pseudo-actions.
read the original abstract
Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ACE-EGO-0, a unified VLA pretraining framework for jointly leveraging robot demonstrations and egocentric human videos. It describes a scalable egocentric video-to-action pipeline that converts human videos into robot-format pseudo-action trajectories, a unified action representation based on camera-space actions, morphology conditioning, and time-aligned chunking, and a reliability-aware training objective with a human auxiliary loss to focus on reliable signals. The model is instantiated on 4.53K hours of robot/simulation data plus 1.48K hours of pseudo-labeled human data, with claims that human supervision under this weighting consistently improves joint pretraining and fine-tuning, achieving SOTA on RoboCasa GR1 TableTop and RoboTwin 2.0 while showing real-world bimanual transfer.
Significance. If the empirical results hold and the pseudo-action signals are validated as useful, the work could meaningfully advance scalable VLA pretraining by demonstrating how to incorporate abundant human video data alongside scarce robot trajectories. The reliability-aware objective and unified representation address practical challenges in heterogeneous embodiment and noisy supervision, with potential to reduce dependence on expensive robot data collection.
major comments (3)
- [Abstract] Abstract: The claims of 'consistent improvements' and 'state-of-the-art performance' on RoboCasa GR1 TableTop and RoboTwin 2.0 are presented without any quantitative metrics, baseline comparisons, ablation results, or error bars, which is load-bearing for evaluating whether gains arise from the human data rather than architecture or data volume alone.
- [Video-to-action pipeline] Egocentric video-to-action pipeline: No quantitative checks on pseudo-action fidelity are described (e.g., robot execution success rates of the converted trajectories or correlation with ground-truth robot actions), which is required to substantiate that the human supervision survives conversion to unified camera-space representation and morphology conditioning and contributes under the reliability-aware objective.
- [Reliability-aware training objective] Reliability-aware training objective: The formulation of the human auxiliary loss and weighting mechanism lacks ablations isolating its contribution versus the joint architecture or extra data volume; without these, it is unclear whether the reported gains on the two benchmarks are attributable to selective amplification of reliable human signals.
minor comments (2)
- [Unified action representation] The exact definitions of camera-space actions and the morphology conditioning mechanism would benefit from explicit equations or pseudocode for reproducibility.
- [Data] Data sources and collection details for the 4.53K hours of robot/simulation data and 1.48K hours of human data should be expanded to support replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the presentation of results, validation of the pipeline, and ablations for the training objective.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claims of 'consistent improvements' and 'state-of-the-art performance' on RoboCasa GR1 TableTop and RoboTwin 2.0 are presented without any quantitative metrics, baseline comparisons, ablation results, or error bars, which is load-bearing for evaluating whether gains arise from the human data rather than architecture or data volume alone.
Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised manuscript, we will add key metrics (e.g., success rates with error bars), baseline comparisons, and references to ablations to better substantiate the claims of consistent improvements from human data and SOTA performance. revision: yes
-
Referee: [Video-to-action pipeline] Egocentric video-to-action pipeline: No quantitative checks on pseudo-action fidelity are described (e.g., robot execution success rates of the converted trajectories or correlation with ground-truth robot actions), which is required to substantiate that the human supervision survives conversion to unified camera-space representation and morphology conditioning and contributes under the reliability-aware objective.
Authors: The manuscript validates the pipeline through downstream task improvements when including the pseudo-labeled human data. However, we acknowledge the value of direct fidelity checks. We will add quantitative evaluations, including execution success rates on held-out converted trajectories and correlation analyses with available ground-truth where possible, in a new subsection of the revised version. revision: yes
-
Referee: [Reliability-aware training objective] Reliability-aware training objective: The formulation of the human auxiliary loss and weighting mechanism lacks ablations isolating its contribution versus the joint architecture or extra data volume; without these, it is unclear whether the reported gains on the two benchmarks are attributable to selective amplification of reliable human signals.
Authors: We will expand the experiments section with targeted ablations that isolate the reliability-aware objective and human auxiliary loss. These will include comparisons to joint training without the weighting mechanism and to data-volume-matched baselines, to clarify the contribution of selective amplification of reliable signals. revision: yes
Circularity Check
No significant circularity; empirical framework relies on external data and benchmarks.
full rationale
The paper presents an empirical VLA pretraining approach that converts egocentric videos to pseudo-actions, applies a unified camera-space representation with morphology conditioning, and uses a reliability-aware loss to weight human supervision. No equations, derivations, or 'predictions' are described that reduce by construction to fitted inputs or self-citations. The claimed gains on RoboCasa and RoboTwin benchmarks are evaluated against external test sets rather than internal redefinitions, so the chain is self-contained and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Brohan, N
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023
2023
-
[2]
Zitkovich, T
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023
2023
-
[3]
K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A Vision-Language-Action Flow Model for General Robot Control. InProceedings ...
-
[4]
Black, N
K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...
2025
-
[5]
ONeill, A
A. ONeill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[6]
Ghosh, H
Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024
2024
-
[7]
L. Wang, X. Chen, J. Zhao, and K. He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. In Advances in Neural Information Processing Systems, volume 37, 2024. doi:10.52202/079017-3952
-
[8]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA, J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A....
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14734 2025
-
[9]
J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025
Pith/arXiv arXiv 2025
-
[10]
Kareer, D
S. Kareer, D. Patel, R. Punamiya, P. Mathur, S. Cheng, C. Wang, J. Hoffman, and D. Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Automation, pages 13226–13233. IEEE, 2025
2025
-
[11]
R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, X. Cheng, R.-Z. Qiu, H. Yin, S. Liu, S. Han, Y . Lu, and X. Wang. EgoVLA: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440,
-
[12]
doi:10.48550/arXiv.2507.12440
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.12440
-
[13]
Z. Fu, Q. Zhao, Q. Wu, G. Wetzstein, and C. Finn. Humanplus: Humanoid shadowing and imitation from humans. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 of Proceedings of Machine Learning Research, pages 2828–2844. PMLR, 2025
2025
-
[14]
G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, and J. Malik. Reconstructing hands in 3d with transformers. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9826–9836. IEEE, June 2024. doi:10.1109/cvpr52733.2024.00938
-
[15]
R. A. Potamias, J. Zhang, J. Deng, and S. Zafeiriou. Wilor: End-to-end 3d hand localization and reconstruction in-the-wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12242–12254, 2025. 15
2025
-
[16]
J. Romero, D. Tzionas, and M. J. Black. Embodied hands: modeling and capturing hands and bodies together.ACM Transac- tions on Graphics, 36(6):1–17, November 2017. doi:10.1145/3130800.3130883
-
[17]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025
2025
-
[18]
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025
2025
-
[19]
Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024
Pith/arXiv arXiv 2024
-
[20]
Zheng, J
J. Zheng, J. Li, D. Liu, Y . Zheng, Z. Wang, Z. Ou, Y . Liu, J. Liu, Y .-Q. Zhang, and X. Zhan. Universal actions for enhanced embodied foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22508–22519, 2025
2025
-
[21]
S. Ye, J. Jang, B. Jeon, S. J. Joo, J. Yang, B. Peng, A. Mandlekar, R. Tan, Y .-W. Chao, B. Y . Lin, L. Liden, K. Lee, J. Gao, L. Zettlemoyer, D. Fox, and M. Seo. Latent action pretraining from videos. InInternational Conference on Learning Repre- sentations, 2025
2025
-
[22]
S. Liu, B. Li, K. Ma, L. Wu, H. Tan, X. Ouyang, H. Su, and J. Zhu. Rdt2: Exploring the scaling limit of umi data towards zero-shot cross-embodiment generalization.arXiv preprint arXiv:2602.03310, 2026
arXiv 2026
-
[23]
D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, J. Gu, Z. Wang, Y . Ding, B. Zhao, D. Wang, and X. Li. SpatialVLA: Exploring spatial representations for visual-language-action models. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025. doi:10.15607/RSS.2025.XXI.011
-
[24]
H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan. 3D-VLA: A 3D vision-language-action generative world model. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Re...
2024
-
[25]
Zheng, Y
R. Zheng, Y . Liang, S. Huang, J. Gao, H. Daumé III, A. Kolobov, F. Huang, and J. Yang. TraceVLA: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, 2025
2025
-
[26]
Assembly101: A large-scale multi-view video dataset for understanding procedural activities
K. Grauman et al. Ego4d: Around the world in 3,000 hours of egocentric video. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18973–18990. IEEE, June 2022. doi:10.1109/cvpr52688.2022.01842
-
[27]
D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1):33–55, 2022. doi:10.1007/s11263-021-01531-2
-
[28]
Grauman, A
K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, E. Byrne, Z. Chavis, J. Chen, F. Cheng, F.-J. Chu, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages...
2024
-
[29]
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
R. Hoque, P. Huang, D. J. Yoon, M. Sivapurapu, and J. Zhang. EgoDex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. doi:10.48550/arXiv.2505.11709
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.11709 2025
- [30]
-
[31]
doi:10.48550/arXiv.2602.16710
-
[32]
S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual representation for robot manipulation. In K. Liu, D. Kulic, and J. Ichnowski, editors,Proceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 892–909. PMLR, 14–18 Dec 2023. URLhttps://proceedings.mlr.press/v205/ nair23a.html
2023
-
[33]
Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=YJ7o2wetJ2
2023
-
[34]
Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. LIV: Language-image representations and rewards for robotic control. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th 16 International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 23301...
2023
-
[35]
T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022. URLhttps://arxiv.org/abs/2203.06173
arXiv 2022
-
[36]
Majumdar, K
A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier. Where are we in the search for an artificial visual cortex for embodied intelligence? InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps://p...
2023
-
[37]
S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language-driven representation learning for robotics. InRobotics: Science and Systems, 2023. URLhttps://arxiv.org/abs/2302.12766
arXiv 2023
-
[38]
K. Q. Lin, J. Wang, M. Soldan, M. Wray, R. Yan, E. Z. Xu, D. Gao, R.-C. Tu, W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, and M. Z. Shou. Egocentric video-language pretraining. InAdvances in Neural Infor- mation Processing Systems, volume 35, 2022. URLhttps://proceedings.neurips.cc/paper_files/paper/2022/hash/ 31fb284a0aaaad837d2930a610c...
2022
-
[39]
Y . Zhao, I. Misra, P. Krähenbühl, and R. Girdhar. Learning video representations from large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, June 2023
2023
-
[40]
J. Li, Y . Zhu, Y . Xie, Z. Jiang, M. Seo, G. Pavlakos, and Y . Zhu. OKAMI: Teaching humanoid robots manipulation skills through single video imitation. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Learning, volume 270 ofProceedings of Machine Learning Research, pages 299–317. PMLR, 2025
2025
-
[41]
Lepert, J
M. Lepert, J. Fang, and J. Bohg. Phantom: Training robots without robots using only human videos. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 4545–4565. PMLR, 2025
2025
-
[42]
L. Y . Zhu, P. Kuppili, R. Punamiya, P. Aphiwetsa, D. Patel, S. Kareer, S. Ha, and D. Xu. EMMA: scaling mobile manipulation via egocentric human data.IEEE Robotics Autom. Lett., 11(3):3087–3094, 2026. doi:10.1109/LRA.2026.3653320. URL https://doi.org/10.1109/LRA.2026.3653320
-
[43]
G. Li, Y . Lyu, Z. Liu, C. Hou, Y . Xu, J. Zhang, and S. Zhang. H2R: A human-to-robot data augmentation for robot pre-training from videos.arXiv preprint arXiv:2505.11920, 2025. doi:10.48550/arXiv.2505.11920
-
[44]
V . Liu, A. Adeniji, D. Zhan, S. Haldar, R. Bhirangi, P. Abbeel, and L. Pinto. Egozero: Robot learning from smart glasses. arXiv preprint arXiv:2505.20290, 2025. doi:10.48550/arXiv.2505.20290
-
[45]
J. Shi, Z. Zhao, T. Wang, I. Pedroza, A. Luo, J. Wang, J. Ma, and D. Jayaraman. Zeromimic: Distilling robotic manipulation skills from web videos. In2025 IEEE International Conference on Robotics and Automation, pages 16939–16947. IEEE,
-
[46]
doi:10.1109/ICRA55743.2025.11128283
-
[47]
Y . Chen, Y . Ge, H. Zhou, M. Ding, Y . Ge, and X. Liu. Dial: Decoupling intent and action via latent world modeling for end-to-end vla.arXiv preprint arXiv:2603.29844, 2026
Pith/arXiv arXiv 2026
-
[48]
Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019
2019
-
[49]
D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. The epic-kitchens dataset: Collection, challenges and baselines.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, November 2021. doi:10.1109/tpami.2020.2991965
-
[50]
Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. He, and H. Dong. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[51]
Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026
Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset
2026
-
[52]
N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
Pith/arXiv arXiv 2025
-
[53]
Z. Yu, S. Zafeiriou, and T. Birdal. Dyn-hamr: Recovering 4d interacting hand motion from a dynamic camera. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27716–27726, 2025
2025
-
[54]
J. Huang, Q. Zhou, H. Rabeti, A. Korovko, H. Ling, X. Ren, T. Shen, J. Gao, D. Slepichev, C.-H. Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 17
Pith/arXiv arXiv 2025
-
[55]
T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025
Pith/arXiv arXiv 2025
-
[56]
Zheng, J
R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, A. Narayan, Y . L. Tan, G. Wang, Q. Wang, J. Xiang, Y . Xu, S. Ye, J. Kautz, F. Huang, Y . Zhu, and L. Fan. Flare: Robot learning with implicit world modeling. In J. Lim, S. Song, and H.-W. Park, editors,Proceedings of The 9th Conference on Robot Learning, volu...
2025
-
[57]
Y . Yang, S. Zeng, T. Lin, X. Chang, D. Qi, J. Xiao, H. Liu, R. Chen, Y . Chen, D. Huo, F. Xiong, X. Wei, Z. Ma, and M. Xu. Abot-m0: Vla foundation model for robotic manipulation with action manifold learning.arXiv preprint arXiv:2602.11236, 2026
Pith/arXiv arXiv 2026
-
[58]
Zhang, Z
T. Zhang, Z. Yuan, D. Chi, P. Liu, D. Li, K. Hu, L. Zhang, J. Nie, Z. Wei, Z. Chen, Y . Tang, J. Li, Z. Xiang, M. Li, T. Luo, H. Wan, A. Li, L. Zhai, Z. Zhan, X. Bai, J. Cai, P. Cao, K. Chen, S. Chen, Y . Dai, S. Di, Y . Gong, C. Gui, Y . Guo, P. Hao, Q. He, H. Huang, K. Huang, Z. Huang, S. Jin, Y . Jin, A. Li, D. Li, J. Li, R. Li, Y . Li, Y . Li, J. Lian...
2026
-
[59]
H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y . Feng, C. Xiang, Y . Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu. Motus: A unified latent action world model, 2025. URLhttps://arxiv.org/abs/2512.13030
Pith/arXiv arXiv 2025
-
[60]
W. Wu, F. Lu, Y . Wang, S. Yang, S. Liu, F. Wang, Q. Zhu, H. Sun, Y . Wang, S. Ma, et al. A pragmatic vla foundation model. arXiv preprint arXiv:2601.18692, 2026
Pith/arXiv arXiv 2026
-
[61]
Tencent Robotics and Tencent Hy Team. Hy-Embodied-0.5-VLA: From vision-language-action models to a real-world robot learning stack.arXiv preprint arXiv:2606.14409, 2026. 18 A Additional Method Details Human Video RoboTwin Galaxea World Agibot World Figure 7Camera-space action visualization across real robot demonstrations, simulation rollouts, and human e...
arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.