pith. sign in

arxiv: 2606.20521 · v1 · pith:6WXUQCOOnew · submitted 2026-06-18 · 💻 cs.CV

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Pith reviewed 2026-06-26 17:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords egocentric human videoembodied pretrainingfoundation modelsrobot learningaction predictiontask successdata scalingteleoperated trajectories
0
0 comments X

The pith

Egocentric human video outperforms teleoperated robot trajectories for embodied model pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that egocentric human video, after a designed filtering and labeling pipeline, produces better pretrained models than an equal volume of teleoperated real-robot trajectories. The comparison holds post-training and validation protocols fixed, so differences trace to the pretraining source. A reader would care because robot data is costly and scarce while human video is abundant and diverse, suggesting a route to scale embodied foundation models without proportional increases in robot collection. The result points to a two-stage approach: pretrain on human video for broad world representations, then adapt with limited robot data for action alignment.

Core claim

With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment.

What carries the argument

The filtering and labeling pipeline that converts raw egocentric human video into pretraining data aligned for embodied action prediction.

If this is right

  • Egocentric pretraining learns more diverse world representations than robot-trajectory pretraining.
  • A small amount of labeled real-robot data suffices afterward to align the action space.
  • The paradigm reduces dependence on high-cost, low-diversity robot data collection.
  • The study supplies guidance for assessing data quality before committing to robot data gathering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The out-of-distribution gains imply that human behavioral variety transfers better to novel robot environments than robot trajectories do.
  • If the pipeline proves reusable, the same human-video source could support pretraining across multiple robot embodiments without new collection.
  • Scaling the human-video volume further while keeping robot adaptation data fixed could widen the observed gap.

Load-bearing premise

The filtering and labeling pipeline applied to egocentric human video is neutral with respect to the downstream evaluation tasks and does not confer an unfair advantage relative to the raw teleoperated robot trajectories.

What would settle it

Running the identical downstream evaluation on models pretrained with the same human videos but without the described filtering and labeling pipeline, and finding no performance advantage over the robot-data baseline, would falsify the central claim.

read the original abstract

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that, under fixed post-training and validation protocols, embodied foundation models pretrained on the same volume of egocentric human video (after a filtering and labeling pipeline) achieve a 24% lower validation loss on real-robot action prediction and 52.5%/90% higher success rates on in-distribution and out-of-distribution real-robot tasks than models pretrained on raw teleoperated robot trajectories. It concludes that human video is not merely a substitute but can be superior for learning diverse world representations before action-space adaptation with limited robot data.

Significance. If the reported gains are shown to arise from the data distribution rather than pipeline-induced selection effects, the result would be significant for embodied AI scaling: it would support a cheaper, higher-diversity pretraining paradigm that reduces reliance on costly robot teleoperation while still enabling strong downstream robot performance. The fixed-protocol design and quantitative comparisons are strengths that make the finding falsifiable and reproducible in principle.

major comments (3)
  1. [Abstract] Abstract: The central quantitative claims (24% lower loss, 52.5% and 90% higher success rates) are presented without error bars, number of runs, or statistical significance tests. Because these numbers are the primary evidence for the superiority claim, the absence of variance estimates leaves the magnitude and reliability of the gains difficult to assess.
  2. [Abstract] Abstract and methods: The human-video pipeline is described only as 'carefully designed filtering and labeling' with no enumeration of criteria (action mapping, quality thresholds, scene selection), no ablation of each step, and no statement that an identical pipeline was applied to the robot trajectories. This detail is load-bearing for the claim that gains derive from the egocentric distribution itself rather than post-hoc selection bias relative to raw robot data.
  3. [Results] Results (assumed §4 or equivalent): The comparison is between processed human data and raw robot data; without a control that applies the same filtering/labeling steps symmetrically to robot trajectories or an ablation isolating the pipeline's contribution, the reported outperformance cannot be unambiguously attributed to data source rather than processing asymmetry.
minor comments (2)
  1. Notation for action spaces and loss functions should be defined explicitly on first use to aid readers comparing the two data regimes.
  2. Figure captions for success-rate plots should state the number of evaluation episodes and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful review and constructive suggestions. The comments correctly identify areas where additional transparency and statistical rigor would strengthen the manuscript. We respond to each point below and will incorporate revisions to address the concerns about reporting and methodological clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central quantitative claims (24% lower loss, 52.5% and 90% higher success rates) are presented without error bars, number of runs, or statistical significance tests. Because these numbers are the primary evidence for the superiority claim, the absence of variance estimates leaves the magnitude and reliability of the gains difficult to assess.

    Authors: We agree that variance estimates are important for assessing reliability. In the revised manuscript we will report means and standard deviations across multiple random seeds for both validation loss and success rates, and include statistical significance tests (e.g., paired t-tests) comparing the two pretraining conditions. revision: yes

  2. Referee: [Abstract] Abstract and methods: The human-video pipeline is described only as 'carefully designed filtering and labeling' with no enumeration of criteria (action mapping, quality thresholds, scene selection), no ablation of each step, and no statement that an identical pipeline was applied to the robot trajectories. This detail is load-bearing for the claim that gains derive from the egocentric distribution itself rather than post-hoc selection bias relative to raw robot data.

    Authors: The methods section already enumerates the pipeline criteria, but the abstract is concise. We will expand the abstract to list the main filtering steps and add an explicit clarification that the pipeline is applied exclusively to human video because it requires action inference from visual observations; robot trajectories already contain direct action labels, so the same steps are neither applicable nor necessary. revision: yes

  3. Referee: [Results] Results (assumed §4 or equivalent): The comparison is between processed human data and raw robot data; without a control that applies the same filtering/labeling steps symmetrically to robot trajectories or an ablation isolating the pipeline's contribution, the reported outperformance cannot be unambiguously attributed to data source rather than processing asymmetry.

    Authors: We will add a dedicated paragraph in the discussion section explaining why symmetric application of the full pipeline is not meaningful (robot data already supplies precise actions). To further isolate effects we will include, where data permits, a quality-filtering ablation on the robot trajectories and report whether performance changes materially. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical comparison is self-contained

full rationale

The paper reports an empirical head-to-head comparison of pretraining data sources under fixed post-training and validation protocols. No equations, fitted parameters, or derivations are present that would reduce the reported loss reductions or success-rate gains to the input data by construction. The filtering/labeling pipeline is described as part of the human-video processing step, but the central claim is an observed performance difference rather than a mathematical identity or self-referential prediction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. This is the expected non-finding for a data-comparison study whose results remain externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical machine-learning comparison study. No new mathematical axioms, free parameters, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5901 in / 1056 out tokens · 33611 ms · 2026-06-26T17:50:56.541335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 23 linked inside Pith

  1. [1]

    Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

  2. [2]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

    AgiBot World Contributors. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  3. [3]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    arXiv preprint arXiv:2410.24164, 2024

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  6. [6]

    Egocentric-100k: 100,000 hours of real-world egocentric video from factory workers

    Build AI. Egocentric-100k: 100,000 hours of real-world egocentric video from factory workers. https:// huggingface.co/datasets/builddotai/Egocentric-100K, 2026

  7. [7]

    Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  8. [8]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130(1):33–55, 2022

  9. [9]

    Humannet: Scaling human-centric video learning to one million hours.arXiv preprint arXiv:2605.06747, 2026

    Yufan Deng and Daquan Zhou. Humannet: Scaling human-centric video learning to one million hours.arXiv preprint arXiv:2605.06747, 2026

  10. [10]

    Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

    Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

  11. [11]

    Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026

    Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, et al. Molmoact2: Action reasoning models for real-world deployment.arXiv preprint arXiv:2605.02881, 2026

  12. [12]

    Ego4D: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  13. [13]

    Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives, 2024

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives, 2024. URLhttps://arxiv.org/abs/2311.18259

  14. [14]

    Yoon, Mouli Sivapurapu, and Jian Zhang

    Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026. URLhttps://arxiv.org/abs/2505.11709

  15. [15]

    Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence

    Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence. arXiv preprint arXiv:2512.24653, 2025

  16. [16]

    Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, et al. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

  17. [17]

    Egomimic: Scaling imitation learning via egocentric video, 2024

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024. URLhttps://arxiv.org/abs/2410.24221

  18. [18]

    Droid: A large-scale in-the-wild robot manipulation dataset, 2025

    Alexander Khazatsky et al. Droid: A large-scale in-the-wild robot manipulation dataset, 2025. URLhttps: //arxiv.org/abs/2403.12945

  19. [19]

    Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  20. [20]

    Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 10

  21. [21]

    Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

    Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

  22. [22]

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  23. [23]

    Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024

  24. [24]

    Being-h0: Vision-language-action pretraining from large-scale human videos, 2025

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos, 2025. URL https://arxiv.org/abs/2507.15597

  25. [25]

    Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026

    Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, Yicheng Feng, and Zongqing Lu. Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization, 2026. URLhttps://arxiv.org/abs/2601.12993

  26. [26]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. URL https://arxiv.org/abs/1906.03327

  27. [27]

    R3m: A universal visual representation for robot manipulation, 2022

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation, 2022. URLhttps://arxiv.org/abs/2203.12601

  28. [28]

    GR00T N1: An open foundation model for generalist humanoid robots, 2025

    NVIDIA et al. GR00T N1: An open foundation model for generalist humanoid robots, 2025. URLhttps: //arxiv.org/abs/2503.14734

  29. [29]

    Open X-Embodiment: Robotic learning datasets and RT-X models,

    Open X-Embodiment Collaboration et al. Open X-Embodiment: Robotic learning datasets and RT-X models,

  30. [30]

    URLhttps://arxiv.org/abs/2310.08864

  31. [31]

    Egoverse: An egocentric human dataset for robot learning from around the world, 2026

    Ryan Punamiya et al. Egoverse: An egocentric human dataset for robot learning from around the world, 2026. URLhttps://arxiv.org/abs/2604.07607

  32. [32]

    Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026

    Ropedia. Xperience-10m: A large-scale egocentric multimodal dataset with structured 3d/4d annotations, 2026. Dataset

  33. [33]

    Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  34. [34]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  35. [35]

    Humanego: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint, 2025

    Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, and Yiannis Aloimonos. Humanego: Zero-shot robot learning from minutes of human egocentric videos.arXiv preprint, 2025

  36. [36]

    A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  37. [37]

    World action models are zero-shot policies, 2026

    Seonghyeon Ye et al. World action models are zero-shot policies, 2026. URLhttps://arxiv.org/abs/2602.15922

  38. [38]

    Fast-wam: Do world action models need test-time future imagination?, 2026

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-wam: Do world action models need test-time future imagination?, 2026. URLhttps://arxiv.org/abs/2603.16666

  39. [39]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023

  40. [40]

    Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026

    Ruijie Zheng, Dantong Niu, Yuqi Xie, Jing Wang, Mengda Xu, Yunfan Jiang, Fernando Castañeda, Fengyuan Hu, You Liang Tan, Letian Fu, Trevor Darrell, Furong Huang, Yuke Zhu, Danfei Xu, and Linxi Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data, 2026. URLhttps://arxiv.org/abs/2602.16710

  41. [41]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 11