Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Jingjing Gong; Junhao Shi; Li Ji; Siyin Wang; Xiaopeng Yu; Xipeng Qiu

arxiv: 2607.02466 · v1 · pith:MAWHW5A6new · submitted 2026-07-02 · 💻 cs.RO · cs.AI

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

Junhao Shi , Siyin Wang , Xiaopeng Yu , Li Ji , Jingjing Gong , Xipeng Qiu This is my paper

Pith reviewed 2026-07-03 10:50 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords vision-language-actiontask-agnostic pretraininginverse dynamicsembodied airobot learningbehavior cloningmotor priorssimpler benchmark

0 comments

The pith

Task-agnostic pretraining on unlabeled robot interactions lets VLAs match models trained on over a million expert trajectories while using far less labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims the data bottleneck in vision-language-action models comes from treating physical movement skills and task-specific language alignment as a single supervised problem. It separates these by first learning motor competence from abundant cheap unlabeled data such as off-task trajectories and robot play, using only a self-supervised inverse dynamics objective. A second lightweight stage then aligns the resulting priors to language instructions with minimal expert demonstrations. This decomposition matters because expert trajectories are expensive to collect at scale, so decoupling the objectives opens a route to higher performance with orders of magnitude less labeled data.

Core claim

Task-Agnostic Pretraining first acquires transferable motor priors from cheap unlabeled interaction data via a self-supervised Inverse Dynamics objective, then grounds those priors in language using minimal expert data. On the SIMPLER benchmark this matches performance of models trained on over 1M expert trajectories while using orders of magnitude less labeled data and yields a 10% absolute gain over standard behavior cloning. On a real WidowX platform the method retains 25% success under camera perturbations where internet-scale baselines fall to 0%.

What carries the argument

Task-Agnostic Pretraining (TAP), a two-stage framework whose first stage learns motor priors through self-supervised inverse dynamics on unlabeled robot trajectories.

If this is right

Physical representations learned from unlabeled data transfer across tasks and remain robust under camera changes.
Discarded off-task trajectories and autonomous robot play become useful training resources instead of waste.
Performance improves by scaling cheap unlabeled data rather than by collecting more costly expert demonstrations.
Real-world success rates stay positive under distribution shift where purely supervised baselines reach zero.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large-scale unlabeled interaction datasets collected across many robots could further strengthen the motor priors without added labeling cost.
The separation of control learning from goal specification may apply to other sequential decision domains that currently rely on fully supervised trajectories.
Motor priors from inverse dynamics could be combined with video-only pretraining to bootstrap competence even before any robot interaction occurs.

Load-bearing premise

Physical competence for moving can be learned effectively from unlabeled interaction data alone without any language or task supervision.

What would settle it

A controlled experiment in which the pretraining stage is removed and the resulting model trained only on the same small expert set shows no gain over standard behavior cloning on the SIMPLER benchmark would falsify the value of the first stage.

read the original abstract

Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap, unlabeled interaction data -- including discarded off-task trajectories and autonomous robot play -- via a self-supervised Inverse Dynamics objective. A lightweight second stage then grounds these priors in language using minimal expert data. On the SIMPLER benchmark, TAP matches models trained on over 1M expert trajectories while using orders of magnitude less labeled data, yielding a 10% absolute gain over standard behavior cloning. On a real-world WidowX platform, TAP retains 25% success under camera perturbations where internet-scale baselines collapse to 0%, demonstrating that task-agnostic pretraining produces robust, transferable physical representations and offers a scalable path forward for Embodied AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAP separates motor pretraining via inverse dynamics on unlabeled trajectories from later language grounding and reports matching large-scale VLA results with far less expert data.

read the letter

The core claim is that robot physical competence can be learned separately from semantic alignment, so the first stage uses self-supervised inverse dynamics on cheap unlabeled and off-task trajectories while the second stage adds language with minimal expert demos. This produces the reported 10% gain over behavior cloning on SIMPLER and the 25% success retention under camera shift on WidowX where other baselines drop to zero.

The explicit two-stage framing and the decision to recycle discarded trajectories are the clearest new pieces. Treating motor priors as task-agnostic and pre-trainable without language is a practical move that directly targets the data scarcity problem in VLAs. The real-robot robustness result is the strongest concrete evidence offered.

The load-bearing part is whether the inverse-dynamics stage actually isolates transferable high-level competence rather than low-level dynamics that happen to help on these particular tasks. Without ablations that isolate the pretraining effect or representation comparisons, it is hard to know how much of the gain traces to the decomposition versus other factors like training schedule or architecture tweaks. The WidowX result is also on a single platform, so the generalization claim stays narrow.

This is for groups working on scaling embodied models under tight data budgets. The idea is straightforward to test and the efficiency numbers are large enough to matter, so it deserves referee time even if the experiments will need tightening on controls and ablations.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes Task-Agnostic Pretraining (TAP) for Vision-Language-Action (VLA) models based on a Decomposition Hypothesis that physical competence ('how to move') can be acquired separately from semantic alignment ('what to do') via self-supervised inverse dynamics on unlabeled/off-task trajectories, followed by a lightweight language-grounding stage using minimal expert data. On the SIMPLER benchmark, TAP is reported to match models trained on >1M expert trajectories while using orders of magnitude less labeled data and yielding a 10% absolute gain over standard behavior cloning; on a real WidowX platform it retains 25% success under camera perturbations where baselines drop to 0%.

Significance. If the empirical claims and the separability of motor priors are substantiated, the work would offer a concrete path to scaling VLAs by exploiting cheap unlabeled interaction data, potentially reducing reliance on costly expert demonstrations. The reported robustness under camera shift would be a notable strength if shown to stem from the task-agnostic pretraining rather than other factors.

major comments (3)

[Abstract, §3] Abstract and §3 (method): The Decomposition Hypothesis is load-bearing for the central claim that inverse-dynamics pretraining isolates transferable physical competence, yet the manuscript provides no representation-similarity metrics, probing experiments, or controlled ablations comparing the pre-trained encoder against an end-to-end baseline to demonstrate that the self-supervised stage indeed captures high-level motor priors rather than low-level dynamics.
[§5] §5 (experiments): The 10% absolute gain over behavior cloning and the match to 1M-trajectory models on SIMPLER are the primary quantitative results; however, the text does not report the precise number of expert trajectories used in the second-stage grounding, the exact composition and volume of the unlabeled pretraining corpus, or statistical significance across seeds, which are required to evaluate the 'orders of magnitude less labeled data' assertion.
[§5.2] §5.2 (real-world WidowX): The 25% success rate under camera perturbations is presented as evidence of robust physical representations, but without an ablation that freezes the pre-trained motor prior versus training from scratch or using a different pretraining objective, it remains unclear whether the two-stage separation is causally responsible for the robustness gain.

minor comments (3)

[§3] Notation: The inverse-dynamics loss is introduced without an explicit equation number; adding Eq. (X) would improve traceability when the objective is referenced in later sections.
[§5.1] Figure clarity: The SIMPLER benchmark results table would benefit from error bars or standard deviations across multiple runs to allow readers to assess the reliability of the reported 10% margin.
[§2] References: The manuscript cites prior inverse-dynamics work but does not discuss how TAP differs from recent self-supervised robotics pretraining methods (e.g., those using forward dynamics or contrastive objectives); a short related-work paragraph would strengthen positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing that additional analyses and clarifications will strengthen the manuscript. All requested details and ablations can be incorporated in a revision.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method): The Decomposition Hypothesis is load-bearing for the central claim that inverse-dynamics pretraining isolates transferable physical competence, yet the manuscript provides no representation-similarity metrics, probing experiments, or controlled ablations comparing the pre-trained encoder against an end-to-end baseline to demonstrate that the self-supervised stage indeed captures high-level motor priors rather than low-level dynamics.

Authors: We agree that direct representation-level evidence would provide stronger support for the Decomposition Hypothesis beyond the downstream performance gains. The current results on SIMPLER and real-robot transfer serve as indirect validation, but we will add controlled ablations in the revision, including layer-wise representation similarity (e.g., CKA) between the TAP encoder and an end-to-end trained counterpart, plus a probing task that evaluates motor-skill transfer with the pre-trained encoder frozen. revision: yes
Referee: [§5] §5 (experiments): The 10% absolute gain over behavior cloning and the match to 1M-trajectory models on SIMPLER are the primary quantitative results; however, the text does not report the precise number of expert trajectories used in the second-stage grounding, the exact composition and volume of the unlabeled pretraining corpus, or statistical significance across seeds, which are required to evaluate the 'orders of magnitude less labeled data' assertion.

Authors: We will revise §5 to explicitly report the exact counts and composition of both the unlabeled pretraining corpus and the expert trajectories used for language grounding, along with mean and standard deviation results across multiple random seeds to establish statistical significance. revision: yes
Referee: [§5.2] §5.2 (real-world WidowX): The 25% success rate under camera perturbations is presented as evidence of robust physical representations, but without an ablation that freezes the pre-trained motor prior versus training from scratch or using a different pretraining objective, it remains unclear whether the two-stage separation is causally responsible for the robustness gain.

Authors: We acknowledge that a direct ablation isolating the pre-trained motor prior is needed to establish causality. In the revision we will add an experiment comparing the full TAP model against a from-scratch baseline (and, if feasible, an alternative pretraining objective) under identical camera-perturbation conditions on the WidowX platform. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks.

full rationale

The paper advances an empirical claim on the SIMPLER benchmark and a real-world platform, grounded in the Decomposition Hypothesis presented as an argumentative premise rather than a derived result. No equations, self-citations, fitted parameters renamed as predictions, or derivation chains appear in the provided text. The two-stage framework is described at the level of objectives and data sources without any reduction of outputs to inputs by construction. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach rests on one domain assumption extracted from the abstract.

axioms (1)

domain assumption Decomposition Hypothesis: physical competence and semantic alignment are distinct objectives and only the latter requires language supervision.
Explicitly stated as the foundation for separating the two training stages.

pith-pipeline@v0.9.1-grok · 5755 in / 1138 out tokens · 36824 ms · 2026-07-03T10:50:46.280586+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 19 canonical work pages · 10 internal anchors

[1]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. 𝜋0.5: a vision-language-action model with open-world generaliza- tion. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov , Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, ...

2024
[3]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, T obias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. CoRR, abs/2405.12213, 2024. doi: 10.4...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.12213 2024
[4]

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley , Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert T ung, Alex Bewley , Alex Herzog, Alex Irpan, Alexander Khazatsky , Anant Rai, Anchit Gupta, Andrew Wang, An- drey Kolobov , Anikait Singh, Animesh...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. 𝜋0: A visi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.24164 2024
[6]

Open X-Embodiment: Robotic learning datasets and RT-X models

Abhishek Padalkar, Acorn Pooley , Ajinkya Jain, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. In IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903, 2024

2024
[7]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, T ony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning (CoRL), 2023

2023
[8]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky , Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany , Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, P Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Ye Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon Lee, Marius ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

It’s the journey , not the destination: Locomotor explo- ration in infants

Justine E Hoch, Sinclaire M O’Grady , and Karen E Adolph. It’s the journey , not the destination: Locomotor explo- ration in infants. Developmental science, 22(2):e12740, 2019

2019
[10]

The psychology and neuroscience of curiosity

Celeste Kidd and Benjamin Y Hayden. The psychology and neuroscience of curiosity. Neuron, 88(3):449–460, 2015

2015
[11]

Motor development: Embodied, embedded, enculturated, and enabling

Karen E Adolph and Justine E Hoch. Motor development: Embodied, embedded, enculturated, and enabling. Annual review of psychology, 70(1):141–164, 2019

2019
[12]

Inverse dynamics pretraining learns good representations for multitask imitation

David Brandfonbrener, Ofir Nachum, and Joan Bruna. Inverse dynamics pretraining learns good representations for multitask imitation. ArXiv, abs/2305.16985, 2023. URL https://api.semanticscholar.org/CorpusID:258947266

work page arXiv 2023
[13]

Evaluating real- world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Oier Mees, Karl Pertsch, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real- world robot manipulation policies in simulation. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors, Conference on Robot...

2024
[14]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai, 2026. URL https://arxiv.org/abs/2405.14093

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

arXiv preprint arXiv:2509.19012 (2025)

Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Pure vision language action (vla) models: A comprehensive survey , 2025. URL https://arxiv.org/abs/2509.19012. 13

work page arXiv 2025
[16]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, and Yaodong Yang. A survey on vision- language-action models: An action tokenization perspective, 2025. URL https://arxiv.org/abs/2507.01925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, T omas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov , Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Uts...

work page doi:10.15607/rss.2023.xix.025 2023
[18]

Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong T. Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalew...

2023
[19]

Sanketi, Archit Sharma, Cody Simpson, Quan Vuong, Homer Rich Walke, Blake Wulfe, Ted Xiao, Jonathan Heewon Yang, Arefeh Yavary , T ony Z

Alexander Khazatsky , Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany , Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...

work page doi:10.15607/rss.2024.xx.120 2024
[20]

Gen-0: Embodied foundation models that scale with physical interaction

Generalist AI Team. Gen-0: Embodied foundation models that scale with physical interaction. Generalist AI Blog,
[21]

https://generalistai.com/blog/preview-uqlxvb-bb.html
[22]

Roboomni: Proactive robot manipulation in omni-modal context, 2025

Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu, Junhao Shi, Kexin Huang, Zhaoye Fei, Jingjing Gong, Zuxuan Wu, Yu-Gang Jiang, See-Kiong Ng, Tat-Seng Chua, and Xipeng Qiu. Roboomni: Proactive robot manipulation in omni-modal context, 2025. URL https://arxiv.org/abs/2510.23763

work page arXiv 2025
[23]

MIDAS: multi-layered attack detection architecture with decision optimisation

Kieran Rendall, Alexios Mylonas, Stilianos Vidalis, and Dimitris Gritzalis. MIDAS: multi-layered attack detection architecture with decision optimisation. Comput. Secur., 148:104154, 2025. doi: 10.1016/J.COSE.2024.104154. URL https://doi.org/10.1016/j.cose.2024.104154

work page doi:10.1016/j.cose.2024.104154 2025
[24]

SMART : self- supervised multi-task pretraining with control transformers

Yanchao Sun, Shuang Ma, Ratnesh Madaan, Rogerio Bonatti, Furong Huang, and Ashish Kapoor. SMART : self- supervised multi-task pretraining with control transformers. In The Eleventh International Conference on Learning 14 Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/ forum?id=9piH3Hg8QEf

2023
[25]

Exploring visual pre- training for robot manipulation: Datasets, models and methods

Rogerio Bonatti, Sai Vemprala, Shuang Ma, Felipe Frujeri, Shuhang Chen, and Ashish Kapoor. P ACT : perception- action causal transformer for autoregressive robotics pre-training. In IROS, pages 3621–3627, 2023. doi: 10.1109/ IROS55552.2023.10342381. URL https://doi.org/10.1109/IROS55552.2023.10342381

work page doi:10.1109/iros55552.2023.10342381 2023
[26]

Exploring visual pre- training for robot manipulation: Datasets, models and methods

Ya Jing, Xuelin Zhu, Xingbin Liu, Qie Sima, Taozheng Yang, Yunhai Feng, and Tao Kong. Exploring visual pre- training for robot manipulation: Datasets, models and methods. In IROS, pages 11390–11395, 2023. doi: 10.1109/ IROS55552.2023.10342201. URL https://doi.org/10.1109/IROS55552.2023.10342201

work page doi:10.1109/iros55552.2023.10342201 2023
[27]

Masked autoencoding for scalable and generalizable de- cision making

Fangchen Liu, Hao Liu, Aditya Grover, and Pieter Abbeel. Masked autoencoding for scalable and generalizable de- cision making. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orlea...

2022
[28]

Video pretraining (VPT): learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokhov , Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (VPT): learning to act by watching unlabeled online videos. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annu...

2022
[29]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. In The T welfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://openrevie...

2024
[30]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. NORA: A small open-sourced generalist vision language action model for embodied tasks. CoRR, abs/2504.19854,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

doi: 10.48550/ARXIV.2504.19854. URL https://doi.org/10.48550/arXiv.2504.19854

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19854
[32]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017

2017
[33]

World-aware planning narratives enhance large vision-language model planner, 2025

Junhao Shi, Zhaoye Fei, Siyin Wang, Qipeng Guo, Jingjing Gong, and Xipeng Qiu. World-aware planning narratives enhance large vision-language model planner, 2025. URL https://arxiv.org/abs/2506.21230

work page arXiv 2025
[34]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models, 2025. URL https://arxiv.org/abs/2510.13626. 15 Appendix A Details of Autonomous Random Play Data Collection T o ensure that...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. 𝜋0.5: a vision-language-action model with open-world generaliza- tion. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov , Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, ...

2024

[3] [3]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, T obias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. CoRR, abs/2405.12213, 2024. doi: 10.4...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.12213 2024

[4] [4]

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley , Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert T ung, Alex Bewley , Alex Herzog, Alex Irpan, Alexander Khazatsky , Anant Rai, Anchit Gupta, Andrew Wang, An- drey Kolobov , Anikait Singh, Animesh...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. 𝜋0: A visi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.24164 2024

[6] [6]

Open X-Embodiment: Robotic learning datasets and RT-X models

Abhishek Padalkar, Acorn Pooley , Ajinkya Jain, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. In IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903, 2024

2024

[7] [7]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, T ony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning (CoRL), 2023

2023

[8] [8]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky , Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany , Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, P Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Ye Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon Lee, Marius ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

It’s the journey , not the destination: Locomotor explo- ration in infants

Justine E Hoch, Sinclaire M O’Grady , and Karen E Adolph. It’s the journey , not the destination: Locomotor explo- ration in infants. Developmental science, 22(2):e12740, 2019

2019

[10] [10]

The psychology and neuroscience of curiosity

Celeste Kidd and Benjamin Y Hayden. The psychology and neuroscience of curiosity. Neuron, 88(3):449–460, 2015

2015

[11] [11]

Motor development: Embodied, embedded, enculturated, and enabling

Karen E Adolph and Justine E Hoch. Motor development: Embodied, embedded, enculturated, and enabling. Annual review of psychology, 70(1):141–164, 2019

2019

[12] [12]

Inverse dynamics pretraining learns good representations for multitask imitation

David Brandfonbrener, Ofir Nachum, and Joan Bruna. Inverse dynamics pretraining learns good representations for multitask imitation. ArXiv, abs/2305.16985, 2023. URL https://api.semanticscholar.org/CorpusID:258947266

work page arXiv 2023

[13] [13]

Evaluating real- world robot manipulation policies in simulation

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Oier Mees, Karl Pertsch, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real- world robot manipulation policies in simulation. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors, Conference on Robot...

2024

[14] [14]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai, 2026. URL https://arxiv.org/abs/2405.14093

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

arXiv preprint arXiv:2509.19012 (2025)

Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Pure vision language action (vla) models: A comprehensive survey , 2025. URL https://arxiv.org/abs/2509.19012. 13

work page arXiv 2025

[16] [16]

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, and Yaodong Yang. A survey on vision- language-action models: An action tokenization perspective, 2025. URL https://arxiv.org/abs/2507.01925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, T omas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov , Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Uts...

work page doi:10.15607/rss.2023.xix.025 2023

[18] [18]

Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong T. Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalew...

2023

[19] [19]

Sanketi, Archit Sharma, Cody Simpson, Quan Vuong, Homer Rich Walke, Blake Wulfe, Ted Xiao, Jonathan Heewon Yang, Arefeh Yavary , T ony Z

Alexander Khazatsky , Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany , Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Y...

work page doi:10.15607/rss.2024.xx.120 2024

[20] [20]

Gen-0: Embodied foundation models that scale with physical interaction

Generalist AI Team. Gen-0: Embodied foundation models that scale with physical interaction. Generalist AI Blog,

[21] [21]

https://generalistai.com/blog/preview-uqlxvb-bb.html

[22] [22]

Roboomni: Proactive robot manipulation in omni-modal context, 2025

Siyin Wang, Jinlan Fu, Feihong Liu, Xinzhe He, Huangxuan Wu, Junhao Shi, Kexin Huang, Zhaoye Fei, Jingjing Gong, Zuxuan Wu, Yu-Gang Jiang, See-Kiong Ng, Tat-Seng Chua, and Xipeng Qiu. Roboomni: Proactive robot manipulation in omni-modal context, 2025. URL https://arxiv.org/abs/2510.23763

work page arXiv 2025

[23] [23]

MIDAS: multi-layered attack detection architecture with decision optimisation

Kieran Rendall, Alexios Mylonas, Stilianos Vidalis, and Dimitris Gritzalis. MIDAS: multi-layered attack detection architecture with decision optimisation. Comput. Secur., 148:104154, 2025. doi: 10.1016/J.COSE.2024.104154. URL https://doi.org/10.1016/j.cose.2024.104154

work page doi:10.1016/j.cose.2024.104154 2025

[24] [24]

SMART : self- supervised multi-task pretraining with control transformers

Yanchao Sun, Shuang Ma, Ratnesh Madaan, Rogerio Bonatti, Furong Huang, and Ashish Kapoor. SMART : self- supervised multi-task pretraining with control transformers. In The Eleventh International Conference on Learning 14 Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/ forum?id=9piH3Hg8QEf

2023

[25] [25]

Exploring visual pre- training for robot manipulation: Datasets, models and methods

Rogerio Bonatti, Sai Vemprala, Shuang Ma, Felipe Frujeri, Shuhang Chen, and Ashish Kapoor. P ACT : perception- action causal transformer for autoregressive robotics pre-training. In IROS, pages 3621–3627, 2023. doi: 10.1109/ IROS55552.2023.10342381. URL https://doi.org/10.1109/IROS55552.2023.10342381

work page doi:10.1109/iros55552.2023.10342381 2023

[26] [26]

Exploring visual pre- training for robot manipulation: Datasets, models and methods

Ya Jing, Xuelin Zhu, Xingbin Liu, Qie Sima, Taozheng Yang, Yunhai Feng, and Tao Kong. Exploring visual pre- training for robot manipulation: Datasets, models and methods. In IROS, pages 11390–11395, 2023. doi: 10.1109/ IROS55552.2023.10342201. URL https://doi.org/10.1109/IROS55552.2023.10342201

work page doi:10.1109/iros55552.2023.10342201 2023

[27] [27]

Masked autoencoding for scalable and generalizable de- cision making

Fangchen Liu, Hao Liu, Aditya Grover, and Pieter Abbeel. Masked autoencoding for scalable and generalizable de- cision making. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orlea...

2022

[28] [28]

Video pretraining (VPT): learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokhov , Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (VPT): learning to act by watching unlabeled online videos. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annu...

2022

[29] [29]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. In The T welfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, 2024. URL https://openrevie...

2024

[30] [30]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. NORA: A small open-sourced generalist vision language action model for embodied tasks. CoRR, abs/2504.19854,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

doi: 10.48550/ARXIV.2504.19854. URL https://doi.org/10.48550/arXiv.2504.19854

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.19854

[32] [32]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017

2017

[33] [33]

World-aware planning narratives enhance large vision-language model planner, 2025

Junhao Shi, Zhaoye Fei, Siyin Wang, Qipeng Guo, Jingjing Gong, and Xipeng Qiu. World-aware planning narratives enhance large vision-language model planner, 2025. URL https://arxiv.org/abs/2506.21230

work page arXiv 2025

[34] [34]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models, 2025. URL https://arxiv.org/abs/2510.13626. 15 Appendix A Details of Autonomous Random Play Data Collection T o ensure that...

work page internal anchor Pith review Pith/arXiv arXiv 2025