pith. machine review for the scientific record. sign in

arxiv: 2605.10925 · v1 · submitted 2026-05-11 · 💻 cs.RO

Recognition: no theorem link

PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

Bin Xie, Tiancai Wang, Wei Chai, Xianchi Deng, Xingyu Chen, Xinyu Guo, Zhengxing Wu

Pith reviewed 2026-05-12 03:33 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelsrobot manipulationprior preservationparameter-efficient adaptationout-of-distribution generalizationfew-shot learningembodied AI
0
0 comments X

The pith

PriorVLA adapts vision-language-action models to robot tasks by freezing a Prior Expert and integrating its priors via Expert Queries into a trainable Adaptation Expert, using only 25 percent of the parameter updates required by full fine-t

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large pretrained vision-language-action models serve as generalist foundations for robot manipulation, yet full fine-tuning often overwrites broad priors with narrow task patterns. PriorVLA keeps the pretrained component frozen as a read-only Prior Expert while training a separate Adaptation Expert. Expert Queries pull scene understanding from the vision-language model and motor knowledge from the Prior Expert, feeding both into the adaptation process. The design delivers higher success rates than full fine-tuning or existing baselines across simulation and real-robot evaluations, with the clearest advantages when test conditions differ from training data or when only a few demonstrations are available.

Core claim

PriorVLA preserves pretrained priors during adaptation of vision-language-action models by maintaining a frozen Prior Expert as a source of scene and motor knowledge while training only an Adaptation Expert that receives integrated priors through Expert Queries. This approach updates just 25 percent of the parameters changed by full fine-tuning. It produces stronger overall performance than full fine-tuning and current VLA baselines on RoboTwin 2.0, LIBERO, and real-world tasks, with the largest gains under out-of-distribution and few-shot conditions, including an 11-point improvement over pi0.5 on RoboTwin 2.0-Hard and 99.1 percent average success on LIBERO.

What carries the argument

Expert Queries, which extract scene priors from the pretrained vision-language model and motor priors from the frozen Prior Expert to guide the trainable Adaptation Expert during task specialization.

If this is right

  • PriorVLA updates only 25 percent of the parameters changed during full fine-tuning while achieving higher task success.
  • Performance gains are largest in out-of-distribution and few-shot settings, such as an 11-point lift over pi0.5 on RoboTwin 2.0-Hard.
  • The method reaches 99.1 percent average success on LIBERO and, on eight real-world tasks across two embodiments, attains 81 percent in-distribution and 57 percent out-of-distribution success with standard data.
  • With only 10 demonstrations per task it still achieves 48 percent in-distribution and 32 percent out-of-distribution success, surpassing pi0.5 by 24 and 22 points respectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If prior preservation works as claimed, the same frozen-expert-plus-query pattern could support incremental skill acquisition without repeated full retraining.
  • Fewer updated parameters may enable on-device or edge adaptation of robot policies where compute and memory are limited.
  • The query-based integration of a frozen expert could be tested on other sequential decision models that currently rely on full fine-tuning.

Load-bearing premise

The frozen Prior Expert holds useful non-conflicting priors that the Adaptation Expert can reliably extract and apply through Expert Queries without needing joint optimization of the full model.

What would settle it

Running full fine-tuning and PriorVLA on identical data and tasks then observing that full fine-tuning matches or exceeds PriorVLA success rates in out-of-distribution and few-shot regimes would show that freezing the prior source adds no benefit.

Figures

Figures reproduced from arXiv: 2605.10925 by Bin Xie, Tiancai Wang, Wei Chai, Xianchi Deng, Xingyu Chen, Xinyu Guo, Zhengxing Wu.

Figure 1
Figure 1. Figure 1: Overview of PriorVLA. Large-scale pretraining provides broad priors for general manipulation, but full fine-tuning on limited downstream data can treat these priors mainly as initialization and lead to prior forgetting, especially when evaluated under OOD scenes. PriorVLA instead preserves and leverages pretrained scene and motor priors through Dual Action Experts and Expert Queries, improving adaptation a… view at source ↗
Figure 2
Figure 2. Figure 2: PriorVLA architecture. PriorVLA builds on a pretrained VLA and introduces two coupled modules: Dual Action Experts and Expert Queries. Dual Action Experts keep the original AE as a frozen Prior Expert and train an Adaptation Expert for downstream action generation. Expert Queries capture scene and motor priors from pretrained forward paths and integrate them into the Adaptation Expert; the Prior Expert ser… view at source ↗
Figure 3
Figure 3. Figure 3: Attention design of PriorVLA. (Left) Attention mask over token groups in the VLM, Prior Expert (PE), and Adaptation Expert (AE). Orange cells indicate allowed attention and blank cells indicate blocked attention. OBS denotes original VLM input tokens, SQ Scene Queries, N1 PE noisy action tokens, MQ Motor Queries, AQ Action Queries, and N2 AE noisy action tokens. (Right) Information flow induced by the mask… view at source ↗
Figure 4
Figure 4. Figure 4: Experimental overview. We evaluate PriorVLA on RoboTwin 2.0 [22], LIBERO [25], and two real-world robot embodiments, covering ID/OOD generalization, data regimes, and component ablations. Action Queries. Action Queries are learnable tokens inserted alongside the noisy action tokens of the Adaptation Expert. They integrate scene priors from Scene Queries and motor priors from Motor Queries inside the Adapta… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world platforms used in our experiments. Top: the single-arm Franka setup with one third-person and one wrist Intel RealSense D435 camera. Bottom: the dual-arm AC-One setup with one top-view Intel RealSense D435 camera and two Intel RealSense D405 wrist cameras. we keep the official training schedule unchanged: 10k warmup steps, peak and decay learning rates of 5.0 × 10−5 , 1M decay steps, and EMA dec… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative real-world results on the Franka platform. We show rollout snapshots for four representative tasks under both in-distribution (ID) and out-of-distribution (OOD) evaluation. These examples illustrate representative task progression under nominal and perturbed real-world conditions. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative real-world results on the AC-One platform. We show rollout snapshots for four representative dual-arm tasks under both in-distribution (ID) and out-of-distribution (OOD) evaluation. These examples illustrate representative task progression on the dual-arm platform. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative simulation results on RoboTwin 2.0 (Part I). We show additional represen￾tative rollout snapshots under ID/Easy and OOD/Hard evaluation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative simulation results on RoboTwin 2.0 (Part II). We show additional represen￾tative rollout snapshots under ID/Easy and OOD/Hard evaluation. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative simulation results on LIBERO. We show one representative task from each of the four suites. Since LIBERO is evaluated under the standard benchmark setting, only in-distribution rollouts are shown. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative VQA examples after adaptation. The probes include general visual recognition, [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
read the original abstract

Large-scale pretraining has made Vision-Language-Action (VLA) models promising foundations for generalist robot manipulation, yet adapting them to downstream tasks remains necessary. However, the common practice of full fine-tuning treats pretraining as initialization and can shift broad priors toward narrow training-distribution patterns. We propose PriorVLA, a novel framework that preserves pretrained priors and learns to leverage them for effective adaptation. PriorVLA keeps a frozen Prior Expert as a read-only prior source and trains an Adaptation Expert for downstream specialization. Expert Queries capture scene priors from the pretrained VLM and motor priors from the Prior Expert, integrating both into the Adaptation Expert to guide adaptation. Together, PriorVLA updates only 25% of the parameters updated by full fine-tuning. Across RoboTwin 2.0, LIBERO, and real-world tasks, PriorVLA achieves stronger overall performance than full fine-tuning and state-of-the-art VLA baselines, with the largest gains under out-of-distribution (OOD) and few-shot settings. PriorVLA improves over pi0.5 by 11 points on RoboTwin 2.0-Hard and achieves 99.1% average success on LIBERO. Across eight real-world tasks and two embodiments, PriorVLA reaches 81% in-distribution (ID) and 57% OOD success with standard data. With only 10 demonstrations per task, PriorVLA reaches 48% ID and 32% OOD success, surpassing pi0.5 by 24 and 22 points, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes PriorVLA, a framework for adapting large Vision-Language-Action (VLA) models that freezes a Prior Expert to retain pretrained priors while training an Adaptation Expert (updating only 25% of parameters) that uses Expert Queries to extract and integrate scene priors from a VLM and motor priors from the Prior Expert. Empirical results on RoboTwin 2.0, LIBERO, and real-world tasks with two embodiments claim consistent outperformance over full fine-tuning and baselines such as pi0.5, with largest gains in OOD and few-shot regimes (e.g., +11 points on RoboTwin 2.0-Hard, 99.1% average on LIBERO, +24/+22 points with 10 demos).

Significance. If the core assumption holds, PriorVLA offers a practical route to more efficient and generalizable VLA adaptation that could reduce compute demands in robotics while improving robustness under distribution shift and limited data. The reported gains in few-shot OOD settings, if reproducible, would be a meaningful empirical contribution to parameter-efficient robot learning.

major comments (2)
  1. [§4] §4 (Ablation studies) and Table 3: no ablation isolates the contribution of the frozen Prior Expert's motor priors versus the effect of updating only 25% of parameters. A control with a randomly initialized frozen expert or disabled Expert Queries is required to confirm that OOD/few-shot gains (e.g., RoboTwin 2.0-Hard and 10-demo results) arise from useful non-conflicting priors rather than reduced overfitting.
  2. [§3.2] §3.2 (Expert Queries formulation): the query mechanism for reading motor priors from the frozen Prior Expert is described at high level without equations or pseudocode showing how alignment with downstream actions is enforced. This leaves open the risk of distribution mismatch in OOD regimes, which is load-bearing for the central claim that priors remain useful without joint optimization.
minor comments (2)
  1. [Abstract] Abstract and §5: exact data splits, number of random seeds, and statistical significance tests (e.g., p-values or confidence intervals) for the reported success rates are not stated, making it difficult to assess reliability of the 99.1% LIBERO and real-world numbers.
  2. [§3] Figure 2 and §3: notation for Expert Queries (e.g., how scene vs. motor queries are distinguished and fused) could be clarified with a diagram or explicit equations to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of our ablation design and methodological clarity that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [§4] §4 (Ablation studies) and Table 3: no ablation isolates the contribution of the frozen Prior Expert's motor priors versus the effect of updating only 25% of parameters. A control with a randomly initialized frozen expert or disabled Expert Queries is required to confirm that OOD/few-shot gains (e.g., RoboTwin 2.0-Hard and 10-demo results) arise from useful non-conflicting priors rather than reduced overfitting.

    Authors: We agree that the current ablations in Table 3 do not fully isolate the pretrained motor priors from the general benefits of updating fewer parameters. While the existing controls demonstrate the value of the Adaptation Expert and Expert Queries, they lack a randomly initialized frozen Prior Expert baseline. In the revised manuscript we will add this control experiment (and an additional ablation disabling Expert Queries to the Prior Expert) and report the results in an expanded Table 3. These new runs will directly test whether the OOD and few-shot gains derive from the preserved priors rather than reduced overfitting alone. We will also update §4 to discuss the outcomes. revision: yes

  2. Referee: [§3.2] §3.2 (Expert Queries formulation): the query mechanism for reading motor priors from the frozen Prior Expert is described at high level without equations or pseudocode showing how alignment with downstream actions is enforced. This leaves open the risk of distribution mismatch in OOD regimes, which is load-bearing for the central claim that priors remain useful without joint optimization.

    Authors: We acknowledge that §3.2 currently presents the Expert Queries at a conceptual level. To improve rigor, the revised version will include the explicit mathematical formulation (query, key, and value projections together with the cross-attention equations) and pseudocode in the appendix that shows how motor-prior features are read from the frozen Prior Expert and fused into the Adaptation Expert. Alignment with downstream actions is enforced by the end-to-end action-prediction loss; we will add a short paragraph clarifying this point. We will also expand the discussion of potential distribution mismatch in OOD settings, noting that our empirical results on RoboTwin 2.0-Hard and real-world OOD tasks indicate the priors remain beneficial, while acknowledging the design choice of freezing the Prior Expert as a deliberate safeguard against catastrophic forgetting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out benchmarks

full rationale

The paper introduces PriorVLA as an architectural framework (frozen Prior Expert + trainable Adaptation Expert + Expert Queries) and reports success rates on RoboTwin 2.0, LIBERO, and real-world tasks. These metrics are direct experimental measurements on separate test sets, not algebraic derivations, fitted parameters renamed as predictions, or self-referential equations. No load-bearing self-citations, uniqueness theorems, or ansatzes that reduce the central performance claims to the method's own inputs appear in the abstract or described structure. The derivation chain consists of design choices followed by empirical validation, which is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the domain assumption that large pretrained VLAs encode broadly useful scene and motor priors that remain valuable after freezing, plus the ad-hoc architectural choice of separate Prior and Adaptation Experts connected only by queries. No new physical entities are postulated.

free parameters (1)
  • fraction of parameters updated (25%)
    The exact split between frozen and trainable components is chosen to balance preservation and adaptation; its value is not derived from first principles.
axioms (1)
  • domain assumption Pretrained VLA models contain broad, transferable priors about scenes and motor skills that are worth preserving during downstream adaptation.
    Invoked in the motivation and method description to justify freezing the Prior Expert.
invented entities (3)
  • Prior Expert no independent evidence
    purpose: Frozen read-only source of pretrained scene and motor priors
    New architectural component introduced to hold the original model weights unchanged.
  • Adaptation Expert no independent evidence
    purpose: Trainable module that specializes to downstream tasks while guided by priors
    New architectural component that receives the queried priors.
  • Expert Queries no independent evidence
    purpose: Mechanism to extract and inject priors from the frozen expert into the adaptation expert
    New interface defined between the two experts.

pith-pipeline@v0.9.0 · 5598 in / 1570 out tokens · 36658 ms · 2026-05-12T03:33:31.703289+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 7 internal anchors

  1. [1]

    Ryoo, Grecia Salazar, Pannag R

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Mall...

  2. [2]

    Sanketi, Grecia Salazar, Michael S

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...

  3. [3]

    Foster, Pannag R

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P. Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Open- VLA: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, a...

  4. [4]

    π0: A vision-language-action flow model for general robot control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, Laura Smith, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Z...

  5. [5]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...

  6. [6]

    Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine

    Homer Rich Walke, Kevin Black, Tony Z. Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors,Proceedings of The 7th Conference on Robot...

  7. [7]

    Open X-Embodiment: Robotic learning datasets and RT-X models

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  8. [8]

    Sanketi, Archit Sharma, Cody Simpson, Quan Vuong, Homer Rich Walke, Blake Wulfe, Ted Xiao, Jonathan Heewon Yang, Arefeh Yavary, Tony Z

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  9. [9]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  10. [10]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  11. [11]

    UniVLA: Learning to act anywhere with task-centric latent actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. UniVLA: Learning to act anywhere with task-centric latent actions. In Proceedings of Robotics: Science and Systems, 2025

  12. [12]

    Fine-tuning vision-language-action models: Optimizing speed and success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025

  13. [13]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. CogACT: A foundational vision-language- action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2...

  14. [14]

    SpatialVLA: Exploring spatial representations for visual-language-action model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. SpatialVLA: Exploring spatial representations for visual-language-action model. InProceedings of Robotics: Science and Systems, 2025

  15. [15]

    Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. TinyVLA: Towards fast, data-efficient vision-language-action models for robotic manipulation.arXiv preprint arXiv:2409.12514, 2024

  16. [16]

    Driess, J

    Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

  17. [17]

    Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXiv preprint arXiv:2509.09372, 2025

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. VLA-Adapter: An effective paradigm for tiny-scale vision-language- action model.arXiv preprint arXiv:2509.09372, 2025

  18. [18]

    Zhang, Robert Azarcon, Glen Chou, and Zsolt Kira

    Chengyue Huang, Mellon M. Zhang, Robert Azarcon, Glen Chou, and Zsolt Kira. MAPS: Preserving vision-language representations via module-wise proximity scheduling for better vision-language-action generalization.arXiv preprint arXiv:2511.19878, 2025. 11

  19. [19]

    Robust fine-tuning of vision-language-action robot policies via parameter merging

    Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, and Sergey Levine. Robust fine-tuning of vision-language-action robot policies via parameter merging. InInternational Conference on Learning Representations, 2026

  20. [20]

    MimicGen: A data generation system for scalable robot learning using human demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors, Proceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Le...

  21. [21]

    RoboCasa: Large-scale simulation of household tasks for generalist robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa: Large-scale simulation of household tasks for generalist robots. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024

  22. [22]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. RoboTwin 2.0: A scalable d...

  23. [23]

    RoboVerse: A unified platform, benchmark and dataset for scalable and generalizable robot learning

    Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Haozhe Lou, Charlie Tianyue Cheng, Peihao Li, Haozhe Chen, Yutong Liang, Yuxi Qian, Jiageng Mao, Weikang Wan, Yiran Geng, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Chaoyi Xu, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gong, Yuran Wang, Yuxuan Kuang, R...

  24. [24]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  25. [25]

    LIBERO: Benchmarking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

  26. [26]

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin C. M. Burchfiel, and Shuran Song. Diffusion Policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023

  27. [27]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023

  28. [28]

    Densegrounding: Improving dense language-vision semantics for ego- centric 3d visual grounding.arXiv preprint arXiv:2505.04965, 2025

    Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng Weng, Zhongchao Shi, and Gao Huang. Densegrounding: Improving dense language-vision semantics for ego- centric 3d visual grounding.arXiv preprint arXiv:2505.04965, 2025

  29. [29]

    Perceiver-actor: A multi-task transformer for robotic manipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InProceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 785–799. PMLR, 2023

  30. [30]

    RVT: Robotic view transformer for 3d object manipulation

    Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. RVT: Robotic view transformer for 3d object manipulation. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 694–710. PMLR, 2023

  31. [31]

    Spatialactor: Exploring disentangled spatial representations for robust robotic manipulation

    Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Spatialactor: Exploring disentangled spatial representations for robust robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8969–8977, 2026. 12

  32. [32]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

  33. [33]

    Autotrialgen: Automated data generation from few human demonstrations via trajectory annotation and simulation trials.IEEE Robotics and Automation Letters, 11(6):6935–6942, 2026

    Huailiang Ma, Aiguo Song, Mutian He, Mingyu Li, Yibing Yan, and Linhu Wei. Autotrialgen: Automated data generation from few human demonstrations via trajectory annotation and simulation trials.IEEE Robotics and Automation Letters, 11(6):6935–6942, 2026

  34. [34]

    RDT-1B: A diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: A diffusion foundation model for bimanual manipulation. In International Conference on Learning Representations, 2025

  35. [35]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Jim Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loïc Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Li...

  36. [36]

    Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer.arXiv e-prints, page arXiv:2510.03342, October 2025

    Gemini Robotics Team. Gemini Robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025

  37. [37]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  38. [38]

    HAMLET: Switch your vision-language-action model into a history-aware policy

    Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your vision-language-action model into a history-aware policy. InInternational Conference on Learning Representations, 2026

  39. [39]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...

  40. [40]

    Gradient episodic memory for continual learning

    David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

  41. [41]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  42. [42]

    Dreamvla: A vision- language-action model dreamed with comprehensive world knowledge.CoRR, abs/2507.04447,

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447, 2025

  43. [43]

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu

    Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025

  44. [44]

    Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language- action policies.arXiv preprint arXiv:2508.20072,

    Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete Diffusion VLA: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025. 13

  45. [45]

    Genie Envisioner: A unified world foundation platform for robotic manipulation

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie Envisioner: A unified world foundation platform for robotic manipulation. InInternational Conference on Learning Representations, 2026

  46. [46]

    MemoryVLA: Perceptual-cognitive memory in vision- language-action models for robotic manipulation

    Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. MemoryVLA: Perceptual-cognitive memory in vision- language-action models for robotic manipulation. InInternational Conference on Learning Representations, 2026

  47. [47]

    NVIDIA Isaac GR00T N1.7: Open foundation model for generalized humanoid robot reasoning and skills

    NVIDIA. NVIDIA Isaac GR00T N1.7: Open foundation model for generalized humanoid robot reasoning and skills. https://huggingface.co/nvidia/GR00T-N1.7-3B, 2026. Model card. 14 A Implementation Notes for PriorVLA The main paper presents the architecture of PriorVLA, including Dual Action Experts, Expert Queries, the attention design, and the training objecti...