Recognition: no theorem link
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
Pith reviewed 2026-05-12 03:33 UTC · model grok-4.3
The pith
PriorVLA adapts vision-language-action models to robot tasks by freezing a Prior Expert and integrating its priors via Expert Queries into a trainable Adaptation Expert, using only 25 percent of the parameter updates required by full fine-t
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PriorVLA preserves pretrained priors during adaptation of vision-language-action models by maintaining a frozen Prior Expert as a source of scene and motor knowledge while training only an Adaptation Expert that receives integrated priors through Expert Queries. This approach updates just 25 percent of the parameters changed by full fine-tuning. It produces stronger overall performance than full fine-tuning and current VLA baselines on RoboTwin 2.0, LIBERO, and real-world tasks, with the largest gains under out-of-distribution and few-shot conditions, including an 11-point improvement over pi0.5 on RoboTwin 2.0-Hard and 99.1 percent average success on LIBERO.
What carries the argument
Expert Queries, which extract scene priors from the pretrained vision-language model and motor priors from the frozen Prior Expert to guide the trainable Adaptation Expert during task specialization.
If this is right
- PriorVLA updates only 25 percent of the parameters changed during full fine-tuning while achieving higher task success.
- Performance gains are largest in out-of-distribution and few-shot settings, such as an 11-point lift over pi0.5 on RoboTwin 2.0-Hard.
- The method reaches 99.1 percent average success on LIBERO and, on eight real-world tasks across two embodiments, attains 81 percent in-distribution and 57 percent out-of-distribution success with standard data.
- With only 10 demonstrations per task it still achieves 48 percent in-distribution and 32 percent out-of-distribution success, surpassing pi0.5 by 24 and 22 points respectively.
Where Pith is reading between the lines
- If prior preservation works as claimed, the same frozen-expert-plus-query pattern could support incremental skill acquisition without repeated full retraining.
- Fewer updated parameters may enable on-device or edge adaptation of robot policies where compute and memory are limited.
- The query-based integration of a frozen expert could be tested on other sequential decision models that currently rely on full fine-tuning.
Load-bearing premise
The frozen Prior Expert holds useful non-conflicting priors that the Adaptation Expert can reliably extract and apply through Expert Queries without needing joint optimization of the full model.
What would settle it
Running full fine-tuning and PriorVLA on identical data and tasks then observing that full fine-tuning matches or exceeds PriorVLA success rates in out-of-distribution and few-shot regimes would show that freezing the prior source adds no benefit.
Figures
read the original abstract
Large-scale pretraining has made Vision-Language-Action (VLA) models promising foundations for generalist robot manipulation, yet adapting them to downstream tasks remains necessary. However, the common practice of full fine-tuning treats pretraining as initialization and can shift broad priors toward narrow training-distribution patterns. We propose PriorVLA, a novel framework that preserves pretrained priors and learns to leverage them for effective adaptation. PriorVLA keeps a frozen Prior Expert as a read-only prior source and trains an Adaptation Expert for downstream specialization. Expert Queries capture scene priors from the pretrained VLM and motor priors from the Prior Expert, integrating both into the Adaptation Expert to guide adaptation. Together, PriorVLA updates only 25% of the parameters updated by full fine-tuning. Across RoboTwin 2.0, LIBERO, and real-world tasks, PriorVLA achieves stronger overall performance than full fine-tuning and state-of-the-art VLA baselines, with the largest gains under out-of-distribution (OOD) and few-shot settings. PriorVLA improves over pi0.5 by 11 points on RoboTwin 2.0-Hard and achieves 99.1% average success on LIBERO. Across eight real-world tasks and two embodiments, PriorVLA reaches 81% in-distribution (ID) and 57% OOD success with standard data. With only 10 demonstrations per task, PriorVLA reaches 48% ID and 32% OOD success, surpassing pi0.5 by 24 and 22 points, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PriorVLA, a framework for adapting large Vision-Language-Action (VLA) models that freezes a Prior Expert to retain pretrained priors while training an Adaptation Expert (updating only 25% of parameters) that uses Expert Queries to extract and integrate scene priors from a VLM and motor priors from the Prior Expert. Empirical results on RoboTwin 2.0, LIBERO, and real-world tasks with two embodiments claim consistent outperformance over full fine-tuning and baselines such as pi0.5, with largest gains in OOD and few-shot regimes (e.g., +11 points on RoboTwin 2.0-Hard, 99.1% average on LIBERO, +24/+22 points with 10 demos).
Significance. If the core assumption holds, PriorVLA offers a practical route to more efficient and generalizable VLA adaptation that could reduce compute demands in robotics while improving robustness under distribution shift and limited data. The reported gains in few-shot OOD settings, if reproducible, would be a meaningful empirical contribution to parameter-efficient robot learning.
major comments (2)
- [§4] §4 (Ablation studies) and Table 3: no ablation isolates the contribution of the frozen Prior Expert's motor priors versus the effect of updating only 25% of parameters. A control with a randomly initialized frozen expert or disabled Expert Queries is required to confirm that OOD/few-shot gains (e.g., RoboTwin 2.0-Hard and 10-demo results) arise from useful non-conflicting priors rather than reduced overfitting.
- [§3.2] §3.2 (Expert Queries formulation): the query mechanism for reading motor priors from the frozen Prior Expert is described at high level without equations or pseudocode showing how alignment with downstream actions is enforced. This leaves open the risk of distribution mismatch in OOD regimes, which is load-bearing for the central claim that priors remain useful without joint optimization.
minor comments (2)
- [Abstract] Abstract and §5: exact data splits, number of random seeds, and statistical significance tests (e.g., p-values or confidence intervals) for the reported success rates are not stated, making it difficult to assess reliability of the 99.1% LIBERO and real-world numbers.
- [§3] Figure 2 and §3: notation for Expert Queries (e.g., how scene vs. motor queries are distinguished and fused) could be clarified with a diagram or explicit equations to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of our ablation design and methodological clarity that we will address in the revision. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [§4] §4 (Ablation studies) and Table 3: no ablation isolates the contribution of the frozen Prior Expert's motor priors versus the effect of updating only 25% of parameters. A control with a randomly initialized frozen expert or disabled Expert Queries is required to confirm that OOD/few-shot gains (e.g., RoboTwin 2.0-Hard and 10-demo results) arise from useful non-conflicting priors rather than reduced overfitting.
Authors: We agree that the current ablations in Table 3 do not fully isolate the pretrained motor priors from the general benefits of updating fewer parameters. While the existing controls demonstrate the value of the Adaptation Expert and Expert Queries, they lack a randomly initialized frozen Prior Expert baseline. In the revised manuscript we will add this control experiment (and an additional ablation disabling Expert Queries to the Prior Expert) and report the results in an expanded Table 3. These new runs will directly test whether the OOD and few-shot gains derive from the preserved priors rather than reduced overfitting alone. We will also update §4 to discuss the outcomes. revision: yes
-
Referee: [§3.2] §3.2 (Expert Queries formulation): the query mechanism for reading motor priors from the frozen Prior Expert is described at high level without equations or pseudocode showing how alignment with downstream actions is enforced. This leaves open the risk of distribution mismatch in OOD regimes, which is load-bearing for the central claim that priors remain useful without joint optimization.
Authors: We acknowledge that §3.2 currently presents the Expert Queries at a conceptual level. To improve rigor, the revised version will include the explicit mathematical formulation (query, key, and value projections together with the cross-attention equations) and pseudocode in the appendix that shows how motor-prior features are read from the frozen Prior Expert and fused into the Adaptation Expert. Alignment with downstream actions is enforced by the end-to-end action-prediction loss; we will add a short paragraph clarifying this point. We will also expand the discussion of potential distribution mismatch in OOD settings, noting that our empirical results on RoboTwin 2.0-Hard and real-world OOD tasks indicate the priors remain beneficial, while acknowledging the design choice of freezing the Prior Expert as a deliberate safeguard against catastrophic forgetting. revision: yes
Circularity Check
No circularity: empirical results on held-out benchmarks
full rationale
The paper introduces PriorVLA as an architectural framework (frozen Prior Expert + trainable Adaptation Expert + Expert Queries) and reports success rates on RoboTwin 2.0, LIBERO, and real-world tasks. These metrics are direct experimental measurements on separate test sets, not algebraic derivations, fitted parameters renamed as predictions, or self-referential equations. No load-bearing self-citations, uniqueness theorems, or ansatzes that reduce the central performance claims to the method's own inputs appear in the abstract or described structure. The derivation chain consists of design choices followed by empirical validation, which is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- fraction of parameters updated (25%)
axioms (1)
- domain assumption Pretrained VLA models contain broad, transferable priors about scenes and motor skills that are worth preserving during downstream adaptation.
invented entities (3)
-
Prior Expert
no independent evidence
-
Adaptation Expert
no independent evidence
-
Expert Queries
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ryoo, Grecia Salazar, Pannag R
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Mall...
work page 2023
-
[2]
Sanketi, Grecia Salazar, Michael S
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...
work page 2023
-
[3]
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P. Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Open- VLA: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, a...
work page 2025
-
[4]
π0: A vision-language-action flow model for general robot control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, Laura Smith, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Z...
work page 2025
-
[5]
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren,...
work page 2025
-
[6]
Homer Rich Walke, Kevin Black, Tony Z. Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen- Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Sergey Levine. BridgeData V2: A dataset for robot learning at scale. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors,Proceedings of The 7th Conference on Robot...
work page 2023
-
[7]
Open X-Embodiment: Robotic learning datasets and RT-X models
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open X-Embodiment: Robotic learning datasets and RT-X models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
work page 2024
-
[8]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...
work page 2024
-
[9]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
UniVLA: Learning to act anywhere with task-centric latent actions
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. UniVLA: Learning to act anywhere with task-centric latent actions. In Proceedings of Robotics: Science and Systems, 2025
work page 2025
-
[12]
Fine-tuning vision-language-action models: Optimizing speed and success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. InProceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2025
work page 2025
-
[13]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. CogACT: A foundational vision-language- action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2...
work page Pith review arXiv 2024
-
[14]
SpatialVLA: Exploring spatial representations for visual-language-action model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. SpatialVLA: Exploring spatial representations for visual-language-action model. InProceedings of Robotics: Science and Systems, 2025
work page 2025
-
[15]
Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. TinyVLA: Towards fast, data-efficient vision-language-action models for robotic manipulation.arXiv preprint arXiv:2409.12514, 2024
-
[16]
Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, and Sergey Levine. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025
-
[17]
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. VLA-Adapter: An effective paradigm for tiny-scale vision-language- action model.arXiv preprint arXiv:2509.09372, 2025
-
[18]
Zhang, Robert Azarcon, Glen Chou, and Zsolt Kira
Chengyue Huang, Mellon M. Zhang, Robert Azarcon, Glen Chou, and Zsolt Kira. MAPS: Preserving vision-language representations via module-wise proximity scheduling for better vision-language-action generalization.arXiv preprint arXiv:2511.19878, 2025. 11
-
[19]
Robust fine-tuning of vision-language-action robot policies via parameter merging
Yajat Yadav, Zhiyuan Zhou, Andrew Wagenmaker, Karl Pertsch, and Sergey Levine. Robust fine-tuning of vision-language-action robot policies via parameter merging. InInternational Conference on Learning Representations, 2026
work page 2026
-
[20]
MimicGen: A data generation system for scalable robot learning using human demonstrations
Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. In Jie Tan, Marc Toussaint, and Kourosh Darvish, editors, Proceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Le...
work page 2023
-
[21]
RoboCasa: Large-scale simulation of household tasks for generalist robots
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa: Large-scale simulation of household tasks for generalist robots. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024
work page 2024
-
[22]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. RoboTwin 2.0: A scalable d...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
RoboVerse: A unified platform, benchmark and dataset for scalable and generalizable robot learning
Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Haozhe Lou, Charlie Tianyue Cheng, Peihao Li, Haozhe Chen, Yutong Liang, Yuxi Qian, Jiageng Mao, Weikang Wan, Yiran Geng, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Chaoyi Xu, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gong, Yuran Wang, Yuxuan Kuang, R...
work page 2025
-
[24]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
work page 2022
-
[25]
LIBERO: Benchmarking knowledge transfer for lifelong robot learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[26]
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin C. M. Burchfiel, and Shuran Song. Diffusion Policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023
work page 2023
-
[27]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023
work page 2023
-
[28]
Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng Weng, Zhongchao Shi, and Gao Huang. Densegrounding: Improving dense language-vision semantics for ego- centric 3d visual grounding.arXiv preprint arXiv:2505.04965, 2025
-
[29]
Perceiver-actor: A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InProceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 785–799. PMLR, 2023
work page 2023
-
[30]
RVT: Robotic view transformer for 3d object manipulation
Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. RVT: Robotic view transformer for 3d object manipulation. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 694–710. PMLR, 2023
work page 2023
-
[31]
Spatialactor: Exploring disentangled spatial representations for robust robotic manipulation
Hao Shi, Bin Xie, Yingfei Liu, Yang Yue, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Spatialactor: Exploring disentangled spatial representations for robust robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8969–8977, 2026. 12
work page 2026
-
[32]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. AgiBot World Colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Huailiang Ma, Aiguo Song, Mutian He, Mingyu Li, Yibing Yan, and Linhu Wei. Autotrialgen: Automated data generation from few human demonstrations via trajectory annotation and simulation trials.IEEE Robotics and Automation Letters, 11(6):6935–6942, 2026
work page 2026
-
[34]
RDT-1B: A diffusion foundation model for bimanual manipulation
Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1B: A diffusion foundation model for bimanual manipulation. In International Conference on Learning Representations, 2025
work page 2025
-
[35]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Jim Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loïc Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Li...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Gemini Robotics Team. Gemini Robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025
-
[37]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
HAMLET: Switch your vision-language-action model into a history-aware policy
Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, and Jinwoo Shin. HAMLET: Switch your vision-language-action model into a history-aware policy. InInternational Conference on Learning Representations, 2026
work page 2026
-
[39]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catas- trophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):352...
work page 2017
-
[40]
Gradient episodic memory for continual learning
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[41]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. arXiv preprint arXiv:2507.04447, 2025
-
[43]
Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951, 2025
-
[44]
Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete Diffusion VLA: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025. 13
-
[45]
Genie Envisioner: A unified world foundation platform for robotic manipulation
Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie Envisioner: A unified world foundation platform for robotic manipulation. InInternational Conference on Learning Representations, 2026
work page 2026
-
[46]
MemoryVLA: Perceptual-cognitive memory in vision- language-action models for robotic manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. MemoryVLA: Perceptual-cognitive memory in vision- language-action models for robotic manipulation. InInternational Conference on Learning Representations, 2026
work page 2026
-
[47]
NVIDIA Isaac GR00T N1.7: Open foundation model for generalized humanoid robot reasoning and skills
NVIDIA. NVIDIA Isaac GR00T N1.7: Open foundation model for generalized humanoid robot reasoning and skills. https://huggingface.co/nvidia/GR00T-N1.7-3B, 2026. Model card. 14 A Implementation Notes for PriorVLA The main paper presents the architecture of PriorVLA, including Dual Action Experts, Expert Queries, the attention design, and the training objecti...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.