TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models

Jiajun Liu; Jianxiong Li; Lingqiao Liu; Shuai Yang; Sijin Chen; Wenbo Zhang; Xiao Ma

arxiv: 2606.03127 · v1 · pith:7R5FNZ4Lnew · submitted 2026-06-02 · 💻 cs.RO

TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models

Wenbo Zhang , Jianxiong Li , Shuai Yang , Sijin Chen , Jiajun Liu , Lingqiao Liu , Xiao Ma This is my paper

Pith reviewed 2026-06-28 10:07 UTC · model grok-4.3

classification 💻 cs.RO

keywords test-time traininglatent prompt optimizationvision-language-action modelsrobot manipulationpolicy adaptationdistribution shiftfoundation models

0 comments

The pith

Optimizing a latent prompt at test time improves vision-language-action model success rates on new environments by correcting critical decisions without changing the policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TTT-VLA, a framework that learns a latent prompt during training alongside the main policy by adding a proxy task. At test time the prompt is optimized alone using self-supervised signals from interaction data collected in the target setting, while the policy weights stay fixed. This yields higher task success rates on SimplerEnv benchmarks in both single-embodiment and multi-embodiment cases. The gains arise mainly from fixing a handful of important mistakes rather than reshaping overall behavior. The approach therefore supplies a lightweight way to adapt foundation manipulation policies to distribution shift after deployment.

Core claim

TTT-VLA performs test-time training for VLA models by optimizing only the latent prompt on the proxy task's self-supervised signal derived from current-environment interaction data, producing higher task success rates without any modification to the policy itself.

What carries the argument

Latent Prompt Optimization (LPO), an extra learned conditioning signal trained with a proxy task and then tuned at test time to steer the frozen policy.

If this is right

Task success rates rise consistently in single-embodiment settings on SimplerEnv.
Task success rates rise consistently in multi-embodiment settings on SimplerEnv.
Performance gains come chiefly from correcting a small number of critical decisions rather than global changes to policy behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt-only adaptation pattern could be tested on other prompt-conditioned foundation models that encounter distribution shift.
Targeted prompt updates may prove sufficient for many practical deployment fixes in robotics without full policy retraining.
Combining LPO with additional test-time signals could be explored to handle more severe shifts.

Load-bearing premise

The proxy task's self-supervised signal remains informative and sufficient to optimize the latent prompt for correcting distribution-shift errors when only interaction data from the current environment is available and the policy itself is not modified.

What would settle it

If optimizing the latent prompt on interaction data from a new environment produces no increase in task success rates compared with the unadapted baseline, the central claim would be falsified.

read the original abstract

Vision-Language-Action (VLA) models trained on large-scale data have made remarkable progress, but they remain vulnerable to distribution shifts at deployment time. Recent VLA models suggest that prompts can serve as an efficient interface for steering policy behavior, but existing prompt-based steering typically relies on external guidance. This raises a natural question: can test-time training (TTT) for VLA be achieved by optimizing a prompt, so that the steering interface itself can be learned and adapted from interaction? We address this question with TTT-VLA, a test-time training framework based on Latent Prompt Optimization (LPO). During training, the latent prompt is learned with an additional proxy task, providing an extra learned conditioning signal for policy learning. At test time, TTT is performed by collecting interaction data from the current environment and optimizing only the latent prompt on those data using the proxy task's self-supervised signal, without modifying the policy itself. Experiments on SimplerEnv demonstrate that the proposed method consistently improves task success rates in both single- and multi-embodiment settings. Further analysis shows that the gains arise primarily from correcting a small number of critical decisions rather than globally altering policy behavior. These results suggest that LPO provides an effective and practical pathway for deployment-time improvement of foundation manipulation policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TTT-VLA gives a workable test-time route for VLA models by optimizing a learned latent prompt on proxy signals, with reported gains on SimplerEnv, though the proxy's link to fixing shift errors stays under-explained.

read the letter

The paper's core move is to train a latent prompt alongside the main VLA policy using an extra proxy task, then at test time collect fresh interaction data and tune only that prompt via the proxy's self-supervised loss. No policy weights change. This turns the prompt into an adaptable steering knob learned from the current environment.

It is new in tying latent prompt optimization directly to a test-time training loop for VLAs, rather than relying on external guidance or full retraining. The SimplerEnv results claim consistent success-rate lifts in both single- and multi-embodiment settings, with the added note that gains come from fixing a few critical decisions instead of broad behavior change. That framing is useful for deployment where you want light adaptation.

The experiments appear to support the practical claim, and the method avoids the usual cost of updating large models. Credit for shipping a concrete framework that others could try.

The soft spot is the proxy task itself. The abstract gives no definition, no relation to the main objective, and no check that its signal actually points at the distribution-shift errors rather than unrelated features. If the proxy is only loosely coupled, test-time optimization could satisfy the loss without improving (or while hurting) task success. The stress-test concern lands here; the paper needs to show the proxy targets the right errors and that the gains survive ablations on the proxy choice.

This is for robotics groups working on VLA deployment and test-time methods. Readers already following prompt steering or foundation-model adaptation will get the most from the experiments and the LPO framing. It is coherent enough on its own terms to deserve referee time, even if the proxy details need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces TTT-VLA, a test-time training framework for Vision-Language-Action (VLA) models based on Latent Prompt Optimization (LPO). During training, a latent prompt is learned jointly with an additional proxy task that provides an extra conditioning signal. At test time, interaction data from the current environment is collected and used to optimize only the latent prompt via the proxy task's self-supervised signal, without modifying the policy parameters. Experiments on SimplerEnv are reported to show consistent improvements in task success rates in both single- and multi-embodiment settings, with further analysis indicating that gains come from correcting a small number of critical decisions rather than global policy changes.

Significance. If the central claim holds with rigorous evidence, the work would demonstrate a practical mechanism for deployment-time adaptation of large foundation VLA policies using only interaction data and a fixed proxy signal. This could be significant for robotics, as it avoids the computational cost of policy fine-tuning while addressing distribution shift. The approach of learning a proxy task specifically to enable self-supervised prompt optimization at test time is a potentially useful idea, provided the signal is shown to be informative for the claimed error-correction behavior.

major comments (2)

[Method] Method section (description of LPO and proxy task): The proxy task is load-bearing for the entire test-time procedure, yet no definition of its formulation, loss function, or coupling to the main manipulation objective is supplied. Without this, it is impossible to evaluate whether the self-supervised signal remains informative for correcting distribution-shift errors (as required by the central claim) or whether optimization could converge to prompts that satisfy the proxy loss without improving success rates.
[Experiments] Experiments section: The abstract states that experiments demonstrate consistent improvements and that gains arise from correcting a small number of critical decisions, but supplies no quantitative success rates, baselines, statistical tests, number of trials, or details on interaction data collection and optimization procedure. These omissions make the empirical support for the claim unverifiable from the provided text.

minor comments (2)

[Abstract] The abstract and introduction use the term 'latent prompt' without an initial formal definition or diagram showing its integration into the VLA architecture.
[Method] Notation for the proxy task loss and the test-time optimization objective should be introduced explicitly with equations to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on TTT-VLA. The comments correctly identify needs for greater methodological transparency and empirical detail. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Method] Method section (description of LPO and proxy task): The proxy task is load-bearing for the entire test-time procedure, yet no definition of its formulation, loss function, or coupling to the main manipulation objective is supplied. Without this, it is impossible to evaluate whether the self-supervised signal remains informative for correcting distribution-shift errors (as required by the central claim) or whether optimization could converge to prompts that satisfy the proxy loss without improving success rates.

Authors: We agree that an explicit definition is required for the proxy task to be evaluable. In the revised manuscript we will expand the Method section to state the precise formulation of the proxy task (including input/output structure), its loss function, and the joint training objective that couples it to the primary VLA policy. This addition will directly address whether the self-supervised signal is expected to remain informative under distribution shift. revision: yes
Referee: [Experiments] Experiments section: The abstract states that experiments demonstrate consistent improvements and that gains arise from correcting a small number of critical decisions, but supplies no quantitative success rates, baselines, statistical tests, number of trials, or details on interaction data collection and optimization procedure. These omissions make the empirical support for the claim unverifiable from the provided text.

Authors: We acknowledge that the current text does not supply the requested quantitative details. In the revision we will augment the Experiments section with concrete success rates per task and embodiment setting, baseline comparisons, number of evaluation trials, any statistical tests performed, and a precise description of interaction-data collection and latent-prompt optimization steps. We will also add a short summary of these numbers to the abstract if length permits. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical method for test-time latent prompt optimization using a proxy task's self-supervised signal on new interaction data, without any equations, derivations, or mathematical claims. The optimization is performed on external data at deployment and does not reduce any prediction or result to quantities fitted inside the paper. No self-citations are invoked as load-bearing premises, and the central claims rest on experimental outcomes rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields minimal concrete parameters or axioms; the core premise that a proxy self-supervised signal can drive effective prompt adaptation is treated as a domain assumption.

axioms (1)

domain assumption The proxy task learned during training supplies a reliable self-supervised signal usable for test-time prompt optimization under distribution shift.
Invoked as the mechanism enabling adaptation without policy modification.

invented entities (1)

Latent prompt no independent evidence
purpose: Extra learned conditioning signal for both training and test-time adaptation of the policy.
Introduced as the central new interface in the TTT-VLA framework.

pith-pipeline@v0.9.1-grok · 5779 in / 1254 out tokens · 22528 ms · 2026-06-28T10:07:21.040992+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 17 canonical work pages · 12 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

RT-1: robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Mall...

work page doi:10.15607/rss.2023.xix.025 2023
[5]

Univla: Learning to act anywhere with task-centric latent actions, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions, 2025

2025
[6]

Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. In Advancesin Neural Information Processing Systems, 2022

2022
[7]

Efros, Lerrel Pinto, and Xiaolong Wang

Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A. Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment.arXiv preprint arXiv:2007.04309, 2020

work page arXiv 2007
[8]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al.π0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning, 2025

Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu, and Ken Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning, 2025

2025
[11]

OpenVLA: An Open-Source Vision-Language-Action Model, June 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, June 2024

2024
[12]

Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, vol...

2024
[13]

Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026

2026
[14]

Molmoact: Action reasoning models that can reason in space, 2025

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space, 2025

2025
[15]

Sanketi Li, Hao Wang, Ajay Mandlekar, Sichen M

Pannag R. Sanketi Li, Hao Wang, Ajay Mandlekar, Sichen M. Yang, Haoran Geng, Yizhou Duan, Ruslan Li, Vincent Vanhoucke, Chelsea Finn, Julian Ibarz, Fei Xia, and Tianhe Yu. Evaluating real-world robot manipulation policies in simulation. InProceedings of the 8th Conference on Robot Learning, 2025. 11

2025
[16]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning, 2024

Shijie Li, Weixian Li, Tianle Zhou, Abul Kalam Azad, Jiankai Liu, Yeyun Gong, Liangming Pan, Chao Li, Zhangyang Wang, Bo An, and Li Dong. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning.arXiv preprint arXiv:2412.11974, 2024

work page arXiv 2024
[18]

What Matters in Building Vision-Language-Action Models for Generalist Robots

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Gr00t n1: An open foundation model for generalist humanoid robots, 2025

NVIDIA GEAR Team, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, et al. Gr00t n1: An open foundation model for generalist humanoid robots, 2025

2025
[20]

Octo: An open-source generalist robot policy.https://octo-models.github.io, 2023

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.https://octo-models.github.io, 2023

2023
[21]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation, 2025

Kaustubh Pathak, Lawrence Yunliang Chen, Yiming Gao, Somil Kent, Long Ma, Anca Dragan, Dorsa Sadigh, and Chelsea Finn. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation, 2025

2025
[23]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Quan Vuong, Kevin Black, You Liang Tan, Adnan Esmail, Isabel Leal, Homer Walke, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Alex Irpan, Julian Jones, Nikhil Joshi, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Sergey Levine, Yao Lu, Corey Lynch, Karl Pertsch, Kanishka Rao, Krista Reyma...

2025
[25]

Physical Intelligence, Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π∗ 0.6: a vla that learns from experience, 2025

2025
[26]

Physical Intelligence, Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0.7: a steerable generalist robotic foundation model with emergent behaviors, 2026

2026
[27]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn

Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z. Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections. InRobotics: Science and Systems, 2024

2024
[29]

Efros, and Moritz Hardt

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning, pages 9229–9248, 2020

2020
[30]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

2023
[31]

Worldagen: Unified state-action prediction with test-time world model training

Chi Wan, Kangrui Wang, Yuan Si, Pingyue Zhang, and Manling Li. Worldagen: Unified state-action prediction with test-time world model training. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 12

2026
[32]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021

2021
[33]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvancesin Neural Information Processing Systems, 2024

2024
[34]

Continual test-time domain adaptation

Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022

2022
[35]

Rosa: Harnessing robot states for vision-language and action alignment, 2025

Yuqing Wen, Kefan Gu, Haoxuan Liu, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, and Xiaoyan Sun. Rosa: Harnessing robot states for vision-language and action alignment, 2025

2025
[36]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents. arXiv preprint arXiv:2502.13130, 2025

work page arXiv 2025
[37]

Instructvla: Vision-language-action instruction tuning from understanding to manipulation

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520, 2025

work page arXiv 2025
[38]

Latent action pretraining from videos, 2025

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos, 2025

2025
[39]

World action models are zero-shot policies, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

2026
[40]

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 13 Appendix Table 5 Extended comparison of deployment-time improvement strategies for VLA.This...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

RT-1: robotics transformer for real-world control at scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Mall...

work page doi:10.15607/rss.2023.xix.025 2023

[5] [5]

Univla: Learning to act anywhere with task-centric latent actions, 2025

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions, 2025

2025

[6] [6]

Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. In Advancesin Neural Information Processing Systems, 2022

2022

[7] [7]

Efros, Lerrel Pinto, and Xiaolong Wang

Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A. Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment.arXiv preprint arXiv:2007.04309, 2020

work page arXiv 2007

[8] [8]

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al.π0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning, 2025

Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu, and Ken Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning, 2025

2025

[11] [11]

OpenVLA: An Open-Source Vision-Language-Action Model, June 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, June 2024

2024

[12] [12]

Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, vol...

2024

[13] [13]

Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026

2026

[14] [14]

Molmoact: Action reasoning models that can reason in space, 2025

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space, 2025

2025

[15] [15]

Sanketi Li, Hao Wang, Ajay Mandlekar, Sichen M

Pannag R. Sanketi Li, Hao Wang, Ajay Mandlekar, Sichen M. Yang, Haoran Geng, Yizhou Duan, Ruslan Li, Vincent Vanhoucke, Chelsea Finn, Julian Ibarz, Fei Xia, and Tianhe Yu. Evaluating real-world robot manipulation policies in simulation. InProceedings of the 8th Conference on Robot Learning, 2025. 11

2025

[16] [16]

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning, 2024

Shijie Li, Weixian Li, Tianle Zhou, Abul Kalam Azad, Jiankai Liu, Yeyun Gong, Liangming Pan, Chao Li, Zhangyang Wang, Bo An, and Li Dong. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning.arXiv preprint arXiv:2412.11974, 2024

work page arXiv 2024

[18] [18]

What Matters in Building Vision-Language-Action Models for Generalist Robots

Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Gr00t n1: An open foundation model for generalist humanoid robots, 2025

NVIDIA GEAR Team, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, et al. Gr00t n1: An open foundation model for generalist humanoid robots, 2025

2025

[20] [20]

Octo: An open-source generalist robot policy.https://octo-models.github.io, 2023

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.https://octo-models.github.io, 2023

2023

[21] [21]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation, 2025

Kaustubh Pathak, Lawrence Yunliang Chen, Yiming Gao, Somil Kent, Long Ma, Anca Dragan, Dorsa Sadigh, and Chelsea Finn. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation, 2025

2025

[23] [23]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Quan Vuong, Kevin Black, You Liang Tan, Adnan Esmail, Isabel Leal, Homer Walke, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Alex Irpan, Julian Jones, Nikhil Joshi, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Sergey Levine, Yao Lu, Corey Lynch, Karl Pertsch, Kanishka Rao, Krista Reyma...

2025

[25] [25]

Physical Intelligence, Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π∗ 0.6: a vla that learns from experience, 2025

2025

[26] [26]

Physical Intelligence, Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0.7: a steerable generalist robotic foundation model with emergent behaviors, 2026

2026

[27] [27]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn

Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z. Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections. InRobotics: Science and Systems, 2024

2024

[29] [29]

Efros, and Moritz Hardt

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning, pages 9229–9248, 2020

2020

[30] [30]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

2023

[31] [31]

Worldagen: Unified state-action prediction with test-time world model training

Chi Wan, Kangrui Wang, Yuan Si, Pingyue Zhang, and Manling Li. Worldagen: Unified state-action prediction with test-time world model training. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 12

2026

[32] [32]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021

2021

[33] [33]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvancesin Neural Information Processing Systems, 2024

2024

[34] [34]

Continual test-time domain adaptation

Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022

2022

[35] [35]

Rosa: Harnessing robot states for vision-language and action alignment, 2025

Yuqing Wen, Kefan Gu, Haoxuan Liu, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, and Xiaoyan Sun. Rosa: Harnessing robot states for vision-language and action alignment, 2025

2025

[36] [36]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents. arXiv preprint arXiv:2502.13130, 2025

work page arXiv 2025

[37] [37]

Instructvla: Vision-language-action instruction tuning from understanding to manipulation

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520, 2025

work page arXiv 2025

[38] [38]

Latent action pretraining from videos, 2025

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos, 2025

2025

[39] [39]

World action models are zero-shot policies, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

2026

[40] [40]

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 13 Appendix Table 5 Extended comparison of deployment-time improvement strategies for VLA.This...

work page internal anchor Pith review Pith/arXiv arXiv 2024