pith. sign in

arxiv: 2606.03127 · v1 · pith:7R5FNZ4Lnew · submitted 2026-06-02 · 💻 cs.RO

TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models

Pith reviewed 2026-06-28 10:07 UTC · model grok-4.3

classification 💻 cs.RO
keywords test-time traininglatent prompt optimizationvision-language-action modelsrobot manipulationpolicy adaptationdistribution shiftfoundation models
0
0 comments X

The pith

Optimizing a latent prompt at test time improves vision-language-action model success rates on new environments by correcting critical decisions without changing the policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TTT-VLA, a framework that learns a latent prompt during training alongside the main policy by adding a proxy task. At test time the prompt is optimized alone using self-supervised signals from interaction data collected in the target setting, while the policy weights stay fixed. This yields higher task success rates on SimplerEnv benchmarks in both single-embodiment and multi-embodiment cases. The gains arise mainly from fixing a handful of important mistakes rather than reshaping overall behavior. The approach therefore supplies a lightweight way to adapt foundation manipulation policies to distribution shift after deployment.

Core claim

TTT-VLA performs test-time training for VLA models by optimizing only the latent prompt on the proxy task's self-supervised signal derived from current-environment interaction data, producing higher task success rates without any modification to the policy itself.

What carries the argument

Latent Prompt Optimization (LPO), an extra learned conditioning signal trained with a proxy task and then tuned at test time to steer the frozen policy.

If this is right

  • Task success rates rise consistently in single-embodiment settings on SimplerEnv.
  • Task success rates rise consistently in multi-embodiment settings on SimplerEnv.
  • Performance gains come chiefly from correcting a small number of critical decisions rather than global changes to policy behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt-only adaptation pattern could be tested on other prompt-conditioned foundation models that encounter distribution shift.
  • Targeted prompt updates may prove sufficient for many practical deployment fixes in robotics without full policy retraining.
  • Combining LPO with additional test-time signals could be explored to handle more severe shifts.

Load-bearing premise

The proxy task's self-supervised signal remains informative and sufficient to optimize the latent prompt for correcting distribution-shift errors when only interaction data from the current environment is available and the policy itself is not modified.

What would settle it

If optimizing the latent prompt on interaction data from a new environment produces no increase in task success rates compared with the unadapted baseline, the central claim would be falsified.

read the original abstract

Vision-Language-Action (VLA) models trained on large-scale data have made remarkable progress, but they remain vulnerable to distribution shifts at deployment time. Recent VLA models suggest that prompts can serve as an efficient interface for steering policy behavior, but existing prompt-based steering typically relies on external guidance. This raises a natural question: can test-time training (TTT) for VLA be achieved by optimizing a prompt, so that the steering interface itself can be learned and adapted from interaction? We address this question with TTT-VLA, a test-time training framework based on Latent Prompt Optimization (LPO). During training, the latent prompt is learned with an additional proxy task, providing an extra learned conditioning signal for policy learning. At test time, TTT is performed by collecting interaction data from the current environment and optimizing only the latent prompt on those data using the proxy task's self-supervised signal, without modifying the policy itself. Experiments on SimplerEnv demonstrate that the proposed method consistently improves task success rates in both single- and multi-embodiment settings. Further analysis shows that the gains arise primarily from correcting a small number of critical decisions rather than globally altering policy behavior. These results suggest that LPO provides an effective and practical pathway for deployment-time improvement of foundation manipulation policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TTT-VLA, a test-time training framework for Vision-Language-Action (VLA) models based on Latent Prompt Optimization (LPO). During training, a latent prompt is learned jointly with an additional proxy task that provides an extra conditioning signal. At test time, interaction data from the current environment is collected and used to optimize only the latent prompt via the proxy task's self-supervised signal, without modifying the policy parameters. Experiments on SimplerEnv are reported to show consistent improvements in task success rates in both single- and multi-embodiment settings, with further analysis indicating that gains come from correcting a small number of critical decisions rather than global policy changes.

Significance. If the central claim holds with rigorous evidence, the work would demonstrate a practical mechanism for deployment-time adaptation of large foundation VLA policies using only interaction data and a fixed proxy signal. This could be significant for robotics, as it avoids the computational cost of policy fine-tuning while addressing distribution shift. The approach of learning a proxy task specifically to enable self-supervised prompt optimization at test time is a potentially useful idea, provided the signal is shown to be informative for the claimed error-correction behavior.

major comments (2)
  1. [Method] Method section (description of LPO and proxy task): The proxy task is load-bearing for the entire test-time procedure, yet no definition of its formulation, loss function, or coupling to the main manipulation objective is supplied. Without this, it is impossible to evaluate whether the self-supervised signal remains informative for correcting distribution-shift errors (as required by the central claim) or whether optimization could converge to prompts that satisfy the proxy loss without improving success rates.
  2. [Experiments] Experiments section: The abstract states that experiments demonstrate consistent improvements and that gains arise from correcting a small number of critical decisions, but supplies no quantitative success rates, baselines, statistical tests, number of trials, or details on interaction data collection and optimization procedure. These omissions make the empirical support for the claim unverifiable from the provided text.
minor comments (2)
  1. [Abstract] The abstract and introduction use the term 'latent prompt' without an initial formal definition or diagram showing its integration into the VLA architecture.
  2. [Method] Notation for the proxy task loss and the test-time optimization objective should be introduced explicitly with equations to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on TTT-VLA. The comments correctly identify needs for greater methodological transparency and empirical detail. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Method] Method section (description of LPO and proxy task): The proxy task is load-bearing for the entire test-time procedure, yet no definition of its formulation, loss function, or coupling to the main manipulation objective is supplied. Without this, it is impossible to evaluate whether the self-supervised signal remains informative for correcting distribution-shift errors (as required by the central claim) or whether optimization could converge to prompts that satisfy the proxy loss without improving success rates.

    Authors: We agree that an explicit definition is required for the proxy task to be evaluable. In the revised manuscript we will expand the Method section to state the precise formulation of the proxy task (including input/output structure), its loss function, and the joint training objective that couples it to the primary VLA policy. This addition will directly address whether the self-supervised signal is expected to remain informative under distribution shift. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract states that experiments demonstrate consistent improvements and that gains arise from correcting a small number of critical decisions, but supplies no quantitative success rates, baselines, statistical tests, number of trials, or details on interaction data collection and optimization procedure. These omissions make the empirical support for the claim unverifiable from the provided text.

    Authors: We acknowledge that the current text does not supply the requested quantitative details. In the revision we will augment the Experiments section with concrete success rates per task and embodiment setting, baseline comparisons, number of evaluation trials, any statistical tests performed, and a precise description of interaction-data collection and latent-prompt optimization steps. We will also add a short summary of these numbers to the abstract if length permits. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical method for test-time latent prompt optimization using a proxy task's self-supervised signal on new interaction data, without any equations, derivations, or mathematical claims. The optimization is performed on external data at deployment and does not reduce any prediction or result to quantities fitted inside the paper. No self-citations are invoked as load-bearing premises, and the central claims rest on experimental outcomes rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields minimal concrete parameters or axioms; the core premise that a proxy self-supervised signal can drive effective prompt adaptation is treated as a domain assumption.

axioms (1)
  • domain assumption The proxy task learned during training supplies a reliable self-supervised signal usable for test-time prompt optimization under distribution shift.
    Invoked as the mechanism enabling adaptation without policy modification.
invented entities (1)
  • Latent prompt no independent evidence
    purpose: Extra learned conditioning signal for both training and test-time adaptation of the policy.
    Introduced as the central new interface in the TTT-VLA framework.

pith-pipeline@v0.9.1-grok · 5779 in / 1254 out tokens · 22528 ms · 2026-06-28T10:07:21.040992+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 17 canonical work pages · 12 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  2. [2]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  3. [3]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023

  4. [4]

    RT-1: robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Mall...

  5. [5]

    Univla: Learning to act anywhere with task-centric latent actions, 2025

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions, 2025

  6. [6]

    Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. In Advancesin Neural Information Processing Systems, 2022

  7. [7]

    Efros, Lerrel Pinto, and Xiaolong Wang

    Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A. Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment.arXiv preprint arXiv:2007.04309, 2020

  8. [8]

    ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025

  9. [9]

    Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al.π0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

  10. [10]

    Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning, 2025

    Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu, and Ken Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning, 2025

  11. [11]

    OpenVLA: An Open-Source Vision-Language-Action Model, June 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, June 2024

  12. [12]

    Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, vol...

  13. [13]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026

  14. [14]

    Molmoact: Action reasoning models that can reason in space, 2025

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space, 2025

  15. [15]

    Sanketi Li, Hao Wang, Ajay Mandlekar, Sichen M

    Pannag R. Sanketi Li, Hao Wang, Ajay Mandlekar, Sichen M. Yang, Haoran Geng, Yizhou Duan, Ruslan Li, Vincent Vanhoucke, Chelsea Finn, Julian Ibarz, Fei Xia, and Tianhe Yu. Evaluating real-world robot manipulation policies in simulation. InProceedings of the 8th Conference on Robot Learning, 2025. 11

  16. [16]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2...

  17. [17]

    Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning, 2024

    Shijie Li, Weixian Li, Tianle Zhou, Abul Kalam Azad, Jiankai Liu, Yeyun Gong, Liangming Pan, Chao Li, Zhangyang Wang, Bo An, and Li Dong. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning.arXiv preprint arXiv:2412.11974, 2024

  18. [18]

    What Matters in Building Vision-Language-Action Models for Generalist Robots

    Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058, 2024

  19. [19]

    Gr00t n1: An open foundation model for generalist humanoid robots, 2025

    NVIDIA GEAR Team, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, et al. Gr00t n1: An open foundation model for generalist humanoid robots, 2025

  20. [20]

    Octo: An open-source generalist robot policy.https://octo-models.github.io, 2023

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.https://octo-models.github.io, 2023

  21. [21]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023

  22. [22]

    Gr-rl: Going dexterous and precise for long-horizon robotic manipulation, 2025

    Kaustubh Pathak, Lawrence Yunliang Chen, Yiming Gao, Somil Kent, Long Ma, Anca Dragan, Dorsa Sadigh, and Chelsea Finn. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation, 2025

  23. [23]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Quan Vuong, Kevin Black, You Liang Tan, Adnan Esmail, Isabel Leal, Homer Walke, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  24. [24]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Alex Irpan, Julian Jones, Nikhil Joshi, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Sergey Levine, Yao Lu, Corey Lynch, Karl Pertsch, Kanishka Rao, Krista Reyma...

  25. [25]

    Physical Intelligence, Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π∗ 0.6: a vla that learns from experience, 2025

  26. [26]

    Physical Intelligence, Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0.7: a steerable generalist robotic foundation model with emergent behaviors, 2026

  27. [27]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025

  28. [28]

    Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn

    Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z. Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections. InRobotics: Science and Systems, 2024

  29. [29]

    Efros, and Moritz Hardt

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning, pages 9229–9248, 2020

  30. [30]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

  31. [31]

    Worldagen: Unified state-action prediction with test-time world model training

    Chi Wan, Kangrui Wang, Yuan Si, Pingyue Zhang, and Manling Li. Worldagen: Unified state-action prediction with test-time world model training. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 12

  32. [32]

    Tent: Fully test-time adaptation by entropy minimization

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021

  33. [33]

    Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

    Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvancesin Neural Information Processing Systems, 2024

  34. [34]

    Continual test-time domain adaptation

    Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022

  35. [35]

    Rosa: Harnessing robot states for vision-language and action alignment, 2025

    Yuqing Wen, Kefan Gu, Haoxuan Liu, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, and Xiaoyan Sun. Rosa: Harnessing robot states for vision-language and action alignment, 2025

  36. [36]

    Magma: A foundation model for multimodal ai agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents. arXiv preprint arXiv:2502.13130, 2025

  37. [37]

    Instructvla: Vision-language-action instruction tuning from understanding to manipulation

    Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520, 2025

  38. [38]

    Latent action pretraining from videos, 2025

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos, 2025

  39. [39]

    World action models are zero-shot policies, 2026

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

  40. [40]

    CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025

  41. [41]

    TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 13 Appendix Table 5 Extended comparison of deployment-time improvement strategies for VLA.This...