TTT-VLA: Test-Time Latent Prompt Optimization for Vision-Language-Action Models
Pith reviewed 2026-06-28 10:07 UTC · model grok-4.3
The pith
Optimizing a latent prompt at test time improves vision-language-action model success rates on new environments by correcting critical decisions without changing the policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TTT-VLA performs test-time training for VLA models by optimizing only the latent prompt on the proxy task's self-supervised signal derived from current-environment interaction data, producing higher task success rates without any modification to the policy itself.
What carries the argument
Latent Prompt Optimization (LPO), an extra learned conditioning signal trained with a proxy task and then tuned at test time to steer the frozen policy.
If this is right
- Task success rates rise consistently in single-embodiment settings on SimplerEnv.
- Task success rates rise consistently in multi-embodiment settings on SimplerEnv.
- Performance gains come chiefly from correcting a small number of critical decisions rather than global changes to policy behavior.
Where Pith is reading between the lines
- The same prompt-only adaptation pattern could be tested on other prompt-conditioned foundation models that encounter distribution shift.
- Targeted prompt updates may prove sufficient for many practical deployment fixes in robotics without full policy retraining.
- Combining LPO with additional test-time signals could be explored to handle more severe shifts.
Load-bearing premise
The proxy task's self-supervised signal remains informative and sufficient to optimize the latent prompt for correcting distribution-shift errors when only interaction data from the current environment is available and the policy itself is not modified.
What would settle it
If optimizing the latent prompt on interaction data from a new environment produces no increase in task success rates compared with the unadapted baseline, the central claim would be falsified.
read the original abstract
Vision-Language-Action (VLA) models trained on large-scale data have made remarkable progress, but they remain vulnerable to distribution shifts at deployment time. Recent VLA models suggest that prompts can serve as an efficient interface for steering policy behavior, but existing prompt-based steering typically relies on external guidance. This raises a natural question: can test-time training (TTT) for VLA be achieved by optimizing a prompt, so that the steering interface itself can be learned and adapted from interaction? We address this question with TTT-VLA, a test-time training framework based on Latent Prompt Optimization (LPO). During training, the latent prompt is learned with an additional proxy task, providing an extra learned conditioning signal for policy learning. At test time, TTT is performed by collecting interaction data from the current environment and optimizing only the latent prompt on those data using the proxy task's self-supervised signal, without modifying the policy itself. Experiments on SimplerEnv demonstrate that the proposed method consistently improves task success rates in both single- and multi-embodiment settings. Further analysis shows that the gains arise primarily from correcting a small number of critical decisions rather than globally altering policy behavior. These results suggest that LPO provides an effective and practical pathway for deployment-time improvement of foundation manipulation policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TTT-VLA, a test-time training framework for Vision-Language-Action (VLA) models based on Latent Prompt Optimization (LPO). During training, a latent prompt is learned jointly with an additional proxy task that provides an extra conditioning signal. At test time, interaction data from the current environment is collected and used to optimize only the latent prompt via the proxy task's self-supervised signal, without modifying the policy parameters. Experiments on SimplerEnv are reported to show consistent improvements in task success rates in both single- and multi-embodiment settings, with further analysis indicating that gains come from correcting a small number of critical decisions rather than global policy changes.
Significance. If the central claim holds with rigorous evidence, the work would demonstrate a practical mechanism for deployment-time adaptation of large foundation VLA policies using only interaction data and a fixed proxy signal. This could be significant for robotics, as it avoids the computational cost of policy fine-tuning while addressing distribution shift. The approach of learning a proxy task specifically to enable self-supervised prompt optimization at test time is a potentially useful idea, provided the signal is shown to be informative for the claimed error-correction behavior.
major comments (2)
- [Method] Method section (description of LPO and proxy task): The proxy task is load-bearing for the entire test-time procedure, yet no definition of its formulation, loss function, or coupling to the main manipulation objective is supplied. Without this, it is impossible to evaluate whether the self-supervised signal remains informative for correcting distribution-shift errors (as required by the central claim) or whether optimization could converge to prompts that satisfy the proxy loss without improving success rates.
- [Experiments] Experiments section: The abstract states that experiments demonstrate consistent improvements and that gains arise from correcting a small number of critical decisions, but supplies no quantitative success rates, baselines, statistical tests, number of trials, or details on interaction data collection and optimization procedure. These omissions make the empirical support for the claim unverifiable from the provided text.
minor comments (2)
- [Abstract] The abstract and introduction use the term 'latent prompt' without an initial formal definition or diagram showing its integration into the VLA architecture.
- [Method] Notation for the proxy task loss and the test-time optimization objective should be introduced explicitly with equations to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on TTT-VLA. The comments correctly identify needs for greater methodological transparency and empirical detail. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Method] Method section (description of LPO and proxy task): The proxy task is load-bearing for the entire test-time procedure, yet no definition of its formulation, loss function, or coupling to the main manipulation objective is supplied. Without this, it is impossible to evaluate whether the self-supervised signal remains informative for correcting distribution-shift errors (as required by the central claim) or whether optimization could converge to prompts that satisfy the proxy loss without improving success rates.
Authors: We agree that an explicit definition is required for the proxy task to be evaluable. In the revised manuscript we will expand the Method section to state the precise formulation of the proxy task (including input/output structure), its loss function, and the joint training objective that couples it to the primary VLA policy. This addition will directly address whether the self-supervised signal is expected to remain informative under distribution shift. revision: yes
-
Referee: [Experiments] Experiments section: The abstract states that experiments demonstrate consistent improvements and that gains arise from correcting a small number of critical decisions, but supplies no quantitative success rates, baselines, statistical tests, number of trials, or details on interaction data collection and optimization procedure. These omissions make the empirical support for the claim unverifiable from the provided text.
Authors: We acknowledge that the current text does not supply the requested quantitative details. In the revision we will augment the Experiments section with concrete success rates per task and embodiment setting, baseline comparisons, number of evaluation trials, any statistical tests performed, and a precise description of interaction-data collection and latent-prompt optimization steps. We will also add a short summary of these numbers to the abstract if length permits. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes an empirical method for test-time latent prompt optimization using a proxy task's self-supervised signal on new interaction data, without any equations, derivations, or mathematical claims. The optimization is performed on external data at deployment and does not reduce any prediction or result to quantities fitted inside the paper. No self-citations are invoked as load-bearing premises, and the central claims rest on experimental outcomes rather than any self-referential construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The proxy task learned during training supplies a reliable self-supervised signal usable for test-time prompt optimization under distribution shift.
invented entities (1)
-
Latent prompt
no independent evidence
Reference graph
Works this paper leans on
-
[1]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
RT-1: robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Mall...
-
[5]
Univla: Learning to act anywhere with task-centric latent actions, 2025
Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions, 2025
2025
-
[6]
Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. In Advancesin Neural Information Processing Systems, 2022
2022
-
[7]
Efros, Lerrel Pinto, and Xiaolong Wang
Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Alenyà, Pieter Abbeel, Alexei A. Efros, Lerrel Pinto, and Xiaolong Wang. Self-supervised policy adaptation during deployment.arXiv preprint arXiv:2007.04309, 2020
-
[8]
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, et al.π0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning, 2025
Guanhua Ji, Harsha Polavaram, Lawrence Yunliang Chen, Sandeep Bajamahal, Zehan Ma, Simeon Adebola, Chenfeng Xu, and Ken Goldberg. Oxe-auge: A large-scale robot augmentation of oxe for scaling cross-embodiment policy learning, 2025
2025
-
[11]
OpenVLA: An Open-Source Vision-Language-Action Model, June 2024
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, June 2024
2024
-
[12]
Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Conference on Robot Learning, vol...
2024
-
[13]
Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026
2026
-
[14]
Molmoact: Action reasoning models that can reason in space, 2025
Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, Winson Han, Wilbert Pumacay, Angelica Wu, Rose Hendrix, Karen Farley, Eli VanderBilt, Ali Farhadi, Dieter Fox, and Ranjay Krishna. Molmoact: Action reasoning models that can reason in space, 2025
2025
-
[15]
Sanketi Li, Hao Wang, Ajay Mandlekar, Sichen M
Pannag R. Sanketi Li, Hao Wang, Ajay Mandlekar, Sichen M. Yang, Haoran Geng, Yizhou Duan, Ruslan Li, Vincent Vanhoucke, Chelsea Finn, Julian Ibarz, Fei Xia, and Tianhe Yu. Evaluating real-world robot manipulation policies in simulation. InProceedings of the 8th Conference on Robot Learning, 2025. 11
2025
-
[16]
Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, Xiaofan Wang, Bei Liu, Jianlong Fu, Jianmin Bao, Dong Chen, Yuanchun Shi, Jiaolong Yang, and Baining Guo. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Shijie Li, Weixian Li, Tianle Zhou, Abul Kalam Azad, Jiankai Liu, Yeyun Gong, Liangming Pan, Chao Li, Zhangyang Wang, Bo An, and Li Dong. Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning.arXiv preprint arXiv:2412.11974, 2024
-
[18]
What Matters in Building Vision-Language-Action Models for Generalist Robots
Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Gr00t n1: An open foundation model for generalist humanoid robots, 2025
NVIDIA GEAR Team, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, et al. Gr00t n1: An open foundation model for generalist humanoid robots, 2025
2025
-
[20]
Octo: An open-source generalist robot policy.https://octo-models.github.io, 2023
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.https://octo-models.github.io, 2023
2023
-
[21]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Gr-rl: Going dexterous and precise for long-horizon robotic manipulation, 2025
Kaustubh Pathak, Lawrence Yunliang Chen, Yiming Gao, Somil Kent, Long Ma, Anca Dragan, Dorsa Sadigh, and Chelsea Finn. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation, 2025
2025
-
[23]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Quan Vuong, Kevin Black, You Liang Tan, Adnan Esmail, Isabel Leal, Homer Walke, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Alex Irpan, Julian Jones, Nikhil Joshi, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Sergey Levine, Yao Lu, Corey Lynch, Karl Pertsch, Kanishka Rao, Krista Reyma...
2025
-
[25]
Physical Intelligence, Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π∗ 0.6: a vla that learns from experience, 2025
2025
-
[26]
Physical Intelligence, Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0.7: a steerable generalist robotic foundation model with emergent behaviors, 2026
2026
-
[27]
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn
Lucy Xiaoyang Shi, Zheyuan Hu, Tony Z. Zhao, Archit Sharma, Karl Pertsch, Jianlan Luo, Sergey Levine, and Chelsea Finn. Yell at your robot: Improving on-the-fly from language corrections. InRobotics: Science and Systems, 2024
2024
-
[29]
Efros, and Moritz Hardt
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A. Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InProceedings of the 37th International Conference on Machine Learning, pages 9229–9248, 2020
2020
-
[30]
Bridgedata v2: A dataset for robot learning at scale
Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023
2023
-
[31]
Worldagen: Unified state-action prediction with test-time world model training
Chi Wan, Kangrui Wang, Yuan Si, Pingyue Zhang, and Manling Li. Worldagen: Unified state-action prediction with test-time world model training. InProceedings of the AAAI Conference on Artificial Intelligence, 2026. 12
2026
-
[32]
Tent: Fully test-time adaptation by entropy minimization
Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021
2021
-
[33]
Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers
Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvancesin Neural Information Processing Systems, 2024
2024
-
[34]
Continual test-time domain adaptation
Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022
2022
-
[35]
Rosa: Harnessing robot states for vision-language and action alignment, 2025
Yuqing Wen, Kefan Gu, Haoxuan Liu, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, and Xiaoyan Sun. Rosa: Harnessing robot states for vision-language and action alignment, 2025
2025
-
[36]
Magma: A foundation model for multimodal ai agents
Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents. arXiv preprint arXiv:2502.13130, 2025
-
[37]
Instructvla: Vision-language-action instruction tuning from understanding to manipulation
Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, and Jiangmiao Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation. arXiv preprint arXiv:2507.17520, 2025
-
[38]
Latent action pretraining from videos, 2025
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos, 2025
2025
-
[39]
World action models are zero-shot policies, 2026
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...
2026
-
[40]
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345, 2024. 13 Appendix Table 5 Extended comparison of deployment-time improvement strategies for VLA.This...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.