Continuous Reasoning for Vision-Language-Action

Kei Ota; Tatsuya Matsushima; Yueh-Hua Wu

arxiv: 2606.00229 · v2 · pith:HBTGZ6DTnew · submitted 2026-05-29 · 💻 cs.RO · cs.AI· cs.LG

Continuous Reasoning for Vision-Language-Action

Yueh-Hua Wu , Tatsuya Matsushima , Kei Ota This is my paper

Pith reviewed 2026-06-28 22:04 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords vision-language-actioncontinuous reasoningGaussian latentself-verificationrobot learningpolicy improvement

0 comments

The pith

A shared Gaussian latent for continuous thoughts lets vision-language-action policies reason without text tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Natural language operates at task-level granularity while vision-language-action policies must select actions at much finer scales, so text-based reasoning remains only weakly coupled to immediate control. The paper proposes continuous reasoning as a structured set of continuous thoughts encoded in a shared Gaussian latent that serves as context for chunk-structured action generation. This latent is trained via a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning to predict target actions. The resulting medium is therefore shareable across model instances and directly verifiable through downstream action improvement rather than functioning as a private shortcut. On real robots the approach raises mean subtask success over a strong baseline by 40.4 percent on one platform and 26.3 percent on another.

Core claim

Continuous reasoning predicts a structured set of continuous thoughts that serve as shared context for chunk-structured action generation; the mechanism is instantiated as a shared Gaussian latent interface and trained with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions.

What carries the argument

A shared Gaussian latent interface for continuous thoughts, enforced by a self-verification objective using an exponential-moving-average teacher.

If this is right

Mean subtask success rises 40.4 percent over the baseline on the TX-G2 platform.
Mean subtask success rises 26.3 percent over the baseline on the HSR platform.
The same internal medium supports both simulation robustness on LIBERO-PRO and real-robot control.
Reasoning in VLA is effective when it is shareable across instances and verifiable through improved action prediction rather than when it consists of extra discrete tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent interface could be tested for transfer between different VLA architectures without retraining the full policy.
Inspection of the learned continuous thoughts might reveal whether they align with human-interpretable subgoals or remain opaque.
The approach suggests that other continuous-control domains could replace language-based planners with shareable latent interfaces.

Load-bearing premise

The added latent will function as shareable and verifiable reasoning rather than a model-private shortcut, which the self-verification objective is intended to ensure by making the EMA teacher consume the student's reasoning.

What would settle it

An ablation that removes the self-verification objective while retaining the latent and shows no remaining performance gains on the real-robot tasks would falsify the claim that the reasoning medium itself drives the improvement.

read the original abstract

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over {\pi}0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a Gaussian latent for continuous reasoning in VLAs with an EMA-based self-verification objective and reports clear gains on LIBERO and real robots, but the training setup only demonstrates intra-model stability rather than the claimed cross-instance shareability.

read the letter

The main takeaway is that this work tries to fix the granularity mismatch between language-style reasoning and fine-grained robot control by using a structured Gaussian latent as the internal medium instead. It trains the latent so an EMA teacher must consume it to predict actions, and the results show meaningful lifts over the pi0.5 baseline on both simulation robustness and two physical platforms.

What stands out is the explicit framing that any added reasoning component must be shareable across models and verifiable through downstream action improvement, plus the choice to make the latent continuous and chunk-aligned. That is a clean conceptual step beyond token-based approaches.

The results look usable for people working on VLA policies. The gains are large enough on the reported tasks to be worth attention.

The soft spot sits in the shareability claim. The self-verification objective uses an exponential moving average of the same student parameters, which shows the latent stays consistent and helpful inside one training trajectory. It does not test whether a latent produced by one independently trained model can be consumed by a different model instance to improve its predictions on the same inputs. The reported improvements remain compatible with the latent acting as a model-specific auxiliary feature.

If the full paper includes cross-model swapping experiments or other direct tests of shareability, that would address the gap. From the description given, the central requirement is not yet strongly evidenced.

This is for researchers focused on internal representations inside embodied policies rather than general VLM audiences. It has enough structure and empirical signal to deserve referee time, though the authors should expect questions on whether the latent truly functions as a transferable medium.

Referee Report

2 major / 1 minor

Summary. The paper proposes Continuous Reasoning for Vision-Language-Action (CR-VLA), arguing that natural language is mismatched to the temporal granularity of continuous control in VLA policies. It introduces a structured set of continuous thoughts encoded as a shared Gaussian latent, which is reused as context for chunk-structured action generation. The latent is trained via a self-verification objective in which an EMA teacher must consume the student's reasoning to predict target actions. Empirical results claim improved robustness on LIBERO-PRO and real-robot tasks, with mean subtask success gains of 40.4% on TX-G2 and 26.3% on HSR over the π0.5 baseline.

Significance. If the central claim holds—that the Gaussian latent functions as a shareable, verifiable internal medium rather than a model-private shortcut—this could meaningfully advance VLA design by shifting emphasis from discrete token-based reasoning to continuous, temporally aligned representations. The reported real-robot gains would then indicate practical value for generalizable control.

major comments (2)

[Abstract] Abstract (paragraph 2): The claim that the latent must be 'shareable across model instances' and 'independently verified through improved downstream control' is load-bearing for distinguishing the method from a private shortcut. The self-verification objective is instantiated only with an EMA teacher (an exponential moving average of the student's own parameters). This tests intra-trajectory stability but provides no evidence that a latent produced by one independently initialized or separately trained model instance can be consumed by another distinct instance to improve action prediction on the same inputs.
[Abstract] Abstract (empirical paragraph): The reported gains (40.4% on TX-G2, 26.3% on HSR) are presented as evidence that the latent supports generalizable control. Without experiments that swap latents across independently trained models or ablate the EMA teacher against a cross-model verification protocol, these gains remain compatible with the latent acting as an auxiliary feature private to the training trajectory.

minor comments (1)

The abstract states empirical gains but supplies no derivation details, training equations, data exclusion rules, error bars, or exact baseline configurations beyond the π0.5 reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the gap between the stated requirements for a reasoning medium and the evidence provided by EMA-based self-verification. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph 2): The claim that the latent must be 'shareable across model instances' and 'independently verified through improved downstream control' is load-bearing for distinguishing the method from a private shortcut. The self-verification objective is instantiated only with an EMA teacher (an exponential moving average of the student's own parameters). This tests intra-trajectory stability but provides no evidence that a latent produced by one independently initialized or separately trained model instance can be consumed by another distinct instance to improve action prediction on the same inputs.

Authors: We agree that EMA verification only establishes that the latent remains useful under gradual parameter drift within a single training trajectory. It does not constitute evidence that an independently initialized model could consume the same latent to improve its own action predictions. We will revise the abstract and the corresponding paragraph in Section 3 to replace the phrasing 'shareable across model instances' with 'stable under parameter evolution within training' and to explicitly note that cross-model transfer remains untested. revision: yes
Referee: [Abstract] Abstract (empirical paragraph): The reported gains (40.4% on TX-G2, 26.3% on HSR) are presented as evidence that the latent supports generalizable control. Without experiments that swap latents across independently trained models or ablate the EMA teacher against a cross-model verification protocol, these gains remain compatible with the latent acting as an auxiliary feature private to the training trajectory.

Authors: The reported gains are obtained under the EMA self-verification regime. We concur that these numbers alone do not rule out the possibility that the latent functions as a training-trajectory-specific auxiliary feature. We will revise the empirical paragraph in the abstract and the corresponding discussion in Section 5 to present the gains as improvements in robustness under the proposed objective, while adding a sentence acknowledging that stronger claims about cross-instance shareability would require latent-swapping or cross-model verification experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines continuous reasoning via a shared Gaussian latent and a self-verification objective using an EMA teacher that consumes the student's latent to predict actions. Performance claims rest on independent empirical evaluations (LIBERO-PRO robustness, TX-G2 and HSR robot success rates) rather than any quantity that reduces by construction to the training objective or to prior self-citations. No equations, fitted-input predictions, uniqueness theorems, or ansatzes are shown that equate the reported gains to the inputs; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full equations, training details, and parameter counts unavailable.

axioms (1)

domain assumption A useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure.
Explicitly stated in the abstract as the premise motivating the design.

invented entities (1)

Structured set of continuous thoughts represented as shared Gaussian latent no independent evidence
purpose: To serve as the reasoning medium that is reused for chunk-structured action generation
Introduced in the abstract as the core representational choice.

pith-pipeline@v0.9.1-grok · 5849 in / 1294 out tokens · 21928 ms · 2026-06-28T22:04:59.379472+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, Y. Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J. Q...

Pith/arXiv arXiv 2022
[2]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Perts...

Pith/arXiv arXiv 2023
[3]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023. URL https://arxiv.org/abs/2303.03378

Pith/arXiv arXiv 2023
[4]

Ghosh, H

Octo Model Team , D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

Pith/arXiv arXiv 2024
[5]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246

Pith/arXiv arXiv 2024
[6]

URL https://www.roboticsproceedings.org/ rss21/p010.html

K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. _0 : A Vision-Language-Action Flow Model for General Robot Control . In Proceedin...

work page doi:10.15607/rss.2025.xxi.010 2025
[7]

M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghou...

Pith/arXiv arXiv 2022
[8]

Huang, F

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022. URL https://arxiv.org/abs/2207.05608

Pith/arXiv arXiv 2022
[9]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control, 2022. URL https://arxiv.org/abs/2209.07753

Pith/arXiv arXiv 2022
[10]

Huang, C

W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models, 2023. URL https://arxiv.org/abs/2307.05973

Pith/arXiv arXiv 2023
[11]

Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. 2025. doi:10.48550/ARXIV.2503.22020. URL https://arxiv.org/abs/2503.22020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.22020 2025
[12]

Q. Shou, F. Zhu, S. Chen, P. Yan, Z. Yan, Y. Miao, X. Pang, Z. Hong, R. Shi, H. Huang, J. Zhang, and S. Guo. Halo: A unified vision-language-action model for embodied multimodal chain-of-thought reasoning, 2026. URL https://arxiv.org/abs/2602.21157

arXiv 2026
[13]

Zhong, J

Z. Zhong, J. Li, J. He, H. Yan, X. Gong, G. Zhao, Y. Cai, J. Gao, X. Yan, B. Liu, Y. Chen, L. Yang, and H. Li. Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models, 2026. URL https://arxiv.org/abs/2603.22280

arXiv 2026
[14]

J. Wen, M. Zhu, J. Liu, Z. Liu, Y. Yang, L. Zhang, S. Zhang, Y. Zhu, and Y. Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought, 2025. URL https://arxiv.org/abs/2509.25681

arXiv 2025
[15]

C. Yin, Y. Lin, W. Xu, S. Tam, X. Zeng, Z. Liu, and Z. Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models, 2025. URL https://arxiv.org/abs/2511.15669

Pith/arXiv arXiv 2025
[16]

Huang, Y.-H

C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025. URL https://arxiv.org/abs/2507.16815

Pith/arXiv arXiv 2025
[17]

Huang, Y

C.-P. Huang, Y. Man, Z. Yu, M.-H. Chen, J. Kautz, Y.-C. F. Wang, and F.-E. Yang. Fast-thinkact: Efficient vision-language-action reasoning via verbalizable latent planning, 2026. URL https://arxiv.org/abs/2601.09708

arXiv 2026
[18]

S. Bai, J. Lyu, W. Zhou, Z. Li, D. Wang, L. Xing, X. Zhao, P. Wang, Z. Wang, C. Chi, B. Chen, and S. Zhang. Latent reasoning vla: Latent thinking and prediction for vision-language-action models, 2026. URL https://arxiv.org/abs/2602.01166

Pith/arXiv arXiv 2026
[19]

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, Z. Che, J. Tang, P.-A. Heng, and S. Zhang. Last _ 0 : Latent spatio-temporal chain-of-thought for robotic vision-language-action model, 2026. URL https://arxiv.org/abs/2601.05248

arXiv 2026
[20]

J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917

Pith/arXiv arXiv 2025
[21]

Zhong, Y

L. Zhong, Y. Liu, Y. Wei, Z. Xiong, M. Yao, S. Liu, and G. Ren. Acot-vla: Action chain-of-thought for vision-language-action models, 2026. URL https://arxiv.org/abs/2601.11404

arXiv 2026
[22]

Y. Ling, Q. Lian, J. Li, Q. Jiang, T. Zhang, X. Jiang, C. Liu, J. Liu, and L. Zhang. Guide, think, act: Interactive embodied reasoning in vision-language-action models, 2026. URL https://arxiv.org/abs/2605.13632

Pith/arXiv arXiv 2026
[23]

Lanham, A

T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. Denison, D. Hernandez, D. Li, E. Durmus, E. Hubinger, J. Kernion, K. Lukošiūtė, K. Nguyen, N. Cheng, N. Joseph, N. Schiefer, O. Rausch, R. Larson, S. McCandlish, S. Kundu, S. Kadavath, S. Yang, T. Henighan, T. Maxwell, T. Telleen-Lawton, T. Hume, Z. Hatfield-Dodds, J. Kaplan, J. Brauner, S. R. Bowman, a...

Pith/arXiv arXiv 2023
[24]

Mohammadi, T

H. Mohammadi, T. Kozak, and A. Giachanou. Evaluating GRPO and DPO for faithful chain-of-thought reasoning in LLMs , 2025. URL https://arxiv.org/abs/2512.22631

arXiv 2025
[25]

Q. Yu, A. Tartaglini, P. Hase, C. Guestrin, and C. Potts. Outcome rewards do not guarantee verifiable or causally important reasoning, 2026. URL https://arxiv.org/abs/2604.22074

Pith/arXiv arXiv 2026
[26]

Tolstikhin, O

I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders, 2017. URL https://arxiv.org/abs/1711.01558

arXiv 2017
[27]

Tarvainen and H

A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, 2017. URL https://arxiv.org/abs/1703.01780

Pith/arXiv arXiv 2017
[28]

Z. Zhou, Y. Zhu, J. Wen, C. Shen, and Y. Xu. Chatvla-2: Vision-language-action model with open-world embodied reasoning from pretrained knowledge, 2025. URL https://arxiv.org/abs/2505.21906

arXiv 2025
[29]

S. Yang, H. Li, B. Wang, Y. Chen, Y. Tian, T. Wang, H. Wang, F. Zhao, Y. Liao, and J. Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025. URL https://arxiv.org/abs/2507.17520

arXiv 2025
[30]

D. Qu, H. Song, Q. Chen, Y. Yao, X. Ye, Y. Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, and X. Li. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URL https://arxiv.org/abs/2501.15830

Pith/arXiv arXiv 2025
[31]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645

Pith/arXiv arXiv 2025
[32]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vu...

Pith/arXiv arXiv 2025
[33]

T. Z. Zhao, V. Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URL https://arxiv.org/abs/2304.13705

Pith/arXiv arXiv 2023
[34]

N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone, 2022. URL https://arxiv.org/abs/2206.11251

arXiv 2022
[35]

S. Lee, Y. Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto. Behavior generation with latent actions. 2024. doi:10.48550/ARXIV.2403.03181. URL https://arxiv.org/abs/2403.03181

work page doi:10.48550/arxiv.2403.03181 2024
[36]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2023. URL https://arxiv.org/abs/2303.04137

Pith/arXiv arXiv 2023
[37]

Z. Hou, T. Zhang, Y. Xiong, H. Pu, C. Zhao, R. Tong, Y. Qiao, J. Dai, and Y. Chen. Diffusion transformer policy, 2024. URL https://arxiv.org/abs/2410.15959

arXiv 2024
[38]

Haldar, Z

S. Haldar, Z. Peng, and L. Pinto. Baku: An efficient transformer for multi-task policy learning, 2024. URL https://arxiv.org/abs/2406.07539

arXiv 2024
[39]

Huang, Y

Z. Huang, Y. Lin, F. Yang, and D. Berenson. Subgoal diffuser: Coarse-to-fine subgoal generation to guide model predictive control for robot manipulation, 2024. URL https://arxiv.org/abs/2403.13085

arXiv 2024
[40]

L. Liu, W. Wang, Y. Han, Z. Xie, P. Yi, J. Li, Y. Qin, and W. Lian. Foam: Foresight-augmented multi-task imitation policy for robotic manipulation, 2024. URL https://arxiv.org/abs/2409.19528

arXiv 2024
[41]

H. Chen, J. Guo, B. Wang, T. Zhang, X. Huang, B. Zheng, Y. Hou, C. Tie, J. Deng, and L. Shao. Goal-vla: Image-generative vlms as object-centric world models empowering zero-shot robot manipulation, 2025. URL https://arxiv.org/abs/2506.23919

arXiv 2025
[42]

Lynch, M

C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play, 2019. URL https://arxiv.org/abs/1903.01973

arXiv 2019
[43]

Grill, F

J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020. URL https://arxiv.org/abs/2006.07733

arXiv 2020
[44]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers, 2021. URL https://arxiv.org/abs/2104.14294

Pith/arXiv arXiv 2021
[45]

X. Zhou, Y. Xu, G. Tie, Y. Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization, 2025. URL https://arxiv.org/abs/2510.03827

Pith/arXiv arXiv 2025
[46]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, Y.-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https://arxiv.org/abs/2510.10274

Pith/arXiv arXiv 2025
[47]

Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, S. Huang, Y. Tang, W. Wang, R. Zhang, J. Liu, and D. Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model, 2025. URL https://arxiv.org/abs/2509.09372

arXiv 2025

[1] [1]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, Y. Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J. Q...

Pith/arXiv arXiv 2022

[2] [2]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Perts...

Pith/arXiv arXiv 2023

[3] [3]

Driess, F

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023. URL https://arxiv.org/abs/2303.03378

Pith/arXiv arXiv 2023

[4] [4]

Ghosh, H

Octo Model Team , D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

Pith/arXiv arXiv 2024

[5] [5]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246

Pith/arXiv arXiv 2024

[6] [6]

URL https://www.roboticsproceedings.org/ rss21/p010.html

K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. _0 : A Vision-Language-Action Flow Model for General Robot Control . In Proceedin...

work page doi:10.15607/rss.2025.xxi.010 2025

[7] [7]

M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghou...

Pith/arXiv arXiv 2022

[8] [8]

Huang, F

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022. URL https://arxiv.org/abs/2207.05608

Pith/arXiv arXiv 2022

[9] [9]

Liang, W

J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control, 2022. URL https://arxiv.org/abs/2209.07753

Pith/arXiv arXiv 2022

[10] [10]

Huang, C

W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models, 2023. URL https://arxiv.org/abs/2307.05973

Pith/arXiv arXiv 2023

[11] [11]

Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. 2025. doi:10.48550/ARXIV.2503.22020. URL https://arxiv.org/abs/2503.22020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.22020 2025

[12] [12]

Q. Shou, F. Zhu, S. Chen, P. Yan, Z. Yan, Y. Miao, X. Pang, Z. Hong, R. Shi, H. Huang, J. Zhang, and S. Guo. Halo: A unified vision-language-action model for embodied multimodal chain-of-thought reasoning, 2026. URL https://arxiv.org/abs/2602.21157

arXiv 2026

[13] [13]

Zhong, J

Z. Zhong, J. Li, J. He, H. Yan, X. Gong, G. Zhao, Y. Cai, J. Gao, X. Yan, B. Liu, Y. Chen, L. Yang, and H. Li. Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models, 2026. URL https://arxiv.org/abs/2603.22280

arXiv 2026

[14] [14]

J. Wen, M. Zhu, J. Liu, Z. Liu, Y. Yang, L. Zhang, S. Zhang, Y. Zhu, and Y. Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought, 2025. URL https://arxiv.org/abs/2509.25681

arXiv 2025

[15] [15]

C. Yin, Y. Lin, W. Xu, S. Tam, X. Zeng, Z. Liu, and Z. Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models, 2025. URL https://arxiv.org/abs/2511.15669

Pith/arXiv arXiv 2025

[16] [16]

Huang, Y.-H

C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025. URL https://arxiv.org/abs/2507.16815

Pith/arXiv arXiv 2025

[17] [17]

Huang, Y

C.-P. Huang, Y. Man, Z. Yu, M.-H. Chen, J. Kautz, Y.-C. F. Wang, and F.-E. Yang. Fast-thinkact: Efficient vision-language-action reasoning via verbalizable latent planning, 2026. URL https://arxiv.org/abs/2601.09708

arXiv 2026

[18] [18]

S. Bai, J. Lyu, W. Zhou, Z. Li, D. Wang, L. Xing, X. Zhao, P. Wang, Z. Wang, C. Chi, B. Chen, and S. Zhang. Latent reasoning vla: Latent thinking and prediction for vision-language-action models, 2026. URL https://arxiv.org/abs/2602.01166

Pith/arXiv arXiv 2026

[19] [19]

Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, Z. Che, J. Tang, P.-A. Heng, and S. Zhang. Last _ 0 : Latent spatio-temporal chain-of-thought for robotic vision-language-action model, 2026. URL https://arxiv.org/abs/2601.05248

arXiv 2026

[20] [20]

J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917

Pith/arXiv arXiv 2025

[21] [21]

Zhong, Y

L. Zhong, Y. Liu, Y. Wei, Z. Xiong, M. Yao, S. Liu, and G. Ren. Acot-vla: Action chain-of-thought for vision-language-action models, 2026. URL https://arxiv.org/abs/2601.11404

arXiv 2026

[22] [22]

Y. Ling, Q. Lian, J. Li, Q. Jiang, T. Zhang, X. Jiang, C. Liu, J. Liu, and L. Zhang. Guide, think, act: Interactive embodied reasoning in vision-language-action models, 2026. URL https://arxiv.org/abs/2605.13632

Pith/arXiv arXiv 2026

[23] [23]

Lanham, A

T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. Denison, D. Hernandez, D. Li, E. Durmus, E. Hubinger, J. Kernion, K. Lukošiūtė, K. Nguyen, N. Cheng, N. Joseph, N. Schiefer, O. Rausch, R. Larson, S. McCandlish, S. Kundu, S. Kadavath, S. Yang, T. Henighan, T. Maxwell, T. Telleen-Lawton, T. Hume, Z. Hatfield-Dodds, J. Kaplan, J. Brauner, S. R. Bowman, a...

Pith/arXiv arXiv 2023

[24] [24]

Mohammadi, T

H. Mohammadi, T. Kozak, and A. Giachanou. Evaluating GRPO and DPO for faithful chain-of-thought reasoning in LLMs , 2025. URL https://arxiv.org/abs/2512.22631

arXiv 2025

[25] [25]

Q. Yu, A. Tartaglini, P. Hase, C. Guestrin, and C. Potts. Outcome rewards do not guarantee verifiable or causally important reasoning, 2026. URL https://arxiv.org/abs/2604.22074

Pith/arXiv arXiv 2026

[26] [26]

Tolstikhin, O

I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders, 2017. URL https://arxiv.org/abs/1711.01558

arXiv 2017

[27] [27]

Tarvainen and H

A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, 2017. URL https://arxiv.org/abs/1703.01780

Pith/arXiv arXiv 2017

[28] [28]

Z. Zhou, Y. Zhu, J. Wen, C. Shen, and Y. Xu. Chatvla-2: Vision-language-action model with open-world embodied reasoning from pretrained knowledge, 2025. URL https://arxiv.org/abs/2505.21906

arXiv 2025

[29] [29]

S. Yang, H. Li, B. Wang, Y. Chen, Y. Tian, T. Wang, H. Wang, F. Zhao, Y. Liao, and J. Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025. URL https://arxiv.org/abs/2507.17520

arXiv 2025

[30] [30]

D. Qu, H. Song, Q. Chen, Y. Yao, X. Ye, Y. Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, and X. Li. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URL https://arxiv.org/abs/2501.15830

Pith/arXiv arXiv 2025

[31] [31]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645

Pith/arXiv arXiv 2025

[32] [32]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vu...

Pith/arXiv arXiv 2025

[33] [33]

T. Z. Zhao, V. Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URL https://arxiv.org/abs/2304.13705

Pith/arXiv arXiv 2023

[34] [34]

N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone, 2022. URL https://arxiv.org/abs/2206.11251

arXiv 2022

[35] [35]

S. Lee, Y. Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto. Behavior generation with latent actions. 2024. doi:10.48550/ARXIV.2403.03181. URL https://arxiv.org/abs/2403.03181

work page doi:10.48550/arxiv.2403.03181 2024

[36] [36]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2023. URL https://arxiv.org/abs/2303.04137

Pith/arXiv arXiv 2023

[37] [37]

Z. Hou, T. Zhang, Y. Xiong, H. Pu, C. Zhao, R. Tong, Y. Qiao, J. Dai, and Y. Chen. Diffusion transformer policy, 2024. URL https://arxiv.org/abs/2410.15959

arXiv 2024

[38] [38]

Haldar, Z

S. Haldar, Z. Peng, and L. Pinto. Baku: An efficient transformer for multi-task policy learning, 2024. URL https://arxiv.org/abs/2406.07539

arXiv 2024

[39] [39]

Huang, Y

Z. Huang, Y. Lin, F. Yang, and D. Berenson. Subgoal diffuser: Coarse-to-fine subgoal generation to guide model predictive control for robot manipulation, 2024. URL https://arxiv.org/abs/2403.13085

arXiv 2024

[40] [40]

L. Liu, W. Wang, Y. Han, Z. Xie, P. Yi, J. Li, Y. Qin, and W. Lian. Foam: Foresight-augmented multi-task imitation policy for robotic manipulation, 2024. URL https://arxiv.org/abs/2409.19528

arXiv 2024

[41] [41]

H. Chen, J. Guo, B. Wang, T. Zhang, X. Huang, B. Zheng, Y. Hou, C. Tie, J. Deng, and L. Shao. Goal-vla: Image-generative vlms as object-centric world models empowering zero-shot robot manipulation, 2025. URL https://arxiv.org/abs/2506.23919

arXiv 2025

[42] [42]

Lynch, M

C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play, 2019. URL https://arxiv.org/abs/1903.01973

arXiv 2019

[43] [43]

Grill, F

J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020. URL https://arxiv.org/abs/2006.07733

arXiv 2020

[44] [44]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers, 2021. URL https://arxiv.org/abs/2104.14294

Pith/arXiv arXiv 2021

[45] [45]

X. Zhou, Y. Xu, G. Tie, Y. Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization, 2025. URL https://arxiv.org/abs/2510.03827

Pith/arXiv arXiv 2025

[46] [46]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, Y.-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https://arxiv.org/abs/2510.10274

Pith/arXiv arXiv 2025

[47] [47]

Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, S. Huang, Y. Tang, W. Wang, R. Zhang, J. Liu, and D. Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model, 2025. URL https://arxiv.org/abs/2509.09372

arXiv 2025