pith. sign in

arxiv: 2606.00229 · v2 · pith:HBTGZ6DTnew · submitted 2026-05-29 · 💻 cs.RO · cs.AI· cs.LG

Continuous Reasoning for Vision-Language-Action

Pith reviewed 2026-06-28 22:04 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords vision-language-actioncontinuous reasoningGaussian latentself-verificationrobot learningpolicy improvement
0
0 comments X

The pith

A shared Gaussian latent for continuous thoughts lets vision-language-action policies reason without text tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Natural language operates at task-level granularity while vision-language-action policies must select actions at much finer scales, so text-based reasoning remains only weakly coupled to immediate control. The paper proposes continuous reasoning as a structured set of continuous thoughts encoded in a shared Gaussian latent that serves as context for chunk-structured action generation. This latent is trained via a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning to predict target actions. The resulting medium is therefore shareable across model instances and directly verifiable through downstream action improvement rather than functioning as a private shortcut. On real robots the approach raises mean subtask success over a strong baseline by 40.4 percent on one platform and 26.3 percent on another.

Core claim

Continuous reasoning predicts a structured set of continuous thoughts that serve as shared context for chunk-structured action generation; the mechanism is instantiated as a shared Gaussian latent interface and trained with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions.

What carries the argument

A shared Gaussian latent interface for continuous thoughts, enforced by a self-verification objective using an exponential-moving-average teacher.

If this is right

  • Mean subtask success rises 40.4 percent over the baseline on the TX-G2 platform.
  • Mean subtask success rises 26.3 percent over the baseline on the HSR platform.
  • The same internal medium supports both simulation robustness on LIBERO-PRO and real-robot control.
  • Reasoning in VLA is effective when it is shareable across instances and verifiable through improved action prediction rather than when it consists of extra discrete tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent interface could be tested for transfer between different VLA architectures without retraining the full policy.
  • Inspection of the learned continuous thoughts might reveal whether they align with human-interpretable subgoals or remain opaque.
  • The approach suggests that other continuous-control domains could replace language-based planners with shareable latent interfaces.

Load-bearing premise

The added latent will function as shareable and verifiable reasoning rather than a model-private shortcut, which the self-verification objective is intended to ensure by making the EMA teacher consume the student's reasoning.

What would settle it

An ablation that removes the self-verification objective while retaining the latent and shows no remaining performance gains on the real-robot tasks would falsify the claim that the reasoning medium itself drives the improvement.

read the original abstract

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over {\pi}0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Continuous Reasoning for Vision-Language-Action (CR-VLA), arguing that natural language is mismatched to the temporal granularity of continuous control in VLA policies. It introduces a structured set of continuous thoughts encoded as a shared Gaussian latent, which is reused as context for chunk-structured action generation. The latent is trained via a self-verification objective in which an EMA teacher must consume the student's reasoning to predict target actions. Empirical results claim improved robustness on LIBERO-PRO and real-robot tasks, with mean subtask success gains of 40.4% on TX-G2 and 26.3% on HSR over the π0.5 baseline.

Significance. If the central claim holds—that the Gaussian latent functions as a shareable, verifiable internal medium rather than a model-private shortcut—this could meaningfully advance VLA design by shifting emphasis from discrete token-based reasoning to continuous, temporally aligned representations. The reported real-robot gains would then indicate practical value for generalizable control.

major comments (2)
  1. [Abstract] Abstract (paragraph 2): The claim that the latent must be 'shareable across model instances' and 'independently verified through improved downstream control' is load-bearing for distinguishing the method from a private shortcut. The self-verification objective is instantiated only with an EMA teacher (an exponential moving average of the student's own parameters). This tests intra-trajectory stability but provides no evidence that a latent produced by one independently initialized or separately trained model instance can be consumed by another distinct instance to improve action prediction on the same inputs.
  2. [Abstract] Abstract (empirical paragraph): The reported gains (40.4% on TX-G2, 26.3% on HSR) are presented as evidence that the latent supports generalizable control. Without experiments that swap latents across independently trained models or ablate the EMA teacher against a cross-model verification protocol, these gains remain compatible with the latent acting as an auxiliary feature private to the training trajectory.
minor comments (1)
  1. The abstract states empirical gains but supplies no derivation details, training equations, data exclusion rules, error bars, or exact baseline configurations beyond the π0.5 reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the gap between the stated requirements for a reasoning medium and the evidence provided by EMA-based self-verification. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph 2): The claim that the latent must be 'shareable across model instances' and 'independently verified through improved downstream control' is load-bearing for distinguishing the method from a private shortcut. The self-verification objective is instantiated only with an EMA teacher (an exponential moving average of the student's own parameters). This tests intra-trajectory stability but provides no evidence that a latent produced by one independently initialized or separately trained model instance can be consumed by another distinct instance to improve action prediction on the same inputs.

    Authors: We agree that EMA verification only establishes that the latent remains useful under gradual parameter drift within a single training trajectory. It does not constitute evidence that an independently initialized model could consume the same latent to improve its own action predictions. We will revise the abstract and the corresponding paragraph in Section 3 to replace the phrasing 'shareable across model instances' with 'stable under parameter evolution within training' and to explicitly note that cross-model transfer remains untested. revision: yes

  2. Referee: [Abstract] Abstract (empirical paragraph): The reported gains (40.4% on TX-G2, 26.3% on HSR) are presented as evidence that the latent supports generalizable control. Without experiments that swap latents across independently trained models or ablate the EMA teacher against a cross-model verification protocol, these gains remain compatible with the latent acting as an auxiliary feature private to the training trajectory.

    Authors: The reported gains are obtained under the EMA self-verification regime. We concur that these numbers alone do not rule out the possibility that the latent functions as a training-trajectory-specific auxiliary feature. We will revise the empirical paragraph in the abstract and the corresponding discussion in Section 5 to present the gains as improvements in robustness under the proposed objective, while adding a sentence acknowledging that stronger claims about cross-instance shareability would require latent-swapping or cross-model verification experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines continuous reasoning via a shared Gaussian latent and a self-verification objective using an EMA teacher that consumes the student's latent to predict actions. Performance claims rest on independent empirical evaluations (LIBERO-PRO robustness, TX-G2 and HSR robot success rates) rather than any quantity that reduces by construction to the training objective or to prior self-citations. No equations, fitted-input predictions, uniqueness theorems, or ansatzes are shown that equate the reported gains to the inputs; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full equations, training details, and parameter counts unavailable.

axioms (1)
  • domain assumption A useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure.
    Explicitly stated in the abstract as the premise motivating the design.
invented entities (1)
  • Structured set of continuous thoughts represented as shared Gaussian latent no independent evidence
    purpose: To serve as the reasoning medium that is reused for chunk-structured action generation
    Introduced in the abstract as the core representational choice.

pith-pipeline@v0.9.1-grok · 5849 in / 1294 out tokens · 21928 ms · 2026-06-28T22:04:59.379472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, Y. Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J. Q...

  2. [2]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Perts...

  3. [3]

    Driess, F

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model, 2023. URL https://arxiv.org/abs/2303.03378

  4. [4]

    Ghosh, H

    Octo Model Team , D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, L. Y. Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

  5. [5]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246

  6. [6]

    URL https://www.roboticsproceedings.org/ rss21/p010.html

    K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. _0 : A Vision-Language-Action Flow Model for General Robot Control . In Proceedin...

  7. [7]

    M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghou...

  8. [8]

    Huang, F

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, N. Brown, T. Jackson, L. Luu, S. Levine, K. Hausman, and B. Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022. URL https://arxiv.org/abs/2207.05608

  9. [9]

    Liang, W

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng. Code as policies: Language model programs for embodied control, 2022. URL https://arxiv.org/abs/2209.07753

  10. [10]

    Huang, C

    W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models, 2023. URL https://arxiv.org/abs/2307.05973

  11. [11]

    Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. 2025. doi:10.48550/ARXIV.2503.22020. URL https://arxiv.org/abs/2503.22020

  12. [12]

    Q. Shou, F. Zhu, S. Chen, P. Yan, Z. Yan, Y. Miao, X. Pang, Z. Hong, R. Shi, H. Huang, J. Zhang, and S. Guo. Halo: A unified vision-language-action model for embodied multimodal chain-of-thought reasoning, 2026. URL https://arxiv.org/abs/2602.21157

  13. [13]

    Zhong, J

    Z. Zhong, J. Li, J. He, H. Yan, X. Gong, G. Zhao, Y. Cai, J. Gao, X. Yan, B. Liu, Y. Chen, L. Yang, and H. Li. Dualcot-vla: Visual-linguistic chain of thought via parallel reasoning for vision-language-action models, 2026. URL https://arxiv.org/abs/2603.22280

  14. [14]

    J. Wen, M. Zhu, J. Liu, Z. Liu, Y. Yang, L. Zhang, S. Zhang, Y. Zhu, and Y. Xu. dvla: Diffusion vision-language-action model with multimodal chain-of-thought, 2025. URL https://arxiv.org/abs/2509.25681

  15. [15]

    C. Yin, Y. Lin, W. Xu, S. Tam, X. Zeng, Z. Liu, and Z. Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models, 2025. URL https://arxiv.org/abs/2511.15669

  16. [16]

    Huang, Y.-H

    C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025. URL https://arxiv.org/abs/2507.16815

  17. [17]

    Huang, Y

    C.-P. Huang, Y. Man, Z. Yu, M.-H. Chen, J. Kautz, Y.-C. F. Wang, and F.-E. Yang. Fast-thinkact: Efficient vision-language-action reasoning via verbalizable latent planning, 2026. URL https://arxiv.org/abs/2601.09708

  18. [18]

    S. Bai, J. Lyu, W. Zhou, Z. Li, D. Wang, L. Xing, X. Zhao, P. Wang, Z. Wang, C. Chi, B. Chen, and S. Zhang. Latent reasoning vla: Latent thinking and prediction for vision-language-action models, 2026. URL https://arxiv.org/abs/2602.01166

  19. [19]

    Z. Liu, J. Liu, H. Chen, J. Yu, Z. Guo, C. Hou, C. Gu, X. Mi, R. Zhang, K. Wu, Z. Che, J. Tang, P.-A. Heng, and S. Zhang. Last _ 0 : Latent spatio-temporal chain-of-thought for robotic vision-language-action model, 2026. URL https://arxiv.org/abs/2601.05248

  20. [20]

    J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917

  21. [21]

    Zhong, Y

    L. Zhong, Y. Liu, Y. Wei, Z. Xiong, M. Yao, S. Liu, and G. Ren. Acot-vla: Action chain-of-thought for vision-language-action models, 2026. URL https://arxiv.org/abs/2601.11404

  22. [22]

    Y. Ling, Q. Lian, J. Li, Q. Jiang, T. Zhang, X. Jiang, C. Liu, J. Liu, and L. Zhang. Guide, think, act: Interactive embodied reasoning in vision-language-action models, 2026. URL https://arxiv.org/abs/2605.13632

  23. [23]

    Lanham, A

    T. Lanham, A. Chen, A. Radhakrishnan, B. Steiner, C. Denison, D. Hernandez, D. Li, E. Durmus, E. Hubinger, J. Kernion, K. Lukošiūtė, K. Nguyen, N. Cheng, N. Joseph, N. Schiefer, O. Rausch, R. Larson, S. McCandlish, S. Kundu, S. Kadavath, S. Yang, T. Henighan, T. Maxwell, T. Telleen-Lawton, T. Hume, Z. Hatfield-Dodds, J. Kaplan, J. Brauner, S. R. Bowman, a...

  24. [24]

    Mohammadi, T

    H. Mohammadi, T. Kozak, and A. Giachanou. Evaluating GRPO and DPO for faithful chain-of-thought reasoning in LLMs , 2025. URL https://arxiv.org/abs/2512.22631

  25. [25]

    Q. Yu, A. Tartaglini, P. Hase, C. Guestrin, and C. Potts. Outcome rewards do not guarantee verifiable or causally important reasoning, 2026. URL https://arxiv.org/abs/2604.22074

  26. [26]

    Tolstikhin, O

    I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders, 2017. URL https://arxiv.org/abs/1711.01558

  27. [27]

    Tarvainen and H

    A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, 2017. URL https://arxiv.org/abs/1703.01780

  28. [28]

    Z. Zhou, Y. Zhu, J. Wen, C. Shen, and Y. Xu. Chatvla-2: Vision-language-action model with open-world embodied reasoning from pretrained knowledge, 2025. URL https://arxiv.org/abs/2505.21906

  29. [29]

    S. Yang, H. Li, B. Wang, Y. Chen, Y. Tian, T. Wang, H. Wang, F. Zhao, Y. Liao, and J. Pang. Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025. URL https://arxiv.org/abs/2507.17520

  30. [30]

    D. Qu, H. Song, Q. Chen, Y. Yao, X. Ye, Y. Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, and X. Li. Spatialvla: Exploring spatial representations for visual-language-action model, 2025. URL https://arxiv.org/abs/2501.15830

  31. [31]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645

  32. [32]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vu...

  33. [33]

    T. Z. Zhao, V. Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware, 2023. URL https://arxiv.org/abs/2304.13705

  34. [34]

    N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone, 2022. URL https://arxiv.org/abs/2206.11251

  35. [35]

    S. Lee, Y. Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto. Behavior generation with latent actions. 2024. doi:10.48550/ARXIV.2403.03181. URL https://arxiv.org/abs/2403.03181

  36. [36]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2023. URL https://arxiv.org/abs/2303.04137

  37. [37]

    Z. Hou, T. Zhang, Y. Xiong, H. Pu, C. Zhao, R. Tong, Y. Qiao, J. Dai, and Y. Chen. Diffusion transformer policy, 2024. URL https://arxiv.org/abs/2410.15959

  38. [38]

    Haldar, Z

    S. Haldar, Z. Peng, and L. Pinto. Baku: An efficient transformer for multi-task policy learning, 2024. URL https://arxiv.org/abs/2406.07539

  39. [39]

    Huang, Y

    Z. Huang, Y. Lin, F. Yang, and D. Berenson. Subgoal diffuser: Coarse-to-fine subgoal generation to guide model predictive control for robot manipulation, 2024. URL https://arxiv.org/abs/2403.13085

  40. [40]

    L. Liu, W. Wang, Y. Han, Z. Xie, P. Yi, J. Li, Y. Qin, and W. Lian. Foam: Foresight-augmented multi-task imitation policy for robotic manipulation, 2024. URL https://arxiv.org/abs/2409.19528

  41. [41]

    H. Chen, J. Guo, B. Wang, T. Zhang, X. Huang, B. Zheng, Y. Hou, C. Tie, J. Deng, and L. Shao. Goal-vla: Image-generative vlms as object-centric world models empowering zero-shot robot manipulation, 2025. URL https://arxiv.org/abs/2506.23919

  42. [42]

    Lynch, M

    C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play, 2019. URL https://arxiv.org/abs/1903.01973

  43. [43]

    Grill, F

    J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020. URL https://arxiv.org/abs/2006.07733

  44. [44]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers, 2021. URL https://arxiv.org/abs/2104.14294

  45. [45]

    X. Zhou, Y. Xu, G. Tie, Y. Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: Towards robust and fair evaluation of vision-language-action models beyond memorization, 2025. URL https://arxiv.org/abs/2510.03827

  46. [46]

    Zheng, J

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, Y.-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https://arxiv.org/abs/2510.10274

  47. [47]

    Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, S. Huang, Y. Tang, W. Wang, R. Zhang, J. Liu, and D. Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model, 2025. URL https://arxiv.org/abs/2509.09372