pith. sign in

arxiv: 2606.02735 · v2 · pith:UVOY7HFKnew · submitted 2026-06-01 · 💻 cs.RO · cs.AI· cs.LG

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

Pith reviewed 2026-06-28 14:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords vision-language-actionrobot generalizationvisual evidence budgetsubtask languagepolicy interfacereal robot evaluationmultimodal control
0
0 comments X

The pith

Training VLAs with refined subtask language and an explicit visual evidence budget improves generalization on real robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that vision-language-action models generalize better when the executor is trained under a changed interface rather than from coarse instructions and full images. Specify More keeps the original high-level goal but adds refined trajectory- and subtask-level language to clarify the current execution mode. See Less adds an explicit visual evidence budget that forces the policy to act from task-sufficient evidence only, without any mask or region labels. A sympathetic reader would care because distractors, appearance shifts, and similar tasks currently force policies to resolve too much ambiguity on their own, and the reported gains come from altering the learning problem itself rather than scaling data or model size.

Core claim

S2 improves VLA generalization by training the executor to act from informative local guidance and task-sufficient visual evidence rather than recovering both from weak supervision. Specify More preserves the original instruction while relabeling each trajectory with refined language that disambiguates execution mode. See Less imposes an explicit visual evidence budget during training so the executor relies on task-sufficient evidence instead of unconstrained visual context, without region or mask annotation. Across eight real-robot tasks on TX-G2 and HSR this raises mean subtask success from 54.2 percent to 79.0 percent over the pi0.5 baseline.

What carries the argument

The S2 training interface that pairs goal-preserving refined subtask language with an imposed visual evidence budget.

If this is right

  • Coarse instructions create avoidable supervision aliasing that harms generalization.
  • Goal-preserving local guidance outperforms replacing the full instruction in the reported ablations.
  • Explicit evidence budgeting reduces dependence on broad visual context beyond efficiency gains.
  • The resulting executor remains compatible with off-the-shelf VLM planners through in-context learning.
  • Changing the executor's learning problem this way produces higher success than recovering details from weak supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interface could be tested on other vision-language control policies that currently rely on full-image attention.
  • Gains might appear in simulated environments with controlled distractors, allowing direct measurement of the budget's isolated effect.
  • The approach may lower the amount of visual data needed for training if the budget is tightened further.
  • Future experiments could check whether the refined language also helps when the planner itself is updated rather than kept fixed.

Load-bearing premise

Refined trajectory- and subtask-level language can be generated to disambiguate execution mode while preserving the original instruction, and an explicit visual evidence budget can be imposed during training without any region or mask annotations.

What would settle it

Applying the same refined instructions and evidence budget to the eight tasks and observing no improvement or a drop in subtask success rates relative to the baseline.

read the original abstract

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes the S2 framework for vision-language-action (VLA) models to improve generalization. It introduces 'Specify More' to relabel trajectories with refined trajectory- and subtask-level language that preserves the original high-level instruction while disambiguating execution mode, and 'See Less' to impose an explicit visual evidence budget during training without requiring region or mask annotations. The central empirical claim is that this approach raises mean subtask success from 54.2% to 79.0% over the pi0.5 baseline across eight real-robot tasks on TX-G2 and HSR platforms.

Significance. If the reported performance gains can be attributed specifically to the proposed changes in the executor's learning interface rather than to unaccounted differences in supervision richness or experimental confounds, the work would offer a valuable contribution to VLA training by highlighting the role of goal-preserving local guidance and constrained visual context in reducing supervision aliasing. The compatibility with off-the-shelf VLM planners via in-context learning is noted as a strength.

major comments (3)
  1. [Abstract (Specify More description)] The procedure for generating the refined trajectory- and subtask-level language (Specify More) is not described in the provided manuscript text (e.g., manual annotation, VLM prompting details, or specific prompts used). This detail is load-bearing for the central claim that gains arise from the cleaner interface rather than richer supervision, as the 54.2% to 79.0% improvement is presented as evidence for the proposed training changes.
  2. [Abstract (See Less description)] The mechanism for imposing the explicit visual evidence budget (See Less) is not specified (e.g., random masking, attention regularization, cropping, or other implementation). Without this, it is impossible to verify that the budget is enforced without implicit cues or annotations, which is required to isolate its contribution to the reported generalization improvements.
  3. [Abstract (evaluation paragraph)] The abstract reports the key quantitative result (mean subtask success 54.2% to 79.0% on eight tasks) but provides no details on experimental design, number of trials per task, statistical significance tests, variance, or controls for confounds. This makes the empirical support for the generalization claim preliminary and load-bearing for the paper's conclusions.
minor comments (1)
  1. [Abstract] The phrase 'Across our main evaluation settings' is imprecise; cross-referencing specific tables, figures, or sections would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify places where additional detail in the abstract and methods would strengthen the manuscript. We address each point below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract (Specify More description)] The procedure for generating the refined trajectory- and subtask-level language (Specify More) is not described in the provided manuscript text (e.g., manual annotation, VLM prompting details, or specific prompts used). This detail is load-bearing for the central claim that gains arise from the cleaner interface rather than richer supervision, as the 54.2% to 79.0% improvement is presented as evidence for the proposed training changes.

    Authors: We agree that the generation procedure must be explicit to isolate the effect of goal-preserving refinement from added supervision richness. The full manuscript describes the use of VLM prompting on trajectory data to produce subtask- and trajectory-level labels while retaining the original high-level instruction; we will revise the abstract to briefly state this VLM-based relabeling approach and expand the methods section with the exact prompting strategy and examples. This revision will directly support the claim that the interface change, rather than arbitrary additional labels, drives the observed gains. revision: yes

  2. Referee: [Abstract (See Less description)] The mechanism for imposing the explicit visual evidence budget (See Less) is not specified (e.g., random masking, attention regularization, cropping, or other implementation). Without this, it is impossible to verify that the budget is enforced without implicit cues or annotations, which is required to isolate its contribution to the reported generalization improvements.

    Authors: We concur that the concrete implementation of the evidence budget must be stated to confirm it requires no region or mask annotations. The manuscript introduces the budget as an explicit constraint on visual context during training; we will revise the abstract and methods to specify the exact enforcement technique (e.g., the form of masking or regularization used) and reiterate that it operates without any spatial annotations. This addition will allow readers to verify the isolation of the visual-context effect. revision: yes

  3. Referee: [Abstract (evaluation paragraph)] The abstract reports the key quantitative result (mean subtask success 54.2% to 79.0% on eight tasks) but provides no details on experimental design, number of trials per task, statistical significance tests, variance, or controls for confounds. This makes the empirical support for the generalization claim preliminary and load-bearing for the paper's conclusions.

    Authors: The abstract is necessarily concise, but the referee is correct that it should surface key experimental parameters. We will revise the abstract to include the number of trials per task and a brief note on controls; the main text already reports per-task results, variance across runs, and the experimental protocol on the two robot platforms. If additional statistical significance testing is required beyond what is currently presented, we will add it in the revision. These changes will make the empirical support more self-contained in the abstract while preserving the full details in the body. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains presented as measured outcomes of interface changes without definitional reduction.

full rationale

The paper reports an empirical result (subtask success 54.2% → 79.0% on eight real-robot tasks) attributed to training under refined language guidance plus an explicit visual evidence budget. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The relabeling and budgeting steps are described as methodological interventions whose effects are evaluated experimentally against a baseline; they do not reduce to the input data by construction. This is a standard non-circular empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Since only the abstract is available, the ledger is based on high-level concepts described; full paper may reveal more parameters or assumptions.

axioms (2)
  • domain assumption Refined subtask-level language can be created from trajectories to disambiguate execution without additional annotations.
    Mentioned as part of Specify More but method not specified in abstract.
  • domain assumption An explicit visual evidence budget can be imposed during training to force the model to use task-sufficient evidence.
    Core to See Less component.

pith-pipeline@v0.9.1-grok · 5845 in / 1320 out tokens · 38309 ms · 2026-06-28T14:02:16.344262+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, et al. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems, 202...

  2. [2]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, et al. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246

  3. [3]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. _0 : A vision-language-action flow model for general robot control, 2024. URL https://arxiv.or...

  4. [4]

    M. Ahn, A. Brohan, N. Brown, et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022. URL https://arxiv.org/abs/2204.01691

  5. [5]

    Myers, C

    V. Myers, C. Zheng, O. Mees, K. Fang, and S. Levine. Policy adaptation via language optimization: Decomposing tasks for few-shot imitation. In Proceedings of the 8th Conference on Robot Learning, pages 1402--1426, 2025. URL https://proceedings.mlr.press/v270/myers25a.html

  6. [6]

    L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, A. Li-Bell, D. Driess, L. Groom, S. Levine, and C. Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models, 2025. URL https://arxiv.org/abs/2502.19417

  7. [7]

    M. S. Ryoo, A. J. Piergiovanni, M. Tan, and A. Angelova. Tokenlearner: What can 8 learned tokens do for images and videos?, 2021. URL https://arxiv.org/abs/2106.11297

  8. [8]

    Cheng, H

    J. Cheng, H. Wang, W. Li, G. Wang, Y. Zhang, X. Tang, J. Wu, X. Chen, Y. Liu, and W. Zhang. Vla-iap: Training-free visual token pruning via interaction alignment for vision-language-action models, 2026. URL https://arxiv.org/abs/2603.22991

  9. [9]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Proceedings of the 5th Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=8kbp23tSGYv

  10. [10]

    Ghosh, H

    D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, S. Levine, et al. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

  11. [11]

    Shukor, D

    M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, S. Alibert, M. Cord, T. Wolf, and R. Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025. URL https://arxiv.org/abs/2506.01844

  12. [12]

    H. Wang, C. Xiong, R. Wang, and X. Chen. Bitvla: 1-bit vision-language-action models for robotics manipulation, 2025a. URL https://arxiv.org/abs/2506.07530

  13. [13]

    Zheng, J

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, Y.-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https://arxiv.org/abs/2510.10274

  14. [14]

    B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv.org/abs/2306.03310

  15. [15]

    O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7 0 (3): 0 7327--7334, 2022. doi:10.1109/LRA.2022.3180108. URL https://arxiv.org/abs/2112.03227

  16. [16]

    Liang, W

    J. Liang, W. Huang, F. Xia, et al. Code as policies: Language model programs for embodied control, 2022. URL https://arxiv.org/abs/2209.07753

  17. [17]

    Huang, F

    W. Huang, F. Xia, T. Xiao, et al. Inner monologue: Embodied reasoning through planning with language models, 2022. URL https://arxiv.org/abs/2207.05608

  18. [18]

    Y. Li, Y. Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal. Hamster: Hierarchical action models for open-world robot manipulation, 2025. URL https://arxiv.org/abs/2502.05485

  19. [19]

    Q. Long, Y. Wang, J. Song, J. Zhang, P. Li, W. Wang, Y. Wang, H. Li, S. Xie, G. Yao, H. Zhang, X. Wang, Z. Wang, X. Lan, H. Liu, and X. Li. Scaling world model for hierarchical manipulation policies, 2026. URL https://arxiv.org/abs/2602.10983

  20. [20]

    Smith, A

    L. Smith, A. Irpan, M. G. Arenas, S. Kirmani, D. Kalashnikov, D. Shah, and T. Xiao. Steer: Flexible robotic manipulation via dense language grounding, 2024. URL https://arxiv.org/abs/2411.03409

  21. [21]

    W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control, 2026 a . URL https://arxiv.org/abs/2602.13193

  22. [22]

    Z. Chen, A. Tian, L. Wang, B. Joffe, Y. C. Lin, Y. Chen, S. Karamcheti, and D. Xu. Resteer: Quantifying and refining the steerability of multitask robot policies, 2026 b . URL https://arxiv.org/abs/2603.17300

  23. [23]

    T. Xiao, H. Chan, P. Sermanet, A. Wahid, A. Brohan, K. Hausman, S. Levine, and J. Tompson. Robotic skill acquisition via instruction augmentation with vision-language models. In Robotics: Science and Systems, 2023. URL https://rss2023.github.io/rss2023-website/program/papers/029/. Introduces Data-driven Instruction Augmentation for Language-conditioned co...

  24. [24]

    J. Wen, Y. Zhu, M. Zhu, J. Li, Z. Xu, Z. Che, C. Shen, Y. Peng, D. Liu, F. Feng, and J. Tang. Object-centric instruction augmentation for robotic manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation, 2024. URL https://oci-robotics.github.io/

  25. [25]

    u hle, \

    N. Blank, M. Reuss, M. R \"u hle, \"O . E. Ya g murlu, F. Wenzel, O. Mees, and R. Lioutikov. Scaling robot policy learning via zero-shot labeling with foundation models. In Proceedings of the 8th Conference on Robot Learning, 2024. URL https://proceedings.mlr.press/v270/blank25a.html

  26. [26]

    Kuramshin, O

    A. Kuramshin, O. Aslan, C. Neary, and G. Berseth. Task robustness via re-labelling vision-action robot data. In CoRL 2025 Robot Data Workshop, 2025. URL https://openreview.net/forum?id=M6M5W0lmaY

  27. [27]

    Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models, 2025. URL https://arxiv.org/abs/2503.22020

  28. [28]

    J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917

  29. [29]

    C. Yin, Y. Lin, W. Xu, S. Tam, X. Zeng, Z. Liu, and Z. Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models, 2025. URL https://arxiv.org/abs/2511.15669

  30. [30]

    Huang, Y.-H

    C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025. URL https://jasper0314-huang.github.io/thinkact-vla/

  31. [31]

    Z. Chen, R. Niu, H. Kong, Q. Wang, Q. Xing, and Z. Fan. Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization, 2025. URL https://arxiv.org/abs/2506.08440

  32. [32]

    Abolghasemi, A

    P. Abolghasemi, A. Mazaheri, M. Shah, and L. B \"o l \"o ni. Pay attention! - robustifying a deep visuomotor policy through task-focused attention, 2018. URL https://arxiv.org/abs/1809.10093

  33. [33]

    Devin, P

    C. Devin, P. Abbeel, T. Darrell, and S. Levine. Deep object-centric representations for generalizable robot learning, 2017. URL https://arxiv.org/abs/1708.04225

  34. [34]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Proceedings of the Conference on Robot Learning, 2022 a . URL https://arxiv.org/abs/2109.12098

  35. [35]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation, 2022 b . URL https://arxiv.org/abs/2209.05451

  36. [36]

    Y. Zhao, K. Wu, T. Yi, Z. Xu, X. Ju, Z. Che, C. H. Liu, and J. Tang. Efficient training of generalizable visuomotor policies via control-aware augmentation, 2024. URL https://arxiv.org/abs/2401.09258. EAGLE

  37. [37]

    Zhang, Y

    T. Zhang, Y. Hu, J. You, and Y. Gao. Leveraging locality to boost sample efficiency in robotic manipulation. In Proceedings of the 8th Conference on Robot Learning, pages 3264--3284, 2025. URL https://proceedings.mlr.press/v270/zhang25h.html

  38. [38]

    Chapin, B

    A. Chapin, B. Machado, E. Dellandr \'e a, and L. Chen. Spotlighting task-relevant features: Object-centric representations for better generalization in robotic manipulation, 2026. URL https://arxiv.org/abs/2601.21416

  39. [39]

    Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems, 2021. URL https://arxiv.org/abs/2106.02034

  40. [40]

    Liang, G

    Y. Liang, G. Zhang, Z. Zhang, Y. Hu, B. Chandramouli, et al. Evit: Expediting vision transformers via token reorganizations, 2022. URL https://arxiv.org/abs/2202.07800

  41. [41]

    Bolya, C.-Y

    D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023. URL https://arxiv.org/abs/2210.09461

  42. [42]

    H. Li, W. Mao, Z. Lan, H. Xiong, H. Wang, C. Si, Z. Liu, X. Deng, and H. Chen. Bfa++: Hierarchical best-feature-aware token prune for multi-view vision language action model, 2026. URL https://arxiv.org/abs/2602.20566

  43. [43]

    Black, N

    Physical Intelligence , K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

  44. [44]

    M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645

  45. [45]

    Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, S. Huang, Y. Tang, W. Wang, R. Zhang, J. Liu, and D. Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model, 2025b. URL https://arxiv.org/abs/2509.09372

  46. [46]

    Kimi k2.5: Visual agentic intelligence, 2026

    Kimi Team . Kimi k2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/2602.02276

  47. [47]

    Introducing gpt-5.4 mini and nano, Mar

    OpenAI . Introducing gpt-5.4 mini and nano, Mar. 2026. URL https://openai.com/index/introducing-gpt-5-4-mini-and-nano/

  48. [48]

    Y. Wang, X. Li, W. Wang, J. Zhang, Y. Li, Y. Chen, X. Wang, and Z. Zhang. Unified vision-language-action model, 2025c. URL https://arxiv.org/abs/2506.19850