See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

Kei Ota; Tatsuya Matsushima; Yueh-Hua Wu

arxiv: 2606.02735 · v2 · pith:UVOY7HFKnew · submitted 2026-06-01 · 💻 cs.RO · cs.AI· cs.LG

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

Yueh-Hua Wu , Tatsuya Matsushima , Kei Ota This is my paper

Pith reviewed 2026-06-28 14:02 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords vision-language-actionrobot generalizationvisual evidence budgetsubtask languagepolicy interfacereal robot evaluationmultimodal control

0 comments

The pith

Training VLAs with refined subtask language and an explicit visual evidence budget improves generalization on real robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that vision-language-action models generalize better when the executor is trained under a changed interface rather than from coarse instructions and full images. Specify More keeps the original high-level goal but adds refined trajectory- and subtask-level language to clarify the current execution mode. See Less adds an explicit visual evidence budget that forces the policy to act from task-sufficient evidence only, without any mask or region labels. A sympathetic reader would care because distractors, appearance shifts, and similar tasks currently force policies to resolve too much ambiguity on their own, and the reported gains come from altering the learning problem itself rather than scaling data or model size.

Core claim

S2 improves VLA generalization by training the executor to act from informative local guidance and task-sufficient visual evidence rather than recovering both from weak supervision. Specify More preserves the original instruction while relabeling each trajectory with refined language that disambiguates execution mode. See Less imposes an explicit visual evidence budget during training so the executor relies on task-sufficient evidence instead of unconstrained visual context, without region or mask annotation. Across eight real-robot tasks on TX-G2 and HSR this raises mean subtask success from 54.2 percent to 79.0 percent over the pi0.5 baseline.

What carries the argument

The S2 training interface that pairs goal-preserving refined subtask language with an imposed visual evidence budget.

If this is right

Coarse instructions create avoidable supervision aliasing that harms generalization.
Goal-preserving local guidance outperforms replacing the full instruction in the reported ablations.
Explicit evidence budgeting reduces dependence on broad visual context beyond efficiency gains.
The resulting executor remains compatible with off-the-shelf VLM planners through in-context learning.
Changing the executor's learning problem this way produces higher success than recovering details from weak supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interface could be tested on other vision-language control policies that currently rely on full-image attention.
Gains might appear in simulated environments with controlled distractors, allowing direct measurement of the budget's isolated effect.
The approach may lower the amount of visual data needed for training if the budget is tightened further.
Future experiments could check whether the refined language also helps when the planner itself is updated rather than kept fixed.

Load-bearing premise

Refined trajectory- and subtask-level language can be generated to disambiguate execution mode while preserving the original instruction, and an explicit visual evidence budget can be imposed during training without any region or mask annotations.

What would settle it

Applying the same refined instructions and evidence budget to the eight tasks and observing no improvement or a drop in subtask success rates relative to the baseline.

read the original abstract

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S2 shows real-robot gains from refined language plus visual budgeting, but the relabeling step risks confounding the claimed interface effect.

read the letter

The core idea is training the VLA executor on goal-preserving subtask-level language labels instead of coarse instructions, combined with an explicit limit on visual evidence during training. This is positioned as a cleaner interface that reduces the need for the policy to resolve ambiguity or attend to irrelevant image regions on its own.

The real-robot results stand out: across eight tasks on TX-G2 and HSR platforms, mean subtask success rises from 54.2% to 79.0% over pi0.5. The ablations mentioned in the abstract (local guidance beating full instruction replacement, evidence budgeting helping beyond efficiency) give some support for the mechanism. Compatibility with off-the-shelf VLM planners via in-context learning is a practical plus.

The main soft spot is the unspecified relabeling procedure. If the trajectory- and subtask-level labels are generated with additional model calls or human input that injects disambiguating information unavailable to the baseline, the lift cannot be cleanly attributed to the interface change rather than richer supervision. The abstract also gives no detail on how the visual budget is actually imposed without masks or annotations, nor on trial counts, variance, or statistical tests. These gaps make the evidence preliminary.

This is for people working on VLA generalization and real-world deployment. The thinking is clear and the problem is central, so it deserves a serious referee even if the methods section will need expansion to isolate the effect.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes the S2 framework for vision-language-action (VLA) models to improve generalization. It introduces 'Specify More' to relabel trajectories with refined trajectory- and subtask-level language that preserves the original high-level instruction while disambiguating execution mode, and 'See Less' to impose an explicit visual evidence budget during training without requiring region or mask annotations. The central empirical claim is that this approach raises mean subtask success from 54.2% to 79.0% over the pi0.5 baseline across eight real-robot tasks on TX-G2 and HSR platforms.

Significance. If the reported performance gains can be attributed specifically to the proposed changes in the executor's learning interface rather than to unaccounted differences in supervision richness or experimental confounds, the work would offer a valuable contribution to VLA training by highlighting the role of goal-preserving local guidance and constrained visual context in reducing supervision aliasing. The compatibility with off-the-shelf VLM planners via in-context learning is noted as a strength.

major comments (3)

[Abstract (Specify More description)] The procedure for generating the refined trajectory- and subtask-level language (Specify More) is not described in the provided manuscript text (e.g., manual annotation, VLM prompting details, or specific prompts used). This detail is load-bearing for the central claim that gains arise from the cleaner interface rather than richer supervision, as the 54.2% to 79.0% improvement is presented as evidence for the proposed training changes.
[Abstract (See Less description)] The mechanism for imposing the explicit visual evidence budget (See Less) is not specified (e.g., random masking, attention regularization, cropping, or other implementation). Without this, it is impossible to verify that the budget is enforced without implicit cues or annotations, which is required to isolate its contribution to the reported generalization improvements.
[Abstract (evaluation paragraph)] The abstract reports the key quantitative result (mean subtask success 54.2% to 79.0% on eight tasks) but provides no details on experimental design, number of trials per task, statistical significance tests, variance, or controls for confounds. This makes the empirical support for the generalization claim preliminary and load-bearing for the paper's conclusions.

minor comments (1)

[Abstract] The phrase 'Across our main evaluation settings' is imprecise; cross-referencing specific tables, figures, or sections would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify places where additional detail in the abstract and methods would strengthen the manuscript. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract (Specify More description)] The procedure for generating the refined trajectory- and subtask-level language (Specify More) is not described in the provided manuscript text (e.g., manual annotation, VLM prompting details, or specific prompts used). This detail is load-bearing for the central claim that gains arise from the cleaner interface rather than richer supervision, as the 54.2% to 79.0% improvement is presented as evidence for the proposed training changes.

Authors: We agree that the generation procedure must be explicit to isolate the effect of goal-preserving refinement from added supervision richness. The full manuscript describes the use of VLM prompting on trajectory data to produce subtask- and trajectory-level labels while retaining the original high-level instruction; we will revise the abstract to briefly state this VLM-based relabeling approach and expand the methods section with the exact prompting strategy and examples. This revision will directly support the claim that the interface change, rather than arbitrary additional labels, drives the observed gains. revision: yes
Referee: [Abstract (See Less description)] The mechanism for imposing the explicit visual evidence budget (See Less) is not specified (e.g., random masking, attention regularization, cropping, or other implementation). Without this, it is impossible to verify that the budget is enforced without implicit cues or annotations, which is required to isolate its contribution to the reported generalization improvements.

Authors: We concur that the concrete implementation of the evidence budget must be stated to confirm it requires no region or mask annotations. The manuscript introduces the budget as an explicit constraint on visual context during training; we will revise the abstract and methods to specify the exact enforcement technique (e.g., the form of masking or regularization used) and reiterate that it operates without any spatial annotations. This addition will allow readers to verify the isolation of the visual-context effect. revision: yes
Referee: [Abstract (evaluation paragraph)] The abstract reports the key quantitative result (mean subtask success 54.2% to 79.0% on eight tasks) but provides no details on experimental design, number of trials per task, statistical significance tests, variance, or controls for confounds. This makes the empirical support for the generalization claim preliminary and load-bearing for the paper's conclusions.

Authors: The abstract is necessarily concise, but the referee is correct that it should surface key experimental parameters. We will revise the abstract to include the number of trials per task and a brief note on controls; the main text already reports per-task results, variance across runs, and the experimental protocol on the two robot platforms. If additional statistical significance testing is required beyond what is currently presented, we will add it in the revision. These changes will make the empirical support more self-contained in the abstract while preserving the full details in the body. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical gains presented as measured outcomes of interface changes without definitional reduction.

full rationale

The paper reports an empirical result (subtask success 54.2% → 79.0% on eight real-robot tasks) attributed to training under refined language guidance plus an explicit visual evidence budget. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The relabeling and budgeting steps are described as methodological interventions whose effects are evaluated experimentally against a baseline; they do not reduce to the input data by construction. This is a standard non-circular empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Since only the abstract is available, the ledger is based on high-level concepts described; full paper may reveal more parameters or assumptions.

axioms (2)

domain assumption Refined subtask-level language can be created from trajectories to disambiguate execution without additional annotations.
Mentioned as part of Specify More but method not specified in abstract.
domain assumption An explicit visual evidence budget can be imposed during training to force the model to use task-sufficient evidence.
Core to See Less component.

pith-pipeline@v0.9.1-grok · 5845 in / 1320 out tokens · 38309 ms · 2026-06-28T14:02:16.344262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 2 canonical work pages · 1 internal anchor

[1]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, et al. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems, 202...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.15607/rss.2023.xix.025 2023
[2]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, et al. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246

Pith/arXiv arXiv 2024
[3]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. _0 : A vision-language-action flow model for general robot control, 2024. URL https://arxiv.or...

Pith/arXiv arXiv 2024
[4]

M. Ahn, A. Brohan, N. Brown, et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022. URL https://arxiv.org/abs/2204.01691

Pith/arXiv arXiv 2022
[5]

Myers, C

V. Myers, C. Zheng, O. Mees, K. Fang, and S. Levine. Policy adaptation via language optimization: Decomposing tasks for few-shot imitation. In Proceedings of the 8th Conference on Robot Learning, pages 1402--1426, 2025. URL https://proceedings.mlr.press/v270/myers25a.html

2025
[6]

L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, A. Li-Bell, D. Driess, L. Groom, S. Levine, and C. Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models, 2025. URL https://arxiv.org/abs/2502.19417

Pith/arXiv arXiv 2025
[7]

M. S. Ryoo, A. J. Piergiovanni, M. Tan, and A. Angelova. Tokenlearner: What can 8 learned tokens do for images and videos?, 2021. URL https://arxiv.org/abs/2106.11297

arXiv 2021
[8]

Cheng, H

J. Cheng, H. Wang, W. Li, G. Wang, Y. Zhang, X. Tang, J. Wu, X. Chen, Y. Liu, and W. Zhang. Vla-iap: Training-free visual token pruning via interaction alignment for vision-language-action models, 2026. URL https://arxiv.org/abs/2603.22991

arXiv 2026
[9]

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Proceedings of the 5th Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=8kbp23tSGYv

2021
[10]

Ghosh, H

D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, S. Levine, et al. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

Pith/arXiv arXiv 2024
[11]

Shukor, D

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, S. Alibert, M. Cord, T. Wolf, and R. Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025. URL https://arxiv.org/abs/2506.01844

Pith/arXiv arXiv 2025
[12]

H. Wang, C. Xiong, R. Wang, and X. Chen. Bitvla: 1-bit vision-language-action models for robotics manipulation, 2025a. URL https://arxiv.org/abs/2506.07530

arXiv
[13]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, Y.-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https://arxiv.org/abs/2510.10274

Pith/arXiv arXiv 2025
[14]

B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv.org/abs/2306.03310

Pith/arXiv arXiv 2023
[15]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7 0 (3): 0 7327--7334, 2022. doi:10.1109/LRA.2022.3180108. URL https://arxiv.org/abs/2112.03227

work page doi:10.1109/lra.2022.3180108 2022
[16]

Liang, W

J. Liang, W. Huang, F. Xia, et al. Code as policies: Language model programs for embodied control, 2022. URL https://arxiv.org/abs/2209.07753

Pith/arXiv arXiv 2022
[17]

Huang, F

W. Huang, F. Xia, T. Xiao, et al. Inner monologue: Embodied reasoning through planning with language models, 2022. URL https://arxiv.org/abs/2207.05608

Pith/arXiv arXiv 2022
[18]

Y. Li, Y. Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal. Hamster: Hierarchical action models for open-world robot manipulation, 2025. URL https://arxiv.org/abs/2502.05485

arXiv 2025
[19]

Q. Long, Y. Wang, J. Song, J. Zhang, P. Li, W. Wang, Y. Wang, H. Li, S. Xie, G. Yao, H. Zhang, X. Wang, Z. Wang, X. Lan, H. Liu, and X. Li. Scaling world model for hierarchical manipulation policies, 2026. URL https://arxiv.org/abs/2602.10983

arXiv 2026
[20]

Smith, A

L. Smith, A. Irpan, M. G. Arenas, S. Kirmani, D. Kalashnikov, D. Shah, and T. Xiao. Steer: Flexible robotic manipulation via dense language grounding, 2024. URL https://arxiv.org/abs/2411.03409

arXiv 2024
[21]

W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control, 2026 a . URL https://arxiv.org/abs/2602.13193

Pith/arXiv arXiv 2026
[22]

Z. Chen, A. Tian, L. Wang, B. Joffe, Y. C. Lin, Y. Chen, S. Karamcheti, and D. Xu. Resteer: Quantifying and refining the steerability of multitask robot policies, 2026 b . URL https://arxiv.org/abs/2603.17300

arXiv 2026
[23]

T. Xiao, H. Chan, P. Sermanet, A. Wahid, A. Brohan, K. Hausman, S. Levine, and J. Tompson. Robotic skill acquisition via instruction augmentation with vision-language models. In Robotics: Science and Systems, 2023. URL https://rss2023.github.io/rss2023-website/program/papers/029/. Introduces Data-driven Instruction Augmentation for Language-conditioned co...

2023
[24]

J. Wen, Y. Zhu, M. Zhu, J. Li, Z. Xu, Z. Che, C. Shen, Y. Peng, D. Liu, F. Feng, and J. Tang. Object-centric instruction augmentation for robotic manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation, 2024. URL https://oci-robotics.github.io/

2024
[25]

u hle, \

N. Blank, M. Reuss, M. R \"u hle, \"O . E. Ya g murlu, F. Wenzel, O. Mees, and R. Lioutikov. Scaling robot policy learning via zero-shot labeling with foundation models. In Proceedings of the 8th Conference on Robot Learning, 2024. URL https://proceedings.mlr.press/v270/blank25a.html

2024
[26]

Kuramshin, O

A. Kuramshin, O. Aslan, C. Neary, and G. Berseth. Task robustness via re-labelling vision-action robot data. In CoRL 2025 Robot Data Workshop, 2025. URL https://openreview.net/forum?id=M6M5W0lmaY

2025
[27]

Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models, 2025. URL https://arxiv.org/abs/2503.22020

Pith/arXiv arXiv 2025
[28]

J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917

Pith/arXiv arXiv 2025
[29]

C. Yin, Y. Lin, W. Xu, S. Tam, X. Zeng, Z. Liu, and Z. Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models, 2025. URL https://arxiv.org/abs/2511.15669

Pith/arXiv arXiv 2025
[30]

Huang, Y.-H

C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025. URL https://jasper0314-huang.github.io/thinkact-vla/

2025
[31]

Z. Chen, R. Niu, H. Kong, Q. Wang, Q. Xing, and Z. Fan. Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization, 2025. URL https://arxiv.org/abs/2506.08440

arXiv 2025
[32]

Abolghasemi, A

P. Abolghasemi, A. Mazaheri, M. Shah, and L. B \"o l \"o ni. Pay attention! - robustifying a deep visuomotor policy through task-focused attention, 2018. URL https://arxiv.org/abs/1809.10093

Pith/arXiv arXiv 2018
[33]

Devin, P

C. Devin, P. Abbeel, T. Darrell, and S. Levine. Deep object-centric representations for generalizable robot learning, 2017. URL https://arxiv.org/abs/1708.04225

Pith/arXiv arXiv 2017
[34]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Proceedings of the Conference on Robot Learning, 2022 a . URL https://arxiv.org/abs/2109.12098

arXiv 2022
[35]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation, 2022 b . URL https://arxiv.org/abs/2209.05451

arXiv 2022
[36]

Y. Zhao, K. Wu, T. Yi, Z. Xu, X. Ju, Z. Che, C. H. Liu, and J. Tang. Efficient training of generalizable visuomotor policies via control-aware augmentation, 2024. URL https://arxiv.org/abs/2401.09258. EAGLE

arXiv 2024
[37]

Zhang, Y

T. Zhang, Y. Hu, J. You, and Y. Gao. Leveraging locality to boost sample efficiency in robotic manipulation. In Proceedings of the 8th Conference on Robot Learning, pages 3264--3284, 2025. URL https://proceedings.mlr.press/v270/zhang25h.html

2025
[38]

Chapin, B

A. Chapin, B. Machado, E. Dellandr \'e a, and L. Chen. Spotlighting task-relevant features: Object-centric representations for better generalization in robotic manipulation, 2026. URL https://arxiv.org/abs/2601.21416

Pith/arXiv arXiv 2026
[39]

Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems, 2021. URL https://arxiv.org/abs/2106.02034

arXiv 2021
[40]

Liang, G

Y. Liang, G. Zhang, Z. Zhang, Y. Hu, B. Chandramouli, et al. Evit: Expediting vision transformers via token reorganizations, 2022. URL https://arxiv.org/abs/2202.07800

arXiv 2022
[41]

Bolya, C.-Y

D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023. URL https://arxiv.org/abs/2210.09461

Pith/arXiv arXiv 2023
[42]

H. Li, W. Mao, Z. Lan, H. Xiong, H. Wang, C. Si, Z. Liu, X. Deng, and H. Chen. Bfa++: Hierarchical best-feature-aware token prune for multi-view vision language action model, 2026. URL https://arxiv.org/abs/2602.20566

arXiv 2026
[43]

Black, N

Physical Intelligence , K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

Pith/arXiv arXiv 2025
[44]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645

Pith/arXiv arXiv 2025
[45]

Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, S. Huang, Y. Tang, W. Wang, R. Zhang, J. Liu, and D. Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model, 2025b. URL https://arxiv.org/abs/2509.09372

arXiv
[46]

Kimi k2.5: Visual agentic intelligence, 2026

Kimi Team . Kimi k2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/2602.02276

Pith/arXiv arXiv 2026
[47]

Introducing gpt-5.4 mini and nano, Mar

OpenAI . Introducing gpt-5.4 mini and nano, Mar. 2026. URL https://openai.com/index/introducing-gpt-5-4-mini-and-nano/

2026
[48]

Y. Wang, X. Li, W. Wang, J. Zhang, Y. Li, Y. Chen, X. Wang, and Z. Zhang. Unified vision-language-action model, 2025c. URL https://arxiv.org/abs/2506.19850

arXiv

[1] [1]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, et al. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems, 202...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.15607/rss.2023.xix.025 2023

[2] [2]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, et al. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246

Pith/arXiv arXiv 2024

[3] [3]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. _0 : A vision-language-action flow model for general robot control, 2024. URL https://arxiv.or...

Pith/arXiv arXiv 2024

[4] [4]

M. Ahn, A. Brohan, N. Brown, et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022. URL https://arxiv.org/abs/2204.01691

Pith/arXiv arXiv 2022

[5] [5]

Myers, C

V. Myers, C. Zheng, O. Mees, K. Fang, and S. Levine. Policy adaptation via language optimization: Decomposing tasks for few-shot imitation. In Proceedings of the 8th Conference on Robot Learning, pages 1402--1426, 2025. URL https://proceedings.mlr.press/v270/myers25a.html

2025

[6] [6]

L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, A. Li-Bell, D. Driess, L. Groom, S. Levine, and C. Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models, 2025. URL https://arxiv.org/abs/2502.19417

Pith/arXiv arXiv 2025

[7] [7]

M. S. Ryoo, A. J. Piergiovanni, M. Tan, and A. Angelova. Tokenlearner: What can 8 learned tokens do for images and videos?, 2021. URL https://arxiv.org/abs/2106.11297

arXiv 2021

[8] [8]

Cheng, H

J. Cheng, H. Wang, W. Li, G. Wang, Y. Zhang, X. Tang, J. Wu, X. Chen, Y. Liu, and W. Zhang. Vla-iap: Training-free visual token pruning via interaction alignment for vision-language-action models, 2026. URL https://arxiv.org/abs/2603.22991

arXiv 2026

[9] [9]

E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Proceedings of the 5th Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=8kbp23tSGYv

2021

[10] [10]

Ghosh, H

D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, S. Levine, et al. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213

Pith/arXiv arXiv 2024

[11] [11]

Shukor, D

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, S. Alibert, M. Cord, T. Wolf, and R. Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025. URL https://arxiv.org/abs/2506.01844

Pith/arXiv arXiv 2025

[12] [12]

H. Wang, C. Xiong, R. Wang, and X. Chen. Bitvla: 1-bit vision-language-action models for robotics manipulation, 2025a. URL https://arxiv.org/abs/2506.07530

arXiv

[13] [13]

Zheng, J

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, Y.-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https://arxiv.org/abs/2510.10274

Pith/arXiv arXiv 2025

[14] [14]

B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv.org/abs/2306.03310

Pith/arXiv arXiv 2023

[15] [15]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7 0 (3): 0 7327--7334, 2022. doi:10.1109/LRA.2022.3180108. URL https://arxiv.org/abs/2112.03227

work page doi:10.1109/lra.2022.3180108 2022

[16] [16]

Liang, W

J. Liang, W. Huang, F. Xia, et al. Code as policies: Language model programs for embodied control, 2022. URL https://arxiv.org/abs/2209.07753

Pith/arXiv arXiv 2022

[17] [17]

Huang, F

W. Huang, F. Xia, T. Xiao, et al. Inner monologue: Embodied reasoning through planning with language models, 2022. URL https://arxiv.org/abs/2207.05608

Pith/arXiv arXiv 2022

[18] [18]

Y. Li, Y. Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal. Hamster: Hierarchical action models for open-world robot manipulation, 2025. URL https://arxiv.org/abs/2502.05485

arXiv 2025

[19] [19]

Q. Long, Y. Wang, J. Song, J. Zhang, P. Li, W. Wang, Y. Wang, H. Li, S. Xie, G. Yao, H. Zhang, X. Wang, Z. Wang, X. Lan, H. Liu, and X. Li. Scaling world model for hierarchical manipulation policies, 2026. URL https://arxiv.org/abs/2602.10983

arXiv 2026

[20] [20]

Smith, A

L. Smith, A. Irpan, M. G. Arenas, S. Kirmani, D. Kalashnikov, D. Shah, and T. Xiao. Steer: Flexible robotic manipulation via dense language grounding, 2024. URL https://arxiv.org/abs/2411.03409

arXiv 2024

[21] [21]

W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control, 2026 a . URL https://arxiv.org/abs/2602.13193

Pith/arXiv arXiv 2026

[22] [22]

Z. Chen, A. Tian, L. Wang, B. Joffe, Y. C. Lin, Y. Chen, S. Karamcheti, and D. Xu. Resteer: Quantifying and refining the steerability of multitask robot policies, 2026 b . URL https://arxiv.org/abs/2603.17300

arXiv 2026

[23] [23]

T. Xiao, H. Chan, P. Sermanet, A. Wahid, A. Brohan, K. Hausman, S. Levine, and J. Tompson. Robotic skill acquisition via instruction augmentation with vision-language models. In Robotics: Science and Systems, 2023. URL https://rss2023.github.io/rss2023-website/program/papers/029/. Introduces Data-driven Instruction Augmentation for Language-conditioned co...

2023

[24] [24]

J. Wen, Y. Zhu, M. Zhu, J. Li, Z. Xu, Z. Che, C. Shen, Y. Peng, D. Liu, F. Feng, and J. Tang. Object-centric instruction augmentation for robotic manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation, 2024. URL https://oci-robotics.github.io/

2024

[25] [25]

u hle, \

N. Blank, M. Reuss, M. R \"u hle, \"O . E. Ya g murlu, F. Wenzel, O. Mees, and R. Lioutikov. Scaling robot policy learning via zero-shot labeling with foundation models. In Proceedings of the 8th Conference on Robot Learning, 2024. URL https://proceedings.mlr.press/v270/blank25a.html

2024

[26] [26]

Kuramshin, O

A. Kuramshin, O. Aslan, C. Neary, and G. Berseth. Task robustness via re-labelling vision-action robot data. In CoRL 2025 Robot Data Workshop, 2025. URL https://openreview.net/forum?id=M6M5W0lmaY

2025

[27] [27]

Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models, 2025. URL https://arxiv.org/abs/2503.22020

Pith/arXiv arXiv 2025

[28] [28]

J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917

Pith/arXiv arXiv 2025

[29] [29]

C. Yin, Y. Lin, W. Xu, S. Tam, X. Zeng, Z. Liu, and Z. Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models, 2025. URL https://arxiv.org/abs/2511.15669

Pith/arXiv arXiv 2025

[30] [30]

Huang, Y.-H

C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025. URL https://jasper0314-huang.github.io/thinkact-vla/

2025

[31] [31]

Z. Chen, R. Niu, H. Kong, Q. Wang, Q. Xing, and Z. Fan. Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization, 2025. URL https://arxiv.org/abs/2506.08440

arXiv 2025

[32] [32]

Abolghasemi, A

P. Abolghasemi, A. Mazaheri, M. Shah, and L. B \"o l \"o ni. Pay attention! - robustifying a deep visuomotor policy through task-focused attention, 2018. URL https://arxiv.org/abs/1809.10093

Pith/arXiv arXiv 2018

[33] [33]

Devin, P

C. Devin, P. Abbeel, T. Darrell, and S. Levine. Deep object-centric representations for generalizable robot learning, 2017. URL https://arxiv.org/abs/1708.04225

Pith/arXiv arXiv 2017

[34] [34]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Proceedings of the Conference on Robot Learning, 2022 a . URL https://arxiv.org/abs/2109.12098

arXiv 2022

[35] [35]

Shridhar, L

M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation, 2022 b . URL https://arxiv.org/abs/2209.05451

arXiv 2022

[36] [36]

Y. Zhao, K. Wu, T. Yi, Z. Xu, X. Ju, Z. Che, C. H. Liu, and J. Tang. Efficient training of generalizable visuomotor policies via control-aware augmentation, 2024. URL https://arxiv.org/abs/2401.09258. EAGLE

arXiv 2024

[37] [37]

Zhang, Y

T. Zhang, Y. Hu, J. You, and Y. Gao. Leveraging locality to boost sample efficiency in robotic manipulation. In Proceedings of the 8th Conference on Robot Learning, pages 3264--3284, 2025. URL https://proceedings.mlr.press/v270/zhang25h.html

2025

[38] [38]

Chapin, B

A. Chapin, B. Machado, E. Dellandr \'e a, and L. Chen. Spotlighting task-relevant features: Object-centric representations for better generalization in robotic manipulation, 2026. URL https://arxiv.org/abs/2601.21416

Pith/arXiv arXiv 2026

[39] [39]

Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems, 2021. URL https://arxiv.org/abs/2106.02034

arXiv 2021

[40] [40]

Liang, G

Y. Liang, G. Zhang, Z. Zhang, Y. Hu, B. Chandramouli, et al. Evit: Expediting vision transformers via token reorganizations, 2022. URL https://arxiv.org/abs/2202.07800

arXiv 2022

[41] [41]

Bolya, C.-Y

D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023. URL https://arxiv.org/abs/2210.09461

Pith/arXiv arXiv 2023

[42] [42]

H. Li, W. Mao, Z. Lan, H. Xiong, H. Wang, C. Si, Z. Liu, X. Deng, and H. Chen. Bfa++: Hierarchical best-feature-aware token prune for multi-view vision language action model, 2026. URL https://arxiv.org/abs/2602.20566

arXiv 2026

[43] [43]

Black, N

Physical Intelligence , K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...

Pith/arXiv arXiv 2025

[44] [44]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645

Pith/arXiv arXiv 2025

[45] [45]

Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, S. Huang, Y. Tang, W. Wang, R. Zhang, J. Liu, and D. Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model, 2025b. URL https://arxiv.org/abs/2509.09372

arXiv

[46] [46]

Kimi k2.5: Visual agentic intelligence, 2026

Kimi Team . Kimi k2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/2602.02276

Pith/arXiv arXiv 2026

[47] [47]

Introducing gpt-5.4 mini and nano, Mar

OpenAI . Introducing gpt-5.4 mini and nano, Mar. 2026. URL https://openai.com/index/introducing-gpt-5-4-mini-and-nano/

2026

[48] [48]

Y. Wang, X. Li, W. Wang, J. Zhang, Y. Li, Y. Chen, X. Wang, and Z. Zhang. Unified vision-language-action model, 2025c. URL https://arxiv.org/abs/2506.19850

arXiv