See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs
Pith reviewed 2026-06-28 14:02 UTC · model grok-4.3
The pith
Training VLAs with refined subtask language and an explicit visual evidence budget improves generalization on real robot tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
S2 improves VLA generalization by training the executor to act from informative local guidance and task-sufficient visual evidence rather than recovering both from weak supervision. Specify More preserves the original instruction while relabeling each trajectory with refined language that disambiguates execution mode. See Less imposes an explicit visual evidence budget during training so the executor relies on task-sufficient evidence instead of unconstrained visual context, without region or mask annotation. Across eight real-robot tasks on TX-G2 and HSR this raises mean subtask success from 54.2 percent to 79.0 percent over the pi0.5 baseline.
What carries the argument
The S2 training interface that pairs goal-preserving refined subtask language with an imposed visual evidence budget.
If this is right
- Coarse instructions create avoidable supervision aliasing that harms generalization.
- Goal-preserving local guidance outperforms replacing the full instruction in the reported ablations.
- Explicit evidence budgeting reduces dependence on broad visual context beyond efficiency gains.
- The resulting executor remains compatible with off-the-shelf VLM planners through in-context learning.
- Changing the executor's learning problem this way produces higher success than recovering details from weak supervision.
Where Pith is reading between the lines
- The same interface could be tested on other vision-language control policies that currently rely on full-image attention.
- Gains might appear in simulated environments with controlled distractors, allowing direct measurement of the budget's isolated effect.
- The approach may lower the amount of visual data needed for training if the budget is tightened further.
- Future experiments could check whether the refined language also helps when the planner itself is updated rather than kept fixed.
Load-bearing premise
Refined trajectory- and subtask-level language can be generated to disambiguate execution mode while preserving the original instruction, and an explicit visual evidence budget can be imposed during training without any region or mask annotations.
What would settle it
Applying the same refined instructions and evidence budget to the eight tasks and observing no improvement or a drop in subtask success rates relative to the baseline.
read the original abstract
Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the S2 framework for vision-language-action (VLA) models to improve generalization. It introduces 'Specify More' to relabel trajectories with refined trajectory- and subtask-level language that preserves the original high-level instruction while disambiguating execution mode, and 'See Less' to impose an explicit visual evidence budget during training without requiring region or mask annotations. The central empirical claim is that this approach raises mean subtask success from 54.2% to 79.0% over the pi0.5 baseline across eight real-robot tasks on TX-G2 and HSR platforms.
Significance. If the reported performance gains can be attributed specifically to the proposed changes in the executor's learning interface rather than to unaccounted differences in supervision richness or experimental confounds, the work would offer a valuable contribution to VLA training by highlighting the role of goal-preserving local guidance and constrained visual context in reducing supervision aliasing. The compatibility with off-the-shelf VLM planners via in-context learning is noted as a strength.
major comments (3)
- [Abstract (Specify More description)] The procedure for generating the refined trajectory- and subtask-level language (Specify More) is not described in the provided manuscript text (e.g., manual annotation, VLM prompting details, or specific prompts used). This detail is load-bearing for the central claim that gains arise from the cleaner interface rather than richer supervision, as the 54.2% to 79.0% improvement is presented as evidence for the proposed training changes.
- [Abstract (See Less description)] The mechanism for imposing the explicit visual evidence budget (See Less) is not specified (e.g., random masking, attention regularization, cropping, or other implementation). Without this, it is impossible to verify that the budget is enforced without implicit cues or annotations, which is required to isolate its contribution to the reported generalization improvements.
- [Abstract (evaluation paragraph)] The abstract reports the key quantitative result (mean subtask success 54.2% to 79.0% on eight tasks) but provides no details on experimental design, number of trials per task, statistical significance tests, variance, or controls for confounds. This makes the empirical support for the generalization claim preliminary and load-bearing for the paper's conclusions.
minor comments (1)
- [Abstract] The phrase 'Across our main evaluation settings' is imprecise; cross-referencing specific tables, figures, or sections would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify places where additional detail in the abstract and methods would strengthen the manuscript. We address each point below and will revise accordingly.
read point-by-point responses
-
Referee: [Abstract (Specify More description)] The procedure for generating the refined trajectory- and subtask-level language (Specify More) is not described in the provided manuscript text (e.g., manual annotation, VLM prompting details, or specific prompts used). This detail is load-bearing for the central claim that gains arise from the cleaner interface rather than richer supervision, as the 54.2% to 79.0% improvement is presented as evidence for the proposed training changes.
Authors: We agree that the generation procedure must be explicit to isolate the effect of goal-preserving refinement from added supervision richness. The full manuscript describes the use of VLM prompting on trajectory data to produce subtask- and trajectory-level labels while retaining the original high-level instruction; we will revise the abstract to briefly state this VLM-based relabeling approach and expand the methods section with the exact prompting strategy and examples. This revision will directly support the claim that the interface change, rather than arbitrary additional labels, drives the observed gains. revision: yes
-
Referee: [Abstract (See Less description)] The mechanism for imposing the explicit visual evidence budget (See Less) is not specified (e.g., random masking, attention regularization, cropping, or other implementation). Without this, it is impossible to verify that the budget is enforced without implicit cues or annotations, which is required to isolate its contribution to the reported generalization improvements.
Authors: We concur that the concrete implementation of the evidence budget must be stated to confirm it requires no region or mask annotations. The manuscript introduces the budget as an explicit constraint on visual context during training; we will revise the abstract and methods to specify the exact enforcement technique (e.g., the form of masking or regularization used) and reiterate that it operates without any spatial annotations. This addition will allow readers to verify the isolation of the visual-context effect. revision: yes
-
Referee: [Abstract (evaluation paragraph)] The abstract reports the key quantitative result (mean subtask success 54.2% to 79.0% on eight tasks) but provides no details on experimental design, number of trials per task, statistical significance tests, variance, or controls for confounds. This makes the empirical support for the generalization claim preliminary and load-bearing for the paper's conclusions.
Authors: The abstract is necessarily concise, but the referee is correct that it should surface key experimental parameters. We will revise the abstract to include the number of trials per task and a brief note on controls; the main text already reports per-task results, variance across runs, and the experimental protocol on the two robot platforms. If additional statistical significance testing is required beyond what is currently presented, we will add it in the revision. These changes will make the empirical support more self-contained in the abstract while preserving the full details in the body. revision: yes
Circularity Check
No circularity; empirical gains presented as measured outcomes of interface changes without definitional reduction.
full rationale
The paper reports an empirical result (subtask success 54.2% → 79.0% on eight real-robot tasks) attributed to training under refined language guidance plus an explicit visual evidence budget. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The relabeling and budgeting steps are described as methodological interventions whose effects are evaluated experimentally against a baseline; they do not reduce to the input data by construction. This is a standard non-circular empirical claim.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Refined subtask-level language can be created from trajectories to disambiguate execution without additional annotations.
- domain assumption An explicit visual evidence budget can be imposed during training to force the model to use task-sufficient evidence.
Reference graph
Works this paper leans on
-
[1]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, et al. Rt-1: Robotics transformer for real-world control at scale. In Robotics: Science and Systems, 202...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.15607/rss.2023.xix.025 2023
-
[2]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, et al. Openvla: An open-source vision-language-action model, 2024. URL https://arxiv.org/abs/2406.09246
Pith/arXiv arXiv 2024
-
[3]
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. _0 : A vision-language-action flow model for general robot control, 2024. URL https://arxiv.or...
Pith/arXiv arXiv 2024
-
[4]
M. Ahn, A. Brohan, N. Brown, et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022. URL https://arxiv.org/abs/2204.01691
Pith/arXiv arXiv 2022
-
[5]
Myers, C
V. Myers, C. Zheng, O. Mees, K. Fang, and S. Levine. Policy adaptation via language optimization: Decomposing tasks for few-shot imitation. In Proceedings of the 8th Conference on Robot Learning, pages 1402--1426, 2025. URL https://proceedings.mlr.press/v270/myers25a.html
2025
-
[6]
L. X. Shi, B. Ichter, M. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, A. Li-Bell, D. Driess, L. Groom, S. Levine, and C. Finn. Hi robot: Open-ended instruction following with hierarchical vision-language-action models, 2025. URL https://arxiv.org/abs/2502.19417
Pith/arXiv arXiv 2025
-
[7]
M. S. Ryoo, A. J. Piergiovanni, M. Tan, and A. Angelova. Tokenlearner: What can 8 learned tokens do for images and videos?, 2021. URL https://arxiv.org/abs/2106.11297
arXiv 2021
- [8]
-
[9]
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Proceedings of the 5th Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=8kbp23tSGYv
2021
-
[10]
D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y. L. Tan, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, S. Levine, et al. Octo: An open-source generalist robot policy, 2024. URL https://arxiv.org/abs/2405.12213
Pith/arXiv arXiv 2024
-
[11]
M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, S. Alibert, M. Cord, T. Wolf, and R. Cadene. Smolvla: A vision-language-action model for affordable and efficient robotics, 2025. URL https://arxiv.org/abs/2506.01844
Pith/arXiv arXiv 2025
-
[12]
H. Wang, C. Xiong, R. Wang, and X. Chen. Bitvla: 1-bit vision-language-action models for robotics manipulation, 2025a. URL https://arxiv.org/abs/2506.07530
-
[13]
J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, Y.-Q. Zhang, J. Pang, J. Liu, T. Wang, and X. Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https://arxiv.org/abs/2510.10274
Pith/arXiv arXiv 2025
-
[14]
B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv.org/abs/2306.03310
Pith/arXiv arXiv 2023
-
[15]
O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7 0 (3): 0 7327--7334, 2022. doi:10.1109/LRA.2022.3180108. URL https://arxiv.org/abs/2112.03227
-
[16]
J. Liang, W. Huang, F. Xia, et al. Code as policies: Language model programs for embodied control, 2022. URL https://arxiv.org/abs/2209.07753
Pith/arXiv arXiv 2022
-
[17]
W. Huang, F. Xia, T. Xiao, et al. Inner monologue: Embodied reasoning through planning with language models, 2022. URL https://arxiv.org/abs/2207.05608
Pith/arXiv arXiv 2022
-
[18]
Y. Li, Y. Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal. Hamster: Hierarchical action models for open-world robot manipulation, 2025. URL https://arxiv.org/abs/2502.05485
arXiv 2025
-
[19]
Q. Long, Y. Wang, J. Song, J. Zhang, P. Li, W. Wang, Y. Wang, H. Li, S. Xie, G. Yao, H. Zhang, X. Wang, Z. Wang, X. Lan, H. Liu, and X. Li. Scaling world model for hierarchical manipulation policies, 2026. URL https://arxiv.org/abs/2602.10983
arXiv 2026
- [20]
-
[21]
W. Chen, J. S. Bhatia, C. Glossop, N. Mathihalli, R. Doshi, A. Tang, D. Driess, K. Pertsch, and S. Levine. Steerable vision-language-action policies for embodied reasoning and hierarchical control, 2026 a . URL https://arxiv.org/abs/2602.13193
Pith/arXiv arXiv 2026
-
[22]
Z. Chen, A. Tian, L. Wang, B. Joffe, Y. C. Lin, Y. Chen, S. Karamcheti, and D. Xu. Resteer: Quantifying and refining the steerability of multitask robot policies, 2026 b . URL https://arxiv.org/abs/2603.17300
arXiv 2026
-
[23]
T. Xiao, H. Chan, P. Sermanet, A. Wahid, A. Brohan, K. Hausman, S. Levine, and J. Tompson. Robotic skill acquisition via instruction augmentation with vision-language models. In Robotics: Science and Systems, 2023. URL https://rss2023.github.io/rss2023-website/program/papers/029/. Introduces Data-driven Instruction Augmentation for Language-conditioned co...
2023
-
[24]
J. Wen, Y. Zhu, M. Zhu, J. Li, Z. Xu, Z. Che, C. Shen, Y. Peng, D. Liu, F. Feng, and J. Tang. Object-centric instruction augmentation for robotic manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation, 2024. URL https://oci-robotics.github.io/
2024
-
[25]
u hle, \
N. Blank, M. Reuss, M. R \"u hle, \"O . E. Ya g murlu, F. Wenzel, O. Mees, and R. Lioutikov. Scaling robot policy learning via zero-shot labeling with foundation models. In Proceedings of the 8th Conference on Robot Learning, 2024. URL https://proceedings.mlr.press/v270/blank25a.html
2024
-
[26]
Kuramshin, O
A. Kuramshin, O. Aslan, C. Neary, and G. Berseth. Task robustness via re-labelling vision-action robot data. In CoRL 2025 Robot Data Workshop, 2025. URL https://openreview.net/forum?id=M6M5W0lmaY
2025
-
[27]
Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y. Liu, D. Xiang, G. Wetzstein, and T.-Y. Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models, 2025. URL https://arxiv.org/abs/2503.22020
Pith/arXiv arXiv 2025
-
[28]
J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space, 2025. URL https://arxiv.org/abs/2508.07917
Pith/arXiv arXiv 2025
-
[29]
C. Yin, Y. Lin, W. Xu, S. Tam, X. Zeng, Z. Liu, and Z. Yin. Deepthinkvla: Enhancing reasoning capability of vision-language-action models, 2025. URL https://arxiv.org/abs/2511.15669
Pith/arXiv arXiv 2025
-
[30]
Huang, Y.-H
C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang. Thinkact: Vision-language-action reasoning via reinforced visual latent planning, 2025. URL https://jasper0314-huang.github.io/thinkact-vla/
2025
-
[31]
Z. Chen, R. Niu, H. Kong, Q. Wang, Q. Xing, and Z. Fan. Tgrpo: Fine-tuning vision-language-action model via trajectory-wise group relative policy optimization, 2025. URL https://arxiv.org/abs/2506.08440
arXiv 2025
-
[32]
P. Abolghasemi, A. Mazaheri, M. Shah, and L. B \"o l \"o ni. Pay attention! - robustifying a deep visuomotor policy through task-focused attention, 2018. URL https://arxiv.org/abs/1809.10093
Pith/arXiv arXiv 2018
-
[33]
C. Devin, P. Abbeel, T. Darrell, and S. Levine. Deep object-centric representations for generalizable robot learning, 2017. URL https://arxiv.org/abs/1708.04225
Pith/arXiv arXiv 2017
-
[34]
M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Proceedings of the Conference on Robot Learning, 2022 a . URL https://arxiv.org/abs/2109.12098
arXiv 2022
-
[35]
M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation, 2022 b . URL https://arxiv.org/abs/2209.05451
arXiv 2022
-
[36]
Y. Zhao, K. Wu, T. Yi, Z. Xu, X. Ju, Z. Che, C. H. Liu, and J. Tang. Efficient training of generalizable visuomotor policies via control-aware augmentation, 2024. URL https://arxiv.org/abs/2401.09258. EAGLE
arXiv 2024
-
[37]
Zhang, Y
T. Zhang, Y. Hu, J. You, and Y. Gao. Leveraging locality to boost sample efficiency in robotic manipulation. In Proceedings of the 8th Conference on Robot Learning, pages 3264--3284, 2025. URL https://proceedings.mlr.press/v270/zhang25h.html
2025
-
[38]
A. Chapin, B. Machado, E. Dellandr \'e a, and L. Chen. Spotlighting task-relevant features: Object-centric representations for better generalization in robotic manipulation, 2026. URL https://arxiv.org/abs/2601.21416
Pith/arXiv arXiv 2026
-
[39]
Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems, 2021. URL https://arxiv.org/abs/2106.02034
arXiv 2021
- [40]
-
[41]
D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023. URL https://arxiv.org/abs/2210.09461
Pith/arXiv arXiv 2023
-
[42]
H. Li, W. Mao, Z. Lan, H. Xiong, H. Wang, C. Si, Z. Liu, X. Deng, and H. Chen. Bfa++: Hierarchical best-feature-aware token prune for multi-view vision language action model, 2026. URL https://arxiv.org/abs/2602.20566
arXiv 2026
-
[43]
Physical Intelligence , K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y. Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner...
Pith/arXiv arXiv 2025
-
[44]
M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645
Pith/arXiv arXiv 2025
-
[45]
Y. Wang, P. Ding, L. Li, C. Cui, Z. Ge, X. Tong, W. Song, H. Zhao, W. Zhao, P. Hou, S. Huang, Y. Tang, W. Wang, R. Zhang, J. Liu, and D. Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model, 2025b. URL https://arxiv.org/abs/2509.09372
-
[46]
Kimi k2.5: Visual agentic intelligence, 2026
Kimi Team . Kimi k2.5: Visual agentic intelligence, 2026. URL https://arxiv.org/abs/2602.02276
Pith/arXiv arXiv 2026
-
[47]
Introducing gpt-5.4 mini and nano, Mar
OpenAI . Introducing gpt-5.4 mini and nano, Mar. 2026. URL https://openai.com/index/introducing-gpt-5-4-mini-and-nano/
2026
-
[48]
Y. Wang, X. Li, W. Wang, J. Zhang, Y. Li, Y. Chen, X. Wang, and Z. Zhang. Unified vision-language-action model, 2025c. URL https://arxiv.org/abs/2506.19850
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.