Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Albina Burlova; Aleksandr I. Panov; Alexey K. Kovalev; Andrey Kuznetsov; Andrey Moskalenko; Daria Pugacheva; Denis Shepelev; Elena Tutubalina; Matvey Skripkin; Mikhail Kolosov

arxiv: 2606.19297 · v1 · pith:XJAESYJDnew · submitted 2026-06-17 · 💻 cs.LG · cs.RO

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

Nikita Kachaev , Andrey Moskalenko , Matvey Skripkin , Nikita Kurlaev , Daria Pugacheva , Albina Burlova , Mikhail Kolosov , Denis Shepelev

show 5 more authors

Andrey Kuznetsov Elena Tutubalina Aleksandr I. Panov Alexey K. Kovalev Vlad Shakhuro

This is my paper

Pith reviewed 2026-06-26 20:53 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords vision-language-action modelscommonsense knowledgeworld knowledgeknowledge retentionAct2AnswerVQA co-traininglayerwise probingembodied AI

0 comments

The pith

Vision-language-action models retain commonsense on simple concepts but show larger gaps on richer categories than their source vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Act2Answer, a protocol that converts standard VLM knowledge benchmarks into short tabletop episodes where a VLA model answers by performing one object-placement action. This yields an action-grounded success rate intended to isolate retained information from low-level control problems. Across seven VLAs and nine VLM baselines, the evaluation finds solid results on basic concepts but wider shortfalls on richer semantic categories relative to the original VLMs. VQA co-training during adaptation correlates with stronger retention, and layerwise probing shows answer-relevant signals strongest in middle layers that weaken toward the output.

Core claim

VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs; VQA co-training is associated with better knowledge retention; and answer-relevant signals peak in middle VLA layers but attenuate in upper layers. These patterns are measured by converting knowledge questions into episodes that require a single object-placement action to select among candidate answers.

What carries the argument

Act2Answer, the protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring the agent to perform a single object-placement action to indicate the selected answer.

If this is right

VLAs perform adequately on basic commonsense but lose more ground on complex semantic categories than the VLMs they are derived from.
Including VQA data during fine-tuning is linked to higher retention rates across the tested categories.
Answer-relevant information concentrates in the middle layers of the VLM backbone and declines in the upper layers.
The action-grounded protocol provides a way to rank models by retained knowledge with reduced control confounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of future VLAs could add objectives that protect middle-layer representations to limit knowledge loss during action fine-tuning.
The single-action format could be extended to multi-step sequences to test whether retained knowledge supports longer reasoning chains.
Robotics tasks that depend on factual or commonsense understanding may require explicit selection of VLA checkpoints that score high on Act2Answer-style probes.
The observed layer attenuation pattern suggests that knowledge extraction methods for VLAs should target intermediate rather than final layers.

Load-bearing premise

That requiring a single object-placement action sufficiently isolates knowledge retention from low-level control confounds so that success rate directly reflects retained information.

What would settle it

A VLA model that fails an Act2Answer episode but correctly answers the identical question when evaluated directly as a VLM would indicate that the failure stems from execution rather than missing knowledge.

Figures

Figures reproduced from arXiv: 2606.19297 by Albina Burlova, Aleksandr I. Panov, Alexey K. Kovalev, Andrey Kuznetsov, Andrey Moskalenko, Daria Pugacheva, Denis Shepelev, Elena Tutubalina, Matvey Skripkin, Mikhail Kolosov, Nikita Kachaev, Nikita Kurlaev, Vlad Shakhuro.

**Figure 2.** Figure 2: Act2Answer episodes examples for testing VLA models, built on top of VLM benchmark questions. In [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the data curation pipeline used to construct the Act2Answer task suite from VLM benchmarks, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Probing results for internal representations of [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Additional ACT2ANSWER environment examples from the EMOTION, CELEBRITY, and LIVING WORLD categories [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Additional ACT2ANSWER environment examples from the TIME, TRAFFIC, and PUBLIC INFO categories. cial, biological, and culturally grounded categories, (ii) temporal and public-convention categories, and (iii) physical and quantitative categories. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Additional ACT2ANSWER environment examples from the ATTRIBUTE, STATE, COLOR, SYMMETRY, SHAPE, and COUNTING categories. B.1 Near Format-Preserving Adaptations MLLM-CompBench. MLLM-CompBench was the cleanest source benchmark for Act2Answer. Its native format already consists of two candidate images and a comparative question, making it closely aligned with our final embodied answer-selection setup. We there… view at source ↗

read the original abstract

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Act2Answer turns VLM tests into single-action placements to check retained knowledge in VLAs, with results pointing to VQA co-training as helpful and middle-layer signals as strongest, but the isolation from control confounds rests on an assumption that needs more checks.

read the letter

The main thing to know is that this paper introduces Act2Answer to convert standard knowledge benchmarks into short tabletop episodes where a VLA answers by placing one object among options. Their tests across seven VLAs and nine VLMs show decent performance on simple concepts, larger gaps on richer categories, better retention when VQA co-training is used, and answer signals peaking in middle layers before fading higher up.

The protocol and the layerwise probing are the concrete additions. Comparing the adapted models directly to their source VLMs and linking outcomes to training details gives a usable ranking that robotics groups could apply.

The soft spot is the assumption that one placement action sufficiently removes control and environmental confounds. The abstract claims reduced confounds but does not detail randomization, distractor handling, or baselines that would rule out scene statistics or visual shortcuts. Post-hoc category splits could also influence the size of the reported gaps.

This is for teams working on VLA adaptation who want a diagnostic for knowledge loss. It deserves peer review because the scale is reasonable and the protocol is reusable, though the methods will need scrutiny on controls and statistical reporting.

Referee Report

1 major / 0 minor

Summary. The paper introduces Act2Answer, a protocol that converts VLM knowledge benchmarks into short tabletop episodes where VLAs answer via a single object-placement action, claiming this yields an action-grounded success rate with reduced control confounds. In a study of 7 VLA models and 9 VLM baselines across commonsense and world-knowledge categories, it reports solid VLA performance on simple concepts but larger gaps on richer semantics relative to source VLMs, better retention with VQA co-training, and answer-relevant signals peaking in middle layers before attenuating in upper layers.

Significance. If the protocol validly isolates retained knowledge, the work provides a practical evaluation method for post-adaptation knowledge in embodied models and identifies actionable patterns (VQA co-training benefits, layerwise localization) that could inform VLA training. The large-scale comparative design across multiple models and categories is a strength.

major comments (1)

[Abstract] Abstract: The central claim that a single object-placement action produces 'an action-grounded success rate with reduced control confounds' is load-bearing for all reported performance gaps and layerwise findings, yet the description provides no concrete controls (scene randomization, distractor objects, non-semantic baseline comparisons, or statistical tests ruling out placement heuristics and visual shortcuts). Without these, success rates could reflect environmental biases rather than retained VLM knowledge.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for recognizing the potential value of the Act2Answer protocol and the large-scale comparative study. We address the single major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that a single object-placement action produces 'an action-grounded success rate with reduced control confounds' is load-bearing for all reported performance gaps and layerwise findings, yet the description provides no concrete controls (scene randomization, distractor objects, non-semantic baseline comparisons, or statistical tests ruling out placement heuristics and visual shortcuts). Without these, success rates could reflect environmental biases rather than retained VLM knowledge.

Authors: We agree that the abstract is concise and omits explicit mention of the controls. Section 3 of the manuscript details the protocol, including scene randomization across episodes, inclusion of distractor objects, non-semantic baseline comparisons (e.g., random placement and color-based heuristics), and statistical tests (binomial tests and permutation tests) to rule out placement heuristics and visual shortcuts. We will revise the abstract to briefly reference these controls and point to Section 3 for details, ensuring the claim is better supported at the abstract level. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivation chain

full rationale

The paper introduces an evaluation protocol (Act2Answer) that converts VLM benchmarks into single-action tabletop episodes and reports observed success rates and layerwise probing results across 7 VLAs and 9 VLMs. No equations, fitted parameters presented as predictions, or self-citation chains appear in the abstract or described methodology; all claims are framed as direct empirical outcomes rather than reductions to prior inputs by construction. The central assumption about reduced control confounds is an experimental design choice, not a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions in AI benchmarking that curated question-to-action mappings measure the intended knowledge categories and that the chosen VLM baselines are appropriate comparators; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The curated tabletop episodes accurately map knowledge questions to object-placement actions without introducing new biases or control demands that differ across models.
Abstract states the protocol reduces control confounds but provides no further detail on episode construction or validation.

pith-pipeline@v0.9.1-grok · 5833 in / 1384 out tokens · 33466 ms · 2026-06-26T20:53:30.068582+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 1 canonical work pages

[1]

D'Amato, M. R. and Van Sant, Paul , title =. Journal of Experimental Psychology: Animal Behavior Processes , year =
[2]

Learning and Motivation , year =

Bovet, Dalila and Vauclair, Jacques , title =. Learning and Motivation , year =
[3]

Animal Cognition , year =

Tanaka, Masayuki , title =. Animal Cognition , year =
[4]

Animal Cognition , year =

Range, Friederike and Aust, Ulrike and Steurer, Michael and Huber, Ludwig , title =. Animal Cognition , year =
[5]

How to read a picture: Lessons from nonhuman primates , journal =

Fagot, Jo. How to read a picture: Lessons from nonhuman primates , journal =. 2010 , volume =

2010
[6]

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs , url =

Kil, Jihyung and Mai, Zheda and Lee, Justin and Chowdhury, Arpita and Wang, Zihe and Cheng, Kerrie and Wang, Lemeng and Liu, Ye and Chao, Wei-Lun , booktitle =. MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs , url =
[7]

2023 , eprint=

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning , author=. 2023 , eprint=

2023
[8]

2022 , eprint=

CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , author=. 2022 , eprint=

2022
[12]

2024 , eprint=

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks , author=. 2024 , eprint=

2024
[13]

2024 , eprint=

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation , author=. 2024 , eprint=

2024
[14]

2025 , eprint=

Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning , author=. 2025 , eprint=

2025
[15]

2025 , eprint=

Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models , author=. 2025 , eprint=

2025
[16]

Agentic AI in the Wild: From Hallucinations to Reliable Autonomy , year=

Steering Large Language Models Toward Clarification through Sparse Autoencoders , author=. Agentic AI in the Wild: From Hallucinations to Reliable Autonomy , year=
[18]

2025 , eprint=

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization , author=. 2025 , eprint=

2025
[19]

2025 , eprint=

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics , author=. 2025 , eprint=

2025
[20]

2025 , eprint=

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success , author=. 2025 , eprint=

2025
[21]

2025 , eprint=

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model , author=. 2025 , eprint=

2025
[22]

2024 , eprint=

_0 : A Vision-Language-Action Flow Model for General Robot Control , author=. 2024 , eprint=

2024
[23]

2025 , eprint=

Magma: A Foundation Model for Multimodal AI Agents , author=. 2025 , eprint=

2025
[24]

2025 , eprint=

MolmoAct: Action Reasoning Models that can Reason in Space , author=. 2025 , eprint=

2025
[25]

2024 , eprint=

OpenVLA: An Open-Source Vision-Language-Action Model , author=. 2024 , eprint=

2024
[26]

2024 , eprint=

Evaluating Real-World Robot Manipulation Policies in Simulation , author=. 2024 , eprint=

2024
[27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[28]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Docvqa: A dataset for vqa on document images , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
[29]

2016 , eprint=

A Diagram Is Worth A Dozen Images , author=. 2016 , eprint=

2016
[30]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[31]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[32]

The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=
[33]

Conference on Robot Learning , pages=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

2023
[39]

arXiv preprint arXiv:2504.16054 , year=

pi\_ \ 0.5 \ : a Vision-Language-Action Model with Open-World Generalization , author=. arXiv preprint arXiv:2504.16054 , year=

Pith/arXiv arXiv
[40]

Advances in Neural Information Processing Systems , year =

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning , author =. Advances in Neural Information Processing Systems , year =
[41]

European Conference on Computer Vision , year =

MMBench: Is Your Multi-modal Model an All-around Player? , author =. European Conference on Computer Vision , year =
[42]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
[44]

Optical Memory and Neural Networks , volume=

Spatial traces: Enhancing vla models with spatial-temporal understanding , author=. Optical Memory and Neural Networks , volume=. 2025 , publisher=

2025
[45]

Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, and 1 others. 2025. pi*0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759

Pith/arXiv arXiv 2025
[46]

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, and 1 others. 2025. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558

Pith/arXiv arXiv 2025
[47]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, and 5 others. 2024. https://arxiv.org/abs/2410.24164 _0 : A vision-language-action ...

Pith/arXiv arXiv 2024
[49]

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, and 1 others. 2026 b . Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint arXiv:2602.12684

arXiv 2026
[50]

Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, and 1 others. 2025. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy. arXiv preprint arXiv:2510.13778

Pith/arXiv arXiv 2025
[51]

Kovalev, and Aleksandr I

Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, and Aleksandr I. Panov. 2025. https://arxiv.org/abs/2502.10550 Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning . Preprint, arXiv:2502.10550

arXiv 2025
[52]

Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

2019
[53]

Anastasia Ivanova, Bakaeva Eva, Zoya Volovikova, Alexey Kovalev, and Aleksandr Panov. 2025. https://doi.org/10.18653/v1/2025.acl-long.1593 A mbi K : Dataset of ambiguous tasks in kitchen environment . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33216--33241, Vienna, Austria. Ass...

work page doi:10.18653/v1/2025.acl-long.1593 2025
[54]

Kovalev, and Aleksandr I

Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, and Aleksandr I. Panov. 2025. https://arxiv.org/abs/2510.25616 Don't blind your vla: Aligning visual representations for ood generalization . Preprint, arXiv:2510.25616

arXiv 2025
[55]

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. https://arxiv.org/abs/1603.07396 A diagram is worth a dozen images . Preprint, arXiv:1603.07396

Pith/arXiv arXiv 2016
[56]

Jihyung Kil, Zheda Mai, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, and Wei-Lun Chao. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/32923dff09f75cf1974c145764a523e2-Paper-Datasets_and_Benchmarks_Track.pdf Mllm-compbench: A comparative reasoning benchmark for multimodal llms . In Advances in Neural Informa...

2024
[57]

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. 2024. https://arxiv.org/abs/2406.09246 Openvla: An open-source vision-language-action m...

Pith/arXiv arXiv 2024
[58]

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, and 16 others. 2024 a . https://arxiv.org/abs/2403.09227 Behavior-1k: A human-...

Pith/arXiv arXiv 2024
[59]

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. 2024 b . https://arxiv.org/abs/2405.05941 Evaluating real-world robot manipulation policies in simulation . Preprint, arXiv:2405.05941

Pith/arXiv arXiv 2024
[60]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. https://arxiv.org/abs/2306.03310 Libero: Benchmarking knowledge transfer for lifelong robot learning . Preprint, arXiv:2306.03310

Pith/arXiv arXiv 2023
[61]

Weiheng Liu, Yuxuan Wan, Jilong Wang, Yuxuan Kuang, Xuesong Shi, Haoran Li, Dongbin Zhao, Zhizheng Zhang, and He Wang. 2025. Fetchbot: Object fetching in cluttered shelves via zero-shot sim2real. arXiv preprint arXiv:2502.17894

arXiv 2025
[62]

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2024. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision

2024
[63]

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS)

2022
[64]

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. 2021. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In Advances in Neural Information Processing Systems. Datasets and Benchmarks Track

2021
[65]

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In IEEE/CVF Conference on Computer Vision and Pattern Recognition

2019
[66]

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200--2209

2021
[67]

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. 2022. https://arxiv.org/abs/2112.03227 Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks . Preprint, arXiv:2112.03227

arXiv 2022
[68]

Maxim A Patratskiy, Alexey K Kovalev, and Aleksandr I Panov. 2025. Spatial traces: Enhancing vla models with spatial-temporal understanding. Optical Memory and Neural Networks, 34(Suppl 1):S72--S82

2025
[69]

Alisa Petrova and Alexey Kovalev. 2026. https://openreview.net/forum?id=YBgS2GCqXQ Steering large language models toward clarification through sparse autoencoders . In Agentic AI in the Wild: From Hallucinations to Reliable Autonomy

2026
[70]

Daria Pugacheva, Andrey Moskalenko, Denis Shepelev, Andrey Kuznetsov, Vlad Shakhuro, and Elena Tutubalina. 2025. https://arxiv.org/abs/2510.07067 Bring the apple, not the sofa: Impact of irrelevant context in embodied ai commands on vla models . Preprint, arXiv:2510.07067

arXiv 2025
[71]

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. 2025. https://arxiv.org/abs/2501.15830 Spatialvla: Exploring spatial representations for visual-language-action model . Preprint, arXiv:2501.15830

Pith/arXiv arXiv 2025
[72]

Arth Shukla, Stone Tao, and Hao Su. 2024. Maniskill-hab: A benchmark for low-level manipulation in home rearrangement tasks. arXiv preprint arXiv:2412.13211

arXiv 2024
[73]

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. 2025. https://arxiv.org/abs/2506.01844 Smolvla: A vision-language-action model for affordable and efficient robotics . Preprint, a...

Pith/arXiv arXiv 2025
[74]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326

2019
[75]

Konstantin Soshin, Alexander Krapukhin, Andrei Spiridonov, Denis Shepelev, Gregorii Bukhtuev, Andrey Kuznetsov, and Vlad Shakhuro. 2025. Robobenchmart: Benchmarking robots in retail environment. arXiv preprint arXiv:2511.10276

Pith/arXiv arXiv 2025
[76]

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. 2025. https://arxiv.org/abs/2502.13130 Magma: A foundation model for multimodal ai agents . Preprint, arXiv:2502.13130

arXiv 2025
[77]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, and 1 others. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556--9567

2024
[78]

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. 2026. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309

Pith/arXiv arXiv 2026
[79]

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. 2024. https://arxiv.org/abs/2412.18194 Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks . Preprint, arXiv:2412.18194

arXiv 2024
[80]

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, and 1 others. 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165--2183. PMLR

2023

[1] [1]

D'Amato, M. R. and Van Sant, Paul , title =. Journal of Experimental Psychology: Animal Behavior Processes , year =

[2] [2]

Learning and Motivation , year =

Bovet, Dalila and Vauclair, Jacques , title =. Learning and Motivation , year =

[3] [3]

Animal Cognition , year =

Tanaka, Masayuki , title =. Animal Cognition , year =

[4] [4]

Animal Cognition , year =

Range, Friederike and Aust, Ulrike and Steurer, Michael and Huber, Ludwig , title =. Animal Cognition , year =

[5] [5]

How to read a picture: Lessons from nonhuman primates , journal =

Fagot, Jo. How to read a picture: Lessons from nonhuman primates , journal =. 2010 , volume =

2010

[6] [6]

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs , url =

Kil, Jihyung and Mai, Zheda and Lee, Justin and Chowdhury, Arpita and Wang, Zihe and Cheng, Kerrie and Wang, Lemeng and Liu, Ye and Chao, Wei-Lun , booktitle =. MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs , url =

[7] [7]

2023 , eprint=

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning , author=. 2023 , eprint=

2023

[8] [8]

2022 , eprint=

CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , author=. 2022 , eprint=

2022

[9] [12]

2024 , eprint=

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks , author=. 2024 , eprint=

2024

[10] [13]

2024 , eprint=

BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation , author=. 2024 , eprint=

2024

[11] [14]

2025 , eprint=

Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning , author=. 2025 , eprint=

2025

[12] [15]

2025 , eprint=

Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models , author=. 2025 , eprint=

2025

[13] [16]

Agentic AI in the Wild: From Hallucinations to Reliable Autonomy , year=

Steering Large Language Models Toward Clarification through Sparse Autoencoders , author=. Agentic AI in the Wild: From Hallucinations to Reliable Autonomy , year=

[14] [18]

2025 , eprint=

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization , author=. 2025 , eprint=

2025

[15] [19]

2025 , eprint=

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics , author=. 2025 , eprint=

2025

[16] [20]

2025 , eprint=

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success , author=. 2025 , eprint=

2025

[17] [21]

2025 , eprint=

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model , author=. 2025 , eprint=

2025

[18] [22]

2024 , eprint=

_0 : A Vision-Language-Action Flow Model for General Robot Control , author=. 2024 , eprint=

2024

[19] [23]

2025 , eprint=

Magma: A Foundation Model for Multimodal AI Agents , author=. 2025 , eprint=

2025

[20] [24]

2025 , eprint=

MolmoAct: Action Reasoning Models that can Reason in Space , author=. 2025 , eprint=

2025

[21] [25]

2024 , eprint=

OpenVLA: An Open-Source Vision-Language-Action Model , author=. 2024 , eprint=

2024

[22] [26]

2024 , eprint=

Evaluating Real-World Robot Manipulation Policies in Simulation , author=. 2024 , eprint=

2024

[23] [27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[24] [28]

Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

Docvqa: A dataset for vqa on document images , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

[25] [29]

2016 , eprint=

A Diagram Is Worth A Dozen Images , author=. 2016 , eprint=

2016

[26] [30]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

[27] [31]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[28] [32]

The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

[29] [33]

Conference on Robot Learning , pages=

Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

2023

[30] [39]

arXiv preprint arXiv:2504.16054 , year=

pi\_ \ 0.5 \ : a Vision-Language-Action Model with Open-World Generalization , author=. arXiv preprint arXiv:2504.16054 , year=

Pith/arXiv arXiv

[31] [40]

Advances in Neural Information Processing Systems , year =

IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning , author =. Advances in Neural Information Processing Systems , year =

[32] [41]

European Conference on Computer Vision , year =

MMBench: Is Your Multi-modal Model an All-around Player? , author =. European Conference on Computer Vision , year =

[33] [42]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

[34] [44]

Optical Memory and Neural Networks , volume=

Spatial traces: Enhancing vla models with spatial-temporal understanding , author=. Optical Memory and Neural Networks , volume=. 2025 , publisher=

2025

[35] [45]

Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, and 1 others. 2025. pi*0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759

Pith/arXiv arXiv 2025

[36] [46]

Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, and 1 others. 2025. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558

Pith/arXiv arXiv 2025

[37] [47]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, and 5 others. 2024. https://arxiv.org/abs/2410.24164 _0 : A vision-language-action ...

Pith/arXiv arXiv 2024

[38] [49]

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, and 1 others. 2026 b . Xiaomi-robotics-0: An open-sourced vision-language-action model with real-time execution. arXiv preprint arXiv:2602.12684

arXiv 2026

[39] [50]

Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, and 1 others. 2025. Internvla-m1: A spatially guided vision-language-action framework for generalist robot policy. arXiv preprint arXiv:2510.13778

Pith/arXiv arXiv 2025

[40] [51]

Kovalev, and Aleksandr I

Egor Cherepanov, Nikita Kachaev, Alexey K. Kovalev, and Aleksandr I. Panov. 2025. https://arxiv.org/abs/2502.10550 Memory, benchmark & robots: A benchmark for solving complex tasks with reinforcement learning . Preprint, arXiv:2502.10550

arXiv 2025

[41] [52]

Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

2019

[42] [53]

Anastasia Ivanova, Bakaeva Eva, Zoya Volovikova, Alexey Kovalev, and Aleksandr Panov. 2025. https://doi.org/10.18653/v1/2025.acl-long.1593 A mbi K : Dataset of ambiguous tasks in kitchen environment . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33216--33241, Vienna, Austria. Ass...

work page doi:10.18653/v1/2025.acl-long.1593 2025

[43] [54]

Kovalev, and Aleksandr I

Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, and Aleksandr I. Panov. 2025. https://arxiv.org/abs/2510.25616 Don't blind your vla: Aligning visual representations for ood generalization . Preprint, arXiv:2510.25616

arXiv 2025

[44] [55]

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. 2016. https://arxiv.org/abs/1603.07396 A diagram is worth a dozen images . Preprint, arXiv:1603.07396

Pith/arXiv arXiv 2016

[45] [56]

Jihyung Kil, Zheda Mai, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, and Wei-Lun Chao. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/32923dff09f75cf1974c145764a523e2-Paper-Datasets_and_Benchmarks_Track.pdf Mllm-compbench: A comparative reasoning benchmark for multimodal llms . In Advances in Neural Informa...

2024

[46] [57]

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. 2024. https://arxiv.org/abs/2406.09246 Openvla: An open-source vision-language-action m...

Pith/arXiv arXiv 2024

[47] [58]

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, and 16 others. 2024 a . https://arxiv.org/abs/2403.09227 Behavior-1k: A human-...

Pith/arXiv arXiv 2024

[48] [59]

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. 2024 b . https://arxiv.org/abs/2405.05941 Evaluating real-world robot manipulation policies in simulation . Preprint, arXiv:2405.05941

Pith/arXiv arXiv 2024

[49] [60]

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. https://arxiv.org/abs/2306.03310 Libero: Benchmarking knowledge transfer for lifelong robot learning . Preprint, arXiv:2306.03310

Pith/arXiv arXiv 2023

[50] [61]

Weiheng Liu, Yuxuan Wan, Jilong Wang, Yuxuan Kuang, Xuesong Shi, Haoran Li, Dongbin Zhao, Zhizheng Zhang, and He Wang. 2025. Fetchbot: Object fetching in cluttered shelves via zero-shot sim2real. arXiv preprint arXiv:2502.17894

arXiv 2025

[51] [62]

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. 2024. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision

2024

[52] [63]

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS)

2022

[53] [64]

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. 2021. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In Advances in Neural Information Processing Systems. Datasets and Benchmarks Track

2021

[54] [65]

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In IEEE/CVF Conference on Computer Vision and Pattern Recognition

2019

[55] [66]

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200--2209

2021

[56] [67]

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. 2022. https://arxiv.org/abs/2112.03227 Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks . Preprint, arXiv:2112.03227

arXiv 2022

[57] [68]

Maxim A Patratskiy, Alexey K Kovalev, and Aleksandr I Panov. 2025. Spatial traces: Enhancing vla models with spatial-temporal understanding. Optical Memory and Neural Networks, 34(Suppl 1):S72--S82

2025

[58] [69]

Alisa Petrova and Alexey Kovalev. 2026. https://openreview.net/forum?id=YBgS2GCqXQ Steering large language models toward clarification through sparse autoencoders . In Agentic AI in the Wild: From Hallucinations to Reliable Autonomy

2026

[59] [70]

Daria Pugacheva, Andrey Moskalenko, Denis Shepelev, Andrey Kuznetsov, Vlad Shakhuro, and Elena Tutubalina. 2025. https://arxiv.org/abs/2510.07067 Bring the apple, not the sofa: Impact of irrelevant context in embodied ai commands on vla models . Preprint, arXiv:2510.07067

arXiv 2025

[60] [71]

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. 2025. https://arxiv.org/abs/2501.15830 Spatialvla: Exploring spatial representations for visual-language-action model . Preprint, arXiv:2501.15830

Pith/arXiv arXiv 2025

[61] [72]

Arth Shukla, Stone Tao, and Hao Su. 2024. Maniskill-hab: A benchmark for low-level manipulation in home rearrangement tasks. arXiv preprint arXiv:2412.13211

arXiv 2024

[62] [73]

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. 2025. https://arxiv.org/abs/2506.01844 Smolvla: A vision-language-action model for affordable and efficient robotics . Preprint, a...

Pith/arXiv arXiv 2025

[63] [74]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326

2019

[64] [75]

Konstantin Soshin, Alexander Krapukhin, Andrei Spiridonov, Denis Shepelev, Gregorii Bukhtuev, Andrey Kuznetsov, and Vlad Shakhuro. 2025. Robobenchmart: Benchmarking robots in retail environment. arXiv preprint arXiv:2511.10276

Pith/arXiv arXiv 2025

[65] [76]

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. 2025. https://arxiv.org/abs/2502.13130 Magma: A foundation model for multimodal ai agents . Preprint, arXiv:2502.13130

arXiv 2025

[66] [77]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, and 1 others. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556--9567

2024

[67] [78]

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. 2026. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309

Pith/arXiv arXiv 2026

[68] [79]

Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. 2024. https://arxiv.org/abs/2412.18194 Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks . Preprint, arXiv:2412.18194

arXiv 2024

[69] [80]

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, and 1 others. 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165--2183. PMLR

2023