pith. machine review for the scientific record. sign in

arxiv: 2604.04733 · v2 · submitted 2026-04-06 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Discovering Failure Modes in Vision-Language Models using RL

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsreinforcement learningfailure modesblind spotsautomatic discoverymultimodal evaluationadaptive querying
0
0 comments X

The pith

An RL questioner agent automatically uncovers vision-language model blind spots by adapting queries to force errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that a reinforcement learning framework can train a questioner to probe any vision-language model for weaknesses without manual effort. The agent starts with simple questions and escalates complexity by focusing on fine-grained visual details and skill combinations, based on the model's prior responses. This process surfaces failure modes in tasks like counting, spatial reasoning, and viewpoint understanding that human inspection often misses due to bias toward salient objects. If the method works, it offers a scalable, unbiased way to map VLM vulnerabilities across different models and data distributions.

Core claim

We propose a Reinforcement Learning-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM's responses to elicit incorrect answers, increasing question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses.

What carries the argument

The RL-trained questioner agent that conditions new queries on the candidate VLM's previous answers to drive progressive increases in query complexity and error elicitation.

If this is right

  • VLMs exhibit previously undocumented weaknesses in skill compositions that become visible only through adaptive, escalating probes.
  • The same framework applies across different VLM candidates without retraining the questioner from scratch.
  • Failure mode discovery no longer depends on human-selected test cases or salient-object bias.
  • Models can be ranked and improved by the specific failure modes the agent surfaces on a shared distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be extended to other modalities or model types by swapping the candidate and keeping the questioner structure.
  • Discovered modes might guide targeted data collection or fine-tuning to close specific gaps rather than broad retraining.
  • If the questioner generalizes well, it could serve as a standard diagnostic tool run before deploying VLMs in new domains.

Load-bearing premise

The generated questions reveal genuine, generalizable model failures instead of artifacts created by the questioner's own training biases or limited data distribution.

What would settle it

Running the discovered failure modes on held-out image-question pairs or with human raters shows that the errors disappear or that random non-adaptive queries find the same modes at similar rates.

Figures

Figures reproduced from arXiv: 2604.04733 by Aishwarya Agrawal, Kanishk Jain, Nishanth Anand, Parisa Kordjamshidi, Qian Yang, Shravan Nayak.

Figure 1
Figure 1. Figure 1: Overview of our framework for automatically discovering failure modes in VLMs. (a) An [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our approach consists of three models: a A1 = {a1 1 , a12 , …a1 k } [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The verification pipeline for question and answer verification. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The failure taxonomy pipeline consists of four stages: (1) identification of primitives, (2) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-meta-skill comparison of Failure Rate, Semantic Diversity, and Lexical Diversity [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Number of unique skills explored by each method. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of skill diversity and question complexity across methods. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of cumulative failure discovery rates across different methods. Our RL-based [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Meta-skill density during RL training. Dense, continuous bands indicate sustained probing [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Evolution of RL-generated questions for a static image at different stages of training. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
read the original abstract

Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual concepts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favour of salient objects, resulting in an incomplete understanding of a model's vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any ``candidate VLM'' on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM's responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes an RL-based framework to automatically discover failure modes in vision-language models (VLMs) by training a questioner agent that adaptively generates queries to elicit incorrect answers from a candidate VLM. The agent increases query complexity over training by focusing on fine-grained visual details and skill compositions, with the goal of identifying novel blind spots (e.g., in counting, spatial reasoning, viewpoint understanding) without human intervention. The authors claim broad applicability by demonstrating generalizability across various model combinations.

Significance. If the generated queries are shown to be semantically coherent and visually grounded, the framework could offer a scalable alternative to manual failure-mode discovery, reducing human bias and enabling systematic exploration of VLM vulnerabilities. The adaptive RL approach represents a potentially useful methodological contribution for automated evaluation of multimodal models.

major comments (3)
  1. [§3] §3 (RL Framework): The reward is defined solely as the negative of the candidate VLM's accuracy on generated queries, with no explicit term for query validity, naturalness, or distributional similarity to human questions. This leaves open the possibility that the questioner converges on malformed or out-of-distribution prompts that any model would fail, rather than revealing genuine capability deficits. A human validation study or distributional matching loss would be required to substantiate the central claim.
  2. [§4] §4 (Experiments): While qualitative examples of discovered failure modes are presented, there are no quantitative metrics (e.g., human-rated query coherence scores, comparison of failure-mode coverage against human-identified baselines, or cross-model generalization statistics) that demonstrate the modes are novel and not training artifacts. Ablation on the progressive complexity schedule is also absent.
  3. [§5] §5 (Generalizability): The claim that the framework works across 'various model combinations' is supported only by a small number of VLM pairs; no statistical analysis or larger-scale transfer experiments are reported to support broad applicability.
minor comments (2)
  1. [Abstract] The abstract states that the method 'increases question complexity... as training progresses' but the precise mechanism (e.g., curriculum schedule or reward shaping) is not summarized; a one-sentence description would improve clarity.
  2. [§2] Notation for the questioner policy and VLM response function is introduced without an explicit equation reference in the early sections; adding a compact notation table would aid readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: §3 (RL Framework): The reward is defined solely as the negative of the candidate VLM's accuracy on generated queries, with no explicit term for query validity, naturalness, or distributional similarity to human questions. This leaves open the possibility that the questioner converges on malformed or out-of-distribution prompts that any model would fail, rather than revealing genuine capability deficits. A human validation study or distributional matching loss would be required to substantiate the central claim.

    Authors: We acknowledge that the reward function relies solely on the negative accuracy of the candidate VLM and does not include an explicit penalty or reward for query naturalness or distributional alignment with human questions. The questioner is conditioned on the input image and prior responses, which provides some grounding in the visual content and data distribution. However, this does not fully address the concern about potential malformed outputs. In the revised manuscript, we will add a human validation study evaluating the semantic coherence, visual grounding, and naturalness of a sample of generated queries, along with a discussion of how the image-conditioned generation helps mitigate out-of-distribution issues. revision: yes

  2. Referee: §4 (Experiments): While qualitative examples of discovered failure modes are presented, there are no quantitative metrics (e.g., human-rated query coherence scores, comparison of failure-mode coverage against human-identified baselines, or cross-model generalization statistics) that demonstrate the modes are novel and not training artifacts. Ablation on the progressive complexity schedule is also absent.

    Authors: We agree that quantitative metrics would provide stronger evidence that the discovered failure modes are novel rather than artifacts. The current manuscript emphasizes qualitative examples to illustrate the types of blind spots found. In the revision, we will incorporate human-rated coherence scores for generated queries, a comparison of failure-mode coverage against a human-identified baseline set, and an ablation study removing the progressive complexity schedule to quantify its contribution to discovering more challenging failure modes. revision: yes

  3. Referee: §5 (Generalizability): The claim that the framework works across 'various model combinations' is supported only by a small number of VLM pairs; no statistical analysis or larger-scale transfer experiments are reported to support broad applicability.

    Authors: The manuscript reports results on multiple representative VLM pairs to illustrate applicability, with consistent patterns in the types of failure modes discovered. We recognize that a larger number of pairs and formal statistical tests would strengthen the generalizability claim. In the revised version, we will add statistical analysis (e.g., variance across pairs) of the existing results and expand the discussion of limitations and scope of applicability. Due to computational constraints, we are unable to run substantially larger-scale transfer experiments within the revision timeline. revision: partial

standing simulated objections not resolved
  • Larger-scale transfer experiments across many additional VLM pairs, as expanding the experimental scope significantly exceeds available computational resources for the revision.

Circularity Check

0 steps flagged

No significant circularity; procedural framework with no self-referential derivations

full rationale

The paper proposes an RL-based procedural framework to train a questioner agent that generates queries eliciting VLM errors, with failure modes identified empirically from the resulting queries and responses. No equations, fitted parameters, or mathematical derivations are described that reduce to their own inputs by construction. The approach relies on standard RL training dynamics rather than any self-definition, fitted-input-as-prediction, or load-bearing self-citation chains. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a circular manner. The central claim of discovering novel failure modes is an empirical outcome of the method, not a tautological reduction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the method implicitly assumes RL can optimize for informative queries without external validation.

pith-pipeline@v0.9.0 · 5495 in / 992 out tokens · 37503 ms · 2026-05-10T19:17:40.648510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 10 canonical work pages · 6 internal anchors

  1. [1]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

  2. [2]

    Whatever next? predictive brains, situated agents, and the future of cognitive science.Behavioral and brain sciences, 36(3):181–204, 2013

    Andy Clark. Whatever next? predictive brains, situated agents, and the future of cognitive science.Behavioral and brain sciences, 36(3):181–204, 2013. 18

  3. [3]

    GRIT: Teaching MLLMs to Think with Images

    Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879, 2025

  4. [4]

    The vendi score: A diversity evaluation metric for machine learning.Transactions on Machine Learning Research, 2023

    Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.Transactions on Machine Learning Research, 2023. ISSN 2835-8856

  5. [5]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  6. [6]

    OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

    Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, et al. Ocrbench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024

  7. [7]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

  8. [8]

    Adver- sarial policies: Attacking deep reinforcement learning.arXiv preprint arXiv:1905.10615,

    Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning.arXiv preprint arXiv:1905.10615, 2019

  9. [9]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  10. [10]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure

    Maarten Grootendorst. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794, 2022

  11. [11]

    Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision an...

  12. [12]

    Visplay: Self-evolving vision-language models from images,

    Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661, 2025

  13. [13]

    Curiosity-driven red-teaming for large language models

    Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red-teaming for large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4KqkizXgXU

  14. [14]

    Conme: Rethinking evaluation of compositional reasoning for modern vlms.arXiv preprint arXiv:2406.08164, 2024

    Irene Huang, Wei Lin, M Jehanzeb Mirza, Jacob A Hansen, Sivan Doveh, Victor Ion Butoi, Roei Herzig, Assaf Arbelle, Hilde Kuhene, Trevor Darrel, et al. Conme: Rethinking evaluation of compositional reasoning for modern vlms.arXiv preprint arXiv:2406.08164, 2024

  15. [15]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, 19 Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  16. [16]

    Seed-bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024

  17. [17]

    Core knowledge deficits in multi- modal language models

    Yijiang Li, Qingying Gao, Tianwei Zhao, Bingyang Wang, Haoran Sun, Haiyun Lyu, Robert D Hawkins, Nuno Vasconcelos, Tal Golan, Dezhi Luo, et al. Core knowledge deficits in multi- modal language models. InInternational Conference on Machine Learning, pages 34379–34409. PMLR, 2025

  18. [18]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  19. [19]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  20. [20]

    Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding.Advances in Neural Information Processing Systems, 36:46212–46244, 2023

  21. [21]

    Chartqapro: A more diverse and challenging benchmark for chart question answering

    Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmo- hammadi, et al. Chartqapro: A more diverse and challenging benchmark for chart question answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19123–19151, 2025

  22. [22]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InProceedings of the Asian Conference on Computer Vision, pages 18–34, 2024

  23. [23]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  24. [24]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  25. [25]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024

  26. [26]

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems, 33:5776–5788, 2020. 20

  27. [27]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

  28. [28]

    Evaluating object hallucination in large vision-language models

    Li Yifan, Du Yifan, Zhou Kun, Wang Jinpeng, Xin Zhao Wayne, and Wen Ji-Rong. Evaluating object hallucination in large vision-language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id= xozJw0kZXF

  29. [29]

    Mm-vet: Evaluating large multimodal models for integrated capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. In International conference on machine learning. PMLR, 2024

  30. [30]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

  31. [31]

    On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

    Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models.Advances in Neural Information Processing Systems, 36:54111–54138, 2023

  32. [32]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 21