arxiv: 2605.12674 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.LG· cs.RO

Recognition: unknown

Revealing Interpretable Failure Modes of VLMs

Isha Chaudhary , Vedaant V Jain , Kavya Sachdeva , Sayan Ranu , Gagandeep Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:24 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.RO

keywords vision-language modelsfailure modesinterpretabilityautonomous drivingindoor roboticsbeam searchThompson samplingsafety evaluation

0 comments

The pith

REVELIO uncovers interpretable concept compositions that cause consistent failures in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces REVELIO as a method to search the combinatorial space of domain concepts such as weather or proximity and locate the specific mixtures where a VLM produces wrong outputs every time. It pairs a diversity-aware beam search that charts the main failure regions with Gaussian-process Thompson Sampling that reaches rarer or more intricate combinations. The work matters because VLMs now appear in driving and robotics systems where these repeatable errors can produce crashes or halted operations. Revealing the failures in concrete, human-readable terms gives developers direct targets for fixing the models rather than guessing at broad weaknesses.

Core claim

REVELIO defines a failure mode as any composition of interpretable, domain-relevant concepts under which a target VLM consistently behaves incorrectly, then solves the exponential search problem by running diversity-aware beam search to map the failure landscape and Gaussian-process Thompson Sampling to explore more complex modes, and applies the procedure to autonomous-driving and indoor-robotics settings to expose previously unreported vulnerabilities such as weak spatial grounding that produces simulated crashes and either missed hazards or excessive false alarms that reduce efficiency.

What carries the argument

REVELIO framework that combines diversity-aware beam search and Gaussian-process Thompson Sampling to explore combinatorial spaces of concepts for VLM failure modes.

Load-bearing premise

The concept compositions returned by the search correspond to genuine, consistent real-world failure modes rather than artifacts created by the search heuristics.

What would settle it

Execute the VLMs on high-fidelity simulations or physical deployments that instantiate exactly the same concept combinations reported by REVELIO and measure whether the predicted error rate actually appears.

Figures

Figures reproduced from arXiv: 2605.12674 by Gagandeep Singh, Isha Chaudhary, Kavya Sachdeva, Sayan Ranu, Vedaant V Jain.

**Figure 2.** Figure 2: Scene graph We begin by defining scene graphs [6] G = (V, E, A) that abstractly represent image I. Nodes V ⊂ Uent of G represent physical entities like obstructions, pedestrians, traffic lights, etc, drawn from a symbolic universe Uent of all possible real-world entities Uent. Edges E ⊂ Uent × Uent encode directed spatial and semantic relationships between nodes, such as a cyclist in front of the ego vehi… view at source ↗

**Figure 3.** Figure 3: VLM prompt for driving. While the generated scenes vary across a distribution, the user prompt ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Example VLM prompt for indoor. Ground truth is determined by matching the concept set against a library of safety rules. The selected rule determines both the prompt to the VLM and the expected answer [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Scenarios for failure modes discovered by GPTS. Top: driving. Bottom: indoor. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Indoor: PFM and MFR as a function of the beam-phase budget [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Indoor: PFM and MFR as a function of the beam-phase budget [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: a plot with varying τ on x-axis and fraction of failure modes on y-axis for indoor experiments [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: a plot with varying τ on x-axis and fraction of failure modes on y-axis for driving experiments. D.5 Varying the Gaussian Process (GP) kernel To adapt our discrete concept search space for GP modeling, we encode each evaluated concept set as a binary vector x ∈ {0, 1} |Γ| , where an element is 1 if the corresponding concept is present. We base our surrogate model on the linear (dot-product) kernel, k(x, x′… view at source ↗

**Figure 10.** Figure 10: Images rendered for failure modes discovered by GPTS across multiple VLMs. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Cross-model atomic concept analysis. Rows are the 30 concepts that appear in at least [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: Cross-model atomic concept analysis for the [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) are increasingly used in safety-critical applications because of their broad reasoning capabilities and ability to generalize with minimal task-specific engineering. Despite these advantages, they can exhibit catastrophic failures in specific real-world situations, constituting failure modes. We introduce REVELIO, a framework for systematically uncovering interpretable failure modes in VLMs. We define a failure mode as a composition of interpretable, domain-relevant concepts-such as pedestrian proximity or adverse weather conditions-under which a target VLM consistently behaves incorrectly. Identifying such failures requires searching over an exponentially large discrete combinatorial space. To address this challenge, REVELIO combines two search procedures: a diversity-aware beam search that efficiently maps the failure landscape, and a Gaussian-process Thompson Sampling strategy that enables broader exploration of complex failure modes. We apply REVELIO to autonomous driving and indoor robotics domains, uncovering previously unreported vulnerabilities in state-of-the-art VLMs. In driving environments, the models often demonstrate weak spatial grounding and fail to account for major obstructions, leading to recommendations that would result in simulated crashes. In indoor robotics tasks, VLMs either miss safety hazards or behave excessively conservatively, producing false alarms and reducing operational efficiency. By identifying structured and interpretable failure modes, REVELIO offers actionable insights that can support targeted VLM safety improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REVELIO gives a workable search combo for spotting concept-based VLM failures in driving and robotics sims, but the results stay qualitative and simulation-only with no numbers or real-world checks.

read the letter

The core contribution is a framework that treats VLM failure modes as combinations of domain concepts like pedestrian distance plus bad weather, then uses diversity-aware beam search plus Gaussian-process Thompson sampling to explore that space without full enumeration. It applies this to simulated autonomous driving and indoor robotics, surfacing cases where models miss obstructions or overreact to hazards. That pairing of search techniques is new enough in the VLM robustness literature to be worth noting, and the safety angle for driving and robotics is a reasonable fit for the method. The paper does a clean job framing the problem and showing why manual or simpler searches fall short. The soft spots are straightforward: all outcomes are described only as simulated crashes or false alarms, with no failure rates, statistical tests, error bars, or ablation results on beam width or sampling parameters. There are also no transfer experiments to real sensor data or physical robots, so it is hard to tell whether the discovered modes are stable VLM weaknesses or tied to the simulation and the search heuristics. The central claim therefore rests on the search producing consistent, interpretable failures, but the evidence for that consistency is missing from what is shown. This paper is for groups already working on VLM evaluation in robotics or autonomous systems who need structured ways to generate test cases. It is coherent on its own terms and shows honest engagement with the practical problem, so it deserves a serious referee even though the current validation is light. I would send it out for review with the expectation that the authors add quantitative metrics and at least one real-world transfer check.

Referee Report

3 major / 1 minor

Summary. The paper introduces REVELIO, a framework that combines diversity-aware beam search and Gaussian-process Thompson Sampling to search over compositions of interpretable concepts (e.g., pedestrian proximity + adverse weather) and thereby uncover failure modes in VLMs. Applied to simulated autonomous-driving and indoor-robotics environments, the method is claimed to reveal previously unreported vulnerabilities such as weak spatial grounding that produces simulated crashes and overly conservative or missed-hazard behaviors that produce false alarms.

Significance. If the discovered failure modes prove reproducible and transferable beyond the chosen simulators and concept vocabularies, the work would provide a valuable systematic tool for safety analysis of VLMs in critical applications, moving beyond ad-hoc prompting or manual testing.

major comments (3)

[Abstract / Experiments] Abstract and Experiments: the central claim that REVELIO uncovers 'consistent' failures is unsupported by any reported quantitative threshold (e.g., failure rate across repeated trials), statistical test, or error bars; only qualitative descriptions of 'simulated crashes' and 'false alarms' are given.
[Method] Method: the two search procedures introduce free hyperparameters (beam width, diversity parameters, Thompson-sampling acquisition parameters) whose influence on the discovered modes is not ablated; without such controls it is unclear whether the reported modes are robust or artifacts of the chosen search bias.
[Experiments] Experiments: no transfer experiments to real sensor data, physical robots, or out-of-distribution concept combinations are presented, leaving open the possibility that the identified modes are simulation-specific rather than intrinsic VLM weaknesses.

minor comments (1)

[Method] The precise definition of a 'failure mode' (a concept composition under which the VLM 'consistently behaves incorrectly') would benefit from an explicit mathematical formulation or pseudocode in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, with revisions made where the manuscript can be strengthened without misrepresenting the work.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments: the central claim that REVELIO uncovers 'consistent' failures is unsupported by any reported quantitative threshold (e.g., failure rate across repeated trials), statistical test, or error bars; only qualitative descriptions of 'simulated crashes' and 'false alarms' are given.

Authors: We agree that the original presentation relied primarily on qualitative descriptions. In the revised manuscript we now report failure rates over five independent trials per mode, include standard error bars, and define 'consistent' explicitly as failure in at least 70 % of trials. These additions are placed in the Experiments section and referenced in the abstract. revision: yes
Referee: [Method] Method: the two search procedures introduce free hyperparameters (beam width, diversity parameters, Thompson-sampling acquisition parameters) whose influence on the discovered modes is not ablated; without such controls it is unclear whether the reported modes are robust or artifacts of the chosen search bias.

Authors: We acknowledge the value of sensitivity analysis. The chosen hyperparameter settings were guided by preliminary runs balancing coverage and compute. The revision adds an ablation subsection (and supplementary figures) varying beam width (5–20), diversity weight (0.1–0.5), and Thompson-sampling beta (0.1–1.0). The top-ranked failure modes remain stable across these ranges, supporting that they are not artifacts of a single bias setting. revision: yes
Referee: [Experiments] Experiments: no transfer experiments to real sensor data, physical robots, or out-of-distribution concept combinations are presented, leaving open the possibility that the identified modes are simulation-specific rather than intrinsic VLM weaknesses.

Authors: This is a legitimate scope limitation. The current study deliberately uses controlled simulators to enable large-scale combinatorial search that would be unsafe or prohibitively expensive in the real world. We have expanded the Discussion to state this limitation explicitly, note that the discovered modes (weak spatial grounding, over-conservative hazard response) align with independently reported VLM shortcomings, and outline planned sim-to-real validation as future work. No new transfer experiments are added at this revision. revision: partial

Circularity Check

0 steps flagged

No circularity: REVELIO's search procedures are general techniques applied empirically

full rationale

The paper defines failure modes as compositions of domain concepts and presents REVELIO as the combination of diversity-aware beam search plus Gaussian-process Thompson Sampling to explore the combinatorial space. These are introduced as algorithmic search methods without any derivation that reduces to fitted parameters defined on the same data, self-citation chains, or ansatzes smuggled from prior work. Results consist of empirical applications to simulated driving and robotics environments, with no load-bearing step that equates a claimed prediction to its own inputs by construction. The framework is self-contained against external benchmarks as a discovery tool rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework assumes that failure modes can be expressed as finite compositions of human-interpretable domain concepts and that the two search heuristics will locate consistent misbehaviors; no new physical entities are postulated.

free parameters (2)

beam width and diversity parameters
Control the breadth of the diversity-aware beam search; values chosen to balance coverage and efficiency.
Thompson sampling acquisition parameters
Hyperparameters of the Gaussian process used for exploration; selected to enable broader search of complex modes.

axioms (1)

domain assumption Failure modes of VLMs can be decomposed into compositions of interpretable, domain-relevant concepts
Invoked in the definition of failure mode and in the design of the search space.

pith-pipeline@v0.9.0 · 5548 in / 1280 out tokens · 31121 ms · 2026-05-14T20:24:46.550617+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

222 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Claude Sonnet

Anthropic. Claude Sonnet. https://www.anthropic.com/claude/sonnet, 2025. Ac- cessed: 2026-03-07

2025
[2]

Large language model-assisted autonomous vehicle recovery from immobilization, 2025

Zhipeng Bao and Qianwen Li. Large language model-assisted autonomous vehicle recovery from immobilization, 2025

2025
[3]

The use of mmr, diversity-based reranking for reordering documents and producing summaries

Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, page 335–336, New York, NY , USA, 1998. Association for Computing Machinery

1998
[4]

Gonzalez, and Ion Stoica

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025

2025
[5]

Kevin Kai-Chun Chang, Ekin Beyazit, Alberto Sangiovanni-Vincentelli, Tichakorn Wongpirom- sarn, and Sanjit A. Seshia. Scenicrules: An autonomous driving benchmark with multi-objective specifications and abstract scenarios, 2026

2026
[6]

A comprehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1–26, January 2023

Xiaojun Chang, Pengzhen Ren, Pengfei Xu, Zhihui Li, Xiaojiang Chen, and Alex Hauptmann. A comprehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1–26, January 2023

2023
[7]

Lumos: Let there be language model system certification, 2025

Isha Chaudhary, Vedaant Jain, Avaljot Singh, Kavya Sachdeva, Sayan Ranu, and Gagandeep Singh. Lumos: Let there be language model system certification, 2025

2025
[8]

Carla: An open urban driving simulator, 2017

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator, 2017

2017
[9]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 10

2024
[11]

Towards trustworthy autonomous vehicles with vision-language models under targeted and untargeted adversarial attacks

Awal Ahmed Fime, Md Zarif Hossain, Saika Zaman, Abdur R Shahid, and Ahmed Imteaj. Towards trustworthy autonomous vehicles with vision-language models under targeted and untargeted adversarial attacks. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 619–628, 2025

2025
[12]

Fremont, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L

Daniel J. Fremont, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L. Sangiovanni- Vincentelli, and Sanjit A. Seshia. Scenic: a language for scenario specification and scene generation. InProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’19, page 63–78. ACM, June 2019

2019
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Google Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Gemini 3 flash model card

Google DeepMind. Gemini 3 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf , 2025. Accessed: 2026-05-01

2025
[15]

Gemini 3 pro model card

Google DeepMind. Gemini 3 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf , 2025. Accessed: 2026-02-24

2025
[16]

Lit: Large language model driven intention tracking for proactive human-robot collaboration – a robot sous-chef application, 2024

Zhe Huang, John Pohovey, Ananya Yammanuru, and Katherine Driggs-Campbell. Lit: Large language model driven intention tracking for proactive human-robot collaboration – a robot sous-chef application, 2024

2024
[17]

Discovering failure modes in vision-language models using rl, 2026

Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand, and Aishwarya Agrawal. Discovering failure modes in vision-language models using rl, 2026

2026
[18]

Failure modes in machine learning systems, 2019

Ram Shankar Siva Kumar, David O Brien, Kendra Albert, Salomé Viljöen, and Jeffrey Snover. Failure modes in machine learning systems, 2019

2019
[19]

Concept-based explanations in computer vision: Where are we and where could we go?, 2024

Jae Hee Lee, Georgii Mikriukov, Gesina Schwalbe, Stefan Wermter, and Diedrich Wolter. Concept-based explanations in computer vision: Where are we and where could we go?, 2024

2024
[20]

A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges, 2025

Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges, 2025

2025
[21]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Pengchuan Zhang, Haocheng Ruan, Xiaowei Hu, Chunyuan Li, and Lei Zhang. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Safety alignment for vision language models, 2024

Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, and Bo Zheng. Safety alignment for vision language models, 2024

2024
[23]

Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025

2025
[24]

Pedram MohajerAnsari, Amir Salarpour, Michael Kühr, Siyu Huang, Mohammad Hamad, Sebastian Steinhorst, Habeeb Olufowobi, Bing Li, and Mert D. Pesé. Toward inherently robust vlms against visual perception attacks, 2026

2026
[25]

Concept- based explainable artificial intelligence: A survey.ACM Computing Surveys, November 2025

Eleonora Poeta, Gabriele Ciravegna, Eliana Pastor, Tania Cerquitelli, and Elena Baralis. Concept- based explainable artificial intelligence: A survey.ACM Computing Surveys, November 2025

2025
[26]

Qwen3 technical report, 2025

Qwen-Team. Qwen3 technical report, 2025

2025
[27]

Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026

Rohit Saxena, Alessandro Suglia, and Pasquale Minervini. Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026. 11

2026
[28]

Explain any concept: Segment anything meets concept-based explanation

Ao Sun, Pingchuan Ma, Yuanyuan Yuan, and Shuai Wang. Explain any concept: Segment anything meets concept-based explanation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 21826–21840. Curran Associates, Inc., 2023

2023
[29]

Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2026

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Haochen Li, Jiale Zhu, Jiali Chen, Ji...

2026
[30]

Gonzalo Travieso, Alexandre Benatti, and Luciano da F. Costa. An analytical approach to the jaccard similarity index, 2024

2024
[31]

Failure modes in llm systems: A system-level taxonomy for reliable ai applica- tions, 2025

Vaishali Vinay. Failure modes in llm systems: A system-level taxonomy for reliable ai applica- tions, 2025

2025
[32]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

AdvEDM: Fine-grained adversarial attack against VLM-based embodied agents

Yichen Wang, Hangtao Zhang, Hewen Pan, Ziqi Zhou, Xianlong Wang, Peijin Guo, Lulu Xue, Shengshan Hu, Minghui Li, and Leo Yu Zhang. AdvEDM: Fine-grained adversarial attack against VLM-based embodied agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[34]

Navitrace: Evaluating embodied navigation of vision-language models, 2026

Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, and Jonas Frey. Navitrace: Evaluating embodied navigation of vision-language models, 2026

2026
[35]

Sadler, Dinesh Manocha, and Amrit Singh Bedi

Xiyang Wu, Souradip Chakraborty, Ruiqi Xian, Jing Liang, Tianrui Guan, Fuxiao Liu, Brian M. Sadler, Dinesh Manocha, and Amrit Singh Bedi. On the vulnerability of llm/vlm-controlled robotics, 2025

2025
[36]

Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives, 2025

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives, 2025

2025
[37]

Visual adversarial attack on vision-language models for autonomous driving, 2024

Tianyuan Zhang, Lu Wang, Xinwei Zhang, Yitong Zhang, Boyi Jia, Siyuan Liang, Shengshan Hu, Qiang Fu, Aishan Liu, and Xianglong Liu. Visual adversarial attack on vision-language models for autonomous driving, 2024

2024
[38]

Mm-rlhf: The next step forward in multimodal llm alignment, 2025

Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, and Tieniu Tan. Mm-rlhf: The next step forward in multimodal llm alignment, 2025

2025
[39]

Manipbench: Benchmarking vision-language models for low-level robot manipulation, 2025

Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, and Daniel Seita. Manipbench: Benchmarking vision-language models for low-level robot manipulation, 2025. 12

2025
[40]

Vlmbench: A compositional benchmark for vision-and-language manipulation, 2022

Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, and Xin Eric Wang. Vlmbench: A compositional benchmark for vision-and-language manipulation, 2022

2022
[41]

Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C. Knoll. Vision language models in autonomous driving: A survey and outlook, 2024

2024
[42]

no shock risk

Zhihan Zhu, Yanhao Zhang, and Yong Xia. Best subset selection: Optimal pursuit for feature selection and elimination. InForty-second International Conference on Machine Learning, 2025. 13 Appendix This appendix contains supplementary material organized as follows: •Appendix A— Impact statement. • Appendix B— Full catalog of concepts, covering autonomous d...

2025
[43]

Varying beam expansion width in Appendix D.1
[44]

Varying initial beam-phase budget in GPTS in Appendix D.2
[45]

Varying number of observations for per concept set failure rate estimation in Appendix D.3
[46]

Variation in failure mode classification threshold in Appendix D.4
[47]

We use k= 5 by default

Varying the Gaussian Process (GP) kernel in Appendix D.5 D.1 Varying beam width Beam width k is the primary hyperparameter of the BS algorithm. We use k= 5 by default. In this study, we experiment with k= 1 (greedy search) and k= 10 to analyze the effects of tuning the exploitative nature of the search on the primary metrics. Table 6: Effect of varying be...
[48]

barrier ... slow down

For GPTS, we keep half the budget for the initial beam-phase, similar to default. Increasing observations per concept set from m= 5 to m= 10 consistently reduces both PFM and MFR across algorithms and domains, as more evaluations per set yield more reliable failure rate estimates. Table 8: Effect of increasing observations per concept set m from 5 to 10 u...
[49]

weather_clear_noon_0 + chain_barrier_far + town_town02
[50]

chain_barrier_far + debris_far
[51]

weather_hard_rain_0 + chain_barrier_far + town_town01
[52]

obstruction_far + cyclist
[53]

obstruction_far + cyclist + weather_clear_noon_0
[54]

obstruction_far + cyclist + weather_foggy_0
[55]

debris_far + weather_hard_rain_0 + town_town02
[56]

weather_hard_rain_0 + chain_barrier_far + town_town02
[57]

cyclist + obstruction_far + chain_barrier_far
[58]

cyclist + obstruction_far + weather_clear_noon_0
[59]

cyclist + weather_wet_0 + obstruction_near
[60]

chain_barrier_far + cyclist + weather_clear_noon
[61]

obstruction_far + weather_wet_0 + town_town02

chain_barrier_far + town_town02 Gemini (low) 1. obstruction_far + weather_wet_0 + town_town02
[62]

weather_cloudy_0 + light_green + chain_barrier_far
[63]

weather_clear_noon_0 + cyclist
[64]

weather_hard_rain_0 + chain_barrier_far + light_green
[65]

weather_hard_rain_0 + chain_barrier_far + cyclist
[66]

obstruction_far + town_town02 + light_green
[67]

obstruction_far + cyclist + weather_clear_noon_0 + chain_barrier_far
[68]

obstruction_far + cyclist + weather_clear_noon_0 + town_town02
[69]

cyclist + town_town02 + on_lane
[70]

chain_barrier_far + light_green + weather_clear_noon
[71]

chain_barrier_far + weather_cloudy
[73]

town_town02 + obstruction_far + debris_near + weather_cloudy_0

chain_barrier_far + light_green Gemini (medium) 1. town_town02 + obstruction_far + debris_near + weather_cloudy_0
[74]

chain_barrier_far + light_green + town_town02 + debris_near
[76]

cyclist + weather_clear_noon_0
[77]

weather_wet_0 + cone + far_0 + on_lane_0
[78]

weather_wet_0 + cone + far_0 + on_lane_0 + town_town01
[79]

weather_wet_0 + cone + far_0 + on_lane_0 + light_green
[80]

debris_far + obstruction_near + intersection_ego
[81]

debris_far + obstruction_near + intersection_ego + emergency_vehicle
[83]

town_town02 + debris_far + intersection_ego
[84]

town_town02 + debris_far + light_green

Showing first 80 references.