pith. machine review for the scientific record. sign in

arxiv: 2605.12674 · v1 · submitted 2026-05-12 · 💻 cs.AI · cs.LG· cs.RO

Recognition: unknown

Revealing Interpretable Failure Modes of VLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:24 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.RO
keywords vision-language modelsfailure modesinterpretabilityautonomous drivingindoor roboticsbeam searchThompson samplingsafety evaluation
0
0 comments X

The pith

REVELIO uncovers interpretable concept compositions that cause consistent failures in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces REVELIO as a method to search the combinatorial space of domain concepts such as weather or proximity and locate the specific mixtures where a VLM produces wrong outputs every time. It pairs a diversity-aware beam search that charts the main failure regions with Gaussian-process Thompson Sampling that reaches rarer or more intricate combinations. The work matters because VLMs now appear in driving and robotics systems where these repeatable errors can produce crashes or halted operations. Revealing the failures in concrete, human-readable terms gives developers direct targets for fixing the models rather than guessing at broad weaknesses.

Core claim

REVELIO defines a failure mode as any composition of interpretable, domain-relevant concepts under which a target VLM consistently behaves incorrectly, then solves the exponential search problem by running diversity-aware beam search to map the failure landscape and Gaussian-process Thompson Sampling to explore more complex modes, and applies the procedure to autonomous-driving and indoor-robotics settings to expose previously unreported vulnerabilities such as weak spatial grounding that produces simulated crashes and either missed hazards or excessive false alarms that reduce efficiency.

What carries the argument

REVELIO framework that combines diversity-aware beam search and Gaussian-process Thompson Sampling to explore combinatorial spaces of concepts for VLM failure modes.

Load-bearing premise

The concept compositions returned by the search correspond to genuine, consistent real-world failure modes rather than artifacts created by the search heuristics.

What would settle it

Execute the VLMs on high-fidelity simulations or physical deployments that instantiate exactly the same concept combinations reported by REVELIO and measure whether the predicted error rate actually appears.

Figures

Figures reproduced from arXiv: 2605.12674 by Gagandeep Singh, Isha Chaudhary, Kavya Sachdeva, Sayan Ranu, Vedaant V Jain.

Figure 1
Figure 1. Figure 1: Gemini-3 (Flash) with medium thinking suggests AV to slow down (half braking intensity) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scene graph We begin by defining scene graphs [6] G = (V, E, A) that abstractly represent image I. Nodes V ⊂ Uent of G represent physical entities like obstructions, pedestrians, traffic lights, etc, drawn from a sym￾bolic universe Uent of all possible real-world entities Uent. Edges E ⊂ Uent × Uent encode directed spatial and semantic relationships between nodes, such as a cyclist in front of the ego vehi… view at source ↗
Figure 3
Figure 3. Figure 3: VLM prompt for driving. While the generated scenes vary across a distribution, the user prompt ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example VLM prompt for indoor. Ground truth is determined by matching the concept set against a library of safety rules. The selected rule determines both the prompt to the VLM and the expected answer [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scenarios for failure modes discovered by GPTS. Top: driving. Bottom: indoor. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Indoor: PFM and MFR as a function of the beam-phase budget [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Indoor: PFM and MFR as a function of the beam-phase budget [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: a plot with varying τ on x-axis and fraction of failure modes on y-axis for indoor experi￾ments [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: a plot with varying τ on x-axis and fraction of failure modes on y-axis for driving experiments. D.5 Varying the Gaussian Process (GP) kernel To adapt our discrete concept search space for GP modeling, we encode each evaluated concept set as a binary vector x ∈ {0, 1} |Γ| , where an element is 1 if the corresponding concept is present. We base our surrogate model on the linear (dot-product) kernel, k(x, x′… view at source ↗
Figure 10
Figure 10. Figure 10: Images rendered for failure modes discovered by GPTS across multiple VLMs. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Cross-model atomic concept analysis. Rows are the 30 concepts that appear in at least [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Cross-model atomic concept analysis for the [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) are increasingly used in safety-critical applications because of their broad reasoning capabilities and ability to generalize with minimal task-specific engineering. Despite these advantages, they can exhibit catastrophic failures in specific real-world situations, constituting failure modes. We introduce REVELIO, a framework for systematically uncovering interpretable failure modes in VLMs. We define a failure mode as a composition of interpretable, domain-relevant concepts-such as pedestrian proximity or adverse weather conditions-under which a target VLM consistently behaves incorrectly. Identifying such failures requires searching over an exponentially large discrete combinatorial space. To address this challenge, REVELIO combines two search procedures: a diversity-aware beam search that efficiently maps the failure landscape, and a Gaussian-process Thompson Sampling strategy that enables broader exploration of complex failure modes. We apply REVELIO to autonomous driving and indoor robotics domains, uncovering previously unreported vulnerabilities in state-of-the-art VLMs. In driving environments, the models often demonstrate weak spatial grounding and fail to account for major obstructions, leading to recommendations that would result in simulated crashes. In indoor robotics tasks, VLMs either miss safety hazards or behave excessively conservatively, producing false alarms and reducing operational efficiency. By identifying structured and interpretable failure modes, REVELIO offers actionable insights that can support targeted VLM safety improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces REVELIO, a framework that combines diversity-aware beam search and Gaussian-process Thompson Sampling to search over compositions of interpretable concepts (e.g., pedestrian proximity + adverse weather) and thereby uncover failure modes in VLMs. Applied to simulated autonomous-driving and indoor-robotics environments, the method is claimed to reveal previously unreported vulnerabilities such as weak spatial grounding that produces simulated crashes and overly conservative or missed-hazard behaviors that produce false alarms.

Significance. If the discovered failure modes prove reproducible and transferable beyond the chosen simulators and concept vocabularies, the work would provide a valuable systematic tool for safety analysis of VLMs in critical applications, moving beyond ad-hoc prompting or manual testing.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments: the central claim that REVELIO uncovers 'consistent' failures is unsupported by any reported quantitative threshold (e.g., failure rate across repeated trials), statistical test, or error bars; only qualitative descriptions of 'simulated crashes' and 'false alarms' are given.
  2. [Method] Method: the two search procedures introduce free hyperparameters (beam width, diversity parameters, Thompson-sampling acquisition parameters) whose influence on the discovered modes is not ablated; without such controls it is unclear whether the reported modes are robust or artifacts of the chosen search bias.
  3. [Experiments] Experiments: no transfer experiments to real sensor data, physical robots, or out-of-distribution concept combinations are presented, leaving open the possibility that the identified modes are simulation-specific rather than intrinsic VLM weaknesses.
minor comments (1)
  1. [Method] The precise definition of a 'failure mode' (a concept composition under which the VLM 'consistently behaves incorrectly') would benefit from an explicit mathematical formulation or pseudocode in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, with revisions made where the manuscript can be strengthened without misrepresenting the work.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments: the central claim that REVELIO uncovers 'consistent' failures is unsupported by any reported quantitative threshold (e.g., failure rate across repeated trials), statistical test, or error bars; only qualitative descriptions of 'simulated crashes' and 'false alarms' are given.

    Authors: We agree that the original presentation relied primarily on qualitative descriptions. In the revised manuscript we now report failure rates over five independent trials per mode, include standard error bars, and define 'consistent' explicitly as failure in at least 70 % of trials. These additions are placed in the Experiments section and referenced in the abstract. revision: yes

  2. Referee: [Method] Method: the two search procedures introduce free hyperparameters (beam width, diversity parameters, Thompson-sampling acquisition parameters) whose influence on the discovered modes is not ablated; without such controls it is unclear whether the reported modes are robust or artifacts of the chosen search bias.

    Authors: We acknowledge the value of sensitivity analysis. The chosen hyperparameter settings were guided by preliminary runs balancing coverage and compute. The revision adds an ablation subsection (and supplementary figures) varying beam width (5–20), diversity weight (0.1–0.5), and Thompson-sampling beta (0.1–1.0). The top-ranked failure modes remain stable across these ranges, supporting that they are not artifacts of a single bias setting. revision: yes

  3. Referee: [Experiments] Experiments: no transfer experiments to real sensor data, physical robots, or out-of-distribution concept combinations are presented, leaving open the possibility that the identified modes are simulation-specific rather than intrinsic VLM weaknesses.

    Authors: This is a legitimate scope limitation. The current study deliberately uses controlled simulators to enable large-scale combinatorial search that would be unsafe or prohibitively expensive in the real world. We have expanded the Discussion to state this limitation explicitly, note that the discovered modes (weak spatial grounding, over-conservative hazard response) align with independently reported VLM shortcomings, and outline planned sim-to-real validation as future work. No new transfer experiments are added at this revision. revision: partial

Circularity Check

0 steps flagged

No circularity: REVELIO's search procedures are general techniques applied empirically

full rationale

The paper defines failure modes as compositions of domain concepts and presents REVELIO as the combination of diversity-aware beam search plus Gaussian-process Thompson Sampling to explore the combinatorial space. These are introduced as algorithmic search methods without any derivation that reduces to fitted parameters defined on the same data, self-citation chains, or ansatzes smuggled from prior work. Results consist of empirical applications to simulated driving and robotics environments, with no load-bearing step that equates a claimed prediction to its own inputs by construction. The framework is self-contained against external benchmarks as a discovery tool rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework assumes that failure modes can be expressed as finite compositions of human-interpretable domain concepts and that the two search heuristics will locate consistent misbehaviors; no new physical entities are postulated.

free parameters (2)
  • beam width and diversity parameters
    Control the breadth of the diversity-aware beam search; values chosen to balance coverage and efficiency.
  • Thompson sampling acquisition parameters
    Hyperparameters of the Gaussian process used for exploration; selected to enable broader search of complex modes.
axioms (1)
  • domain assumption Failure modes of VLMs can be decomposed into compositions of interpretable, domain-relevant concepts
    Invoked in the definition of failure mode and in the design of the search space.

pith-pipeline@v0.9.0 · 5548 in / 1280 out tokens · 31121 ms · 2026-05-14T20:24:46.550617+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

222 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Claude Sonnet

    Anthropic. Claude Sonnet. https://www.anthropic.com/claude/sonnet, 2025. Ac- cessed: 2026-03-07

  2. [2]

    Large language model-assisted autonomous vehicle recovery from immobilization, 2025

    Zhipeng Bao and Qianwen Li. Large language model-assisted autonomous vehicle recovery from immobilization, 2025

  3. [3]

    The use of mmr, diversity-based reranking for reordering documents and producing summaries

    Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, page 335–336, New York, NY , USA, 1998. Association for Computing Machinery

  4. [4]

    Gonzalez, and Ion Stoica

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail? InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025

  5. [5]

    Kevin Kai-Chun Chang, Ekin Beyazit, Alberto Sangiovanni-Vincentelli, Tichakorn Wongpirom- sarn, and Sanjit A. Seshia. Scenicrules: An autonomous driving benchmark with multi-objective specifications and abstract scenarios, 2026

  6. [6]

    A comprehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1–26, January 2023

    Xiaojun Chang, Pengzhen Ren, Pengfei Xu, Zhihui Li, Xiaojiang Chen, and Alex Hauptmann. A comprehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1–26, January 2023

  7. [7]

    Lumos: Let there be language model system certification, 2025

    Isha Chaudhary, Vedaant Jain, Avaljot Singh, Kavya Sachdeva, Sayan Ranu, and Gagandeep Singh. Lumos: Let there be language model system certification, 2025

  8. [8]

    Carla: An open urban driving simulator, 2017

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator, 2017

  9. [9]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 10

  10. [11]

    Towards trustworthy autonomous vehicles with vision-language models under targeted and untargeted adversarial attacks

    Awal Ahmed Fime, Md Zarif Hossain, Saika Zaman, Abdur R Shahid, and Ahmed Imteaj. Towards trustworthy autonomous vehicles with vision-language models under targeted and untargeted adversarial attacks. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 619–628, 2025

  11. [12]

    Fremont, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L

    Daniel J. Fremont, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L. Sangiovanni- Vincentelli, and Sanjit A. Seshia. Scenic: a language for scenario specification and scene generation. InProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’19, page 63–78. ACM, June 2019

  12. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Google Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  13. [14]

    Gemini 3 flash model card

    Google DeepMind. Gemini 3 flash model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf , 2025. Accessed: 2026-05-01

  14. [15]

    Gemini 3 pro model card

    Google DeepMind. Gemini 3 pro model card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf , 2025. Accessed: 2026-02-24

  15. [16]

    Lit: Large language model driven intention tracking for proactive human-robot collaboration – a robot sous-chef application, 2024

    Zhe Huang, John Pohovey, Ananya Yammanuru, and Katherine Driggs-Campbell. Lit: Large language model driven intention tracking for proactive human-robot collaboration – a robot sous-chef application, 2024

  16. [17]

    Discovering failure modes in vision-language models using rl, 2026

    Kanishk Jain, Qian Yang, Shravan Nayak, Parisa Kordjamshidi, Nishanth Anand, and Aishwarya Agrawal. Discovering failure modes in vision-language models using rl, 2026

  17. [18]

    Failure modes in machine learning systems, 2019

    Ram Shankar Siva Kumar, David O Brien, Kendra Albert, Salomé Viljöen, and Jeffrey Snover. Failure modes in machine learning systems, 2019

  18. [19]

    Concept-based explanations in computer vision: Where are we and where could we go?, 2024

    Jae Hee Lee, Georgii Mikriukov, Gesina Schwalbe, Stefan Wermter, and Diedrich Wolter. Concept-based explanations in computer vision: Where are we and where could we go?, 2024

  19. [20]

    A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges, 2025

    Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges, 2025

  20. [21]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Pengchuan Zhang, Haocheng Ruan, Xiaowei Hu, Chunyuan Li, and Lei Zhang. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

  21. [22]

    Safety alignment for vision language models, 2024

    Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, and Bo Zheng. Safety alignment for vision language models, 2024

  22. [23]

    Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, and Wenqi Shao. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning, 2025

  23. [24]

    Pedram MohajerAnsari, Amir Salarpour, Michael Kühr, Siyu Huang, Mohammad Hamad, Sebastian Steinhorst, Habeeb Olufowobi, Bing Li, and Mert D. Pesé. Toward inherently robust vlms against visual perception attacks, 2026

  24. [25]

    Concept- based explainable artificial intelligence: A survey.ACM Computing Surveys, November 2025

    Eleonora Poeta, Gabriele Ciravegna, Eliana Pastor, Tania Cerquitelli, and Elena Baralis. Concept- based explainable artificial intelligence: A survey.ACM Computing Surveys, November 2025

  25. [26]

    Qwen3 technical report, 2025

    Qwen-Team. Qwen3 technical report, 2025

  26. [27]

    Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026

    Rohit Saxena, Alessandro Suglia, and Pasquale Minervini. Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026. 11

  27. [28]

    Explain any concept: Segment anything meets concept-based explanation

    Ao Sun, Pingchuan Ma, Yuanyuan Yuan, and Shuai Wang. Explain any concept: Segment anything meets concept-based explanation. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, vol- ume 36, pages 21826–21840. Curran Associates, Inc., 2023

  28. [29]

    Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2026

    V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Haochen Li, Jiale Zhu, Jiali Chen, Ji...

  29. [30]

    Gonzalo Travieso, Alexandre Benatti, and Luciano da F. Costa. An analytical approach to the jaccard similarity index, 2024

  30. [31]

    Failure modes in llm systems: A system-level taxonomy for reliable ai applica- tions, 2025

    Vaishali Vinay. Failure modes in llm systems: A system-level taxonomy for reliable ai applica- tions, 2025

  31. [32]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  32. [33]

    AdvEDM: Fine-grained adversarial attack against VLM-based embodied agents

    Yichen Wang, Hangtao Zhang, Hewen Pan, Ziqi Zhou, Xianlong Wang, Peijin Guo, Lulu Xue, Shengshan Hu, Minghui Li, and Leo Yu Zhang. AdvEDM: Fine-grained adversarial attack against VLM-based embodied agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  33. [34]

    Navitrace: Evaluating embodied navigation of vision-language models, 2026

    Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, and Jonas Frey. Navitrace: Evaluating embodied navigation of vision-language models, 2026

  34. [35]

    Sadler, Dinesh Manocha, and Amrit Singh Bedi

    Xiyang Wu, Souradip Chakraborty, Ruiqi Xian, Jing Liang, Tianrui Guan, Fuxiao Liu, Brian M. Sadler, Dinesh Manocha, and Amrit Singh Bedi. On the vulnerability of llm/vlm-controlled robotics, 2025

  35. [36]

    Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives, 2025

    Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data, and metric perspectives, 2025

  36. [37]

    Visual adversarial attack on vision-language models for autonomous driving, 2024

    Tianyuan Zhang, Lu Wang, Xinwei Zhang, Yitong Zhang, Boyi Jia, Siyuan Liang, Shengshan Hu, Qiang Fu, Aishan Liu, and Xianglong Liu. Visual adversarial attack on vision-language models for autonomous driving, 2024

  37. [38]

    Mm-rlhf: The next step forward in multimodal llm alignment, 2025

    Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, and Tieniu Tan. Mm-rlhf: The next step forward in multimodal llm alignment, 2025

  38. [39]

    Manipbench: Benchmarking vision-language models for low-level robot manipulation, 2025

    Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, and Daniel Seita. Manipbench: Benchmarking vision-language models for low-level robot manipulation, 2025. 12

  39. [40]

    Vlmbench: A compositional benchmark for vision-and-language manipulation, 2022

    Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, and Xin Eric Wang. Vlmbench: A compositional benchmark for vision-and-language manipulation, 2022

  40. [41]

    Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C. Knoll. Vision language models in autonomous driving: A survey and outlook, 2024

  41. [42]

    no shock risk

    Zhihan Zhu, Yanhao Zhang, and Yong Xia. Best subset selection: Optimal pursuit for feature selection and elimination. InForty-second International Conference on Machine Learning, 2025. 13 Appendix This appendix contains supplementary material organized as follows: •Appendix A— Impact statement. • Appendix B— Full catalog of concepts, covering autonomous d...

  42. [43]

    Varying beam expansion width in Appendix D.1

  43. [44]

    Varying initial beam-phase budget in GPTS in Appendix D.2

  44. [45]

    Varying number of observations for per concept set failure rate estimation in Appendix D.3

  45. [46]

    Variation in failure mode classification threshold in Appendix D.4

  46. [47]

    We use k= 5 by default

    Varying the Gaussian Process (GP) kernel in Appendix D.5 D.1 Varying beam width Beam width k is the primary hyperparameter of the BS algorithm. We use k= 5 by default. In this study, we experiment with k= 1 (greedy search) and k= 10 to analyze the effects of tuning the exploitative nature of the search on the primary metrics. Table 6: Effect of varying be...

  47. [48]

    barrier ... slow down

    For GPTS, we keep half the budget for the initial beam-phase, similar to default. Increasing observations per concept set from m= 5 to m= 10 consistently reduces both PFM and MFR across algorithms and domains, as more evaluations per set yield more reliable failure rate estimates. Table 8: Effect of increasing observations per concept set m from 5 to 10 u...

  48. [49]

    weather_clear_noon_0 + chain_barrier_far + town_town02

  49. [50]

    chain_barrier_far + debris_far

  50. [51]

    weather_hard_rain_0 + chain_barrier_far + town_town01

  51. [52]

    obstruction_far + cyclist

  52. [53]

    obstruction_far + cyclist + weather_clear_noon_0

  53. [54]

    obstruction_far + cyclist + weather_foggy_0

  54. [55]

    debris_far + weather_hard_rain_0 + town_town02

  55. [56]

    weather_hard_rain_0 + chain_barrier_far + town_town02

  56. [57]

    cyclist + obstruction_far + chain_barrier_far

  57. [58]

    cyclist + obstruction_far + weather_clear_noon_0

  58. [59]

    cyclist + weather_wet_0 + obstruction_near

  59. [60]

    chain_barrier_far + cyclist + weather_clear_noon

  60. [61]

    obstruction_far + weather_wet_0 + town_town02

    chain_barrier_far + town_town02 Gemini (low) 1. obstruction_far + weather_wet_0 + town_town02

  61. [62]

    weather_cloudy_0 + light_green + chain_barrier_far

  62. [63]

    weather_clear_noon_0 + cyclist

  63. [64]

    weather_hard_rain_0 + chain_barrier_far + light_green

  64. [65]

    weather_hard_rain_0 + chain_barrier_far + cyclist

  65. [66]

    obstruction_far + town_town02 + light_green

  66. [67]

    obstruction_far + cyclist + weather_clear_noon_0 + chain_barrier_far

  67. [68]

    obstruction_far + cyclist + weather_clear_noon_0 + town_town02

  68. [69]

    cyclist + town_town02 + on_lane

  69. [70]

    chain_barrier_far + light_green + weather_clear_noon

  70. [71]

    chain_barrier_far + weather_cloudy

  71. [73]

    town_town02 + obstruction_far + debris_near + weather_cloudy_0

    chain_barrier_far + light_green Gemini (medium) 1. town_town02 + obstruction_far + debris_near + weather_cloudy_0

  72. [74]

    chain_barrier_far + light_green + town_town02 + debris_near

  73. [76]

    cyclist + weather_clear_noon_0

  74. [77]

    weather_wet_0 + cone + far_0 + on_lane_0

  75. [78]

    weather_wet_0 + cone + far_0 + on_lane_0 + town_town01

  76. [79]

    weather_wet_0 + cone + far_0 + on_lane_0 + light_green

  77. [80]

    debris_far + obstruction_near + intersection_ego

  78. [81]

    debris_far + obstruction_near + intersection_ego + emergency_vehicle

  79. [83]

    town_town02 + debris_far + intersection_ego

  80. [84]

    town_town02 + debris_far + light_green

Showing first 80 references.