What to Test Next: Interpretable Coverage Gap Discovery in Driving VLMs

Abhishek Aich; Manmohan Chandraker; Sparsh Garg; Turgun Yusuf Kashgari; Vijay Kumar BG

arxiv: 2606.01624 · v2 · pith:WHK6Y7JXnew · submitted 2026-06-01 · 💻 cs.CV · cs.SE

What to Test Next: Interpretable Coverage Gap Discovery in Driving VLMs

Abhishek Aich , Sparsh Garg , Vijay Kumar BG , Turgun Yusuf Kashgari , Manmohan Chandraker This is my paper

Pith reviewed 2026-06-28 15:34 UTC · model grok-4.3

classification 💻 cs.CV cs.SE

keywords driving VLMscoverage gapstest recommendationSliceScorerSliceNavoperational design domaininterpretable verificationvision language models

0 comments

The pith

A deterministic scorer using exposure and neighbor-failure priors recommends high-risk missing test slices for driving VLMs more effectively than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SliceScorer as a simple, interpretable way to score and recommend untested conditions for verifying driving vision-language models. It prioritizes slices that are rarely exposed in current tests and those near conditions where the model has already failed. This scorer is placed inside SliceNav, an LLM-guided system that turns developer questions into sequences of verification steps while keeping all risk calculations fixed and reviewable. Tests across three different driving VLMs demonstrate that the combined approach identifies more dangerous coverage gaps and keeps suggestions varied across the full range of driving conditions.

Core claim

SliceScorer is a deterministic scoring rule that combines an exposure-based coverage prior to prioritize rare, under-tested regions with a neighbor-failure prior that propagates risk from similar tested conditions. When embedded in the SliceNav pipeline, which uses an LLM to interpret queries and select operators while preserving deterministic scoring, it surfaces high-risk coverage gaps more effectively than prior slice-discovery methods on three driving VLMs while maintaining diverse recommendations.

What carries the argument

SliceScorer, a deterministic scoring rule combining exposure-based coverage prior and neighbor-failure prior for missing-slice recommendation.

If this is right

The method identifies coverage gaps that carry higher failure risk than those found by earlier techniques.
Recommendations remain spread across the condition space rather than clustering in narrow areas.
Both the exposure and neighbor components are necessary, as shown by ablations.
The full pipeline supports traceable workflows from a natural-language query to concrete evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This scoring approach could be applied to other verification tasks where test conditions form a continuous space with local similarity.
If the priors align with real failure modes, it may allow more efficient allocation of testing resources in safety validation.
The separation of LLM orchestration from deterministic scoring might reduce risks of inconsistent or unexplainable recommendations in regulated domains.

Load-bearing premise

The exposure-based coverage prior and neighbor-failure prior accurately reflect actual risk of failure in untested slices.

What would settle it

Running targeted evaluations on the slices recommended by SliceNav versus those from baseline methods and finding that the recommended slices do not exhibit higher failure rates.

Figures

Figures reproduced from arXiv: 2606.01624 by Abhishek Aich, Manmohan Chandraker, Sparsh Garg, Turgun Yusuf Kashgari, Vijay Kumar BG.

**Figure 1.** Figure 1: From stochastic prompting to structured slice reasoning. Top: A pure LLM agent, even when provided with ODD definitions, seed evaluation tables, and strong instruction prompts, produces stochastic slice suggestions with unreliable decision boundaries and incomplete coverage. Bottom: SLICENAV augments the LLM with structured instructions and a dedicated SLICESCORER, enabling deterministic slice suggestion r… view at source ↗

**Figure 2.** Figure 2: Framework overview. The agent takes a natural-language verification goal plus evaluation context and ODD specification, then orchestrates failure triage, in-ODD gap discovery, and controlled out-of-ODD stress proposal via a query-conditioned ODD expansion. It can also close the loop by acquiring data, and re-evaluating the VLM. are pattern-mining approaches such as DivExplorer [11], which frame slice disco… view at source ↗

**Figure 3.** Figure 3: Agentic workflow examples for qualitative analysis. Each row is one completed run from the 15-query qualitative sweep. Broad queries trigger vocabulary extension across multiple ODD dimensions, focused queries select targeted outside-ODD values, and acquisition queries chain slice discovery to data generation/retrieval and model evaluation (see rationales in Appendix I). 4.4 Analysis 4: Reliability and sto… view at source ↗

**Figure 5.** Figure 5: Marginal-Lift Model (SliceFinder-Inspired) [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: WiseAD SliceScorer recommendation profile. Distribution of top-ranked missing-slice recommendations across ODD attributes. clear foggy rainy snowy 0 0.1 0.2 0.3 fraction weather visibility clear cloudy sky conditions day night time city downtown city suburbs construction highway environment dry snow wet 0 0.1 0.2 0.3 fraction road condition intersection non-intersection road type dense sparse traffic type … view at source ↗

**Figure 7.** Figure 7: DriveMM SliceScorer recommendation profile. Distribution of top-ranked missing-slice recommendations across ODD attributes. Output your chosen slices as CSV only : - First line : header with exactly these columns ( comma - separated ): { csv_header } - Following lines : one row per slice , same column order , comma - separated values - Strictly Output { top_k } slices . Do not include any explanation or co… view at source ↗

**Figure 8.** Figure 8: Cosmos SliceScorer recommendation profile. Distribution of top-ranked missing-slice recommendations across ODD attributes. missing-slice ranker then returned top slices all using environment=tunnel, concentrated on foggy daytime dense-traffic cases while varying road condition and road type. A separate failure-summary pass found no tunnel or underpass groups in the existing triage failures, supporting the … view at source ↗

**Figure 9.** Figure 9: Slice-to-executable-test flow. Arrows show what each stage hands off to the next; the dashed branch is narrative interpretation of the normalized bundle (qualitative Example D: generic high-risk acquisition; Appendix I). Snow and ice coverage (Example F). Query: “What snow and ice scenarios are we missing?” Tools: within-ODD missing-slice ranking → refined top-K snow query → existing-failure summary. Vocab… view at source ↗

read the original abstract

Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification remains sparse: many slices are missing, making empirical failure rates unreliable. We propose SliceScorer, a deterministic scoring rule for missing-slice recommendation that combines (i) an exposure-based coverage prior to prioritize rare, under-tested regions, and (ii) a neighbor-failure prior that propagates risk from similar tested conditions. SliceScorer is deliberately simple - interpretable, auditable, and conservative - properties essential for safety-critical validation. For stress testing beyond the declared ODD, we embed SliceScorer within SliceNav, an LLM-orchestrated verification pipeline where the model interprets developer queries to select relevant operators (triage, scoring, acquisition, evaluation) and vocabulary extensions, composing verification workflows while keeping all scoring deterministic and auditable. Experiments on three driving VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) show that SliceNav surfaces high-risk coverage gaps more effectively than prior slice-discovery methods while maintaining diverse recommendations across the condition space. Ablations confirm both scoring components contribute, and qualitative analysis demonstrates end-to-end workflows from developer query to targeted evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SliceScorer gives a clean deterministic rule for recommending test slices in driving VLMs, but the high-risk claim needs direct checks against actual model failures.

read the letter

The key takeaway is that this work gives a straightforward, deterministic method for suggesting which driving conditions to test next in VLMs, using rarity and proximity to failures, and packages it in an LLM-orchestrated system that stays auditable.

What stands out is how they keep everything simple and interpretable on purpose. SliceScorer combines an exposure prior with a neighbor-failure prior without any learned parameters, which fits safety-critical needs. They show it on three models and run ablations that indicate both parts matter. The pipeline lets developers query in natural language and get workflows that include scoring and evaluation. That combination of LLM flexibility with deterministic scoring is a reasonable engineering choice. The fact that they keep the scoring rule itself free of black-box elements is a plus for auditability in AV contexts.

The main concern is whether the high-risk part holds up. The claim is that it finds gaps where the models are more likely to fail, but the description does not show that they measured actual failure rates on the recommended slices after the fact. If the evaluation only uses the same priors or diversity metrics, then it is not clear the method beats baselines on real risk rather than just on coverage. The stress-test note flags this correctly from the abstract. If the full paper has a separate validation step with ground-truth failures, that would strengthen it a lot. Otherwise the superiority over prior slice-discovery methods rests on how well the priors match reality, which is the weakest link.

This paper is aimed at researchers and engineers working on testing and verification for autonomous driving systems that use vision-language models. Someone looking for practical tools to improve ODD coverage would find the method useful, especially if they value deterministic and explainable approaches over more complex learned alternatives.

I would recommend sending it for peer review. The core idea is clear and the experiments are on real models, even if the risk validation needs more detail to fully support the high-risk claim.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes SliceScorer, a deterministic scoring rule combining an exposure-based coverage prior (prioritizing rare under-tested regions) and a neighbor-failure prior (propagating risk from similar tested conditions) to recommend missing slices for driving VLM verification. SliceScorer is embedded in SliceNav, an LLM-orchestrated pipeline that interprets developer queries to select operators and compose auditable verification workflows. Experiments on three VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) claim SliceNav outperforms prior slice-discovery methods at surfacing high-risk gaps while maintaining diversity across the condition space; ablations confirm both scoring components contribute.

Significance. If the result holds, the work supplies a deliberately simple, interpretable, and auditable approach to coverage-gap discovery for safety-critical VLM validation in autonomous driving. The deterministic nature of the scoring rule, the explicit separation of priors from the LLM orchestration layer, and the reported ablations are concrete strengths that align with verification needs. The approach could support more systematic ODD testing if the priors are shown to track actual failure risk.

major comments (2)

[Experiments section] Experiments section (as summarized in the abstract): the claim that SliceNav 'surfaces high-risk coverage gaps more effectively' is evaluated using SliceScorer itself; no results are reported from subsequently running the three VLMs on the recommended slices and measuring empirical failure rates (e.g., incorrect scene descriptions or predictions). Without this independent signal, superiority reduces to a coverage-diversity comparison rather than a risk-identification result.
[Ablations] Ablations paragraph (abstract): the statement that 'ablations confirm both scoring components contribute' measures contribution against the composite SliceScorer score; this does not test whether the exposure or neighbor-failure prior actually correlates with higher VLM failure rates on the surfaced slices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback, particularly the emphasis on distinguishing score-based evaluation from direct empirical validation of risk. We address each major comment below.

read point-by-point responses

Referee: [Experiments section] Experiments section (as summarized in the abstract): the claim that SliceNav 'surfaces high-risk coverage gaps more effectively' is evaluated using SliceScorer itself; no results are reported from subsequently running the three VLMs on the recommended slices and measuring empirical failure rates (e.g., incorrect scene descriptions or predictions). Without this independent signal, superiority reduces to a coverage-diversity comparison rather than a risk-identification result.

Authors: We agree that the reported experiments evaluate SliceNav by comparing SliceScorer values and diversity metrics across methods, without additional VLM inference runs on the newly recommended slices to obtain independent failure rates. The 'high-risk' designation in the manuscript is defined operationally via the exposure and neighbor-failure priors. We recognize that this constitutes an evaluation of coverage and diversity under the proposed scoring rule rather than an independent demonstration of risk identification through measured VLM errors. In the revision we will update the abstract and experiments section to describe the results more precisely as superior slice discovery according to the defined priors and metrics, and we will explicitly note the lack of direct failure-rate measurements as a limitation of the current evaluation. revision: yes
Referee: [Ablations] Ablations paragraph (abstract): the statement that 'ablations confirm both scoring components contribute' measures contribution against the composite SliceScorer score; this does not test whether the exposure or neighbor-failure prior actually correlates with higher VLM failure rates on the surfaced slices.

Authors: The ablations compare the full SliceScorer against ablated variants using the same composite scoring and ranking criteria, thereby showing that each prior improves the overall score. We acknowledge that these results do not establish a direct empirical correlation between the individual priors and observed VLM failure rates. We will revise the abstract and ablation discussion to state that the ablations confirm the contribution of both components to the composite scoring rule. revision: yes

Circularity Check

0 steps flagged

No circularity: deterministic heuristic with independent evaluation claims

full rationale

The paper defines SliceScorer explicitly as a deterministic, non-learned combination of two priors (exposure-based coverage and neighbor-failure) and embeds it in SliceNav without any equations, parameter fitting, or self-citation chains that reduce the output to the input by construction. Experiments compare recommendation quality against prior slice-discovery methods on three VLMs and include ablations on the two components; no derivation step is shown that renames a fitted quantity as a prediction or imports uniqueness from the authors' prior work. The method is therefore self-contained as a proposed auditable heuristic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; no details are given about how the priors are computed or normalized.

pith-pipeline@v0.9.1-grok · 5763 in / 1213 out tokens · 27646 ms · 2026-06-28T15:34:10.534483+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 12 canonical work pages · 7 internal anchors

[1]

Automated Data Slicing for Model Validation:A Big data - AI Integration Approach

Yeounoh Chung, Tim Kraska, Steven Euijong Whang, and Neoklis Polyzotis. Slice finder: Automated data slicing for model validation.arXiv preprint arXiv:1807.06068, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Sliceline: Fast, linear-algebra-based slice finding for ml model debugging

Svetlana Sagadeeva and Matthias Boehm. Sliceline: Fast, linear-algebra-based slice finding for ml model debugging. InSIGMOD, 2021

2021
[3]

Standard ISO 34503:2023, International Organization for Standardization, Geneva, CH, Aug 2023

Road vehicles — Test scenarios for automated driving systems — Specification for operational design domain. Standard ISO 34503:2023, International Organization for Standardization, Geneva, CH, Aug 2023. Accessed: 05 March 2026

2023
[4]

Road vehicles — safety of the intended func- tionality

International Organization for Standardization. Road vehicles — safety of the intended func- tionality. Standard ISO 21448:2022, International Organization for Standardization, Geneva, CH, jun 2022

2022
[5]

Coverage metrics for a scenario database for the scenario-based assessment of automated driving systems

Erwin de Gelder, Maren Buermann, and Olaf Op Den Camp. Coverage metrics for a scenario database for the scenario-based assessment of automated driving systems. In2024 IEEE International Automated Vehicle Validation Conference (IAVVC), pages 1–8. IEEE, 2024

2024
[6]

Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951, 2024

Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, and Chen Lv. Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951, 2024

work page arXiv 2024
[7]

arXiv preprint arXiv:2412.07689 (2024) 4, 8

Zhijian Huang, Chengjian Fen, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Drivemm: All-in-one large multimodal model for autonomous driving.arXiv preprint arXiv:2412.07689, 2024

work page arXiv 2024
[8]

Cosmos-Reason2 — NVIDIA Documentation

NVIDIA Corporation. Cosmos-Reason2 — NVIDIA Documentation. https://docs.nvidia. com/cosmos/latest/reason2/index.html, 2026. Accessed: 2026-03-05

2026
[9]

Autoslicer: Scalable automated data slicing for ml model analysis

Zifan Liu, Evan Rosen, et al. Autoslicer: Scalable automated data slicing for ml model analysis. 2022

2022
[10]

Divisi: Interactive search and visualization for scalable exploratory subgroup analysis

Venkatesh Sivaraman, Zexuan Li, and Adam Perer. Divisi: Interactive search and visualization for scalable exploratory subgroup analysis. pages 1–17, 2025

2025
[11]

Looking for trouble: Analyzing classifier behavior via pattern divergence

Eliana Pastor, Luca De Alfaro, and Elena Baralis. Looking for trouble: Analyzing classifier behavior via pattern divergence. InProceedings of the 2021 International Conference on Management of Data, pages 1400–1412, 2021

2021
[12]

Domino: Discovering systematic errors with cross-modal embeddings.arXiv preprint arXiv:2203.14960, 2022

Sabri Eyuboglu, Maya Varma, Khaled Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, and Christopher Ré. Domino: Discovering systematic errors with cross-modal embeddings.arXiv preprint arXiv:2203.14960, 2022

work page arXiv 2022
[13]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, et al. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017. 10

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InACL, 2020

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InACL, 2020

2020
[18]

Domino: Discovering systematic errors with cross-modal embeddings

Sabri Eyuboglu, Maya Varma, Khaled Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, and Christopher Ré. Domino: Discovering systematic errors with cross-modal embeddings. InICLR, 2022

2022
[19]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

2024
[20]

McNee, Joseph A

Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. Improving recommendation lists through topic diversification. InProceedings of the 14th International Conference on World Wide Web (WWW), pages 22–32, 2005

2005
[21]

ContextVLM: Zero-shot and few-shot context understanding for autonomous driving using vision language models.arXiv [cs.CV], August 2024

Sural Shounak, Naren, and Rajkumar Ragunathan. ContextVLM: Zero-shot and few-shot context understanding for autonomous driving using vision language models.arXiv [cs.CV], August 2024

2024
[22]

Drama: Joint risk localization and captioning in driving

Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1043–1052, 2023

2023
[23]

Drivingvqa: Analyzing visual chain-of-thought reasoning of vision language models in real- world scenarios with driving theory tests.arXiv preprint arXiv:2501.04671, 2025

Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, and Alexandre Alahi. Drivingvqa: Analyzing visual chain-of-thought reasoning of vision language models in real- world scenarios with driving theory tests.arXiv preprint arXiv:2501.04671, 2025

work page arXiv 2025
[24]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

2024
[25]

Realworldqa.https://x.ai/blog/realworldqa, November 2024

X.AI. Realworldqa.https://x.ai/blog/realworldqa, November 2024. Blog post

2024
[26]

Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events

Li Xu, He Huang, and Jun Liu. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9878–9888, 2021

2021
[27]

TUMTraffic- VideoQA: A benchmark for unified spatio-temporal video understanding in traffic scenes.arXiv [cs.CV], February 2025

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jia- jie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, and Alois C Knoll. TUMTraffic- VideoQA: A benchmark for unified spatio-temporal video understanding in traffic scenes.arXiv [cs.CV], February 2025

2025
[28]

Canadian adverse driving conditions dataset.The Interna- tional Journal of Robotics Research, 40(4-5):681–690, 2021

Matthew Pitropov, Danson Evan Garcia, Jason Rebello, Michael Smart, Carlos Wang, Krzysztof Czarnecki, and Steven Waslander. Canadian adverse driving conditions dataset.The Interna- tional Journal of Robotics Research, 40(4-5):681–690, 2021

2021
[29]

Pesotif: A challenging visual dataset for perception sotif problems in long-tail traffic scenarios

Liang Peng, Jun Li, Wenbo Shao, and Hong Wang. Pesotif: A challenging visual dataset for perception sotif problems in long-tail traffic scenarios. In2023 IEEE Intelligent Vehicles Symposium (IV), pages 1–8. IEEE, 2023

2023
[30]

Dawn: vehicle detection in adverse weather nature dataset.arXiv preprint arXiv:2008.05402, 2020

Mourad A Kenk and Mahmoud Hassaballah. Dawn: vehicle detection in adverse weather nature dataset.arXiv preprint arXiv:2008.05402, 2020

work page arXiv 2008
[31]

Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindström, Daria Motorniuk, Junsheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20178–20188, 2023

2023
[32]

Radiate: A radar dataset for automotive perception in bad weather

Marcel Sheeny, Emanuele De Pellegrin, Saptarshi Mukherjee, Alireza Ahrabian, Sen Wang, and Andrew Wallace. Radiate: A radar dataset for automotive perception in bad weather. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1–7. IEEE, 2021. 11

2021
[33]

Video weather recognition (varg): An intensity-labeled video weather recognition dataset.Journal of imaging, 10(11):281, 2024

Himanshu Gupta, Oleksandr Kotlyar, Henrik Andreasson, and Achim J Lilienthal. Video weather recognition (varg): An intensity-labeled video weather recognition dataset.Journal of imaging, 10(11):281, 2024

2024
[34]

Wedge: A multi-weather autonomous driving dataset built from generative vision-language models

Aboli Marathe, Deva Ramanan, Rahee Walambe, and Ketan Kotecha. Wedge: A multi-weather autonomous driving dataset built from generative vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3318–3327, 2023

2023
[35]

Wilddash-creating hazard-aware benchmarks

Oliver Zendel, Katrin Honauer, Markus Murschitz, Daniel Steininger, and Gustavo Fernandez Dominguez. Wilddash-creating hazard-aware benchmarks. InProceedings of the European conference on computer vision (ECCV), pages 402–416, 2018

2018
[36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Introducing GPT-5.2, 12 2025

OpenAI. Introducing GPT-5.2, 12 2025. Accessed: 2026-03-05

2025
[38]

Joseph L. Fleiss. Measuring nominal scale agreement among many raters.Psychological Bulletin, 76(5):378–382, 1971

1971
[39]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

2019
[40]

Étude comparative de la distribution florale dans une portion des alpes et des jura

Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:547–579, 1901

1901
[41]

Taxonomy and definitions for terms related to on-road motor vehicle automated driving systems.SAE Standard J, 3016:1, 2014

SAE On-Road Automated Vehicle Standards Committee et al. Taxonomy and definitions for terms related to on-road motor vehicle automated driving systems.SAE Standard J, 3016:1, 2014

2014
[42]

Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones

Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, and Srinivasa G Narasimhan. Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6...

2025
[43]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

NVIDIA Alpamayo

NVIDIA. NVIDIA Alpamayo. https://developer.nvidia.com/drive/alpamayo, 2025. Accessed: 2026-05-06

2025
[45]

BDD100K: A diverse driving dataset for heterogeneous multitask learning

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2636–2645, 2020

2020
[46]

environmental, geographical, and time-of-day restrictions, and/or the requisite presence or absence of certain traffic or roadway characteristics

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

2025
[47]

Extract predicatesfrom SliceLine’s top-k discovered slices, each with an associated error score
[48]

Assignsthe score of its best-matching (highest-scoring) predicate

Expand to full slices:For each candidate full slice s, find all predicates whose constraints are satisfied bys. Assignsthe score of its best-matching (highest-scoring) predicate
[49]

foggy + night

Rank by specificity then discovery order:Sort matched slices first by predicate specificity (number of constrained dimensions, descending), then by predicate discovery rank (ascending), then by predicate score (descending), with slice identifier as the final deterministic tie-breaker. This expansion allows SliceLine’s partial-predicate signal to propagate...
[50]

What driving scenarios outside our current ODD should we prioritize for testing?
[51]

Identify high-risk conditions we haven’t validated yet
[52]

What blind spots exist in our current test coverage? 18
[53]

Suggest untested edge cases that could cause failures in deployment
[54]

We’re expanding to construction zones - what untested scenarios should we prioritize?
[55]

What snow and ice scenarios are we missing?
[56]

What nighttime edge cases should we test?
[57]

We’re deploying on rural highways - what scenarios haven’t we covered?
[58]

What tunnel and underpass scenarios need testing?
[59]

We’re concerned about heavy rain - what wet road scenarios should we prioritize?
[60]

What rush hour congestion scenarios are undertested?
[61]

What dawn and dusk visibility scenarios are we missing?
[62]

Find high-risk untested scenarios and acquire test samples for evaluation
[63]

Identify missing foggy highway slices, retrieve test images, and evaluate the model
[64]

Outside - ODD

Find untested night driving scenarios, generate synthetic test data, and run VLM evaluation. H LLM choose-slices baseline prompts This section reproduces the prompts used by the LLM baseline in Analysis 4. Note that the baseline conditions on acompact text summaryof the seed triage CSV - marginals, support statistics, and aggregate judge scores - not the ...

[1] [1]

Automated Data Slicing for Model Validation:A Big data - AI Integration Approach

Yeounoh Chung, Tim Kraska, Steven Euijong Whang, and Neoklis Polyzotis. Slice finder: Automated data slicing for model validation.arXiv preprint arXiv:1807.06068, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Sliceline: Fast, linear-algebra-based slice finding for ml model debugging

Svetlana Sagadeeva and Matthias Boehm. Sliceline: Fast, linear-algebra-based slice finding for ml model debugging. InSIGMOD, 2021

2021

[3] [3]

Standard ISO 34503:2023, International Organization for Standardization, Geneva, CH, Aug 2023

Road vehicles — Test scenarios for automated driving systems — Specification for operational design domain. Standard ISO 34503:2023, International Organization for Standardization, Geneva, CH, Aug 2023. Accessed: 05 March 2026

2023

[4] [4]

Road vehicles — safety of the intended func- tionality

International Organization for Standardization. Road vehicles — safety of the intended func- tionality. Standard ISO 21448:2022, International Organization for Standardization, Geneva, CH, jun 2022

2022

[5] [5]

Coverage metrics for a scenario database for the scenario-based assessment of automated driving systems

Erwin de Gelder, Maren Buermann, and Olaf Op Den Camp. Coverage metrics for a scenario database for the scenario-based assessment of automated driving systems. In2024 IEEE International Automated Vehicle Validation Conference (IAVVC), pages 1–8. IEEE, 2024

2024

[6] [6]

Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951, 2024

Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, and Chen Lv. Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951, 2024

work page arXiv 2024

[7] [7]

arXiv preprint arXiv:2412.07689 (2024) 4, 8

Zhijian Huang, Chengjian Fen, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Drivemm: All-in-one large multimodal model for autonomous driving.arXiv preprint arXiv:2412.07689, 2024

work page arXiv 2024

[8] [8]

Cosmos-Reason2 — NVIDIA Documentation

NVIDIA Corporation. Cosmos-Reason2 — NVIDIA Documentation. https://docs.nvidia. com/cosmos/latest/reason2/index.html, 2026. Accessed: 2026-03-05

2026

[9] [9]

Autoslicer: Scalable automated data slicing for ml model analysis

Zifan Liu, Evan Rosen, et al. Autoslicer: Scalable automated data slicing for ml model analysis. 2022

2022

[10] [10]

Divisi: Interactive search and visualization for scalable exploratory subgroup analysis

Venkatesh Sivaraman, Zexuan Li, and Adam Perer. Divisi: Interactive search and visualization for scalable exploratory subgroup analysis. pages 1–17, 2025

2025

[11] [11]

Looking for trouble: Analyzing classifier behavior via pattern divergence

Eliana Pastor, Luca De Alfaro, and Elena Baralis. Looking for trouble: Analyzing classifier behavior via pattern divergence. InProceedings of the 2021 International Conference on Management of Data, pages 1400–1412, 2021

2021

[12] [12]

Domino: Discovering systematic errors with cross-modal embeddings.arXiv preprint arXiv:2203.14960, 2022

Sabri Eyuboglu, Maya Varma, Khaled Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, and Christopher Ré. Domino: Discovering systematic errors with cross-modal embeddings.arXiv preprint arXiv:2203.14960, 2022

work page arXiv 2022

[13] [13]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, et al. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017. 10

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InACL, 2020

Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InACL, 2020

2020

[18] [18]

Domino: Discovering systematic errors with cross-modal embeddings

Sabri Eyuboglu, Maya Varma, Khaled Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, and Christopher Ré. Domino: Discovering systematic errors with cross-modal embeddings. InICLR, 2022

2022

[19] [19]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

2024

[20] [20]

McNee, Joseph A

Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. Improving recommendation lists through topic diversification. InProceedings of the 14th International Conference on World Wide Web (WWW), pages 22–32, 2005

2005

[21] [21]

ContextVLM: Zero-shot and few-shot context understanding for autonomous driving using vision language models.arXiv [cs.CV], August 2024

Sural Shounak, Naren, and Rajkumar Ragunathan. ContextVLM: Zero-shot and few-shot context understanding for autonomous driving using vision language models.arXiv [cs.CV], August 2024

2024

[22] [22]

Drama: Joint risk localization and captioning in driving

Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1043–1052, 2023

2023

[23] [23]

Drivingvqa: Analyzing visual chain-of-thought reasoning of vision language models in real- world scenarios with driving theory tests.arXiv preprint arXiv:2501.04671, 2025

Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, and Alexandre Alahi. Drivingvqa: Analyzing visual chain-of-thought reasoning of vision language models in real- world scenarios with driving theory tests.arXiv preprint arXiv:2501.04671, 2025

work page arXiv 2025

[24] [24]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

2024

[25] [25]

Realworldqa.https://x.ai/blog/realworldqa, November 2024

X.AI. Realworldqa.https://x.ai/blog/realworldqa, November 2024. Blog post

2024

[26] [26]

Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events

Li Xu, He Huang, and Jun Liu. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9878–9888, 2021

2021

[27] [27]

TUMTraffic- VideoQA: A benchmark for unified spatio-temporal video understanding in traffic scenes.arXiv [cs.CV], February 2025

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jia- jie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, and Alois C Knoll. TUMTraffic- VideoQA: A benchmark for unified spatio-temporal video understanding in traffic scenes.arXiv [cs.CV], February 2025

2025

[28] [28]

Canadian adverse driving conditions dataset.The Interna- tional Journal of Robotics Research, 40(4-5):681–690, 2021

Matthew Pitropov, Danson Evan Garcia, Jason Rebello, Michael Smart, Carlos Wang, Krzysztof Czarnecki, and Steven Waslander. Canadian adverse driving conditions dataset.The Interna- tional Journal of Robotics Research, 40(4-5):681–690, 2021

2021

[29] [29]

Pesotif: A challenging visual dataset for perception sotif problems in long-tail traffic scenarios

Liang Peng, Jun Li, Wenbo Shao, and Hong Wang. Pesotif: A challenging visual dataset for perception sotif problems in long-tail traffic scenarios. In2023 IEEE Intelligent Vehicles Symposium (IV), pages 1–8. IEEE, 2023

2023

[30] [30]

Dawn: vehicle detection in adverse weather nature dataset.arXiv preprint arXiv:2008.05402, 2020

Mourad A Kenk and Mahmoud Hassaballah. Dawn: vehicle detection in adverse weather nature dataset.arXiv preprint arXiv:2008.05402, 2020

work page arXiv 2008

[31] [31]

Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindström, Daria Motorniuk, Junsheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20178–20188, 2023

2023

[32] [32]

Radiate: A radar dataset for automotive perception in bad weather

Marcel Sheeny, Emanuele De Pellegrin, Saptarshi Mukherjee, Alireza Ahrabian, Sen Wang, and Andrew Wallace. Radiate: A radar dataset for automotive perception in bad weather. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1–7. IEEE, 2021. 11

2021

[33] [33]

Video weather recognition (varg): An intensity-labeled video weather recognition dataset.Journal of imaging, 10(11):281, 2024

Himanshu Gupta, Oleksandr Kotlyar, Henrik Andreasson, and Achim J Lilienthal. Video weather recognition (varg): An intensity-labeled video weather recognition dataset.Journal of imaging, 10(11):281, 2024

2024

[34] [34]

Wedge: A multi-weather autonomous driving dataset built from generative vision-language models

Aboli Marathe, Deva Ramanan, Rahee Walambe, and Ketan Kotecha. Wedge: A multi-weather autonomous driving dataset built from generative vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3318–3327, 2023

2023

[35] [35]

Wilddash-creating hazard-aware benchmarks

Oliver Zendel, Katrin Honauer, Markus Murschitz, Daniel Steininger, and Gustavo Fernandez Dominguez. Wilddash-creating hazard-aware benchmarks. InProceedings of the European conference on computer vision (ECCV), pages 402–416, 2018

2018

[36] [36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Introducing GPT-5.2, 12 2025

OpenAI. Introducing GPT-5.2, 12 2025. Accessed: 2026-03-05

2025

[38] [38]

Joseph L. Fleiss. Measuring nominal scale agreement among many raters.Psychological Bulletin, 76(5):378–382, 1971

1971

[39] [39]

Sentence-bert: Sentence embeddings using siamese bert- networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

2019

[40] [40]

Étude comparative de la distribution florale dans une portion des alpes et des jura

Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:547–579, 1901

1901

[41] [41]

Taxonomy and definitions for terms related to on-road motor vehicle automated driving systems.SAE Standard J, 3016:1, 2014

SAE On-Road Automated Vehicle Standards Committee et al. Taxonomy and definitions for terms related to on-road motor vehicle automated driving systems.SAE Standard J, 3016:1, 2014

2014

[42] [42]

Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones

Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, and Srinivasa G Narasimhan. Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6...

2025

[43] [43]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[44] [44]

NVIDIA Alpamayo

NVIDIA. NVIDIA Alpamayo. https://developer.nvidia.com/drive/alpamayo, 2025. Accessed: 2026-05-06

2025

[45] [45]

BDD100K: A diverse driving dataset for heterogeneous multitask learning

Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2636–2645, 2020

2020

[46] [46]

environmental, geographical, and time-of-day restrictions, and/or the requisite presence or absence of certain traffic or roadway characteristics

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

2025

[47] [47]

Extract predicatesfrom SliceLine’s top-k discovered slices, each with an associated error score

[48] [48]

Assignsthe score of its best-matching (highest-scoring) predicate

Expand to full slices:For each candidate full slice s, find all predicates whose constraints are satisfied bys. Assignsthe score of its best-matching (highest-scoring) predicate

[49] [49]

foggy + night

Rank by specificity then discovery order:Sort matched slices first by predicate specificity (number of constrained dimensions, descending), then by predicate discovery rank (ascending), then by predicate score (descending), with slice identifier as the final deterministic tie-breaker. This expansion allows SliceLine’s partial-predicate signal to propagate...

[50] [50]

What driving scenarios outside our current ODD should we prioritize for testing?

[51] [51]

Identify high-risk conditions we haven’t validated yet

[52] [52]

What blind spots exist in our current test coverage? 18

[53] [53]

Suggest untested edge cases that could cause failures in deployment

[54] [54]

We’re expanding to construction zones - what untested scenarios should we prioritize?

[55] [55]

What snow and ice scenarios are we missing?

[56] [56]

What nighttime edge cases should we test?

[57] [57]

We’re deploying on rural highways - what scenarios haven’t we covered?

[58] [58]

What tunnel and underpass scenarios need testing?

[59] [59]

We’re concerned about heavy rain - what wet road scenarios should we prioritize?

[60] [60]

What rush hour congestion scenarios are undertested?

[61] [61]

What dawn and dusk visibility scenarios are we missing?

[62] [62]

Find high-risk untested scenarios and acquire test samples for evaluation

[63] [63]

Identify missing foggy highway slices, retrieve test images, and evaluate the model

[64] [64]

Outside - ODD

Find untested night driving scenarios, generate synthetic test data, and run VLM evaluation. H LLM choose-slices baseline prompts This section reproduces the prompts used by the LLM baseline in Analysis 4. Note that the baseline conditions on acompact text summaryof the seed triage CSV - marginals, support statistics, and aggregate judge scores - not the ...