pith. sign in

arxiv: 2606.01624 · v2 · pith:WHK6Y7JXnew · submitted 2026-06-01 · 💻 cs.CV · cs.SE

What to Test Next: Interpretable Coverage Gap Discovery in Driving VLMs

Pith reviewed 2026-06-28 15:34 UTC · model grok-4.3

classification 💻 cs.CV cs.SE
keywords driving VLMscoverage gapstest recommendationSliceScorerSliceNavoperational design domaininterpretable verificationvision language models
0
0 comments X

The pith

A deterministic scorer using exposure and neighbor-failure priors recommends high-risk missing test slices for driving VLMs more effectively than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SliceScorer as a simple, interpretable way to score and recommend untested conditions for verifying driving vision-language models. It prioritizes slices that are rarely exposed in current tests and those near conditions where the model has already failed. This scorer is placed inside SliceNav, an LLM-guided system that turns developer questions into sequences of verification steps while keeping all risk calculations fixed and reviewable. Tests across three different driving VLMs demonstrate that the combined approach identifies more dangerous coverage gaps and keeps suggestions varied across the full range of driving conditions.

Core claim

SliceScorer is a deterministic scoring rule that combines an exposure-based coverage prior to prioritize rare, under-tested regions with a neighbor-failure prior that propagates risk from similar tested conditions. When embedded in the SliceNav pipeline, which uses an LLM to interpret queries and select operators while preserving deterministic scoring, it surfaces high-risk coverage gaps more effectively than prior slice-discovery methods on three driving VLMs while maintaining diverse recommendations.

What carries the argument

SliceScorer, a deterministic scoring rule combining exposure-based coverage prior and neighbor-failure prior for missing-slice recommendation.

If this is right

  • The method identifies coverage gaps that carry higher failure risk than those found by earlier techniques.
  • Recommendations remain spread across the condition space rather than clustering in narrow areas.
  • Both the exposure and neighbor components are necessary, as shown by ablations.
  • The full pipeline supports traceable workflows from a natural-language query to concrete evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This scoring approach could be applied to other verification tasks where test conditions form a continuous space with local similarity.
  • If the priors align with real failure modes, it may allow more efficient allocation of testing resources in safety validation.
  • The separation of LLM orchestration from deterministic scoring might reduce risks of inconsistent or unexplainable recommendations in regulated domains.

Load-bearing premise

The exposure-based coverage prior and neighbor-failure prior accurately reflect actual risk of failure in untested slices.

What would settle it

Running targeted evaluations on the slices recommended by SliceNav versus those from baseline methods and finding that the recommended slices do not exhibit higher failure rates.

Figures

Figures reproduced from arXiv: 2606.01624 by Abhishek Aich, Manmohan Chandraker, Sparsh Garg, Turgun Yusuf Kashgari, Vijay Kumar BG.

Figure 1
Figure 1. Figure 1: From stochastic prompting to structured slice reasoning. Top: A pure LLM agent, even when provided with ODD definitions, seed evaluation tables, and strong instruction prompts, produces stochastic slice suggestions with unreliable decision boundaries and incomplete coverage. Bottom: SLICENAV augments the LLM with structured instructions and a dedicated SLICESCORER, enabling deterministic slice suggestion r… view at source ↗
Figure 2
Figure 2. Figure 2: Framework overview. The agent takes a natural-language verification goal plus evaluation context and ODD specification, then orchestrates failure triage, in-ODD gap discovery, and controlled out-of-ODD stress proposal via a query-conditioned ODD expansion. It can also close the loop by acquiring data, and re-evaluating the VLM. are pattern-mining approaches such as DivExplorer [11], which frame slice disco… view at source ↗
Figure 3
Figure 3. Figure 3: Agentic workflow examples for qualitative analysis. Each row is one completed run from the 15-query qualitative sweep. Broad queries trigger vocabulary extension across multiple ODD dimensions, focused queries select targeted outside-ODD values, and acquisition queries chain slice discovery to data generation/retrieval and model evaluation (see rationales in Appendix I). 4.4 Analysis 4: Reliability and sto… view at source ↗
Figure 5
Figure 5. Figure 5: Marginal-Lift Model (SliceFinder-Inspired) [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: WiseAD SliceScorer recommendation profile. Distribution of top-ranked missing-slice recommendations across ODD attributes. clear foggy rainy snowy 0 0.1 0.2 0.3 fraction weather visibility clear cloudy sky conditions day night time city downtown city suburbs construction highway environment dry snow wet 0 0.1 0.2 0.3 fraction road condition intersection non-intersection road type dense sparse traffic type … view at source ↗
Figure 7
Figure 7. Figure 7: DriveMM SliceScorer recommendation profile. Distribution of top-ranked missing-slice recommendations across ODD attributes. Output your chosen slices as CSV only : - First line : header with exactly these columns ( comma - separated ): { csv_header } - Following lines : one row per slice , same column order , comma - separated values - Strictly Output { top_k } slices . Do not include any explanation or co… view at source ↗
Figure 8
Figure 8. Figure 8: Cosmos SliceScorer recommendation profile. Distribution of top-ranked missing-slice recommendations across ODD attributes. missing-slice ranker then returned top slices all using environment=tunnel, concentrated on foggy daytime dense-traffic cases while varying road condition and road type. A separate failure-summary pass found no tunnel or underpass groups in the existing triage failures, supporting the … view at source ↗
Figure 9
Figure 9. Figure 9: Slice-to-executable-test flow. Arrows show what each stage hands off to the next; the dashed branch is narrative interpretation of the normalized bundle (qualitative Example D: generic high-risk acquisition; Appendix I). Snow and ice coverage (Example F). Query: “What snow and ice scenarios are we missing?” Tools: within-ODD missing-slice ranking → refined top-K snow query → existing-failure summary. Vocab… view at source ↗
read the original abstract

Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification remains sparse: many slices are missing, making empirical failure rates unreliable. We propose SliceScorer, a deterministic scoring rule for missing-slice recommendation that combines (i) an exposure-based coverage prior to prioritize rare, under-tested regions, and (ii) a neighbor-failure prior that propagates risk from similar tested conditions. SliceScorer is deliberately simple - interpretable, auditable, and conservative - properties essential for safety-critical validation. For stress testing beyond the declared ODD, we embed SliceScorer within SliceNav, an LLM-orchestrated verification pipeline where the model interprets developer queries to select relevant operators (triage, scoring, acquisition, evaluation) and vocabulary extensions, composing verification workflows while keeping all scoring deterministic and auditable. Experiments on three driving VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) show that SliceNav surfaces high-risk coverage gaps more effectively than prior slice-discovery methods while maintaining diverse recommendations across the condition space. Ablations confirm both scoring components contribute, and qualitative analysis demonstrates end-to-end workflows from developer query to targeted evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes SliceScorer, a deterministic scoring rule combining an exposure-based coverage prior (prioritizing rare under-tested regions) and a neighbor-failure prior (propagating risk from similar tested conditions) to recommend missing slices for driving VLM verification. SliceScorer is embedded in SliceNav, an LLM-orchestrated pipeline that interprets developer queries to select operators and compose auditable verification workflows. Experiments on three VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) claim SliceNav outperforms prior slice-discovery methods at surfacing high-risk gaps while maintaining diversity across the condition space; ablations confirm both scoring components contribute.

Significance. If the result holds, the work supplies a deliberately simple, interpretable, and auditable approach to coverage-gap discovery for safety-critical VLM validation in autonomous driving. The deterministic nature of the scoring rule, the explicit separation of priors from the LLM orchestration layer, and the reported ablations are concrete strengths that align with verification needs. The approach could support more systematic ODD testing if the priors are shown to track actual failure risk.

major comments (2)
  1. [Experiments section] Experiments section (as summarized in the abstract): the claim that SliceNav 'surfaces high-risk coverage gaps more effectively' is evaluated using SliceScorer itself; no results are reported from subsequently running the three VLMs on the recommended slices and measuring empirical failure rates (e.g., incorrect scene descriptions or predictions). Without this independent signal, superiority reduces to a coverage-diversity comparison rather than a risk-identification result.
  2. [Ablations] Ablations paragraph (abstract): the statement that 'ablations confirm both scoring components contribute' measures contribution against the composite SliceScorer score; this does not test whether the exposure or neighbor-failure prior actually correlates with higher VLM failure rates on the surfaced slices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback, particularly the emphasis on distinguishing score-based evaluation from direct empirical validation of risk. We address each major comment below.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section (as summarized in the abstract): the claim that SliceNav 'surfaces high-risk coverage gaps more effectively' is evaluated using SliceScorer itself; no results are reported from subsequently running the three VLMs on the recommended slices and measuring empirical failure rates (e.g., incorrect scene descriptions or predictions). Without this independent signal, superiority reduces to a coverage-diversity comparison rather than a risk-identification result.

    Authors: We agree that the reported experiments evaluate SliceNav by comparing SliceScorer values and diversity metrics across methods, without additional VLM inference runs on the newly recommended slices to obtain independent failure rates. The 'high-risk' designation in the manuscript is defined operationally via the exposure and neighbor-failure priors. We recognize that this constitutes an evaluation of coverage and diversity under the proposed scoring rule rather than an independent demonstration of risk identification through measured VLM errors. In the revision we will update the abstract and experiments section to describe the results more precisely as superior slice discovery according to the defined priors and metrics, and we will explicitly note the lack of direct failure-rate measurements as a limitation of the current evaluation. revision: yes

  2. Referee: [Ablations] Ablations paragraph (abstract): the statement that 'ablations confirm both scoring components contribute' measures contribution against the composite SliceScorer score; this does not test whether the exposure or neighbor-failure prior actually correlates with higher VLM failure rates on the surfaced slices.

    Authors: The ablations compare the full SliceScorer against ablated variants using the same composite scoring and ranking criteria, thereby showing that each prior improves the overall score. We acknowledge that these results do not establish a direct empirical correlation between the individual priors and observed VLM failure rates. We will revise the abstract and ablation discussion to state that the ablations confirm the contribution of both components to the composite scoring rule. revision: yes

Circularity Check

0 steps flagged

No circularity: deterministic heuristic with independent evaluation claims

full rationale

The paper defines SliceScorer explicitly as a deterministic, non-learned combination of two priors (exposure-based coverage and neighbor-failure) and embeds it in SliceNav without any equations, parameter fitting, or self-citation chains that reduce the output to the input by construction. Experiments compare recommendation quality against prior slice-discovery methods on three VLMs and include ablations on the two components; no derivation step is shown that renames a fitted quantity as a prediction or imports uniqueness from the authors' prior work. The method is therefore self-contained as a proposed auditable heuristic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; no details are given about how the priors are computed or normalized.

pith-pipeline@v0.9.1-grok · 5763 in / 1213 out tokens · 27646 ms · 2026-06-28T15:34:10.534483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 12 canonical work pages · 7 internal anchors

  1. [1]

    Automated Data Slicing for Model Validation:A Big data - AI Integration Approach

    Yeounoh Chung, Tim Kraska, Steven Euijong Whang, and Neoklis Polyzotis. Slice finder: Automated data slicing for model validation.arXiv preprint arXiv:1807.06068, 2018

  2. [2]

    Sliceline: Fast, linear-algebra-based slice finding for ml model debugging

    Svetlana Sagadeeva and Matthias Boehm. Sliceline: Fast, linear-algebra-based slice finding for ml model debugging. InSIGMOD, 2021

  3. [3]

    Standard ISO 34503:2023, International Organization for Standardization, Geneva, CH, Aug 2023

    Road vehicles — Test scenarios for automated driving systems — Specification for operational design domain. Standard ISO 34503:2023, International Organization for Standardization, Geneva, CH, Aug 2023. Accessed: 05 March 2026

  4. [4]

    Road vehicles — safety of the intended func- tionality

    International Organization for Standardization. Road vehicles — safety of the intended func- tionality. Standard ISO 21448:2022, International Organization for Standardization, Geneva, CH, jun 2022

  5. [5]

    Coverage metrics for a scenario database for the scenario-based assessment of automated driving systems

    Erwin de Gelder, Maren Buermann, and Olaf Op Den Camp. Coverage metrics for a scenario database for the scenario-based assessment of automated driving systems. In2024 IEEE International Automated Vehicle Validation Conference (IAVVC), pages 1–8. IEEE, 2024

  6. [6]

    Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951, 2024

    Songyan Zhang, Wenhui Huang, Zihui Gao, Hao Chen, and Chen Lv. Wisead: Knowl- edge augmented end-to-end autonomous driving with vision-language model.arXiv preprint arXiv:2412.09951, 2024

  7. [7]

    arXiv preprint arXiv:2412.07689 (2024) 4, 8

    Zhijian Huang, Chengjian Fen, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, and Lin Ma. Drivemm: All-in-one large multimodal model for autonomous driving.arXiv preprint arXiv:2412.07689, 2024

  8. [8]

    Cosmos-Reason2 — NVIDIA Documentation

    NVIDIA Corporation. Cosmos-Reason2 — NVIDIA Documentation. https://docs.nvidia. com/cosmos/latest/reason2/index.html, 2026. Accessed: 2026-03-05

  9. [9]

    Autoslicer: Scalable automated data slicing for ml model analysis

    Zifan Liu, Evan Rosen, et al. Autoslicer: Scalable automated data slicing for ml model analysis. 2022

  10. [10]

    Divisi: Interactive search and visualization for scalable exploratory subgroup analysis

    Venkatesh Sivaraman, Zexuan Li, and Adam Perer. Divisi: Interactive search and visualization for scalable exploratory subgroup analysis. pages 1–17, 2025

  11. [11]

    Looking for trouble: Analyzing classifier behavior via pattern divergence

    Eliana Pastor, Luca De Alfaro, and Elena Baralis. Looking for trouble: Analyzing classifier behavior via pattern divergence. InProceedings of the 2021 International Conference on Management of Data, pages 1400–1412, 2021

  12. [12]

    Domino: Discovering systematic errors with cross-modal embeddings.arXiv preprint arXiv:2203.14960, 2022

    Sabri Eyuboglu, Maya Varma, Khaled Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, and Christopher Ré. Domino: Discovering systematic errors with cross-modal embeddings.arXiv preprint arXiv:2203.14960, 2022

  13. [13]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng et al. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685, 2023

  14. [14]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024

  15. [15]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, et al. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023

  16. [16]

    Towards A Rigorous Science of Interpretable Machine Learning

    Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017. 10

  17. [17]

    Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InACL, 2020

    Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? InACL, 2020

  18. [18]

    Domino: Discovering systematic errors with cross-modal embeddings

    Sabri Eyuboglu, Maya Varma, Khaled Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, and Christopher Ré. Domino: Discovering systematic errors with cross-modal embeddings. InICLR, 2022

  19. [19]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

  20. [20]

    McNee, Joseph A

    Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. Improving recommendation lists through topic diversification. InProceedings of the 14th International Conference on World Wide Web (WWW), pages 22–32, 2005

  21. [21]

    ContextVLM: Zero-shot and few-shot context understanding for autonomous driving using vision language models.arXiv [cs.CV], August 2024

    Sural Shounak, Naren, and Rajkumar Ragunathan. ContextVLM: Zero-shot and few-shot context understanding for autonomous driving using vision language models.arXiv [cs.CV], August 2024

  22. [22]

    Drama: Joint risk localization and captioning in driving

    Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1043–1052, 2023

  23. [23]

    Drivingvqa: Analyzing visual chain-of-thought reasoning of vision language models in real- world scenarios with driving theory tests.arXiv preprint arXiv:2501.04671, 2025

    Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, and Alexandre Alahi. Drivingvqa: Analyzing visual chain-of-thought reasoning of vision language models in real- world scenarios with driving theory tests.arXiv preprint arXiv:2501.04671, 2025

  24. [24]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

  25. [25]

    Realworldqa.https://x.ai/blog/realworldqa, November 2024

    X.AI. Realworldqa.https://x.ai/blog/realworldqa, November 2024. Blog post

  26. [26]

    Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events

    Li Xu, He Huang, and Jun Liu. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9878–9888, 2021

  27. [27]

    TUMTraffic- VideoQA: A benchmark for unified spatio-temporal video understanding in traffic scenes.arXiv [cs.CV], February 2025

    Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jia- jie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, and Alois C Knoll. TUMTraffic- VideoQA: A benchmark for unified spatio-temporal video understanding in traffic scenes.arXiv [cs.CV], February 2025

  28. [28]

    Canadian adverse driving conditions dataset.The Interna- tional Journal of Robotics Research, 40(4-5):681–690, 2021

    Matthew Pitropov, Danson Evan Garcia, Jason Rebello, Michael Smart, Carlos Wang, Krzysztof Czarnecki, and Steven Waslander. Canadian adverse driving conditions dataset.The Interna- tional Journal of Robotics Research, 40(4-5):681–690, 2021

  29. [29]

    Pesotif: A challenging visual dataset for perception sotif problems in long-tail traffic scenarios

    Liang Peng, Jun Li, Wenbo Shao, and Hong Wang. Pesotif: A challenging visual dataset for perception sotif problems in long-tail traffic scenarios. In2023 IEEE Intelligent Vehicles Symposium (IV), pages 1–8. IEEE, 2023

  30. [30]

    Dawn: vehicle detection in adverse weather nature dataset.arXiv preprint arXiv:2008.05402, 2020

    Mourad A Kenk and Mahmoud Hassaballah. Dawn: vehicle detection in adverse weather nature dataset.arXiv preprint arXiv:2008.05402, 2020

  31. [31]

    Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving

    Mina Alibeigi, William Ljungbergh, Adam Tonderski, Georg Hess, Adam Lilja, Carl Lindström, Daria Motorniuk, Junsheng Fu, Jenny Widahl, and Christoffer Petersson. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20178–20188, 2023

  32. [32]

    Radiate: A radar dataset for automotive perception in bad weather

    Marcel Sheeny, Emanuele De Pellegrin, Saptarshi Mukherjee, Alireza Ahrabian, Sen Wang, and Andrew Wallace. Radiate: A radar dataset for automotive perception in bad weather. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1–7. IEEE, 2021. 11

  33. [33]

    Video weather recognition (varg): An intensity-labeled video weather recognition dataset.Journal of imaging, 10(11):281, 2024

    Himanshu Gupta, Oleksandr Kotlyar, Henrik Andreasson, and Achim J Lilienthal. Video weather recognition (varg): An intensity-labeled video weather recognition dataset.Journal of imaging, 10(11):281, 2024

  34. [34]

    Wedge: A multi-weather autonomous driving dataset built from generative vision-language models

    Aboli Marathe, Deva Ramanan, Rahee Walambe, and Ketan Kotecha. Wedge: A multi-weather autonomous driving dataset built from generative vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3318–3327, 2023

  35. [35]

    Wilddash-creating hazard-aware benchmarks

    Oliver Zendel, Katrin Honauer, Markus Murschitz, Daniel Steininger, and Gustavo Fernandez Dominguez. Wilddash-creating hazard-aware benchmarks. InProceedings of the European conference on computer vision (ECCV), pages 402–416, 2018

  36. [36]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  37. [37]

    Introducing GPT-5.2, 12 2025

    OpenAI. Introducing GPT-5.2, 12 2025. Accessed: 2026-03-05

  38. [38]

    Joseph L. Fleiss. Measuring nominal scale agreement among many raters.Psychological Bulletin, 76(5):378–382, 1971

  39. [39]

    Sentence-bert: Sentence embeddings using siamese bert- networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP- IJCNLP), pages 3982–3992, 2019

  40. [40]

    Étude comparative de la distribution florale dans une portion des alpes et des jura

    Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:547–579, 1901

  41. [41]

    Taxonomy and definitions for terms related to on-road motor vehicle automated driving systems.SAE Standard J, 3016:1, 2014

    SAE On-Road Automated Vehicle Standards Committee et al. Taxonomy and definitions for terms related to on-road motor vehicle automated driving systems.SAE Standard J, 3016:1, 2014

  42. [42]

    Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones

    Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, and Srinivasa G Narasimhan. Roadwork: A dataset and benchmark for learning to recognize, observe, analyze and drive through work zones. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6...

  43. [43]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

  44. [44]

    NVIDIA Alpamayo

    NVIDIA. NVIDIA Alpamayo. https://developer.nvidia.com/drive/alpamayo, 2025. Accessed: 2026-05-06

  45. [45]

    BDD100K: A diverse driving dataset for heterogeneous multitask learning

    Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving dataset for heterogeneous multitask learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2636–2645, 2020

  46. [46]

    environmental, geographical, and time-of-day restrictions, and/or the requisite presence or absence of certain traffic or roadway characteristics

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  47. [47]

    Extract predicatesfrom SliceLine’s top-k discovered slices, each with an associated error score

  48. [48]

    Assignsthe score of its best-matching (highest-scoring) predicate

    Expand to full slices:For each candidate full slice s, find all predicates whose constraints are satisfied bys. Assignsthe score of its best-matching (highest-scoring) predicate

  49. [49]

    foggy + night

    Rank by specificity then discovery order:Sort matched slices first by predicate specificity (number of constrained dimensions, descending), then by predicate discovery rank (ascending), then by predicate score (descending), with slice identifier as the final deterministic tie-breaker. This expansion allows SliceLine’s partial-predicate signal to propagate...

  50. [50]

    What driving scenarios outside our current ODD should we prioritize for testing?

  51. [51]

    Identify high-risk conditions we haven’t validated yet

  52. [52]

    What blind spots exist in our current test coverage? 18

  53. [53]

    Suggest untested edge cases that could cause failures in deployment

  54. [54]

    We’re expanding to construction zones - what untested scenarios should we prioritize?

  55. [55]

    What snow and ice scenarios are we missing?

  56. [56]

    What nighttime edge cases should we test?

  57. [57]

    We’re deploying on rural highways - what scenarios haven’t we covered?

  58. [58]

    What tunnel and underpass scenarios need testing?

  59. [59]

    We’re concerned about heavy rain - what wet road scenarios should we prioritize?

  60. [60]

    What rush hour congestion scenarios are undertested?

  61. [61]

    What dawn and dusk visibility scenarios are we missing?

  62. [62]

    Find high-risk untested scenarios and acquire test samples for evaluation

  63. [63]

    Identify missing foggy highway slices, retrieve test images, and evaluate the model

  64. [64]

    Outside - ODD

    Find untested night driving scenarios, generate synthetic test data, and run VLM evaluation. H LLM choose-slices baseline prompts This section reproduces the prompts used by the LLM baseline in Analysis 4. Note that the baseline conditions on acompact text summaryof the seed triage CSV - marginals, support statistics, and aggregate judge scores - not the ...