pith. sign in

arxiv: 2607.02300 · v1 · pith:WYMVJVOKnew · submitted 2026-07-02 · 💻 cs.CV · cs.SE

Search-based Testing of Vision Language Models for In-Car Scene Understanding

Pith reviewed 2026-07-03 15:32 UTC · model grok-4.3

classification 💻 cs.CV cs.SE
keywords search-based testingvision-language modelsin-car scene understandingautomated testingsynthetic data generationautomotive safetyfailure detection
0
0 comments X

The pith

ISU-Test finds up to 10 times more failures in vision-language models for in-car scenes by optimizing synthetic scenario generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ISU-Test, an automated approach that uses rendering to create synthetic in-car scenes and search-based optimization to explore parameters for testing vision-language models. It aims to address the challenge of evaluating these models without relying on costly real-world data collection. By comparing to random generation, the method shows significantly higher rates of detecting erroneous or incomplete outputs in tasks like question answering and captioning. This matters for automotive safety systems that depend on accurate scene understanding from VLMs.

Core claim

ISU-Test frames the testing of in-car scene understanding systems as an optimization problem, systematically modifying scene parameters to generate diverse scenarios that reveal more failures in VLMs than randomized testing.

What carries the argument

ISU-Test, a search-based testing framework that combines synthetic scene rendering with parameter optimization to maximize failure detection in QA and captioning tasks.

Load-bearing premise

The synthetic rendered scenes sufficiently represent real in-car conditions and the defined failure metrics for QA and captioning reliably indicate safety-relevant errors.

What would settle it

Running the same tests on actual recorded in-car data and finding that failure rates and coverage do not correlate with the synthetic results.

Figures

Figures reproduced from arXiv: 2607.02300 by Andrea Stocco, Chen Yang, Ken E. Friedl, Lev Sorokin.

Figure 1
Figure 1. Figure 1: Rendered scene in a vehicle where a loose suitcase [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (VQA) Prompt used for a VLM in an in-car scene [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of ISU-Test. Embedding Similarity (𝐹emb): The embedding similarity captures the description semantics on the sentence level. We embed the reference and predicted captions using a text encoder, i.e., all-MiniLM-L6-v2, and compute cosine similarity between the encodings. BLEU Score (𝐹bleu): BLEU measures the 𝑛-gram overlap between generated and reference text and allows us to assess the lexical alig… view at source ↗
Figure 4
Figure 4. Figure 4: Caption derived from ground-truth parameters. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Agreement between real and simulated evaluations. Features: K1 phone_codriver_seat, K2 gender, K3 baby_seat, K4 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of two different scenes (top/bottom): rendered scene (left), augmented scene using neural style transfer [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of a produced scene description with a low [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

In the automotive domain, in-car scene understanding (ISU) enables the detection of safety-critical events, such as driver distraction, and supports drivers or passengers by analyzing the in-car scene and adapting the environment (e.g., ambient lighting). The industry is increasingly exploring vision-language models (VLMs) to interpret camera-recorded in-car scenes and extract information for downstream reasoning tasks. However, VLMs may generate incomplete, erroneous, or misleading scene descriptions, highlighting the need for systematic testing. Collecting real in-vehicle data is costly, difficult to scale, and often infeasible, particularly in early design stages. In this paper, we present ISU-Test, an automated testing approach that combines rendering-based scene generation with search-based testing to evaluate ISU systems. By framing testing as an optimization problem and systematically modifying scene parameters, our method generates diverse in-car scenarios and explores a wide range of configurations. We evaluate ISU-Test on both an industrial prototype and open-source VLMs across two case studies: question answering and captioning, comparing against randomized scenario generation. Results show that ISU-Test significantly outperforms the baseline, achieving up to 10 times higher failure rates and up to 3.6 times higher failure coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ISU-Test, a search-based testing framework that combines rendering-based synthetic scene generation with optimization to systematically evaluate vision-language models (VLMs) for in-car scene understanding. The approach is applied to question-answering and captioning tasks on both an industrial prototype and open-source VLMs, with results compared against randomized scenario generation; the abstract reports up to 10x higher failure rates and 3.6x higher failure coverage.

Significance. If the failure oracles align with safety-critical errors and the rendered scenes are sufficiently representative, the method could offer a practical, scalable alternative to real-world data collection for VLM testing in automotive settings. The framing of testing as an optimization problem to explore diverse configurations is a clear methodological strength that could generalize to other VLM evaluation domains.

major comments (2)
  1. [Abstract] Abstract: The central empirical claims (up to 10 times higher failure rates and 3.6 times higher failure coverage) rest on failure oracles for QA and captioning whose definitions, measurement procedures, and relation to safety-relevant events are not specified. Without these, it is impossible to determine whether the reported multipliers reflect meaningful improvements or artifacts of the chosen metrics.
  2. [Abstract] Abstract: No information is given on the search algorithm (e.g., the optimization technique, objective function, or termination criteria), the renderer parameter space, or the scene-generation process. These elements are load-bearing for the claim that the method 'systematically modifies scene parameters' and 'explores a wide range of configurations' more effectively than random sampling.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly stating the specific VLMs evaluated and the scale of the experiments (number of generated scenes or runs).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the abstract to incorporate the requested details on failure oracles and the search-based components.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claims (up to 10 times higher failure rates and 3.6 times higher failure coverage) rest on failure oracles for QA and captioning whose definitions, measurement procedures, and relation to safety-relevant events are not specified. Without these, it is impossible to determine whether the reported multipliers reflect meaningful improvements or artifacts of the chosen metrics.

    Authors: We agree the abstract lacks sufficient detail on the failure oracles. The full manuscript defines them in the evaluation methodology: for QA, failures are incorrect or incomplete answers to questions about safety-critical elements (e.g., driver distraction); for captioning, failures are omissions or inaccuracies in describing key scene elements. Measurement uses automated string matching and semantic similarity thresholds aligned with automotive safety standards. We will add a concise summary of these definitions and their safety relevance to the abstract. revision: yes

  2. Referee: [Abstract] Abstract: No information is given on the search algorithm (e.g., the optimization technique, objective function, or termination criteria), the renderer parameter space, or the scene-generation process. These elements are load-bearing for the claim that the method 'systematically modifies scene parameters' and 'explores a wide range of configurations' more effectively than random sampling.

    Authors: The manuscript details these in the approach section: the search uses a genetic algorithm with an objective function that maximizes a weighted combination of failure rate and coverage, terminating after a fixed number of generations or convergence; the renderer parameter space covers variables such as lighting conditions, object placements, camera angles, and passenger positions; scene generation employs a physics-based renderer to produce synthetic in-car images. We will include a brief overview of the algorithm, objective, and parameter space in the abstract to support the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript describes an empirical search-based testing framework (ISU-Test) that generates synthetic scenes via rendering and optimizes for failure detection in VLM QA/captioning tasks, then compares failure rate and coverage against random sampling. No equations, parameter-fitting steps presented as predictions, uniqueness theorems, or self-citation load-bearing arguments appear. The reported multipliers (10x failure rate, 3.6x coverage) are direct empirical outcomes of the described procedure on the chosen oracles and renderer; they do not reduce to the inputs by construction. The paper is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5750 in / 1026 out tokens · 20407 ms · 2026-07-03T15:32:07.801998+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 11 canonical work pages · 7 internal anchors

  1. [1]

    Pioneering in-cabin monitoring: Unmasking the power of 2d and 3d cameras through sensor fusion,

    F. Diederichs and F. IOSB, “Pioneering in-cabin monitoring: Unmasking the power of 2d and 3d cameras through sensor fusion, ” Fraunhofer Institute of Optronics, System Technologies and Image Exploitation (IOSB), Tech. Rep., — 2025, white Paper. [Online]. Available: https://www.iosb.fraunhofer.de/content/ dam/iosb/iosbtest/documents/kompetenzen/bildauswert...

  2. [2]

    In-cabin monitoring system for autonomous vehicles,

    A. Mishra, S. Lee, D. Kim, and S. Kim, “In-cabin monitoring system for autonomous vehicles, ”Sensors, vol. 22, no. 12, 2022. [Online]. Available: https://www.mdpi.com/1424-8220/22/12/4360

  3. [3]

    Assessment protocol – assisted driving: Highways & interurban assist systems. technical bulletin sd 202 – driver monitoring test procedure,

    E. N. C. A. P. E. NCAP), “Assessment protocol – assisted driving: Highways & interurban assist systems. technical bulletin sd 202 – driver monitoring test procedure, ” Euro NCAP, Tech. Rep., Mar. 2025, implementation January

  4. [4]

    Available: https://www.euroncap.com/media/85831/euro-ncap- protocol-assisted-driving-v10.pdf

    [Online]. Available: https://www.euroncap.com/media/85831/euro-ncap- protocol-assisted-driving-v10.pdf

  5. [5]

    Regulation (eu) 2019/2144 on type-approval requirements for motor vehicles,

    European Parliament and Council, “Regulation (eu) 2019/2144 on type-approval requirements for motor vehicles, ” 2019, official Journal of the European Union

  6. [6]

    The european new car assessment programme,

    Euro NCAP, “The european new car assessment programme, ” https://www. euroncap.com/en, 2025, accessed: 2025-10-26

  7. [7]

    Sviro: Synthetic vehicle interior rear seat occupancy

    S. D. D. Cruz, O. Wasenm ¨uller, H.-P. Beise, T. Stifter, and D. Stricker, “Sviro: Synthetic vehicle interior rear seat occupancy. ” TIB, dec 2024. [Online]. Available: https://service.tib.eu/ldmservice/dataset/sviro--synthetic- vehicle-interior-rear-seat-occupancy

  8. [8]

    Expressive body capture: 3D hands, face, and body from a single image,

    G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black, “Expressive body capture: 3D hands, face, and body from a single image, ” inProceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 975–10 985

  9. [9]

    STELLAR: A search-based testing framework for large language model applications,

    L. Sorokin, I. Vasilev, K. E. Friedl, and A. Stocco, “STELLAR: A search-based testing framework for large language model applications, ” inProceedings of the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 2026

  10. [10]

    [Online]

    Blender Foundation,Shrinkwrap Modifier, Blender Foundation, 2023, blender 3.6 Manual, accessed 2026-04-23. [Online]. Available: https://docs.blender.org/ manual/id/3.6/modeling/modifiers/deform/shrinkwrap.html

  11. [11]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation, ” inProceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser. ACL ’02. USA: Association for Computational Linguistics, 2002, p. 311–318

  12. [12]

    Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments,

    A. Lavie and A. Agarwal, “Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments, ” inProceedings of the Second Workshop on Statistical Machine Translation, ser. StatMT ’07. USA: Association for Computational Linguistics, 2007, p. 228–231

  13. [13]

    BERTScore: Evaluating Text Generation with BERT

    T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert, ”arXiv preprint arXiv:1904.09675, 2019

  14. [14]

    Error detecting and error correcting codes,

    R. W. Hamming, “Error detecting and error correcting codes, ”Bell System Techni- cal Journal, vol. 29, no. 2, pp. 147–160, 1950

  15. [15]

    Testing of deep reinforcement learning agents with surrogate models,

    M. Biagiola and P. Tonella, “Testing of deep reinforcement learning agents with surrogate models, ”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 3, 2024

  16. [16]

    Test set diameter: Quantifying the diversity of sets of test cases,

    R. Feldt, S. Poulding, D. Clark, and S. Yoo, “Test set diameter: Quantifying the diversity of sets of test cases, ” in2016 IEEE International Conference on Software Testing, Verification and Validation (ICST), 2016

  17. [17]

    Replication package,

    Anonymous, “Replication package, ” https://figshare.com/s/ cb5b0eae0411e54b1bbd

  18. [18]

    Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles,

    M. Martin, A. Roitberg, M. Haurilet, M. Horne, S. Reiß, M. Voit, and R. Stiefelhagen, “Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles, ” in2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 2801–2810

  19. [19]

    Deep Visual-Semantic Alignments for Generating Image Descriptions

    A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions, ” 2015. [Online]. Available: https://arxiv.org/abs/1412.2306

  20. [20]

    Show and Tell: A Neural Image Caption Generator

    O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator, ” 2015. [Online]. Available: https://arxiv.org/abs/1411.4555

  21. [21]

    VQA: Visual Question Answering

    A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Batra, and D. Parikh, “Vqa: Visual question answering, ” 2016. [Online]. Available: https://arxiv.org/abs/1505.00468

  22. [22]

    Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures

    R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank, “Automatic description generation from images: A survey of models, datasets, and evaluation measures, ” 2017. [Online]. Available: https://arxiv.org/abs/1601.03896

  23. [23]

    CIDEr: Consensus-based Image Description Evaluation

    R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation, ” 2015. [Online]. Available: https://arxiv.org/abs/1411.5726

  24. [24]

    Re-evaluating Automatic Metrics for Image Captioning

    M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem, “Re-evaluating automatic metrics for image captioning, ” 2016. [Online]. Available: https: //arxiv.org/abs/1612.07600

  25. [25]

    Individual comparisons by ranking methods,

    F. Wilcoxon, “Individual comparisons by ranking methods, ”Biometrics Bulletin, vol. 1, no. 6, 1945

  26. [26]

    A critique and improvement of the

    A. Vargha and H. D. Delaney, “A critique and improvement of the "cl" common language effect size statistics of mcgraw and wong, ”Journal of Educational and Behavioral Statistics, vol. 25, no. 2, pp. 101–132, 2000. [Online]. Available: http://www.jstor.org/stable/1165329

  27. [27]

    Image style transfer using convolutional neural networks,

    L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks, ”Journal of Vision, vol. 16, no. 12, p. 326, 2016

  28. [28]

    Assessing quality metrics for neural reality gap input mitigation in autonomous driving testing,

    S. C. Lambertenghi and A. Stocco, “Assessing quality metrics for neural reality gap input mitigation in autonomous driving testing, ” inProceedings of 17th IEEE International Conference on Software Testing, Verification and Validation, ser. ICST ’24, 2024

  29. [29]

    Nano banana ai image generator,

    Nanobana, “Nano banana ai image generator, ” https://www.nanobana.net/, 2026, aI-based image generation and editing platform

  30. [30]

    Nvidia cosmos: World foundation models for physical ai,

    NVIDIA Corporation, “Nvidia cosmos: World foundation models for physical ai, ” https://www.nvidia.com/en-us/ai/cosmos/, 2026, open platform with world foundation models and data processing for robotics, autonomous systems, and physical-AI research

  31. [31]

    Efficient domain augmentation for autonomous driving testing using diffusion models,

    L. Baresi, D. Y. Xian Hu, A. Stocco, and P. Tonella, “Efficient domain augmentation for autonomous driving testing using diffusion models, ” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025, pp. 398–410

  32. [32]

    Advances in diffusion models for image data augmentation: A review of methods, models, evaluation metrics and future research directions,

    P. Alimisis, I. Mademlis, P. Radoglou-Grammatikis, P. Sarigiannidis, and G. T. Papadopoulos, “Advances in diffusion models for image data augmentation: A review of methods, models, evaluation metrics and future research directions, ”Artificial Intelligence Review, vol. 58, p. 112, 2025. [Online]. Available: https://doi.org/10.1007/s10462-025-11116-x

  33. [33]

    Ticam: A time-of-flight in-car cabin monitoring dataset,

    J. S. Katrolia, A. El-Sherif, H. Feld, B. Mirbach, J. R. Rambach, and D. Stricker, “Ticam: A time-of-flight in-car cabin monitoring dataset, ” in32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25,

  34. [34]

    BMVA Press, 2021, p. 277. [Online]. Available: https://www.bmvc2021- virtualconference.com/assets/papers/0701.pdf

  35. [35]

    Search-based dnn testing and retraining with gan-enhanced simulations,

    M. O. Attaoui, F. Pastore, and L. C. Briand, “Search-based dnn testing and retraining with gan-enhanced simulations, ”IEEE Trans. Softw. Eng., vol. 51, no. 4, p. 1086–1103, Apr. 2025. [Online]. Available: https://doi.org/10.1109/TSE.2025. 3540549

  36. [36]

    Automatic test suite generation for key-points detection dnns using many-objective search (experience paper),

    F. U. Haq, D. Shin, L. C. Briand, T. Stifter, and J. Wang, “Automatic test suite generation for key-points detection dnns using many-objective search (experience paper), ” inProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA ’21. ACM, Jul. 2021, p. 91–102. [Online]. Available: http://dx.doi.org/10.1145...