pith. machine review for the scientific record. sign in

arxiv: 2605.07649 · v2 · submitted 2026-05-08 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords vision-language modelszero-shot perceptionoperational design domainautonomous drivingprompt engineeringchain-of-thoughtsafety-critical systemsODD classification
0
0 comments X

The pith

Vision-language models can serve as zero-shot sensors for operational design domain elements in autonomous driving when using definition-anchored prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether vision-language models can identify and classify elements of the Operational Design Domain without any task-specific training data. It evaluates four VLMs on both classification and detection tasks using a custom dataset plus Mapillary Vistas images, while testing multiple prompting approaches and documenting failure modes. The strongest results come from prompts that anchor responses to explicit ODD definitions, apply chain-of-thought reasoning, and assign a domain-expert persona. This setup matters because many safety regulations for automated vehicles require verifiable perception of the exact conditions under which the vehicle may operate. The authors also release reusable prompt templates to support adaptation as definitions change.

Core claim

VLMs can act as adaptable zero-shot ODD sensors for Automated Driving Systems. Empirical tests show that definition-anchored chain-of-thought prompting with persona decomposition delivers the highest performance on both classification and detection, whereas other zero-shot strategies often lower recall. The study includes failure analyses across the custom dataset and Mapillary Vistas and supplies a set of prompting templates with adaptation guidance, indicating that such models support transparent perception needed for safety-critical use and regulatory auditing.

What carries the argument

Definition-anchored chain-of-thought prompting with persona decomposition, which fixes the model's attention to explicit ODD definitions, forces sequential reasoning steps, and assigns an expert role to improve consistency and accuracy in image-based classification and detection.

If this is right

  • VLMs allow perception systems to update quickly when ODD definitions are revised by regulators.
  • Prompt-based methods reduce the need for collecting and labeling new training data for each ODD variant.
  • Transparent reasoning traces from the model support auditing and explainability requirements in safety cases.
  • Reusable templates lower the effort required to adapt the approach across different vehicle platforms and regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid systems that combine these VLMs with existing geometric or rule-based ODD checkers could raise overall reliability.
  • The prompting approach may transfer to other regulated perception domains such as medical device imaging or industrial inspection.
  • Long-term deployment would benefit from continuous monitoring of prompt drift as VLM versions update.

Load-bearing premise

Results measured on the custom dataset and Mapillary Vistas will hold in real-world, safety-critical driving conditions without further fine-tuning or verification steps.

What would settle it

Running the best prompting method on a new, independent real-world driving dataset and finding recall or accuracy substantially below levels needed for regulatory ODD compliance would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.07649 by Berkehan \"Unal, Dierend Hauke, Fazlija Dren, Plachetka Christopher.

Figure 1
Figure 1. Figure 1: High-level overview of our custom dataset ODD-TA [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structure of the chained prompting pipeline as used by [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of different zero-shot optimization strategies in relation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-Group Recall Advantage: GPT-4o vs. Gemini 2.5 Pro (Positive = [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical study of zero-shot ODD classification and detection using four vision-language models on a custom dataset and Mapillary Vistas. It ablates prompting strategies including definition-anchored chain-of-thought with persona decomposition, provides failure analyses, and releases reusable prompting templates. The central finding is that definition-anchored CoT prompting with persona decomposition achieves the best performance, while other methods can reduce recall, with the work positioned as enabling transparent ODD perception for safety-critical ADS applications.

Significance. If the empirical results hold under broader validation, the work would provide a valuable, training-free approach to adapting perception to evolving regulatory ODD definitions, which is relevant for ADS auditing and compliance. The ablation with cost-performance trade-offs and the reusable templates are concrete strengths that support reproducibility and practical adoption. The inclusion of failure analyses adds transparency by identifying where zero-shot VLMs may fall short.

major comments (2)
  1. [Abstract] Abstract: the claim that the results 'pave the way for transparent and effective ODD-based perception in safety-critical applications' is not supported by the evaluated regimes. The experiments rely on one custom dataset plus Mapillary Vistas and contain no quantitative comparisons against fine-tuned or supervised baselines, nor evaluation on real ADS logs or regulatory ODD definitions.
  2. [Empirical study and ablation] Empirical study and ablation sections: generalization to real-world safety-critical scenarios remains untested. The datasets do not necessarily capture rare edge cases, sensor noise, temporal consistency, or the precise regulatory ODD specifications required for ADS auditing, leaving the assertion that zero-shot capabilities suffice without additional verification unsupported.
minor comments (2)
  1. [Methods] The description of the four VLMs and their exact versions or inference settings would benefit from additional detail to support reproducibility of the reported results.
  2. [Prompting templates] The prompting templates are described as reusable, but placing at least one full example in the main text (rather than only in supplementary material) would improve accessibility for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the abstract claim and the framing of generalization require tempering to accurately reflect the empirical scope of the study. We will make targeted revisions to the abstract, discussion, and limitations sections without altering the reported results or contributions. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the results 'pave the way for transparent and effective ODD-based perception in safety-critical applications' is not supported by the evaluated regimes. The experiments rely on one custom dataset plus Mapillary Vistas and contain no quantitative comparisons against fine-tuned or supervised baselines, nor evaluation on real ADS logs or regulatory ODD definitions.

    Authors: We acknowledge that the original abstract language overstates the direct applicability of our findings to deployed safety-critical systems. The study is an initial empirical evaluation of zero-shot VLM performance for ODD classification and detection, with the strongest results obtained via definition-anchored chain-of-thought prompting with persona decomposition. We will revise the abstract to state that the results 'provide initial evidence toward the potential of transparent and adaptable zero-shot ODD perception' rather than claiming they 'pave the way for' safety-critical applications. We will add a dedicated limitations paragraph noting the absence of quantitative comparisons to fine-tuned baselines (our emphasis is on training-free adaptability to evolving definitions) and the lack of evaluation on proprietary ADS logs or exact regulatory ODD specifications. These changes will be textual only. revision: yes

  2. Referee: [Empirical study and ablation] Empirical study and ablation sections: generalization to real-world safety-critical scenarios remains untested. The datasets do not necessarily capture rare edge cases, sensor noise, temporal consistency, or the precise regulatory ODD specifications required for ADS auditing, leaving the assertion that zero-shot capabilities suffice without additional verification unsupported.

    Authors: We agree that the evaluated datasets, while covering diverse urban and environmental conditions from Mapillary Vistas and our custom collection, do not exhaustively represent rare edge cases, sensor noise, temporal dynamics, or precise regulatory ODD criteria. The manuscript already includes failure analyses highlighting where zero-shot performance degrades; we will expand the discussion and add an explicit limitations subsection to state that these results do not demonstrate sufficiency for ADS auditing and that further validation on real-world logs and regulatory definitions is required. The core claim will be reframed as an empirical demonstration of prompting effectiveness rather than an assertion that zero-shot methods suffice without verification. No new data collection or experiments are needed for these clarifications. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation on external datasets with no self-referential derivations or fitted predictions

full rationale

The paper presents an empirical study of zero-shot ODD classification and detection using VLMs on a custom dataset and Mapillary Vistas, plus an ablation of prompting strategies. All claims rest on reported experimental results and failure analyses rather than any derivation chain, equations, or self-citations that reduce to inputs by construction. No self-definitional elements, fitted inputs labeled as predictions, or load-bearing self-citations appear in the provided text. The central findings (e.g., definition-anchored CoT with persona decomposition performing best) are directly tied to observed performance metrics on the evaluated data, satisfying the criteria for a self-contained, non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of pre-trained VLMs rather than new theoretical constructs; no free parameters are fitted, and no new entities are introduced.

axioms (1)
  • domain assumption Pre-trained vision-language models can perform zero-shot classification and detection of ODD elements when given appropriate prompts.
    This assumption underpins the entire zero-shot approach and is invoked throughout the abstract's description of VLM suitability.

pith-pipeline@v0.9.0 · 5567 in / 1213 out tokens · 64647 ms · 2026-05-12T03:30:21.895346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    SAE International, 2014

    On-Road Automated Driving (ORAD) Committee,Taxonomy and Def- initions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles (J3016 202104). SAE International, 2014

  2. [2]

    Addressing systemic risks in autonomous maritime navigation: A structured stpa and odd-based methodology,

    T. Nakashima, R. Kureta, and S. Khastgir, “Addressing systemic risks in autonomous maritime navigation: A structured stpa and odd-based methodology,”Reliability Engineering and System Safety, vol. 261, no. C, 2025. [Online]. Available: https://ideas.repec.org/a/eee/reensy/ v261y2025ics095183202500242x.html

  3. [3]

    From operational design domain to runtime monitoring of ai-based aviation systems,

    C. Torens, S. Gupta, N. Roy, J. Sprockhoff, and U. Durak, “From operational design domain to runtime monitoring of ai-based aviation systems,” in2024 AIAA DATC/IEEE 43rd Digital Avionics Systems Conference (DASC). IEEE, 2024, pp. 1–9

  4. [4]

    Autonomous driving’s future: Convenient and connected,

    McKinsey & Company, “Autonomous driving’s future: Convenient and connected,” January 2023, accessed: 2025-09-01. [Online]. Avail- able: https://www.mckinsey.com/industries/automotive-and-assembly/ our-insights/autonomous-drivings-future-convenient-and-connected

  5. [5]

    Public perception of connected and automated vehicles: Benefits, concerns, and barriers from an australian perspective,

    A. Matin and H. Dia, “Public perception of connected and automated vehicles: Benefits, concerns, and barriers from an australian perspective,” Journal of Intelligent and Connected Vehicles, vol. 7, no. 2, pp. 108–128, 2024. [Online]. Available: https://www.sciopen.com/article/ 10.26599/JICV .2023.9210028

  6. [6]

    International Organization for Standardization,ISO 34503:2023 - Road vehicles - Test scenarios for automated driving systems - Taxonomy for Operational Design Domain (ODD), International Organization for Standardization Std., 2023, available from https://www.iso.org/standard/ 78952.html

  7. [7]

    Pas 1883 - operational design domain (odd) taxonomy for ads specification,

    The British Standards Institution, “Pas 1883 - operational design domain (odd) taxonomy for ads specification,” Available from BSI Group, London, United Kingdom, 2020, published by BSI Standards Limited. First published August 2020. No copying without BSI permission except as permitted by copyright law

  8. [8]

    Nationale Betriebserlaubnis f ¨ur Kraftfahrzeuge mit autonomer Fahrfunktion,

    Kraftfahrt-Bundesamt, “Nationale Betriebserlaubnis f ¨ur Kraftfahrzeuge mit autonomer Fahrfunktion,” 2023. [Online]. Available: https://www.kba.de/DE/Themen/Typgenehmigung/Autonomes automatisiertes Fahren/nationale Betriebserlaubnis/nationale betriebserlaubnis node.html

  9. [9]

    International Organization for Standardization,ISO 21448:2022 - Road vehicles — Safety of the intended functionality, International Organiza- tion for Standardization Std., 2022, available from https://www.iso.org/ standard/70939.html

  10. [10]

    Towards an operational design domain that supports the safety argumentation of an automated driving system,

    M. Gyllenhammar, R. Johansson, F. Warg, D. Chen, H.-M. Heyn, M. Sanfridson, J. S ¨oderberg, A. Thors ´en, and S. Ursing, “Towards an operational design domain that supports the safety argumentation of an automated driving system,” in10th European congress on embedded real time systems (ERTS 2020), 2020

  11. [11]

    On the future of trans- portation in an era of automated and autonomous vehicles,

    P. A. Hancock, I. Nourbakhsh, and J. Stewart, “On the future of trans- portation in an era of automated and autonomous vehicles,”Proceedings of the National Academy of Sciences, vol. 116, no. 16, pp. 7684–7691, 2019

  12. [12]

    Proud—public road urban driverless-car test,

    A. Broggi, P. Cerri, S. Debattisti, M. C. Laghi, P. Medici, D. Molinari, M. Panciroli, and A. Prioletti, “Proud—public road urban driverless-car test,”IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 6, pp. 3508–3519, 2015

  13. [13]

    ISO 34505:2025 – Road vehicles – Test scenarios for automated driving systems – Scenario evaluation and test case generation,

    International Organization for Standardization, “ISO 34505:2025 – Road vehicles – Test scenarios for automated driving systems – Scenario evaluation and test case generation,” International Standard, Geneva, Switzerland, 2025, accessed 2025-08-14. [Online]. Available: https://www.iso.org/standard/78954.html

  14. [14]

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,

    P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,”ACM Comput. Surv., vol. 55, no. 9, Jan

  15. [15]
  16. [16]

    Language models are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei...

  17. [17]

    Available: https://proceedings.neurips.cc/paper files/ paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

    [Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  18. [18]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th International Conference on Neural Information Processing Systems, ser. NIPS ’22. Red Hook, NY , USA: Curran Associates Inc., 2022

  19. [19]

    Automatic Chain of Thought Prompting in Large Language Models

    Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,”arXiv preprint arXiv:2210.03493, 2022

  20. [20]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” 2023. [Online]. Available: https://arxiv.org/abs/2203.11171

  21. [21]

    Two tales of persona in LLMs: A survey of role-playing and personalization,

    Y .-M. Tseng, Y .-C. Huang, T.-Y . Hsiao, W.-L. Chen, C.-W. Huang, Y . Meng, and Y .-N. Chen, “Two tales of persona in LLMs: A survey of role-playing and personalization,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics,...

  22. [22]

    Picle: Eliciting diverse behaviors from large language models with persona in-context learning,

    H. K. Choi and Y . Li, “Picle: Eliciting diverse behaviors from large language models with persona in-context learning,” inInternational Conference on Machine Learning, 2024

  23. [23]

    Helpful assistant or fruitful facilitator? investigating how personas affect language model behavior,

    P. H. Luz de Araujo and B. Roth, “Helpful assistant or fruitful facilitator? investigating how personas affect language model behavior,”PloS one, vol. 20, no. 6, p. e0325664, 2025

  24. [24]

    In-context impersonation reveals large language models’ strengths and biases,

    L. Salewski, S. Alaniz, I. Rio-Torto, E. Schulz, and Z. Akata, “In-context impersonation reveals large language models’ strengths and biases,” Advances in neural information processing systems, vol. 36, pp. 72 044– 72 057, 2023

  25. [25]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  26. [26]

    Chain-of-verification reduces hallucination in large language models,

    S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” inFindings of the association for computational linguistics: ACL 2024, 2024, pp. 3563–3578

  27. [27]

    Chain-of-note: Enhancing robustness in retrieval-augmented language models,

    W. Yu, H. Zhang, X. Pan, P. Cao, K. Ma, J. Li, H. Wang, and D. Yu, “Chain-of-note: Enhancing robustness in retrieval-augmented language models,” inProceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 14 672–14 685

  28. [28]

    Contextvlm: Zero-shot and few-shot context understanding for autonomous driving using vision language models,

    S. Sural, Naren, and R. R. Rajkumar, “Contextvlm: Zero-shot and few-shot context understanding for autonomous driving using vision language models,” in2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), 2024, pp. 468–475

  29. [29]

    Fine-grained evaluation of large vision-language models in autonomous driving,

    Y . Li, M. Tian, Z. Lin, J. Zhu, D. Zhu, H. Liu, Z. Wang, Y . Zhang, Z. Xiong, and X. Zhao, “Fine-grained evaluation of large vision-language models in autonomous driving,”arXiv preprint arXiv:2503.21505, 2025

  30. [30]

    Lego co-builder: Exploring fine-grained vision-language modeling for multimodal lego assembly assistants,

    H. Huang, J. Pei, M. Aliannejadi, X. Sun, M. Ahsan, C. Yu, Z. Ren, P. Cesar, and J. Wang, “Lego co-builder: Exploring fine-grained vision-language modeling for multimodal lego assembly assistants,”

  31. [31]
  32. [32]

    Ai safety assurance for automated vehicles: A survey on research, standardization, regulation,

    X. Zhou, M. Liu, E. Yurtsever, B. L. ˇZagar, W. Zimmer, H. Cao, and A. C. Knoll, “Vision-language models in autonomous driving: A survey and outlook,”IEEE Transactions on Intelligent Vehicles, pp. 1–20, 2024, early Access. [Online]. Available: https: //doi.org/10.1109/TIV .2024.3402136

  33. [33]

    Foundation models in autonomous driving: A sur- vey on scenario generation and scenario analysis,

    Y . Gaoet al., “Foundation models in autonomous driving: A sur- vey on scenario generation and scenario analysis,”arXiv preprint arXiv:2506.11526, 2025

  34. [34]

    Multimodal large language model driven scenario testing for autonomous vehicles,

    Q. Lu, X. Wang, Y . Jiang, G. Zhao, M. Ma, and S. Feng, “Multimodal large language model driven scenario testing for autonomous vehicles,” arXiv preprint arXiv:2409.06450, 2024

  35. [35]

    WEDGE: A multi-weather autonomous driving dataset built from generative vision- language models,

    A. Marathe, D. Ramanan, R. Walambe, and K. Kotecha, “WEDGE: A multi-weather autonomous driving dataset built from generative vision- language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023

  36. [36]

    Safeauto: Knowledge-enhanced safe autonomous driving with multimodal foundation models.arXiv preprint arXiv:2503.00211,

    J. Zhang, X. Yang, T. Wang, Y . Yao, A. Petiushko, and B. Li, “Safeauto: Knowledge-enhanced safe autonomous driving with multimodal foundation models,” 2025. [Online]. Available: https://arxiv.org/abs/2503.00211

  37. [37]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  38. [38]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gem- ini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  39. [39]

    Introducing llama 4: Advancing multimodal intelligence,

    Meta AI, “Introducing llama 4: Advancing multimodal intelligence,” Meta AI Blog, 2025, accessed: 2025-09-08. [Online]. Available: https://ai.meta.com/blog/llama-4-multimodal-intelligence/

  40. [40]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models,

    M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldainiet al., “Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 91–104

  41. [41]

    The mapillary vistas dataset for semantic understanding of street scenes,

    G. Neuhold, T. Ollmann, S. R. Bul `o, and P. Kontschieder, “The mapillary vistas dataset for semantic understanding of street scenes,” in2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5000– 5009

  42. [42]

    Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas,

    S. Chen, T. Zhu, R. Zhou, J. Zhang, S. Gao, J. C. Niebles, M. Geva, J. He, J. Wu, and M. Li, “Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 9910–9932