GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

Chaowei Xiao; Ming Jiang; Yingzi Ma

arxiv: 2606.02774 · v1 · pith:4W7WAU6Nnew · submitted 2026-06-01 · 💻 cs.CV

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

Yingzi Ma , Chaowei Xiao , Ming Jiang This is my paper

Pith reviewed 2026-06-28 14:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsautonomous drivingbenchmarkregion-specific reasoningmultimodal reasoningtraffic conventionsgeo-cultural differencesdriving tasks

0 comments

The pith

Vision-language models for autonomous driving perform inconsistently across regions because they lack robust awareness of local traffic conventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates GeoDrive-Bench, a set of 5,053 multiple-choice questions drawn from six countries, to test whether vision-language models can choose correct driving actions using only visual scenes plus unspoken local rules. Questions cover perception, prediction, planning, and region reasoning and deliberately omit any country names so models must infer the right behavior from the image alone. The authors also present a distillation method that transfers region-specific traffic knowledge into a model's internal representations. When nine existing models are tested, their accuracy fluctuates sharply from one driving culture to another, while the authors' adapted baselines improve across the board. A sympathetic reader would care because any model deployed worldwide must handle these hidden regional differences or risk unsafe decisions.

Core claim

GeoDrive-Bench supplies 5,053 human-validated questions across six countries that each require a model to combine visual evidence with implicit local traffic conventions to select the correct action among perception, prediction, planning, and region-reasoning options; no country label is provided. A distillation algorithm is introduced that injects region-specific traffic-rule knowledge directly into the model's representations. Experiments on nine state-of-the-art VLMs reveal large accuracy gaps between geo-driving cultures on every task, while the authors' baseline models show measurable gains in cross-region performance, indicating that present VLMs do not yet possess reliable region-awar

What carries the argument

GeoDrive-Bench, a curated collection of 5,053 QA pairs that force inference from visual scenes plus unspoken local traffic conventions, paired with a distillation algorithm that embeds region-specific rule knowledge into VLM representations.

If this is right

Existing VLMs display large performance differences across the six countries on perception, prediction, planning, and region-reasoning tasks.
The proposed distillation method produces baseline models that improve geo-cultural reasoning uniformly across regions.
Current VLMs still lack the region-aware intelligence required for safe deployment in varied global driving environments.
GeoDrive-Bench functions as both a diagnostic test and a training resource for building more deployable autonomous-driving foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the performance gaps persist, any worldwide rollout of driving VLMs would need systematic region-by-region adaptation rather than a single global model.
Models might begin to treat subtle visual markers such as sign styles or vehicle types as implicit location signals, which could be measured in follow-up experiments.
The same curation approach could be applied to other multimodal tasks where unspoken local conventions matter, such as region-specific legal or medical image reasoning.

Load-bearing premise

The benchmark questions can be answered correctly only by combining visual evidence with implicit local traffic conventions rather than by surface-level image features or any explicit country information.

What would settle it

A single VLM that reaches near-ceiling accuracy with no statistically significant difference across all six countries on the full set of 5,053 questions, or a controlled test showing that the same questions can be solved at high accuracy using only generic visual features without any region-specific knowledge.

Figures

Figures reproduced from arXiv: 2606.02774 by Chaowei Xiao, Ming Jiang, Yingzi Ma.

**Figure 1.** Figure 1: Overview of GEODRIVE-BENCH. Left: radar visualization of per-country accuracy across representative VLMs, where each polygon corresponds to one country and each axis denotes a model. The results show that current VLMs exhibit highly imbalanced performance across countryspecific scenarios, even when evaluated on the same driving tasks.Right: region-specific visual cues (signs, license plates, signals, vehi… view at source ↗

**Figure 2.** Figure 2: Overview of our scenario collection pipeline. To make “cultural relevance” an operational criterion rather than an intuitive judgment, we manually define 13 categories of culture-specific traffic situations, drawing on crowdsourced traffic regulations from Wikipedia and prior studies on cross-country driving behavior [21]. A category is retained only when national traffic codes diverge along at least on… view at source ↗

**Figure 3.** Figure 3: Distribution of GEODRIVE-BENCH across datasets, task categories, countries, scenarios, and region-specific topics. 3.2 Culture-relevant Driving Question-Answer Generation Our goal is to identify VLM backbones suitable for VLA systems that operate across countries, so we focus on high-level driving-related VQA. Following prior driving benchmarks [33, 30, 42], we adopt the standard Perception / Prediction / … view at source ↗

**Figure 4.** Figure 4: Ablation on traffic rule injection across the four task categories. Rule-Given provides the correct rule, Wrong-Rule injects a mismatched rule, and Full-Handbook provides the full multicountry handbook. closer to common pretraining priors and degrade sharply in culturally distinctive ones, so high aggregate accuracy does not imply robust region-aware reasoning. Effect of prompting settings. Comparing Dire… view at source ↗

**Figure 5.** Figure 5: Country-wise distribution of error types for InternVL3 and our DRIVEOPD ‡ under the reasoning setting. Each pie chart shows the proportion of four major error categories within a country: Visual Misperception, Geographic Misclassification, Cultural Rule Gap, and Reasoning Error. behave more stably: a smaller gap between Rule-Given and Wrong-Rule together with competitive Full-Handbook performance suggests… view at source ↗

**Figure 6.** Figure 6: shows a region reasoning case study of InternVL3 on a school-warning sign question across four countries. The model identifies each country from salient cues—Japanese text, UK street layout, Indian auto-rickshaws, Chinese license plates—yet defaults to a generic yellow-diamond template, failing in the UK, India, and China. This decoupling of recognition from rule grounding motivates explicit internalizatio… view at source ↗

**Figure 7.** Figure 7: Country-wise distribution of error types for Qwen2.5-VL-7B and DRIVEOPD † under the Reasoning setting. Each pie chart shows the proportion of four error categories within a country: Visual Misperception, Geographic Misclassification, Cultural Rule Gap, and Reasoning Error. toward Visual Misperception and Reasoning Error. This shift is itself informative: once regional rule knowledge is internalized, fine-g… view at source ↗

**Figure 8.** Figure 8: Web-based annotation tool used for human review. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_8.png] view at source ↗

read the original abstract

Vision-language models (VLMs) for autonomous driving have shown promising performance, but their ability to handle region-specific traffic rules remains underexplored, raising uncertainties about their deployment across diverse global settings. We therefore introduce GeoDrive-Bench, a novel benchmark that enables the systematic investigation of VLMs' geo-culturally grounded driving reasoning. We curated 5,053 human-validated multiple-choice QA pairs across six countries covering diverse driving cultures. Specifically, we emphasize four driving tasks: perception, prediction, planning, and region reasoning. Each question requires models to infer the correct driving behavior from visual evidence and local traffic conventions without explicit country labels. Beyond evaluation, we further design a distillation algorithm that injects region-specific traffic-rule knowledge into the internal representations of VLMs, enabling models to better align visual scene understanding with local driving policies. Experiments on nine state-of-the-art VLMs show substantial performance variations across geo-driving cultures for each task, while our proposed baseline models exhibit improved geo-cultural reasoning across regions. These results suggest that current VLMs still lack robust region-aware driving intelligence and highlight GeoDrive-Bench as a diagnostic and training-oriented testbed for deployable autonomous driving foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoDrive-Bench gives a new test set for region-specific driving reasoning in VLMs, but the evidence that its questions actually force implicit geo-cultural inference is still thin.

read the letter

The main takeaway is a new benchmark with 5,053 human-validated QA pairs across six countries, focused on four tasks—perception, prediction, planning, and region reasoning—plus a distillation step meant to push VLMs toward local traffic rules. The experiments run nine current VLMs, report performance gaps by region, and show their baseline models close some of those gaps.

This is new in the combination: a driving benchmark that explicitly drops country labels, covers multiple cultures, and pairs evaluation with a simple knowledge-injection method. It does a straightforward job of documenting that existing VLMs are uneven across driving cultures, which matches what people in the field already suspect about deployment.

The soft spot is the central premise. The claim that performance differences reflect missing geo-cultural reasoning requires that each question cannot be solved from visual surface features, scene statistics, or other non-driving cues. The abstract states the questions were human-validated and lack explicit labels, but it gives no numbers on inter-annotator checks for cue independence, no ablation that masks region signals, and no breakdown of how many questions remain solvable after removing obvious markers. Without that, the gaps could come from ordinary VLM weaknesses rather than the specific failure the benchmark is meant to diagnose.

This is for people working on VLMs for autonomous driving who need a practical way to measure cross-region robustness. A reader who wants data on current model behavior across countries will find usable numbers here.

It is worth sending to peer review so the curation details and distillation results can be examined directly.

Referee Report

2 major / 0 minor

Summary. The paper introduces GeoDrive-Bench, a benchmark of 5,053 human-validated multiple-choice QA pairs spanning six countries and four driving tasks (perception, prediction, planning, region reasoning). Questions are asserted to require joint visual and implicit geo-cultural inference without explicit country labels. The authors additionally propose a distillation algorithm to inject region-specific traffic knowledge into VLMs. Experiments on nine state-of-the-art VLMs report substantial cross-region performance gaps, while the authors' baseline models show improved geo-cultural reasoning; the work concludes that current VLMs lack robust region-aware driving intelligence.

Significance. If the benchmark questions demonstrably require geo-cultural inference beyond surface-level visual or textual cues, the dataset and distillation method would provide a valuable diagnostic and training resource for assessing and improving VLMs in globally deployable autonomous driving systems, where regional traffic conventions vary substantially.

major comments (2)

[Abstract] Abstract: The central claim that performance variations demonstrate missing region-aware intelligence rests on the premise that the 5,053 QA pairs force inference from visual evidence plus implicit local traffic conventions rather than surface cues (sign text, vehicle models, road markings, language) or inferable labels. The abstract asserts human validation and absence of explicit country labels but supplies no quantitative check (e.g., inter-annotator agreement on cue independence, ablation of region-specific elements, or solvability after masking geo-cues) that would be required to secure this premise; without such evidence the observed gaps could arise from training-data biases or general VLM weaknesses instead.
[Abstract] Abstract (curation paragraph): No details are provided on the question-construction or validation process (e.g., how annotators were instructed to avoid explicit or inferable country signals, what fraction of questions were rejected during validation, or any pilot study measuring answerability from non-driving features). This information is load-bearing for interpreting the reported cross-country variations as evidence of missing geo-cultural reasoning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger justification of the benchmark's design in the abstract. We address each point below and commit to revisions that improve clarity without overstating the current evidence.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that performance variations demonstrate missing region-aware intelligence rests on the premise that the 5,053 QA pairs force inference from visual evidence plus implicit local traffic conventions rather than surface cues (sign text, vehicle models, road markings, language) or inferable labels. The abstract asserts human validation and absence of explicit country labels but supplies no quantitative check (e.g., inter-annotator agreement on cue independence, ablation of region-specific elements, or solvability after masking geo-cues) that would be required to secure this premise; without such evidence the observed gaps could arise from training-data biases or general VLM weaknesses instead.

Authors: We agree that quantitative checks would further secure the premise. Section 3 of the full manuscript details the human validation protocol, where country-specific annotators were explicitly instructed to create questions requiring local traffic conventions beyond visible cues, and all questions were reviewed for absence of explicit country labels. While we did not conduct the specific ablations or cue-masking experiments suggested, the consistent cross-region gaps across nine diverse VLMs provide supporting evidence. We will revise the abstract to reference these validation steps and add a limitations paragraph discussing potential surface cues. revision: partial
Referee: [Abstract] Abstract (curation paragraph): No details are provided on the question-construction or validation process (e.g., how annotators were instructed to avoid explicit or inferable country signals, what fraction of questions were rejected during validation, or any pilot study measuring answerability from non-driving features). This information is load-bearing for interpreting the reported cross-country variations as evidence of missing geo-cultural reasoning.

Authors: The full manuscript (Sections 3.1–3.2) describes the construction process, including annotator instructions to avoid inferable signals, the use of a pilot study to confirm answerability from driving features, and rejection criteria during validation. We will expand the abstract with a concise summary of these elements, including key statistics on the validation process. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark curation and empirical evaluation are independent of fitted inputs or self-citation chains

full rationale

The paper presents a new dataset of 5,053 human-validated QA pairs across six countries and reports direct empirical results on nine VLMs plus a proposed distillation baseline. No equations, parameter fits, or derivations appear in the provided text. The central claim (performance variations indicate lack of region-aware intelligence) rests on the benchmark's construction and measured accuracies rather than reducing to any self-defined quantity, fitted subset renamed as prediction, or load-bearing self-citation. Curation is described as external human validation without explicit country labels, and the distillation step is presented as an added contribution rather than a circular justification of the benchmark itself. This is a standard empirical benchmark paper with no detectable circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5741 in / 1014 out tokens · 17260 ms · 2026-06-28T14:43:46.960490+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 21 canonical work pages · 12 internal anchors

[1]

Claude sonnet 4.6

Anthropic. Claude sonnet 4.6. https://www.anthropic.com/claude, 2025. Large language model

2025
[2]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Y u Y amaguchi, Shunsuke Aoki, and Issei Y amamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 1933–1943. IEEE, 2025

2025
[3]

Qwen3-VL Technical Report

Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, V arun Bankiti, Alex H Lang, Sourabh V ora, V enice Erin Liong, Qiang Xu, Anush Krish- nan, Y u Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020

2020
[6]

Automated evaluation of large vision-language models on self-driving corner cases

Kai Chen, Y anze Li, Wenhua Zhang, Y anxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 7817–7826. IEEE, 2025

2025
[7]

Impromptu vla: Open weights and open data for driving vision-language- action models

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Y ang, Y angcheng Y u, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language- action models. arXiv preprint arXiv:2505.23757, 2025

work page arXiv 2025
[8]

Holistic au- tonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic au- tonomous driving understanding by bird’s-eye-view injected multi-modal large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13668–13677, 2024

2024
[9]

Towards understanding worldwide cross-cultural differences in implicit driving cues: Review, comparative analysis, and research roadmap

Y ongqi Dong, Chang Liu, Yiyun Wang, and Zhe Fu. Towards understanding worldwide cross-cultural differences in implicit driving cues: Review, comparative analysis, and research roadmap. In 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC) , pages 1569–1575. IEEE, 2024

2024
[10]

The Llama 3 Herd of Models

Aaron Grattaﬁori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models

Xianda Guo, Ruijun Zhang, Yiqun Duan, Y uhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models. arXiv preprint arXiv:2411.13112, 2024

work page arXiv 2024
[12]

Driveaction: A benchmark for exploring human-like driving decisions in vla models

Y uhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, and Xianpeng Lang. Driveaction: A benchmark for exploring human-like driving decisions in vla models. arXiv preprint arXiv:2506.05667, 2025

work page arXiv 2025
[13]

Carscenes: Semantic vlm dataset for safe autonomous driving

Y uankai He and Weisong Shi. Carscenes: Semantic vlm dataset for safe autonomous driving. arXiv preprint arXiv:2511.10701, 2025

work page arXiv 2025
[14]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations , 2019

2019
[15]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Y ang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 17853–17862, 2023

2023
[16]

Nuscenes-mqa: Integrated evaluation of captions and qa for autonomous driving datasets using markup annotations

Y uichi Inoue, Y uki Y ada, Kotaro Tanahashi, and Y u Y amaguchi. Nuscenes-mqa: Integrated evaluation of captions and qa for autonomous driving datasets using markup annotations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 930–938, 2024

2024
[17]

Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding

Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Y uhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , page...

2025
[18]

V ad: V ectorized scene representation for efﬁcient autonomous driv- ing

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V ad: V ectorized scene representation for efﬁcient autonomous driv- ing. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 8340–8350, 2023

2023
[19]

Sdpo: Segment-level direct preference optimization for social agents

Aobo Kong, Wentao Ma, Shiwan Zhao, Y ongbin Li, Y uchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Y ong Qin, and Fei Huang. Sdpo: Segment-level direct preference optimization for social agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12409–12423, 2025

2025
[20]

Efﬁcient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Y u, Joseph Gonza- lez, Hao Zhang, and Ion Stoica. Efﬁcient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles , pages 611–626, 2023

2023
[21]

Driving everywhere with large language model policy adaptation

Boyi Li, Y ue Wang, Jiageng Mao, Boris Ivanovic, Sushant V eer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14948–14957, 2024

2024
[22]

Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Y uheng Li, Bo Li, Y uanhan Zhang, Sheng Shen, and Y ong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github. io/blog/2024-01-30-llava-next/

2024
[23]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Y ang, Qing Jiang, Chunyuan Li, Jianwei Y ang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision , pages 38–55. Springer, 2024

2024
[24]

Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving

Y uechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Y ang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving. arXiv preprint arXiv:2509.13769, 2025

work page arXiv 2025
[25]

Dolphins: Multimodal language model for driving

Yingzi Ma, Y ulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. In European Conference on Computer Vision, pages 403–420. Springer, 2024

2024
[26]

One million scenes for autonomous driving: Once dataset

Jiageng Mao, Minzhe Niu, Chenhan Jiang, Xiaodan Liang, Y amin Li, Chaoqiang Y e, Wei Zhang, Zhenguo Li, Jie Y u, Chunjing Xu, et al. One million scenes for autonomous driving: Once dataset. 2021

2021
[27]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, and Oleg Sinavski. Lingoqa: Visual question answering for autonomous driving. arXiv preprint arXiv:2312.14115, 2023

work page arXiv 2023
[28]

Is your vlm for autonomous driving safety-ready? a comprehensive benchmark for evaluating external and in-cabin risks

Xianhui Meng, Y uchen Zhang, Zhijian Huang, Zheng Lu, Ziling Ji, Y aoyao Yin, Hongyuan Zhang, Guangfeng Jiang, Y andan Lin, Long Chen, et al. Is your vlm for autonomous driving safety-ready? a comprehensive benchmark for evaluating external and in-cabin risks. arXiv preprint arXiv:2511.14592 , 2025

work page arXiv 2025
[29]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA, Y an Wang, Wenjie Luo, Junjie Bai, Y ulong Cao, Tong Che, Ke Chen, Y uxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Y u Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Y unxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Y ang Jiao, and Y u-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 38, pages 4542–4550, 2024

2024
[31]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Y uxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Y u Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 15120–15130, 2024

2024
[32]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In European conference on computer vision , pages 256–274. Springer, 2024. 12

2024
[34]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving

Kexin Tian, Jingrui Mao, Y unlong Zhang, Jiwan Jiang, Y ang Zhou, and Zhengzhong Tu. Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4567– 4576, 2025

2025
[37]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Y ang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xian- peng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Idd: A dataset for exploring problems of autonomous navigation in unconstrained environments

Girish V arma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and CV Jawahar. Idd: A dataset for exploring problems of autonomous navigation in unconstrained environments. In 2019 IEEE winter conference on applications of computer vision (WACV) , pages 1743–1751. IEEE, 2019

2019
[39]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfac- tual reasoning

Shihao Wang, Zhiding Y u, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfac- tual reasoning. In Proceedings of the computer vision and pattern recognition conference , pages 22442– 22452, 2025

2025
[40]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efﬁciency. arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Impact of regional driving behav- ior differences on trafﬁc ﬂow

Y uting Wang, Zhaocheng He, Wangyong Xing, and Chengchuang Lin. Impact of regional driving behav- ior differences on trafﬁc ﬂow. Scientiﬁc Reports, 15(1):9027, 2025

2025
[42]

Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

Shaoyuan Xie, Lingdong Kong, Y uhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6585–6597, 2025. 13

2025
[43]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios,

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Y uliang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125, 2025

work page arXiv 2025
[44]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Y ujia Zhang, Enze Xie, Zhen Zhao, Y ong Guo, Kwan-Y ee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters , 9(10):8186–8193, 2024

2024
[45]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Opendrivevla: Towards end- to-end autonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Y ang, Y unpu Ma, and Alois C Knoll. Opendrivevla: Towards end- to-end autonomous driving with large vision language action model. arXiv preprint arXiv:2503.23463 , 2025

work page arXiv 2025
[47]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Y un Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforce- ment ﬁne-tuning. arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Y e, Lixin Gu, Hao Tian, Y uchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 14 A Overview Our appendix includes the following sections:

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Prompt templates for every evaluation setting and the full training recipe of D RIVE OPD (algorithm, data, hyperparameters, compute)

Section B: Additional Implementation Details. Prompt templates for every evaluation setting and the full training recipe of D RIVE OPD (algorithm, data, hyperparameters, compute)
[50]

Full image-perturbation table, per-category rule-context ablation, and error-type analysis on the Qwen2.5-VL family

Section C: Additional Results. Full image-perturbation table, per-category rule-context ablation, and error-type analysis on the Qwen2.5-VL family
[51]

The 13 culture-speciﬁc trafﬁc categories, the 20-section per-country trafﬁc-rule handbook, the counterfactual veriﬁcation protocol, and the annotation tool used by human reviewers

Section D: Benchmark Construction Details. The 13 culture-speciﬁc trafﬁc categories, the 20-section per-country trafﬁc-rule handbook, the counterfactual veriﬁcation protocol, and the annotation tool used by human reviewers
[52]

Additional qualitative comparisons between base VLMs and DRIVEOPD across countries

Section E: Extended Case Studies. Additional qualitative comparisons between base VLMs and DRIVEOPD across countries
[53]

this country

Section F: Broader Impact. Discussion of the broader implications of G EODRIVE - BENCH . B Additional Implementation Details B.1 D RIVE OPD Training Details We instantiate D RIVE OPD on top of two open-source VLM backbones, Qwen2.5-VL-7B [ 4] and InternVL3-8B [ 48], yielding the two checkpoints denoted as D RIVE OPD † and D RIVE OPD ‡ in the main paper. B...

2016
[54]

Read s c e n e _ s t a t e to confirm what is ac tu al ly visible
[55]

Apply c a n d i d a t e _ r u l e ( NOT o r i g i n _ r u l e ) to the scene
[56]

Pick the option that becomes correct under c a n d i d a t e _ r u l e
[57]

a n s w e r _ u n d e r _ c a n d i d a t e

Compare against o r i g i n _ g t . Output STRICT JSON , no c o m m e n t a r y : { " a n s w e r _ u n d e r _ c a n d i d a t e " : " A | B | C | D " , " d i f f e r s _ f r o m _ o r i g i n " : true | false , " reason " : " < one - s ent en ce r a t i o n a l e g ro un de d in candidate_rule >" } Dec is io n : a c a n d i d a t e QA pair is R ET AI NE...
[58]

Look at the image c a r e f u l l y
[59]

Decide whether the pr ov ide d G T _ a n s w e r is s u p p o r t e d by the image under the country - spe ci fi c traffic context
[60]

verdict

Output a JSON record with the verdict , your confidence , a one - p a r a g r a p h rationale , and ( for I N C O R R E C T ve rdi ct s ) the option you believe is act ua ll y correct . Inputs : s c e n e _ i m a g e : the c a n d i d a t e driving frame country : the country g o v e r n i n g the rules for this item que st io n : the multiple - choice q ...

[1] [1]

Claude sonnet 4.6

Anthropic. Claude sonnet 4.6. https://www.anthropic.com/claude, 2025. Large language model

2025

[2] [2]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Y u Y amaguchi, Shunsuke Aoki, and Issei Y amamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 1933–1943. IEEE, 2025

2025

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, V arun Bankiti, Alex H Lang, Sourabh V ora, V enice Erin Liong, Qiang Xu, Anush Krish- nan, Y u Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020

2020

[6] [6]

Automated evaluation of large vision-language models on self-driving corner cases

Kai Chen, Y anze Li, Wenhua Zhang, Y anxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages 7817–7826. IEEE, 2025

2025

[7] [7]

Impromptu vla: Open weights and open data for driving vision-language- action models

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Y ang, Y angcheng Y u, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language- action models. arXiv preprint arXiv:2505.23757, 2025

work page arXiv 2025

[8] [8]

Holistic au- tonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic au- tonomous driving understanding by bird’s-eye-view injected multi-modal large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13668–13677, 2024

2024

[9] [9]

Towards understanding worldwide cross-cultural differences in implicit driving cues: Review, comparative analysis, and research roadmap

Y ongqi Dong, Chang Liu, Yiyun Wang, and Zhe Fu. Towards understanding worldwide cross-cultural differences in implicit driving cues: Review, comparative analysis, and research roadmap. In 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC) , pages 1569–1575. IEEE, 2024

2024

[10] [10]

The Llama 3 Herd of Models

Aaron Grattaﬁori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex V aughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models

Xianda Guo, Ruijun Zhang, Yiqun Duan, Y uhang He, Dujun Nie, Wenke Huang, Chenming Zhang, Shuai Liu, Hao Zhao, and Long Chen. Surds: Benchmarking spatial understanding and reasoning in driving scenarios with vision language models. arXiv preprint arXiv:2411.13112, 2024

work page arXiv 2024

[12] [12]

Driveaction: A benchmark for exploring human-like driving decisions in vla models

Y uhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, and Xianpeng Lang. Driveaction: A benchmark for exploring human-like driving decisions in vla models. arXiv preprint arXiv:2506.05667, 2025

work page arXiv 2025

[13] [13]

Carscenes: Semantic vlm dataset for safe autonomous driving

Y uankai He and Weisong Shi. Carscenes: Semantic vlm dataset for safe autonomous driving. arXiv preprint arXiv:2511.10701, 2025

work page arXiv 2025

[14] [14]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations , 2019

2019

[15] [15]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Y ang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 17853–17862, 2023

2023

[16] [16]

Nuscenes-mqa: Integrated evaluation of captions and qa for autonomous driving datasets using markup annotations

Y uichi Inoue, Y uki Y ada, Kotaro Tanahashi, and Y u Y amaguchi. Nuscenes-mqa: Integrated evaluation of captions and qa for autonomous driving datasets using markup annotations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages 930–938, 2024

2024

[17] [17]

Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding

Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Y uhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, et al. Drivelmm-o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , page...

2025

[18] [18]

V ad: V ectorized scene representation for efﬁcient autonomous driv- ing

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V ad: V ectorized scene representation for efﬁcient autonomous driv- ing. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 8340–8350, 2023

2023

[19] [19]

Sdpo: Segment-level direct preference optimization for social agents

Aobo Kong, Wentao Ma, Shiwan Zhao, Y ongbin Li, Y uchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Y ong Qin, and Fei Huang. Sdpo: Segment-level direct preference optimization for social agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12409–12423, 2025

2025

[20] [20]

Efﬁcient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Y u, Joseph Gonza- lez, Hao Zhang, and Ion Stoica. Efﬁcient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles , pages 611–626, 2023

2023

[21] [21]

Driving everywhere with large language model policy adaptation

Boyi Li, Y ue Wang, Jiageng Mao, Boris Ivanovic, Sushant V eer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14948–14957, 2024

2024

[22] [22]

Llava- next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Y uheng Li, Bo Li, Y uanhan Zhang, Sheng Shen, and Y ong Jae Lee. Llava- next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github. io/blog/2024-01-30-llava-next/

2024

[23] [23]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Y ang, Qing Jiang, Chunyuan Li, Jianwei Y ang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision , pages 38–55. Springer, 2024

2024

[24] [24]

Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving

Y uechen Luo, Fang Li, Shaoqing Xu, Zhiyi Lai, Lei Y ang, Qimao Chen, Ziang Luo, Zixun Xie, Shengyin Jiang, Jiaxin Liu, et al. Adathinkdrive: Adaptive thinking via reinforcement learning for autonomous driving. arXiv preprint arXiv:2509.13769, 2025

work page arXiv 2025

[25] [25]

Dolphins: Multimodal language model for driving

Yingzi Ma, Y ulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Multimodal language model for driving. In European Conference on Computer Vision, pages 403–420. Springer, 2024

2024

[26] [26]

One million scenes for autonomous driving: Once dataset

Jiageng Mao, Minzhe Niu, Chenhan Jiang, Xiaodan Liang, Y amin Li, Chaoqiang Y e, Wei Zhang, Zhenguo Li, Jie Y u, Chunjing Xu, et al. One million scenes for autonomous driving: Once dataset. 2021

2021

[27] [27]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, and Oleg Sinavski. Lingoqa: Visual question answering for autonomous driving. arXiv preprint arXiv:2312.14115, 2023

work page arXiv 2023

[28] [28]

Is your vlm for autonomous driving safety-ready? a comprehensive benchmark for evaluating external and in-cabin risks

Xianhui Meng, Y uchen Zhang, Zhijian Huang, Zheng Lu, Ziling Ji, Y aoyao Yin, Hongyuan Zhang, Guangfeng Jiang, Y andan Lin, Long Chen, et al. Is your vlm for autonomous driving safety-ready? a comprehensive benchmark for evaluating external and in-cabin risks. arXiv preprint arXiv:2511.14592 , 2025

work page arXiv 2025

[29] [29]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA, Y an Wang, Wenjie Luo, Junjie Bai, Y ulong Cao, Tong Che, Ke Chen, Y uxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Y u Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Y unxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Y ang Jiao, and Y u-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 38, pages 4542–4550, 2024

2024

[31] [31]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Y uxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Y u Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 15120–15130, 2024

2024

[32] [32]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In European conference on computer vision , pages 256–274. Springer, 2024. 12

2024

[34] [34]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas Bey...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving

Kexin Tian, Jingrui Mao, Y unlong Zhang, Jiwan Jiang, Y ang Zhou, and Zhengzhong Tu. Nuscenes- spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4567– 4576, 2025

2025

[37] [37]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Y ang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xian- peng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Idd: A dataset for exploring problems of autonomous navigation in unconstrained environments

Girish V arma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and CV Jawahar. Idd: A dataset for exploring problems of autonomous navigation in unconstrained environments. In 2019 IEEE winter conference on applications of computer vision (WACV) , pages 1743–1751. IEEE, 2019

2019

[39] [39]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfac- tual reasoning

Shihao Wang, Zhiding Y u, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfac- tual reasoning. In Proceedings of the computer vision and pattern recognition conference , pages 22442– 22452, 2025

2025

[40] [40]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Y e, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efﬁciency. arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Impact of regional driving behav- ior differences on trafﬁc ﬂow

Y uting Wang, Zhaocheng He, Wangyong Xing, and Chengchuang Lin. Impact of regional driving behav- ior differences on trafﬁc ﬂow. Scientiﬁc Reports, 15(1):9027, 2025

2025

[42] [42]

Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

Shaoyuan Xie, Lingdong Kong, Y uhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 6585–6597, 2025. 13

2025

[43] [43]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios,

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Y uliang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125, 2025

work page arXiv 2025

[44] [44]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Y ujia Zhang, Enze Xie, Zhen Zhao, Y ong Guo, Kwan-Y ee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters , 9(10):8186–8193, 2024

2024

[45] [45]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. arXiv preprint arXiv:2505.17685, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Opendrivevla: Towards end- to-end autonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Y ang, Y unpu Ma, and Alois C Knoll. Opendrivevla: Towards end- to-end autonomous driving with large vision language action model. arXiv preprint arXiv:2503.23463 , 2025

work page arXiv 2025

[47] [47]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Y un Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforce- ment ﬁne-tuning. arXiv preprint arXiv:2506.13757, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Y e, Lixin Gu, Hao Tian, Y uchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025. 14 A Overview Our appendix includes the following sections:

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Prompt templates for every evaluation setting and the full training recipe of D RIVE OPD (algorithm, data, hyperparameters, compute)

Section B: Additional Implementation Details. Prompt templates for every evaluation setting and the full training recipe of D RIVE OPD (algorithm, data, hyperparameters, compute)

[50] [50]

Full image-perturbation table, per-category rule-context ablation, and error-type analysis on the Qwen2.5-VL family

Section C: Additional Results. Full image-perturbation table, per-category rule-context ablation, and error-type analysis on the Qwen2.5-VL family

[51] [51]

The 13 culture-speciﬁc trafﬁc categories, the 20-section per-country trafﬁc-rule handbook, the counterfactual veriﬁcation protocol, and the annotation tool used by human reviewers

Section D: Benchmark Construction Details. The 13 culture-speciﬁc trafﬁc categories, the 20-section per-country trafﬁc-rule handbook, the counterfactual veriﬁcation protocol, and the annotation tool used by human reviewers

[52] [52]

Additional qualitative comparisons between base VLMs and DRIVEOPD across countries

Section E: Extended Case Studies. Additional qualitative comparisons between base VLMs and DRIVEOPD across countries

[53] [53]

this country

Section F: Broader Impact. Discussion of the broader implications of G EODRIVE - BENCH . B Additional Implementation Details B.1 D RIVE OPD Training Details We instantiate D RIVE OPD on top of two open-source VLM backbones, Qwen2.5-VL-7B [ 4] and InternVL3-8B [ 48], yielding the two checkpoints denoted as D RIVE OPD † and D RIVE OPD ‡ in the main paper. B...

2016

[54] [54]

Read s c e n e _ s t a t e to confirm what is ac tu al ly visible

[55] [55]

Apply c a n d i d a t e _ r u l e ( NOT o r i g i n _ r u l e ) to the scene

[56] [56]

Pick the option that becomes correct under c a n d i d a t e _ r u l e

[57] [57]

a n s w e r _ u n d e r _ c a n d i d a t e

Compare against o r i g i n _ g t . Output STRICT JSON , no c o m m e n t a r y : { " a n s w e r _ u n d e r _ c a n d i d a t e " : " A | B | C | D " , " d i f f e r s _ f r o m _ o r i g i n " : true | false , " reason " : " < one - s ent en ce r a t i o n a l e g ro un de d in candidate_rule >" } Dec is io n : a c a n d i d a t e QA pair is R ET AI NE...

[58] [58]

Look at the image c a r e f u l l y

[59] [59]

Decide whether the pr ov ide d G T _ a n s w e r is s u p p o r t e d by the image under the country - spe ci fi c traffic context

[60] [60]

verdict

Output a JSON record with the verdict , your confidence , a one - p a r a g r a p h rationale , and ( for I N C O R R E C T ve rdi ct s ) the option you believe is act ua ll y correct . Inputs : s c e n e _ i m a g e : the c a n d i d a t e driving frame country : the country g o v e r n i n g the rules for this item que st io n : the multiple - choice q ...