nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

Hanyin Zhang; Jiali Chen; Jiaqi Ma; Johnson Liu; Mingxuan Gao; Ruining Yang; Rui Song; Tianhui Cai; Tony (Xuewei) Qi; Valeria Xu

arxiv: 2605.31572 · v1 · pith:C6FZ6FFNnew · submitted 2026-05-29 · 💻 cs.CV

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

Zhiyu Huang , Johnson Liu , Rui Song , Zewei Zhou , Ruining Yang , Yun Zhang , Tianhui Cai , Hanyin Zhang

show 7 more authors

Mingxuan Gao Valeria Xu Jiali Chen Yishan Shen Yiluan Guo Tony (Xuewei) Qi Jiaqi Ma

This is my paper

Pith reviewed 2026-06-28 22:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords autonomous drivingreasoning datasetlong-tail scenariosvisual question answeringplanning evaluationvision-language modelsnuScenes

0 comments

The pith

nuReasoning supplies human-verified reasoning annotations across 20,000 driving clips to improve both question answering and planning in long-tail autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces nuReasoning, a dataset of 20,000 twenty-second clips collected across cities with synchronized cameras, LiDAR, maps, object labels, and human-verified annotations for spatial, decision, and counterfactual reasoning. It positions this resource as an advance over prior perception-focused datasets by supporting direct evaluation of how reasoning supervision transfers to driving tasks. Experiments demonstrate gains in vision-language model performance on driving-specific question answering after fine-tuning and gains in vision-language-action model planning performance after reasoning-aware training. The planning improvement persists even when the model is prevented from producing textual reasoning at inference time. This combination establishes a benchmark for testing whether explicit reasoning data produces more robust autonomous driving behavior in rare scenarios.

Core claim

The central claim is that a large-scale real-world dataset with human-verified reasoning annotations for spatial relations, agent interactions, and safe decisions enables both improved reasoning evaluation and improved planning evaluation, with experiments confirming that fine-tuning vision-language models on the data raises driving question-answering accuracy while reasoning supervision during vision-language-action training raises planning performance even when textual reasoning outputs are disabled at inference.

What carries the argument

The nuReasoning dataset, which pairs multi-camera images, LiDAR, HD maps, and object annotations with three categories of human-verified reasoning annotations for each of the 20,000 clips.

If this is right

Fine-tuning vision-language models on nuReasoning substantially improves performance on driving-specific question answering.
Incorporating reasoning supervision into vision-language-action training improves planning performance.
Planning performance gains hold even when textual reasoning outputs are disabled at inference time.
The dataset structure allows a direct study of how reasoning supervision affects driving performance separate from perception or prediction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of reasoning and planning evaluation tracks could be used to measure whether reasoning supervision produces more interpretable intermediate representations inside the model.
If the annotations prove reliable, the same clip-level reasoning labels could serve as supervision for other sensor modalities or for simulation-to-real transfer.
Extending the same annotation protocol to additional cities or weather conditions would test whether the observed planning gains generalize beyond the current collection sites.

Load-bearing premise

The human-verified reasoning annotations accurately capture the commonsense knowledge, spatial relations, and inferences required for safe decisions in long-tail driving scenes.

What would settle it

An experiment in which models trained with nuReasoning reasoning supervision show no planning improvement over models trained only on perception data when both are tested on held-out long-tail scenes, or a comparison showing that the annotations diverge from judgments by experienced drivers on key counterfactual inferences.

Figures

Figures reproduced from arXiv: 2605.31572 by Hanyin Zhang, Jiali Chen, Jiaqi Ma, Johnson Liu, Mingxuan Gao, Ruining Yang, Rui Song, Tianhui Cai, Tony (Xuewei) Qi, Valeria Xu, Yiluan Guo, Yishan Shen, Yun Zhang, Zewei Zhou, Zhiyu Huang.

**Figure 1.** Figure 1: nuReasoning is a large-scale real-world long-tail driving dataset containing 20K 20-second clips across diverse scenario types. The dataset provides high-quality reasoning annotations spanning spatial reasoning, driving decisions, and counterfactual reasoning. Compared with prior datasets, nuReasoning offers substantially larger-scale long-tail driving data and richer reasoning annotations, enabling models… view at source ↗

**Figure 2.** Figure 2: Overview of the long-tail data mining and annotation pipeline. (a) Internal fleet driving [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Example of reasoning annotation in the nuReasoning dataset. The frame is annotated with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the nuVLA baseline and the reasoning evaluation benchmark. (a) nuVLA [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Example of reasoning and planning results on the test set of nuReasoning. Reasoning is [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Reasoning is essential for autonomous driving (AD) in long-tail scenarios, where vehicles must apply commonsense knowledge, understand spatial relations, infer agent interactions, and make safe decisions. However, existing AD datasets and benchmarks mainly target perception, prediction, or planning, and provide limited supervision for reasoning over realistic long-tail driving scenes. We introduce nuReasoning, a large-scale real-world dataset and benchmark for reasoning-centric AD. Following the lineage of nuScenes and nuPlan, nuReasoning advances real-world AD datasets and benchmarks toward reasoning in long-tail driving scenarios. The dataset contains 20,000 clips, each 20 seconds long, collected across multiple cities, with synchronized multi-camera images, LiDAR data, HD maps, object annotations, and human-verified reasoning annotations spanning Spatial Reasoning, Decision Reasoning, and Counterfactual Reasoning. Unlike prior datasets that focus primarily on visual question answering, nuReasoning supports both reasoning evaluation and planning evaluation, enabling a direct study of how reasoning supervision affects driving performance. Experiments show that fine-tuning VLMs on nuReasoning substantially improves driving-specific question answering, while incorporating reasoning supervision into VLA training improves planning performance even when textual reasoning outputs are disabled at inference time. These results establish nuReasoning as a foundation for evaluating and improving robust, interpretable, reasoning-driven AD systems in realistic long-tail settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

nuReasoning adds reasoning labels to nuScenes-style clips and reports gains from that supervision, but the abstract gives almost no detail on how those labels were checked.

read the letter

The main point is a new dataset of 20,000 twenty-second clips that layers Spatial, Decision, and Counterfactual reasoning annotations on top of synchronized camera, LiDAR, and map data from the nuScenes/nuPlan family. The experiments claim that fine-tuning VLMs on it lifts driving-specific QA performance and that adding the reasoning signal during VLA training improves planning even when the model is not asked to output text at inference.

What the work does cleanly is fill an obvious gap: most existing AD datasets stop at perception or trajectory prediction and give no explicit supervision for the commonsense and spatial inferences needed in long-tail scenes. Supporting both QA evaluation and planning evaluation in one benchmark is also useful for studying whether reasoning supervision transfers to control.

The soft spot is exactly the one flagged in the stress test. The abstract only says the annotations are “human-verified.” There is no mention of inter-annotator agreement, how many reviewers saw each clip, how disagreements on ambiguous long-tail cases were resolved, or any quantitative check on label quality. Without those numbers it is difficult to know whether the reported gains come from the reasoning content or simply from more data. If the full paper contains those metrics and they are reasonable, the contribution strengthens; if they are absent or weak, the central claim rests on an unverified assumption.

The paper is aimed at groups already working on VLMs or VLAs for driving who need a reasoning-oriented benchmark. It is worth sending to peer review because a dataset that explicitly targets long-tail reasoning is potentially valuable, but the referees will need to see the annotation protocol and basic agreement statistics before the results can be taken at face value.

Referee Report

3 major / 2 minor

Summary. The paper introduces nuReasoning, a dataset of 20,000 20-second real-world driving clips with synchronized multi-camera, LiDAR, HD maps, and object annotations, augmented by human-verified reasoning labels in three categories (Spatial Reasoning, Decision Reasoning, Counterfactual Reasoning). It supports both VQA-style reasoning evaluation and planning evaluation, with experiments claiming that fine-tuning VLMs on the dataset improves driving-specific question answering and that adding reasoning supervision to VLA training boosts planning metrics even when textual reasoning is disabled at inference.

Significance. If the annotations reliably capture the required commonsense and spatial inferences, nuReasoning would be a useful addition to the nuScenes/nuPlan lineage by enabling controlled study of how explicit reasoning supervision transfers to closed-loop planning. The reported transfer effect (reasoning supervision helping planning without explicit outputs) would be a concrete empirical contribution if robustly demonstrated.

major comments (3)

[§3.2 and §4.1] §3.2 (Dataset Construction) and §4.1 (Annotation Process): The claim that annotations are 'human-verified' is load-bearing for both the VLM QA and VLA planning results, yet the manuscript provides no inter-annotator agreement statistics, number of reviewers per clip, qualification criteria for annotators, or protocol for resolving disagreements on ambiguous long-tail cases. Without these, it is impossible to assess whether the supervision signal is reliable or whether gains could be explained by scale alone.
[§5.2 and Table 4] §5.2 (VLA Experiments) and Table 4: The planning improvement from reasoning supervision is presented as evidence that the annotations encode useful inferences, but the section does not report controls that isolate reasoning content from dataset size or from the base VLA architecture; e.g., no ablation comparing reasoning-augmented data against an equal-sized non-reasoning subset. This directly affects attribution of the reported gains.
[§5.1] §5.1 (VLM Experiments): The driving-specific QA gains are reported without details on the train/test split construction, whether test scenes overlap with training cities, or statistical significance testing across multiple seeds. These omissions make it difficult to judge whether the improvements generalize beyond the particular data partition used.

minor comments (2)

[Abstract and §1] The abstract and §1 use 'long-tail' without a precise operational definition or quantitative characterization of how the 20k clips were selected to emphasize tail events.
[Figure 3] Figure 3 (example annotations) would benefit from clearer indication of which reasoning category each highlighted sentence belongs to.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on annotation reliability and experimental controls. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§3.2 and §4.1] §3.2 (Dataset Construction) and §4.1 (Annotation Process): The claim that annotations are 'human-verified' is load-bearing for both the VLM QA and VLA planning results, yet the manuscript provides no inter-annotator agreement statistics, number of reviewers per clip, qualification criteria for annotators, or protocol for resolving disagreements on ambiguous long-tail cases. Without these, it is impossible to assess whether the supervision signal is reliable or whether gains could be explained by scale alone.

Authors: We agree that these details are essential for assessing annotation quality. The annotations were produced by qualified annotators with driving domain expertise, with multiple reviewers per clip and a structured disagreement resolution process. In the revised manuscript we will report inter-annotator agreement (Cohen’s kappa), the exact number of reviewers per clip, annotator qualification criteria, and the disagreement protocol. revision: yes
Referee: [§5.2 and Table 4] §5.2 (VLA Experiments) and Table 4: The planning improvement from reasoning supervision is presented as evidence that the annotations encode useful inferences, but the section does not report controls that isolate reasoning content from dataset size or from the base VLA architecture; e.g., no ablation comparing reasoning-augmented data against an equal-sized non-reasoning subset. This directly affects attribution of the reported gains.

Authors: We acknowledge the value of an explicit size-matched ablation. The current experiments compare reasoning-augmented training against the base VLA trained on the identical data volume without reasoning labels; however, we will add a new ablation that trains on an equal-sized non-reasoning subset drawn from the same distribution and report the results in the revised §5.2 and Table 4. revision: yes
Referee: [§5.1] §5.1 (VLM Experiments): The driving-specific QA gains are reported without details on the train/test split construction, whether test scenes overlap with training cities, or statistical significance testing across multiple seeds. These omissions make it difficult to judge whether the improvements generalize beyond the particular data partition used.

Authors: We will expand §5.1 to describe the train/test split construction (including city-level separation to avoid scene overlap), confirm that test scenes are drawn from held-out cities, and report mean and standard deviation across three random seeds with statistical significance tests. revision: yes

Circularity Check

0 steps flagged

No circularity; dataset introduction and empirical results contain no derivations or self-referential reductions

full rationale

The paper presents a new dataset (nuReasoning) with human-verified annotations across Spatial/Decision/Counterfactual reasoning and reports empirical results on VLM fine-tuning and VLA planning improvements. No equations, mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on dataset construction and external experimental evaluation rather than any chain that reduces outputs to inputs by definition or construction. Annotation verification details are unspecified, but this concerns evidence quality rather than circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper's contribution is dataset creation and empirical benchmarking rather than a mathematical model or derivation, so no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5827 in / 1147 out tokens · 27526 ms · 2026-06-28T22:33:19.209971+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

89 extracted references · 16 linked inside Pith

[1]

St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

2022
[2]

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020– 2036, 2024

2020
[3]

Collaborative semantic occupancy prediction with hybrid feature fusion in connected automated vehicles

Rui Song, Chenwei Liang, Hu Cao, Zhiran Yan, Walter Zimmer, Markus Gross, Andreas Festag, and Alois Knoll. Collaborative semantic occupancy prediction with hybrid feature fusion in connected automated vehicles. In2024 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR). IEEE/CVF, 2024

2024
[4]

V2XPnP: Vehicle-to-everything spatio-temporal fusion for multi-agent perception and prediction.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Zewei Zhou, Hao Xiang, Zhaoliang Zheng, Seth Z Zhao, Mingyue Lei, Yun Zhang, Tianhui Cai, Xinyi Liu, Johnson Liu, Maheswari Bajji, et al. V2XPnP: Vehicle-to-everything spatio-temporal fusion for multi-agent perception and prediction.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[5]

Maptr: Structured modeling and learning for online vectorized hd map construction

Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. InThe Eleventh International Conference on Learning Representations
[6]

Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving

Zhiyu Huang, Haochen Liu, and Chen Lv. Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3903–3913, 2023

2023
[7]

Relmap: Enhancing online map construction with class-aware spatial relation and semantic priors.arXiv preprint arXiv:2507.21567, 2025

Tianhui Cai, Yun Zhang, Zewei Zhou, Zhiyu Huang, and Jiaqi Ma. Relmap: Enhancing online map construction with class-aware spatial relation and semantic priors.arXiv preprint arXiv:2507.21567, 2025

arXiv 2025
[8]

IPFormer: Visual 3d panoptic scene completion with context-adaptive instance proposals

Markus Gross, Aya Fahmy, Danit Niwattananan, Dominik Muhle, Rui Song, Daniel Cremers, and Henri Meeß. IPFormer: Visual 3d panoptic scene completion with context-adaptive instance proposals. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[9]

Parting with mis- conceptions about learning-based vehicle motion planning

Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with mis- conceptions about learning-based vehicle motion planning. InConference on Robot Learning (CoRL), 2023

2023
[10]

Mdg: Masked denois- ing generation for multi-agent behavior modeling in traffic environments.arXiv preprint arXiv:2511.17496, 2025

Zhiyu Huang, Zewei Zhou, Tianhui Cai, Yun Zhang, and Jiaqi Ma. Mdg: Masked denois- ing generation for multi-agent behavior modeling in traffic environments.arXiv preprint arXiv:2511.17496, 2025

arXiv 2025
[11]

Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning

Zhiyu Huang, Xinshuo Weng, Maximilian Igl, Yuxiao Chen, Yulong Cao, Boris Ivanovic, Marco Pavone, and Chen Lv. Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3445–3451. IEEE, 2025. 10

2025
[12]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

2024
[13]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

2023
[14]

Genad: Gen- erative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

2024
[15]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

2022
[16]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023
[17]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

2025
[18]

Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[19]

Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, and Li Zhang. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

arXiv 2025
[20]

Perception in plan: Coupled perception and planning for end-to-end autonomous driving

Bozhou Zhang, Jingyu Li, Nan Song, and Li Zhang. Perception in plan: Coupled perception and planning for end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 12376–12384, 2026

2026
[21]

Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, et al. Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

Pith/arXiv arXiv 2025
[22]

Generative ai for autonomous driving: Frontiers and opportunities.arXiv preprint arXiv:2505.08854, 2025

Yuping Wang, Shuo Xing, Cui Can, Renjie Li, Hongyuan Hua, Kexin Tian, Zhaobin Mo, Xiangbo Gao, Keshu Wu, Sulong Zhou, et al. Generative ai for autonomous driving: Frontiers and opportunities.arXiv preprint arXiv:2505.08854, 2025

arXiv 2025
[23]

Driving with regulation: Trustworthy and interpretable decision-making for autonomous driving with retrieval-augmented reasoning

Tianhui Cai, Yifan Liu, Zewei Zhou, Haoxuan Ma, Seth Z Zhao, Zhiwen Wu, Xu Han, Zhiyu Huang, and Jiaqi Ma. Driving with regulation: Trustworthy and interpretable decision-making for autonomous driving with retrieval-augmented reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38287–38295, 2026

2026
[24]

Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6585–6597, 2025

2025
[25]

Emma: End-to-end multimodal model for autonomous driving.Transactions on Machine Learning Research, 2025

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.Transactions on Machine Learning Research, 2025

2025
[26]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024. 11

2024
[27]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

2025
[28]

Driving everywhere with large language model policy adaptation

Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14948–14957, 2024

2024
[29]

S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation

Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, et al. S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1622–1632, 2025

2025
[30]

Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025

Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, et al. Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025

Pith/arXiv arXiv 2025
[31]

Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. In European Conference on Computer Vision, pages 292–308. Springer, 2024

2024
[32]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

arXiv 2025
[33]

Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

Pith/arXiv arXiv 2025
[34]

Open- drivevla: Towards end-to-end autonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13782–13790, 2026

2026
[35]

Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Pith/arXiv arXiv 2025
[36]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[37]

Colavla: Leveraging cognitive latent reasoning for hierarchical parallel trajectory planning in autonomous driving.arXiv preprint arXiv:2512.22939, 2025

Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, and Hongsheng Li. Colavla: Leveraging cognitive latent reasoning for hierarchical parallel trajectory planning in autonomous driving.arXiv preprint arXiv:2512.22939, 2025

arXiv 2025
[38]

Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

Qiqi Liu, Huan Xu, Jingyu Li, Bin Sun, Zhihui Hao, Dangen She, Xiatian Zhu, and Li Zhang. Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

arXiv 2026
[39]

Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv preprint arXiv:2512.13636, 2025

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv preprint arXiv:2512.13636, 2025

arXiv 2025
[40]

Real-ad: Towards human-like reasoning in end-to-end autonomous driving

Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real-ad: Towards human-like reasoning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27783–27793, 2025

2025
[41]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15120–15130, 2024. 12

2024
[42]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

2020
[43]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

2020
[44]

Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021

2021
[45]

Argoverse: 3d tracking and forecasting with rich maps

Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3d tracking and forecasting with rich maps. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

2019
[46]

Trends in motion predic- tion toward deployable and generalizable autonomy: A revisit and perspectives.Foundations and Trends® in Robotics, 13(1-2):1–269, 2026

Letian Wang, Marc-Antoine Lavoie, Sandro Papais, Barza Nisar, Yuxiao Chen, Wenhao Ding, Boris Ivanovic, Hao Shao, Abulikemu Abuduweili, Evan Cook, et al. Trends in motion predic- tion toward deployable and generalizable autonomy: A revisit and perspectives.Foundations and Trends® in Robotics, 13(1-2):1–269, 2026

2026
[47]

nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

Pith/arXiv arXiv 2021
[48]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

arXiv 2025
[49]

Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

Seth Z Zhao, Luobin Wang, Hongwei Ruan, Yuxin Bao, Yilan Chen, Ziyang Leng, Abhijit Ravichandran, Honglin He, Zewei Zhou, Xu Han, et al. Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

Pith/arXiv arXiv 2026
[50]

Drivee2e: Closed-loop benchmark for end-to-end autonomous driving through real-to- simulation.arXiv preprint arXiv:2509.23922, 2025

Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong, Ping Luo, and Zaiqing Nie. Drivee2e: Closed-loop benchmark for end-to-end autonomous driving through real-to- simulation.arXiv preprint arXiv:2509.23922, 2025

arXiv 2025
[51]

Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models

Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, and Ziran Wang. Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8066–8076, 2025

2025
[52]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

2024
[53]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

2024
[54]

Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

Yi Gu, Yan Wang, Yuxiao Chen, Yurong You, Wenjie Luo, Yue Wang, Wenhao Ding, Boyi Li, Heng Yang, Boris Ivanovic, et al. Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

arXiv 2026
[55]

Waymoqa: A multi-view visual question answering dataset for safety-critical reasoning in autonomous driving.arXiv preprint arXiv:2511.20022, 2025

Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park, Wonjeong Ryu, Raehyuk Jung, and Hyunjung Shim. Waymoqa: A multi-view visual question answering dataset for safety-critical reasoning in autonomous driving.arXiv preprint arXiv:2511.20022, 2025. 13

arXiv 2025
[56]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

1933
[57]

Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

arXiv 2025
[58]

Spatial-aware vision language model for autonomous driving.arXiv preprint arXiv:2512.24331, 2025

Weijie Wei, Zhipeng Luo, Ling Feng, and Venice Erin Liong. Spatial-aware vision language model for autonomous driving.arXiv preprint arXiv:2512.24331, 2025

Pith/arXiv arXiv 2025
[59]

Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

Pith/arXiv arXiv 2025
[60]

Reasonplan: Unified scene prediction and decision reasoning for closed-loop autonomous driving

Xueyi Liu, Zuodong Zhong, Qichao Zhang, Yuxin Guo, Yupeng Zheng, Junli Wang, Dongbin Zhao, Yun-Fu Liu, Zhiguo Su, Yinfeng Gao, et al. Reasonplan: Unified scene prediction and decision reasoning for closed-loop autonomous driving. InConference on Robot Learning, pages 3051–3068. PMLR, 2025

2025
[61]

Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

arXiv 2025
[62]

Towards learning-based planning: The nuplan benchmark for real-world autonomous driving

Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning-based planning: The nuplan benchmark for real-world autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 629...

2024
[63]

Para-drive: Par- allelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Par- allelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

2024
[64]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

2025
[65]

What matters for scalable and robust learning in end-to-end driving planners?arXiv preprint arXiv:2603.15185, 2026

David Holtz, Niklas Hanselmann, Simon Doll, Marius Cordts, and Bernt Schiele. What matters for scalable and robust learning in end-to-end driving planners?arXiv preprint arXiv:2603.15185, 2026

arXiv 2026
[66]

Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Pith/arXiv arXiv 2024
[67]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

2025
[68]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025

2025
[69]

Reasoning-vla: A fast and general vision-language-action reasoning model for autonomous driving.arXiv preprint arXiv:2511.19912, 2025

Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, and Tat-Seng Chua. Reasoning-vla: A fast and general vision-language-action reasoning model for autonomous driving.arXiv preprint arXiv:2511.19912, 2025

arXiv 2025
[70]

Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 14

Pith/arXiv arXiv 2023
[71]

Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

2024
[72]

Pseudo-simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo-simulation for autonomous driving. InConference on Robot Learning (CoRL), 2025

2025
[73]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

2017
[74]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

2024
[75]

Fail2drive: Benchmarking closed-loop driving generalization.arXiv preprint arXiv:2604.08535, 2026

Simon Gerstenecker, Andreas Geiger, and Katrin Renz. Fail2drive: Benchmarking closed-loop driving generalization.arXiv preprint arXiv:2604.08535, 2026

Pith/arXiv arXiv 2026
[76]

Embodied scene understanding for vision language models via metavqa

Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, and Bolei Zhou. Embodied scene understanding for vision language models via metavqa. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22453–22464, 2025

2025
[77]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

2024
[78]

Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding

Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, and Feng Guo. Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5972–5983, 2025

2025
[79]

Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models

Xiaosong Jia, Yuqian Shao, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models. arXiv preprint arXiv:2604.01259, 2026

arXiv 2026
[80]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026. Accessed: 2026-05-01

2026

Showing first 80 references.

[1] [1]

St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

2022

[2] [2]

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020– 2036, 2024

2020

[3] [3]

Collaborative semantic occupancy prediction with hybrid feature fusion in connected automated vehicles

Rui Song, Chenwei Liang, Hu Cao, Zhiran Yan, Walter Zimmer, Markus Gross, Andreas Festag, and Alois Knoll. Collaborative semantic occupancy prediction with hybrid feature fusion in connected automated vehicles. In2024 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR). IEEE/CVF, 2024

2024

[4] [4]

V2XPnP: Vehicle-to-everything spatio-temporal fusion for multi-agent perception and prediction.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

Zewei Zhou, Hao Xiang, Zhaoliang Zheng, Seth Z Zhao, Mingyue Lei, Yun Zhang, Tianhui Cai, Xinyi Liu, Johnson Liu, Maheswari Bajji, et al. V2XPnP: Vehicle-to-everything spatio-temporal fusion for multi-agent perception and prediction.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025

[5] [5]

Maptr: Structured modeling and learning for online vectorized hd map construction

Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. InThe Eleventh International Conference on Learning Representations

[6] [6]

Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving

Zhiyu Huang, Haochen Liu, and Chen Lv. Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3903–3913, 2023

2023

[7] [7]

Relmap: Enhancing online map construction with class-aware spatial relation and semantic priors.arXiv preprint arXiv:2507.21567, 2025

Tianhui Cai, Yun Zhang, Zewei Zhou, Zhiyu Huang, and Jiaqi Ma. Relmap: Enhancing online map construction with class-aware spatial relation and semantic priors.arXiv preprint arXiv:2507.21567, 2025

arXiv 2025

[8] [8]

IPFormer: Visual 3d panoptic scene completion with context-adaptive instance proposals

Markus Gross, Aya Fahmy, Danit Niwattananan, Dominik Muhle, Rui Song, Daniel Cremers, and Henri Meeß. IPFormer: Visual 3d panoptic scene completion with context-adaptive instance proposals. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[9] [9]

Parting with mis- conceptions about learning-based vehicle motion planning

Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with mis- conceptions about learning-based vehicle motion planning. InConference on Robot Learning (CoRL), 2023

2023

[10] [10]

Mdg: Masked denois- ing generation for multi-agent behavior modeling in traffic environments.arXiv preprint arXiv:2511.17496, 2025

Zhiyu Huang, Zewei Zhou, Tianhui Cai, Yun Zhang, and Jiaqi Ma. Mdg: Masked denois- ing generation for multi-agent behavior modeling in traffic environments.arXiv preprint arXiv:2511.17496, 2025

arXiv 2025

[11] [11]

Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning

Zhiyu Huang, Xinshuo Weng, Maximilian Igl, Yuxiao Chen, Yulong Cao, Boris Ivanovic, Marco Pavone, and Chen Lv. Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3445–3451. IEEE, 2025. 10

2025

[12] [12]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

2024

[13] [13]

Vad: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

2023

[14] [14]

Genad: Gen- erative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

2024

[15] [15]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

2022

[16] [16]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023

[17] [17]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

2025

[18] [18]

Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[19] [19]

Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, and Li Zhang. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

arXiv 2025

[20] [20]

Perception in plan: Coupled perception and planning for end-to-end autonomous driving

Bozhou Zhang, Jingyu Li, Nan Song, and Li Zhang. Perception in plan: Coupled perception and planning for end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 12376–12384, 2026

2026

[21] [21]

Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, et al. Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

Pith/arXiv arXiv 2025

[22] [22]

Generative ai for autonomous driving: Frontiers and opportunities.arXiv preprint arXiv:2505.08854, 2025

Yuping Wang, Shuo Xing, Cui Can, Renjie Li, Hongyuan Hua, Kexin Tian, Zhaobin Mo, Xiangbo Gao, Keshu Wu, Sulong Zhou, et al. Generative ai for autonomous driving: Frontiers and opportunities.arXiv preprint arXiv:2505.08854, 2025

arXiv 2025

[23] [23]

Driving with regulation: Trustworthy and interpretable decision-making for autonomous driving with retrieval-augmented reasoning

Tianhui Cai, Yifan Liu, Zewei Zhou, Haoxuan Ma, Seth Z Zhao, Zhiwen Wu, Xu Han, Zhiyu Huang, and Jiaqi Ma. Driving with regulation: Trustworthy and interpretable decision-making for autonomous driving with retrieval-augmented reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38287–38295, 2026

2026

[24] [24]

Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6585–6597, 2025

2025

[25] [25]

Emma: End-to-end multimodal model for autonomous driving.Transactions on Machine Learning Research, 2025

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.Transactions on Machine Learning Research, 2025

2025

[26] [26]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024. 11

2024

[27] [27]

Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

2025

[28] [28]

Driving everywhere with large language model policy adaptation

Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14948–14957, 2024

2024

[29] [29]

S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation

Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, et al. S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1622–1632, 2025

2025

[30] [30]

Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025

Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, et al. Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025

Pith/arXiv arXiv 2025

[31] [31]

Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving

Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. In European Conference on Computer Vision, pages 292–308. Springer, 2024

2024

[32] [32]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

arXiv 2025

[33] [33]

Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

Pith/arXiv arXiv 2025

[34] [34]

Open- drivevla: Towards end-to-end autonomous driving with large vision language action model

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13782–13790, 2026

2026

[35] [35]

Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

Pith/arXiv arXiv 2025

[36] [36]

Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

[37] [37]

Colavla: Leveraging cognitive latent reasoning for hierarchical parallel trajectory planning in autonomous driving.arXiv preprint arXiv:2512.22939, 2025

Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, and Hongsheng Li. Colavla: Leveraging cognitive latent reasoning for hierarchical parallel trajectory planning in autonomous driving.arXiv preprint arXiv:2512.22939, 2025

arXiv 2025

[38] [38]

Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

Qiqi Liu, Huan Xu, Jingyu Li, Bin Sun, Zhihui Hao, Dangen She, Xiatian Zhu, and Li Zhang. Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

arXiv 2026

[39] [39]

Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv preprint arXiv:2512.13636, 2025

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv preprint arXiv:2512.13636, 2025

arXiv 2025

[40] [40]

Real-ad: Towards human-like reasoning in end-to-end autonomous driving

Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real-ad: Towards human-like reasoning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27783–27793, 2025

2025

[41] [41]

Lmdrive: Closed-loop end-to-end driving with large language models

Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15120–15130, 2024. 12

2024

[42] [42]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

2020

[43] [43]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

2020

[44] [44]

Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021

2021

[45] [45]

Argoverse: 3d tracking and forecasting with rich maps

Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3d tracking and forecasting with rich maps. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

2019

[46] [46]

Trends in motion predic- tion toward deployable and generalizable autonomy: A revisit and perspectives.Foundations and Trends® in Robotics, 13(1-2):1–269, 2026

Letian Wang, Marc-Antoine Lavoie, Sandro Papais, Barza Nisar, Yuxiao Chen, Wenhao Ding, Boris Ivanovic, Hao Shao, Abulikemu Abuduweili, Evan Cook, et al. Trends in motion predic- tion toward deployable and generalizable autonomy: A revisit and perspectives.Foundations and Trends® in Robotics, 13(1-2):1–269, 2026

2026

[47] [47]

nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

Pith/arXiv arXiv 2021

[48] [48]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

arXiv 2025

[49] [49]

Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

Seth Z Zhao, Luobin Wang, Hongwei Ruan, Yuxin Bao, Yilan Chen, Ziyang Leng, Abhijit Ravichandran, Honglin He, Zewei Zhou, Xu Han, et al. Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

Pith/arXiv arXiv 2026

[50] [50]

Drivee2e: Closed-loop benchmark for end-to-end autonomous driving through real-to- simulation.arXiv preprint arXiv:2509.23922, 2025

Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong, Ping Luo, and Zaiqing Nie. Drivee2e: Closed-loop benchmark for end-to-end autonomous driving through real-to- simulation.arXiv preprint arXiv:2509.23922, 2025

arXiv 2025

[51] [51]

Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models

Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, and Ziran Wang. Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8066–8076, 2025

2025

[52] [52]

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

2024

[53] [53]

Lingoqa: Visual question answering for autonomous driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

2024

[54] [54]

Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

Yi Gu, Yan Wang, Yuxiao Chen, Yurong You, Wenjie Luo, Yue Wang, Wenhao Ding, Boyi Li, Heng Yang, Boris Ivanovic, et al. Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

arXiv 2026

[55] [55]

Waymoqa: A multi-view visual question answering dataset for safety-critical reasoning in autonomous driving.arXiv preprint arXiv:2511.20022, 2025

Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park, Wonjeong Ryu, Raehyuk Jung, and Hyunjung Shim. Waymoqa: A multi-view visual question answering dataset for safety-critical reasoning in autonomous driving.arXiv preprint arXiv:2511.20022, 2025. 13

arXiv 2025

[56] [56]

Covla: Comprehensive vision-language-action dataset for autonomous driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

1933

[57] [57]

Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

arXiv 2025

[58] [58]

Spatial-aware vision language model for autonomous driving.arXiv preprint arXiv:2512.24331, 2025

Weijie Wei, Zhipeng Luo, Ling Feng, and Venice Erin Liong. Spatial-aware vision language model for autonomous driving.arXiv preprint arXiv:2512.24331, 2025

Pith/arXiv arXiv 2025

[59] [59]

Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

Pith/arXiv arXiv 2025

[60] [60]

Reasonplan: Unified scene prediction and decision reasoning for closed-loop autonomous driving

Xueyi Liu, Zuodong Zhong, Qichao Zhang, Yuxin Guo, Yupeng Zheng, Junli Wang, Dongbin Zhao, Yun-Fu Liu, Zhiguo Su, Yinfeng Gao, et al. Reasonplan: Unified scene prediction and decision reasoning for closed-loop autonomous driving. InConference on Robot Learning, pages 3051–3068. PMLR, 2025

2025

[61] [61]

Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

arXiv 2025

[62] [62]

Towards learning-based planning: The nuplan benchmark for real-world autonomous driving

Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning-based planning: The nuplan benchmark for real-world autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 629...

2024

[63] [63]

Para-drive: Par- allelized architecture for real-time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Par- allelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

2024

[64] [64]

Sparsedrive: End-to-end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

2025

[65] [65]

What matters for scalable and robust learning in end-to-end driving planners?arXiv preprint arXiv:2603.15185, 2026

David Holtz, Niklas Hanselmann, Simon Doll, Marius Cordts, and Bernt Schiele. What matters for scalable and robust learning in end-to-end driving planners?arXiv preprint arXiv:2603.15185, 2026

arXiv 2026

[66] [66]

Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

Pith/arXiv arXiv 2024

[67] [67]

Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

2025

[68] [68]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025

2025

[69] [69]

Reasoning-vla: A fast and general vision-language-action reasoning model for autonomous driving.arXiv preprint arXiv:2511.19912, 2025

Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, and Tat-Seng Chua. Reasoning-vla: A fast and general vision-language-action reasoning model for autonomous driving.arXiv preprint arXiv:2511.19912, 2025

arXiv 2025

[70] [70]

Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 14

Pith/arXiv arXiv 2023

[71] [71]

Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

2024

[72] [72]

Pseudo-simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo-simulation for autonomous driving. InConference on Robot Learning (CoRL), 2025

2025

[73] [73]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

2017

[74] [74]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

2024

[75] [75]

Fail2drive: Benchmarking closed-loop driving generalization.arXiv preprint arXiv:2604.08535, 2026

Simon Gerstenecker, Andreas Geiger, and Katrin Renz. Fail2drive: Benchmarking closed-loop driving generalization.arXiv preprint arXiv:2604.08535, 2026

Pith/arXiv arXiv 2026

[76] [76]

Embodied scene understanding for vision language models via metavqa

Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, and Bolei Zhou. Embodied scene understanding for vision language models via metavqa. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22453–22464, 2025

2025

[77] [77]

Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

2024

[78] [78]

Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding

Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, and Feng Guo. Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5972–5983, 2025

2025

[79] [79]

Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models

Xiaosong Jia, Yuqian Shao, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models. arXiv preprint arXiv:2604.01259, 2026

arXiv 2026

[80] [80]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026. Accessed: 2026-05-01

2026