pith. sign in

arxiv: 2605.31572 · v1 · pith:C6FZ6FFNnew · submitted 2026-05-29 · 💻 cs.CV

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

Pith reviewed 2026-06-28 22:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords autonomous drivingreasoning datasetlong-tail scenariosvisual question answeringplanning evaluationvision-language modelsnuScenes
0
0 comments X

The pith

nuReasoning supplies human-verified reasoning annotations across 20,000 driving clips to improve both question answering and planning in long-tail autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces nuReasoning, a dataset of 20,000 twenty-second clips collected across cities with synchronized cameras, LiDAR, maps, object labels, and human-verified annotations for spatial, decision, and counterfactual reasoning. It positions this resource as an advance over prior perception-focused datasets by supporting direct evaluation of how reasoning supervision transfers to driving tasks. Experiments demonstrate gains in vision-language model performance on driving-specific question answering after fine-tuning and gains in vision-language-action model planning performance after reasoning-aware training. The planning improvement persists even when the model is prevented from producing textual reasoning at inference time. This combination establishes a benchmark for testing whether explicit reasoning data produces more robust autonomous driving behavior in rare scenarios.

Core claim

The central claim is that a large-scale real-world dataset with human-verified reasoning annotations for spatial relations, agent interactions, and safe decisions enables both improved reasoning evaluation and improved planning evaluation, with experiments confirming that fine-tuning vision-language models on the data raises driving question-answering accuracy while reasoning supervision during vision-language-action training raises planning performance even when textual reasoning outputs are disabled at inference.

What carries the argument

The nuReasoning dataset, which pairs multi-camera images, LiDAR, HD maps, and object annotations with three categories of human-verified reasoning annotations for each of the 20,000 clips.

If this is right

  • Fine-tuning vision-language models on nuReasoning substantially improves performance on driving-specific question answering.
  • Incorporating reasoning supervision into vision-language-action training improves planning performance.
  • Planning performance gains hold even when textual reasoning outputs are disabled at inference time.
  • The dataset structure allows a direct study of how reasoning supervision affects driving performance separate from perception or prediction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of reasoning and planning evaluation tracks could be used to measure whether reasoning supervision produces more interpretable intermediate representations inside the model.
  • If the annotations prove reliable, the same clip-level reasoning labels could serve as supervision for other sensor modalities or for simulation-to-real transfer.
  • Extending the same annotation protocol to additional cities or weather conditions would test whether the observed planning gains generalize beyond the current collection sites.

Load-bearing premise

The human-verified reasoning annotations accurately capture the commonsense knowledge, spatial relations, and inferences required for safe decisions in long-tail driving scenes.

What would settle it

An experiment in which models trained with nuReasoning reasoning supervision show no planning improvement over models trained only on perception data when both are tested on held-out long-tail scenes, or a comparison showing that the annotations diverge from judgments by experienced drivers on key counterfactual inferences.

Figures

Figures reproduced from arXiv: 2605.31572 by Hanyin Zhang, Jiali Chen, Jiaqi Ma, Johnson Liu, Mingxuan Gao, Ruining Yang, Rui Song, Tianhui Cai, Tony (Xuewei) Qi, Valeria Xu, Yiluan Guo, Yishan Shen, Yun Zhang, Zewei Zhou, Zhiyu Huang.

Figure 1
Figure 1. Figure 1: nuReasoning is a large-scale real-world long-tail driving dataset containing 20K 20-second clips across diverse scenario types. The dataset provides high-quality reasoning annotations spanning spatial reasoning, driving decisions, and counterfactual reasoning. Compared with prior datasets, nuReasoning offers substantially larger-scale long-tail driving data and richer reasoning annotations, enabling models… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the long-tail data mining and annotation pipeline. (a) Internal fleet driving [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of reasoning annotation in the nuReasoning dataset. The frame is annotated with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the nuVLA baseline and the reasoning evaluation benchmark. (a) nuVLA [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of reasoning and planning results on the test set of nuReasoning. Reasoning is [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Reasoning is essential for autonomous driving (AD) in long-tail scenarios, where vehicles must apply commonsense knowledge, understand spatial relations, infer agent interactions, and make safe decisions. However, existing AD datasets and benchmarks mainly target perception, prediction, or planning, and provide limited supervision for reasoning over realistic long-tail driving scenes. We introduce nuReasoning, a large-scale real-world dataset and benchmark for reasoning-centric AD. Following the lineage of nuScenes and nuPlan, nuReasoning advances real-world AD datasets and benchmarks toward reasoning in long-tail driving scenarios. The dataset contains 20,000 clips, each 20 seconds long, collected across multiple cities, with synchronized multi-camera images, LiDAR data, HD maps, object annotations, and human-verified reasoning annotations spanning Spatial Reasoning, Decision Reasoning, and Counterfactual Reasoning. Unlike prior datasets that focus primarily on visual question answering, nuReasoning supports both reasoning evaluation and planning evaluation, enabling a direct study of how reasoning supervision affects driving performance. Experiments show that fine-tuning VLMs on nuReasoning substantially improves driving-specific question answering, while incorporating reasoning supervision into VLA training improves planning performance even when textual reasoning outputs are disabled at inference time. These results establish nuReasoning as a foundation for evaluating and improving robust, interpretable, reasoning-driven AD systems in realistic long-tail settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces nuReasoning, a dataset of 20,000 20-second real-world driving clips with synchronized multi-camera, LiDAR, HD maps, and object annotations, augmented by human-verified reasoning labels in three categories (Spatial Reasoning, Decision Reasoning, Counterfactual Reasoning). It supports both VQA-style reasoning evaluation and planning evaluation, with experiments claiming that fine-tuning VLMs on the dataset improves driving-specific question answering and that adding reasoning supervision to VLA training boosts planning metrics even when textual reasoning is disabled at inference.

Significance. If the annotations reliably capture the required commonsense and spatial inferences, nuReasoning would be a useful addition to the nuScenes/nuPlan lineage by enabling controlled study of how explicit reasoning supervision transfers to closed-loop planning. The reported transfer effect (reasoning supervision helping planning without explicit outputs) would be a concrete empirical contribution if robustly demonstrated.

major comments (3)
  1. [§3.2 and §4.1] §3.2 (Dataset Construction) and §4.1 (Annotation Process): The claim that annotations are 'human-verified' is load-bearing for both the VLM QA and VLA planning results, yet the manuscript provides no inter-annotator agreement statistics, number of reviewers per clip, qualification criteria for annotators, or protocol for resolving disagreements on ambiguous long-tail cases. Without these, it is impossible to assess whether the supervision signal is reliable or whether gains could be explained by scale alone.
  2. [§5.2 and Table 4] §5.2 (VLA Experiments) and Table 4: The planning improvement from reasoning supervision is presented as evidence that the annotations encode useful inferences, but the section does not report controls that isolate reasoning content from dataset size or from the base VLA architecture; e.g., no ablation comparing reasoning-augmented data against an equal-sized non-reasoning subset. This directly affects attribution of the reported gains.
  3. [§5.1] §5.1 (VLM Experiments): The driving-specific QA gains are reported without details on the train/test split construction, whether test scenes overlap with training cities, or statistical significance testing across multiple seeds. These omissions make it difficult to judge whether the improvements generalize beyond the particular data partition used.
minor comments (2)
  1. [Abstract and §1] The abstract and §1 use 'long-tail' without a precise operational definition or quantitative characterization of how the 20k clips were selected to emphasize tail events.
  2. [Figure 3] Figure 3 (example annotations) would benefit from clearer indication of which reasoning category each highlighted sentence belongs to.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on annotation reliability and experimental controls. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3.2 and §4.1] §3.2 (Dataset Construction) and §4.1 (Annotation Process): The claim that annotations are 'human-verified' is load-bearing for both the VLM QA and VLA planning results, yet the manuscript provides no inter-annotator agreement statistics, number of reviewers per clip, qualification criteria for annotators, or protocol for resolving disagreements on ambiguous long-tail cases. Without these, it is impossible to assess whether the supervision signal is reliable or whether gains could be explained by scale alone.

    Authors: We agree that these details are essential for assessing annotation quality. The annotations were produced by qualified annotators with driving domain expertise, with multiple reviewers per clip and a structured disagreement resolution process. In the revised manuscript we will report inter-annotator agreement (Cohen’s kappa), the exact number of reviewers per clip, annotator qualification criteria, and the disagreement protocol. revision: yes

  2. Referee: [§5.2 and Table 4] §5.2 (VLA Experiments) and Table 4: The planning improvement from reasoning supervision is presented as evidence that the annotations encode useful inferences, but the section does not report controls that isolate reasoning content from dataset size or from the base VLA architecture; e.g., no ablation comparing reasoning-augmented data against an equal-sized non-reasoning subset. This directly affects attribution of the reported gains.

    Authors: We acknowledge the value of an explicit size-matched ablation. The current experiments compare reasoning-augmented training against the base VLA trained on the identical data volume without reasoning labels; however, we will add a new ablation that trains on an equal-sized non-reasoning subset drawn from the same distribution and report the results in the revised §5.2 and Table 4. revision: yes

  3. Referee: [§5.1] §5.1 (VLM Experiments): The driving-specific QA gains are reported without details on the train/test split construction, whether test scenes overlap with training cities, or statistical significance testing across multiple seeds. These omissions make it difficult to judge whether the improvements generalize beyond the particular data partition used.

    Authors: We will expand §5.1 to describe the train/test split construction (including city-level separation to avoid scene overlap), confirm that test scenes are drawn from held-out cities, and report mean and standard deviation across three random seeds with statistical significance tests. revision: yes

Circularity Check

0 steps flagged

No circularity; dataset introduction and empirical results contain no derivations or self-referential reductions

full rationale

The paper presents a new dataset (nuReasoning) with human-verified annotations across Spatial/Decision/Counterfactual reasoning and reports empirical results on VLM fine-tuning and VLA planning improvements. No equations, mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on dataset construction and external experimental evaluation rather than any chain that reduces outputs to inputs by definition or construction. Annotation verification details are unspecified, but this concerns evidence quality rather than circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper's contribution is dataset creation and empirical benchmarking rather than a mathematical model or derivation, so no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5827 in / 1147 out tokens · 27526 ms · 2026-06-28T22:33:19.209971+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 16 linked inside Pith

  1. [1]

    St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InEuropean Conference on Computer Vision, pages 533–549. Springer, 2022

  2. [2]

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020– 2036, 2024

  3. [3]

    Collaborative semantic occupancy prediction with hybrid feature fusion in connected automated vehicles

    Rui Song, Chenwei Liang, Hu Cao, Zhiran Yan, Walter Zimmer, Markus Gross, Andreas Festag, and Alois Knoll. Collaborative semantic occupancy prediction with hybrid feature fusion in connected automated vehicles. In2024 IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR). IEEE/CVF, 2024

  4. [4]

    V2XPnP: Vehicle-to-everything spatio-temporal fusion for multi-agent perception and prediction.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

    Zewei Zhou, Hao Xiang, Zhaoliang Zheng, Seth Z Zhao, Mingyue Lei, Yun Zhang, Tianhui Cai, Xinyi Liu, Johnson Liu, Maheswari Bajji, et al. V2XPnP: Vehicle-to-everything spatio-temporal fusion for multi-agent perception and prediction.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  5. [5]

    Maptr: Structured modeling and learning for online vectorized hd map construction

    Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. InThe Eleventh International Conference on Learning Representations

  6. [6]

    Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving

    Zhiyu Huang, Haochen Liu, and Chen Lv. Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3903–3913, 2023

  7. [7]

    Relmap: Enhancing online map construction with class-aware spatial relation and semantic priors.arXiv preprint arXiv:2507.21567, 2025

    Tianhui Cai, Yun Zhang, Zewei Zhou, Zhiyu Huang, and Jiaqi Ma. Relmap: Enhancing online map construction with class-aware spatial relation and semantic priors.arXiv preprint arXiv:2507.21567, 2025

  8. [8]

    IPFormer: Visual 3d panoptic scene completion with context-adaptive instance proposals

    Markus Gross, Aya Fahmy, Danit Niwattananan, Dominik Muhle, Rui Song, Daniel Cremers, and Henri Meeß. IPFormer: Visual 3d panoptic scene completion with context-adaptive instance proposals. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  9. [9]

    Parting with mis- conceptions about learning-based vehicle motion planning

    Daniel Dauner, Marcel Hallgarten, Andreas Geiger, and Kashyap Chitta. Parting with mis- conceptions about learning-based vehicle motion planning. InConference on Robot Learning (CoRL), 2023

  10. [10]

    Mdg: Masked denois- ing generation for multi-agent behavior modeling in traffic environments.arXiv preprint arXiv:2511.17496, 2025

    Zhiyu Huang, Zewei Zhou, Tianhui Cai, Yun Zhang, and Jiaqi Ma. Mdg: Masked denois- ing generation for multi-agent behavior modeling in traffic environments.arXiv preprint arXiv:2511.17496, 2025

  11. [11]

    Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning

    Zhiyu Huang, Xinshuo Weng, Maximilian Igl, Yuxiao Chen, Yulong Cao, Boris Ivanovic, Marco Pavone, and Chen Lv. Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3445–3451. IEEE, 2025. 10

  12. [12]

    End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

  13. [13]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  14. [14]

    Genad: Gen- erative end-to-end autonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

  15. [15]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

    Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline.Advances in Neural Information Processing Systems, 35:6119–6132, 2022

  16. [16]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  17. [17]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  18. [18]

    Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving

    Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025

  19. [19]

    Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

    Bozhou Zhang, Nan Song, Jingyu Li, Xiatian Zhu, Jiankang Deng, and Li Zhang. Future-aware end-to-end driving: Bidirectional modeling of trajectory planning and scene evolution.arXiv preprint arXiv:2510.11092, 2025

  20. [20]

    Perception in plan: Coupled perception and planning for end-to-end autonomous driving

    Bozhou Zhang, Jingyu Li, Nan Song, and Li Zhang. Perception in plan: Coupled perception and planning for end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 12376–12384, 2026

  21. [21]

    Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

    Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, et al. Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

  22. [22]

    Generative ai for autonomous driving: Frontiers and opportunities.arXiv preprint arXiv:2505.08854, 2025

    Yuping Wang, Shuo Xing, Cui Can, Renjie Li, Hongyuan Hua, Kexin Tian, Zhaobin Mo, Xiangbo Gao, Keshu Wu, Sulong Zhou, et al. Generative ai for autonomous driving: Frontiers and opportunities.arXiv preprint arXiv:2505.08854, 2025

  23. [23]

    Driving with regulation: Trustworthy and interpretable decision-making for autonomous driving with retrieval-augmented reasoning

    Tianhui Cai, Yifan Liu, Zewei Zhou, Haoxuan Ma, Seth Z Zhao, Zhiwen Wu, Xu Han, Zhiyu Huang, and Jiaqi Ma. Driving with regulation: Trustworthy and interpretable decision-making for autonomous driving with retrieval-augmented reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 38287–38295, 2026

  24. [24]

    Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

    Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6585–6597, 2025

  25. [25]

    Emma: End-to-end multimodal model for autonomous driving.Transactions on Machine Learning Research, 2025

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.Transactions on Machine Learning Research, 2025

  26. [26]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024. 11

  27. [27]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InProceedings of the computer vision and pattern recognition conference, pages 22442–22452, 2025

  28. [28]

    Driving everywhere with large language model policy adaptation

    Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14948–14957, 2024

  29. [29]

    S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation

    Yichen Xie, Runsheng Xu, Tong He, Jyh-Jing Hwang, Katie Luo, Jingwei Ji, Hubert Lin, Letian Chen, Yiren Lu, Zhaoqi Leng, et al. S4-driver: Scalable self-supervised driving multimodal large language model with spatio-temporal visual representation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1622–1632, 2025

  30. [30]

    Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025

    Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, et al. Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025

  31. [31]

    Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving

    Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. In European Conference on Computer Vision, pages 292–308. Springer, 2024

  32. [32]

    Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

    Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

  33. [33]

    Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

  34. [34]

    Open- drivevla: Towards end-to-end autonomous driving with large vision language action model

    Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, V olker Tresp, and Alois Knoll. Open- drivevla: Towards end-to-end autonomous driving with large vision language action model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 13782–13790, 2026

  35. [35]

    Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

    Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Dia- mond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

  36. [36]

    Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  37. [37]

    Colavla: Leveraging cognitive latent reasoning for hierarchical parallel trajectory planning in autonomous driving.arXiv preprint arXiv:2512.22939, 2025

    Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, and Hongsheng Li. Colavla: Leveraging cognitive latent reasoning for hierarchical parallel trajectory planning in autonomous driving.arXiv preprint arXiv:2512.22939, 2025

  38. [38]

    Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

    Qiqi Liu, Huan Xu, Jingyu Li, Bin Sun, Zhihui Hao, Dangen She, Xiatian Zhu, and Li Zhang. Uni-world vla: Interleaved world modeling and planning for autonomous driving.arXiv preprint arXiv:2603.27287, 2026

  39. [39]

    Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv preprint arXiv:2512.13636, 2025

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv preprint arXiv:2512.13636, 2025

  40. [40]

    Real-ad: Towards human-like reasoning in end-to-end autonomous driving

    Yuhang Lu, Jiadong Tu, Yuexin Ma, and Xinge Zhu. Real-ad: Towards human-like reasoning in end-to-end autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27783–27793, 2025

  41. [41]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hong- sheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15120–15130, 2024. 12

  42. [42]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

  43. [43]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020

  44. [44]

    Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset

    Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021

  45. [45]

    Argoverse: 3d tracking and forecasting with rich maps

    Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3d tracking and forecasting with rich maps. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  46. [46]

    Trends in motion predic- tion toward deployable and generalizable autonomy: A revisit and perspectives.Foundations and Trends® in Robotics, 13(1-2):1–269, 2026

    Letian Wang, Marc-Antoine Lavoie, Sandro Papais, Barza Nisar, Yuxiao Chen, Wenhao Ding, Boris Ivanovic, Hao Shao, Abulikemu Abuduweili, Evan Cook, et al. Trends in motion predic- tion toward deployable and generalizable autonomy: A revisit and perspectives.Foundations and Trends® in Robotics, 13(1-2):1–269, 2026

  47. [47]

    nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021

  48. [48]

    Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

    Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

  49. [49]

    Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

    Seth Z Zhao, Luobin Wang, Hongwei Ruan, Yuxin Bao, Yilan Chen, Ziyang Leng, Abhijit Ravichandran, Honglin He, Zewei Zhou, Xu Han, et al. Bridgesim: Unveiling the ol-cl gap in end-to-end autonomous driving.arXiv preprint arXiv:2604.10856, 2026

  50. [50]

    Drivee2e: Closed-loop benchmark for end-to-end autonomous driving through real-to- simulation.arXiv preprint arXiv:2509.23922, 2025

    Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong, Ping Luo, and Zaiqing Nie. Drivee2e: Closed-loop benchmark for end-to-end autonomous driving through real-to- simulation.arXiv preprint arXiv:2509.23922, 2025

  51. [51]

    Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models

    Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, and Ziran Wang. Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8066–8076, 2025

  52. [52]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

  53. [53]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

  54. [54]

    Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

    Yi Gu, Yan Wang, Yuxiao Chen, Yurong You, Wenjie Luo, Yue Wang, Wenhao Ding, Boyi Li, Heng Yang, Boris Ivanovic, et al. Accelerating structured chain-of-thought in autonomous vehicles.arXiv preprint arXiv:2602.02864, 2026

  55. [55]

    Waymoqa: A multi-view visual question answering dataset for safety-critical reasoning in autonomous driving.arXiv preprint arXiv:2511.20022, 2025

    Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park, Wonjeong Ryu, Raehyuk Jung, and Hyunjung Shim. Waymoqa: A multi-view visual question answering dataset for safety-critical reasoning in autonomous driving.arXiv preprint arXiv:2511.20022, 2025. 13

  56. [56]

    Covla: Comprehensive vision-language-action dataset for autonomous driving

    Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

  57. [57]

    Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

    Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, et al. Impromptu vla: Open weights and open data for driving vision-language-action models.arXiv preprint arXiv:2505.23757, 2025

  58. [58]

    Spatial-aware vision language model for autonomous driving.arXiv preprint arXiv:2512.24331, 2025

    Weijie Wei, Zhipeng Luo, Ling Feng, and Venice Erin Liong. Spatial-aware vision language model for autonomous driving.arXiv preprint arXiv:2512.24331, 2025

  59. [59]

    Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

    Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, and Andreas Zell. Spacedrive: Infusing spatial awareness into vlm-based autonomous driving.arXiv preprint arXiv:2512.10719, 2, 2025

  60. [60]

    Reasonplan: Unified scene prediction and decision reasoning for closed-loop autonomous driving

    Xueyi Liu, Zuodong Zhong, Qichao Zhang, Yuxin Guo, Yupeng Zheng, Junli Wang, Dongbin Zhao, Yun-Fu Liu, Zhiguo Su, Yinfeng Gao, et al. Reasonplan: Unified scene prediction and decision reasoning for closed-loop autonomous driving. InConference on Robot Learning, pages 3051–3068. PMLR, 2025

  61. [61]

    Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

    Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

  62. [62]

    Towards learning-based planning: The nuplan benchmark for real-world autonomous driving

    Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning-based planning: The nuplan benchmark for real-world autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 629...

  63. [63]

    Para-drive: Par- allelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Par- allelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

  64. [64]

    Sparsedrive: End-to-end autonomous driving via sparse scene representation

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

  65. [65]

    What matters for scalable and robust learning in end-to-end driving planners?arXiv preprint arXiv:2603.15185, 2026

    David Holtz, Niklas Hanselmann, Simon Doll, Marius Cordts, and Bernt Schiele. What matters for scalable and robust learning in end-to-end driving planners?arXiv preprint arXiv:2603.15185, 2026

  66. [66]

    Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  67. [67]

    Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

  68. [68]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025

  69. [69]

    Reasoning-vla: A fast and general vision-language-action reasoning model for autonomous driving.arXiv preprint arXiv:2511.19912, 2025

    Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, and Tat-Seng Chua. Reasoning-vla: A fast and general vision-language-action reasoning model for autonomous driving.arXiv preprint arXiv:2511.19912, 2025

  70. [70]

    Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 14

  71. [71]

    Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024

  72. [72]

    Pseudo-simulation for autonomous driving

    Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo-simulation for autonomous driving. InConference on Robot Learning (CoRL), 2025

  73. [73]

    Carla: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

  74. [74]

    Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024

  75. [75]

    Fail2drive: Benchmarking closed-loop driving generalization.arXiv preprint arXiv:2604.08535, 2026

    Simon Gerstenecker, Andreas Geiger, and Katrin Renz. Fail2drive: Benchmarking closed-loop driving generalization.arXiv preprint arXiv:2604.08535, 2026

  76. [76]

    Embodied scene understanding for vision language models via metavqa

    Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, and Bolei Zhou. Embodied scene understanding for vision language models via metavqa. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22453–22464, 2025

  77. [77]

    Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models

    Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, and Xiaomeng Li. Holistic autonomous driving understanding by bird’s-eye-view injected multi-modal large models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13668–13677, 2024

  78. [78]

    Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding

    Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, and Feng Guo. Are vision llms road-ready? a comprehensive benchmark for safety-critical driving video understanding. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5972–5983, 2025

  79. [79]

    Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models

    Xiaosong Jia, Yuqian Shao, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive-vl: Benchmarks for closed-loop autonomous driving with vision-language models. arXiv preprint arXiv:2604.01259, 2026

  80. [80]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026. Accessed: 2026-05-01

Showing first 80 references.