pith. sign in

arxiv: 2509.00789 · v2 · submitted 2025-08-31 · 💻 cs.CV

CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving

Pith reviewed 2026-05-18 20:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords autonomous drivingvision-language modelscognitive inertiatemporal coherenceknowledge distillationplanningclosed-loop evaluation
0
0 comments X

The pith

Adding cognitive inertia to vision-language driving agents creates a stable internal state that supports coherent long-term planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models for autonomous driving process each moment separately and therefore lose track of ongoing intentions, producing jittery decisions that break complex maneuvers. The paper supplies a large dataset of narrative annotations that describe persistent scenes and goals, then builds an agent with sparse temporal memory trained through spatiotemporal knowledge distillation to instill cognitive inertia. This internal state is meant to carry understanding forward in time rather than resetting at each frame. If the approach works, driving agents should execute multi-step actions such as lane merges or intersection crossings more reliably because they remember prior context instead of reacting only to the current view. The reported gains on closed-loop benchmarks are presented as evidence that the internal representation has become temporally coherent.

Core claim

By pairing CogDriver-Data, whose narrative annotations supply supervisory signals for temporal dynamics and persistent intent, with CogDriver-Agent, an architecture that uses sparse temporal memory and spatiotemporal knowledge distillation to enforce decision coherence, the agent maintains a stable internal representation that improves closed-loop performance.

What carries the argument

Sparse temporal memory combined with spatiotemporal knowledge distillation that explicitly teaches decision coherence from narrative supervision.

If this is right

  • The agent executes long-horizon maneuvers with less interruption from frame-to-frame changes.
  • Imitation accuracy rises because decisions remain consistent with prior observations.
  • Closed-loop driving scores improve on standard benchmarks such as Bench2Drive.
  • The approach establishes a new state-of-the-art on both Bench2Drive and nuScenes metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory-plus-distillation pattern could be tested in other sequential tasks such as video-based robot control.
  • Replacing human narrative labels with automatically generated scene descriptions would test whether the coherence signal can be scaled without extra annotation cost.
  • Evaluating the agent under rapid environmental changes would show how much inertia the current memory actually provides.

Load-bearing premise

The narrative annotations and distillation procedure are enough to build a stable internal state that generalizes to new driving situations.

What would settle it

Ablating the temporal memory or removing narrative annotations from training and checking whether the 22 percent driving-score gain and 21 percent L2-error reduction both disappear on the same benchmarks.

Figures

Figures reproduced from arXiv: 2509.00789 by Dangen She, Haipeng Liu, Jun Ma, Pei Liu, Peng Jia, Qingtian Ning, Weiliang Ma, Xianpeng Lang, Xinyan Lu.

Figure 1
Figure 1. Figure 1: Annotation pipeline. We adopt a rule-based QA labeling pipeline, where the ground truth annotations in nuScenes and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OmniReason harnesses the power of open-source pre-trained language foundation models to generate context-aware [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The sparse temporal model employs an iterative [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Environment description, action & trajectory, and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Lane number distributions in the Bench2Drive dataset. The left plot shows the distribution of main (ego-moving [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of cross lanes in the Bench2Drive dataset. The left pie chart shows the proportion of samples with [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: 2D histograms visualizing the relationship between the number of objects and their minimum distance to the ego [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ego vehicle’s future action at different current states on the nuScenes dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ego vehicle’s future action at different current states on the Bench2Drive dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Word cloud of causal reasoning annotations on the nuScenes (left) and Bench2Drive (right) dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A dynamic object crossing example from the OmniReason-Bench2Drive dataset is illustrated. The white box repre [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A hazard at side lane two ways example from the OmniReason-Bench2Drive dataset is illustrated. The white box [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: A stop VQA results of OmniReason-Agent on the nuScenes dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A left turn VQA results of OmniReason-Agent on the nuScenes dataset. The green refers to the predicted trajectory. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
read the original abstract

The pursuit of autonomous agents capable of temporally coherent planning is hindered by a fundamental flaw in current vision-language models (VLMs): they lack cognitive inertia. Operating on isolated snapshots, these models cannot form a continuous understanding of the environment, leading to erratic decision jitter and a failure to execute complex, multi-step maneuvers. To remedy this, we introduce CogDriver, a framework designed to build a stable internal representation by instilling this crucial cognitive property. Our work makes two key contributions: (1) We present CogDriver-Data, a large-scale vision-language-action dataset whose narrative annotations provide the supervisory signal for learning temporal dynamics and persistent intent. (2) We develop the CogDriver-Agent, an architecture featuring a sparse temporal memory to maintain a stable internal state. This is enabled by a spatiotemporal knowledge distillation approach that explicitly teaches decision coherence. Comprehensive experiments validate our paradigm: CogDriver-Agent achieves a 22% increase in the closed-loop Driving Score on Bench2Drive and a 21% reduction in mean L2 error on nuScenes, establishing a new state-of-the-art. These significant gains in both long-term decision-making and imitation accuracy provide strong evidence that our agent successfully maintains a temporally coherent internal state, bridging the gap toward more reliable autonomous driving. Project link: https://ocean-luna.github.io/CogDriver.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CogDriver to address the absence of cognitive inertia in vision-language models for autonomous driving. It contributes (1) CogDriver-Data, a large-scale vision-language-action dataset with narrative annotations to supervise temporal dynamics and persistent intent, and (2) CogDriver-Agent, an architecture that employs sparse temporal memory together with spatiotemporal knowledge distillation to maintain a stable internal state. Experiments on Bench2Drive and nuScenes report a 22% increase in closed-loop Driving Score and a 21% reduction in mean L2 error, respectively, which the authors attribute to temporally coherent planning and present as new state-of-the-art results.

Significance. If the performance gains can be shown to arise specifically from the proposed mechanisms for instilling cognitive inertia, the work would offer a concrete route toward more reliable long-horizon decision making in VLM-based driving agents. The use of narrative annotations as a supervisory signal for persistent intent is a distinctive design choice that could transfer to other sequential prediction domains. The empirical improvements on standard closed-loop and open-loop benchmarks are sizable, but their interpretation depends on establishing causality rather than correlation with dataset or model scale.

major comments (2)
  1. [§4] §4 (Experiments): The reported 22% Driving Score lift on Bench2Drive and 21% L2 reduction on nuScenes are presented without ablation studies that isolate the sparse temporal memory, distillation procedure, or narrative annotations from confounding factors such as overall dataset size or base VLM capacity. In the absence of these controls, the central claim that the gains result from a temporally coherent internal state cannot be verified.
  2. [Abstract and §3.2] Abstract and §3.2: The assertion that the results supply 'strong evidence' for successful maintenance of a temporally coherent internal state rests solely on end-task metrics; no auxiliary probes (plan consistency under frame jitter, intent persistence across sequences, or memory-state stability metrics) are reported to confirm that the internal representation is actually coherent rather than merely more accurate at single-frame prediction.
minor comments (2)
  1. [Abstract] The abstract states performance deltas but omits any mention of the number of evaluation runs, statistical significance testing, or precise baseline implementations and data splits used for comparison.
  2. [§3.2] Notation for the memory sparsity threshold (listed among free parameters) is introduced without an explicit equation or hyper-parameter sensitivity analysis in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that stronger controls are needed to attribute performance gains specifically to cognitive inertia mechanisms rather than scale or capacity. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The reported 22% Driving Score lift on Bench2Drive and 21% L2 reduction on nuScenes are presented without ablation studies that isolate the sparse temporal memory, distillation procedure, or narrative annotations from confounding factors such as overall dataset size or base VLM capacity. In the absence of these controls, the central claim that the gains result from a temporally coherent internal state cannot be verified.

    Authors: We acknowledge that the current experiments lack comprehensive ablations isolating each proposed component while holding dataset size and base VLM fixed. Our reported comparisons use the same underlying VLM for baselines, and the dataset construction explicitly incorporates narrative annotations to target temporal dynamics, but these do not fully rule out confounding effects. In the revised manuscript we will add ablation studies that remove or disable the sparse temporal memory, the spatiotemporal distillation loss, and the narrative supervision while controlling for total data volume and model scale. These additions will allow direct assessment of whether the gains arise from the mechanisms for maintaining a stable internal state. revision: yes

  2. Referee: [Abstract and §3.2] Abstract and §3.2: The assertion that the results supply 'strong evidence' for successful maintenance of a temporally coherent internal state rests solely on end-task metrics; no auxiliary probes (plan consistency under frame jitter, intent persistence across sequences, or memory-state stability metrics) are reported to confirm that the internal representation is actually coherent rather than merely more accurate at single-frame prediction.

    Authors: The referee is correct that end-task metrics alone, even in closed-loop settings, do not directly demonstrate coherence of the internal state. We will moderate the phrasing in the abstract and §3.2 from 'strong evidence' to 'supporting evidence' and add auxiliary evaluations in the revision. Specifically, we will report plan consistency under frame-level jitter, measure intent persistence by tracking predicted goals across multi-step sequences, and include simple memory-state stability metrics such as cosine similarity of memory embeddings between consecutive timesteps. These probes will provide more direct support for the claim of temporally coherent planning. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are independent of training inputs

full rationale

The paper introduces CogDriver-Data with narrative annotations and CogDriver-Agent with sparse temporal memory plus distillation, then reports measured performance lifts (22% Driving Score on Bench2Drive, 21% L2 reduction on nuScenes) as external validation. These are standard end-task metrics on public benchmarks, not quantities defined by construction from the fitted parameters, narrative labels, or distillation loss. No equation reduces the claimed coherence or gains to a self-referential fit; the evaluation chain relies on held-out test sets and does not invoke self-citations or uniqueness theorems that collapse back to the authors' prior assumptions. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the new dataset and agent architecture; a small number of architectural choices (memory sparsity, distillation temperature) are likely tuned but not enumerated in the abstract.

free parameters (1)
  • memory sparsity threshold
    Controls which past states are retained; value not stated in abstract but required for the sparse temporal memory design.
axioms (1)
  • domain assumption Vision-language models lack cognitive inertia and therefore produce decision jitter on isolated snapshots.
    Stated as the fundamental flaw that the framework is designed to remedy.
invented entities (2)
  • CogDriver-Data no independent evidence
    purpose: Supply narrative annotations that teach temporal dynamics and persistent intent.
    New dataset introduced by the authors.
  • CogDriver-Agent no independent evidence
    purpose: Maintain stable internal state via sparse temporal memory and distillation.
    New agent architecture proposed in the paper.

pith-pipeline@v0.9.0 · 5796 in / 1476 out tokens · 73058 ms · 2026-05-18T20:06:14.614608+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  2. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 2 Pith papers · 11 internal anchors

  1. [1]

    Arai, H.; Miwa, K.; Sasaki, K.; Watanabe, K.; Yamaguchi, Y.; Aoki, S.; and Yamamoto, I. 2025. CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving . In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, 1933--1943. IEEE

  2. [2]

    Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. 2025. Qwen2.5-VL Technical Report . arXiv preprint arXiv:2502.13923

  3. [3]

    H.; Vora, S.; Liong, V

    Caesar, H.; Bankiti, V.; Lang, A. H.; Vora, S.; Liong, V. E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; and Beijbom, O. 2020. nuScenes: A Multimodal Dataset for Autonomous Driving . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11621--11631

  4. [4]

    J.; Birch, D.; Maund, D.; and Shotton, J

    Chen, L.; Sinavski, O.; H \"u nermann, J.; Karnsund, A.; Willmott, A. J.; Birch, D.; Maund, D.; and Shotton, J. 2024 a . Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving . In 2024 IEEE International Conference on Robotics and Automation, 14093--14100. IEEE

  5. [5]

    Chen, L.; Wu, P.; Chitta, K.; Jaeger, B.; Geiger, A.; and Li, H. 2024 b . End-to-End Autonomous Driving: Challenges and Frontiers . IEEE Transactions on Pattern Analysis and Machine Intelligence

  6. [6]

    Chen, S.; Jiang, B.; Gao, H.; Liao, B.; Xu, Q.; Zhang, Q.; Huang, C.; Liu, W.; and Wang, X. 2024 c . VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning . arXiv preprint arXiv:2402.13243

  7. [7]

    B.; and Moens, M.-F

    Deruyttere, T.; Grujicic, D.; Blaschko, M. B.; and Moens, M.-F. 2022. Talk2Car: Predicting Physical Trajectories for Natural Language Commands . Ieee Access, 10: 123809--123834

  8. [8]

    Deruyttere, T.; Vandenhende, S.; Grujicic, D.; Van Gool, L.; and Moens, M. F. 2019. Talk2Car: Taking Control of Your Self-Driving Car . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, 2088--2098

  9. [9]

    Fang, Y.; Sun, Q.; Wang, X.; Huang, T.; Wang, X.; and Cao, Y. 2024. EVA-02: A visual representation for neon genesis . Image and Vision Computing, 149: 105171

  10. [10]

    Fu, H.; Zhang, D.; Zhao, Z.; Cui, J.; Liang, D.; Zhang, C.; Zhang, D.; Xie, H.; Wang, B.; and Bai, X. 2025. ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation . arXiv preprint arXiv:2503.19755

  11. [11]

    Han, W.; Guo, D.; Xu, C.-Z.; and Shen, J. 2025. DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 3347--3355

  12. [12]

    Hu, S.; Chen, L.; Wu, P.; Li, H.; Yan, J.; and Tao, D. 2022. ST-P3: End-to-End Vision-Based Autonomous Driving via Spatial-Temporal Feature Learning . In European Conference on Computer Vision, 533--549. Springer

  13. [13]

    Jia, X.; Yang, Z.; Li, Q.; Zhang, Z.; and Yan, J. 2024. Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving . arXiv preprint arXiv:2406.03877

  14. [14]

    Jiang, B.; Chen, S.; Liao, B.; Zhang, X.; Yin, W.; Zhang, Q.; Huang, C.; Liu, W.; and Wang, X. 2024. Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving . arXiv preprint arXiv:2410.22313

  15. [15]

    Jiang, B.; Chen, S.; Xu, Q.; Liao, B.; Chen, J.; Zhou, H.; Zhang, Q.; Liu, W.; Huang, C.; and Wang, X. 2023. VAD: Vectorized Scene Representation for Efficient Autonomous Driving . In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8340--8350

  16. [16]

    Kim, J.; Misu, T.; Chen, Y.-T.; Tawari, A.; and Canny, J. 2019. Grounding Human-To-Vehicle Advice for Self-Driving Vehicles . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10591--10599

  17. [17]

    Kim, J.; Rohrbach, A.; Darrell, T.; Canny, J.; and Akata, Z. 2018. Textual Explanations for Self-Driving Vehicles . In Proceedings of the European Conference on Computer Vision, 563--578

  18. [18]

    Li, F.; Zhang, R.; Zhang, H.; Zhang, Y.; Li, B.; Li, W.; Ma, Z.; and Li, C. 2024 a . LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models . arXiv preprint arXiv:2407.07895

  19. [19]

    Li, H.; Li, Y.; Wang, H.; Zeng, J.; Xu, H.; Cai, P.; Chen, L.; Yan, J.; Xu, F.; Xiong, L.; et al. 2023. Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future . arXiv preprint arXiv:2312.03408

  20. [20]

    Li, Z.; Yu, Z.; Lan, S.; Li, J.; Kautz, J.; Lu, T.; and Alvarez, J. M. 2024 b . Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14864--14873

  21. [21]

    Liu, H.; Li, C.; Li, Y.; and Lee, Y. J. 2024 a . Improved Baselines with Visual Instruction Tuning . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26296--26306

  22. [22]

    L.; and Knoll, A

    Liu, M.; Yurtsever, E.; Fossaert, J.; Zhou, X.; Zimmer, W.; Cui, Y.; Zagar, B. L.; and Knoll, A. C. 2024 b . A Survey on Autonomous Driving Datasets: Statistics, Annotation Quality, and a Future Outlook . IEEE Transactions on Intelligent Vehicles

  23. [23]

    Liu, Y.; Wang, T.; Zhang, X.; and Sun, J. 2022. PETR: Position Embedding Transformation for Multi-view 3D Object Detection . In European Conference on Computer Vision, 531--548. Springer

  24. [24]

    Loshchilov, I.; and Hutter, F. 2016. SGDR: Stochastic Gradient Descent with Warm Restarts . arXiv preprint arXiv:1608.03983

  25. [25]

    H.; and Li, J

    Malla, S.; Choi, C.; Dwivedi, I.; Choi, J. H.; and Li, J. 2023. DRAMA: Joint Risk Localization and Captioning in Driving . In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1043--1052

  26. [26]

    Mao, J.; Qian, Y.; Ye, J.; Zhao, H.; and Wang, Y. 2023 a . GPT-Driver: Learning to Drive with GPT . arXiv preprint arXiv:2310.01415

  27. [27]

    Mao, J.; Ye, J.; Qian, Y.; Pavone, M.; and Wang, Y. 2023 b . A Language Agent for Autonomous Driving . arXiv preprint arXiv:2311.10813

  28. [28]

    Marcu, A.; Chen, L.; H \"u nermann, J.; Karnsund, A.; Hanotte, B.; Chidananda, P.; Nair, S.; Badrinarayanan, V.; Kendall, A.; Shotton, J.; et al. 2023. LingoQA: Visual Question Answering for Autonomous Driving . arXiv preprint arXiv:2312.14115

  29. [29]

    Nie, M.; Peng, R.; Wang, C.; Cai, X.; Han, J.; Xu, H.; and Zhang, L. 2024. Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving . In European Conference on Computer Vision, 292--308. Springer

  30. [30]

    G.; Velipasalar, S.; and Ren, L

    Pan, C.; Yaman, B.; Nesti, T.; Mallik, A.; Allievi, A. G.; Velipasalar, S.; and Ren, L. 2024. VLP: Vision Language Planning for Autonomous Driving . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14760--14769

  31. [31]

    W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

    Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning Transferable Visual Models From Natural Language Supervision . In International Conference on Machine Learning, 8748--8763. PmLR

  32. [32]

    Renz, K.; Chen, L.; Marcu, A.-M.; H \"u nermann, J.; Hanotte, B.; Karnsund, A.; Shotton, J.; Arani, E.; and Sinavski, O. 2024. CarLLaVA: Vision language models for camera-only closed-loop driving . arXiv preprint arXiv:2406.10165

  33. [33]

    Sachdeva, E.; Agarwal, N.; Chundi, S.; Roelofs, S.; Li, J.; Kochenderfer, M.; Choi, C.; and Dariush, B. 2024. Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning . In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 7513--7522

  34. [34]

    S.; Al-Rfou, R.; and Sapp, B

    Seff, A.; Cera, B.; Chen, D.; Ng, M.; Zhou, A.; Nayakanti, N.; Refaat, K. S.; Al-Rfou, R.; and Sapp, B. 2023. MotionLM: Multi-Agent Motion Forecasting as Language Modeling . In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8579--8590

  35. [35]

    L.; Liu, Y.; and Li, H

    Shao, H.; Hu, Y.; Wang, L.; Song, G.; Waslander, S. L.; Liu, Y.; and Li, H. 2024. LMDrive: Closed-Loop End-to-End Driving with Large Language Models . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15120--15130

  36. [36]

    Sima, C.; Renz, K.; Chitta, K.; Chen, L.; Zhang, H.; Xie, C.; Bei wenger, J.; Luo, P.; Geiger, A.; and Li, H. 2024. DriveLM: Driving with Graph Visual Question Answering . In European Conference on Computer Vision, 256--274. Springer

  37. [37]

    Team, Q. 2024. Qwen2 Technical Report . arXiv preprint arXiv:2407.10671

  38. [38]

    Tian, X.; Gu, J.; Li, B.; Liu, Y.; Wang, Y.; Zhao, Z.; Zhan, K.; Jia, P.; Lang, X.; and Zhao, H. 2024. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models . arXiv preprint arXiv:2402.12289

  39. [39]

    Wang, S.; Liu, Y.; Wang, T.; Li, Y.; and Zhang, X. 2023 a . Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection . In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3621--3631

  40. [40]

    Wang, S.; Yu, Z.; Jiang, X.; Lan, S.; Shi, M.; Chang, N.; Kautz, J.; Li, Y.; and Alvarez, J. M. 2025. OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning . In Proceedings of the Computer Vision and Pattern Recognition Conference, 22442--22452

  41. [41]

    Wang, W.; Xie, J.; Hu, C.; Zou, H.; Fan, J.; Tong, W.; Wen, Y.; Wu, S.; Deng, H.; Li, Z.; et al. 2023 b . DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving . arXiv preprint arXiv:2312.09245

  42. [42]

    C.; Zhang, T.; Wang, Y.; Zhao, H.; and Solomon, J

    Wang, Y.; Guizilini, V. C.; Zhang, T.; Wang, Y.; Zhao, H.; and Solomon, J. 2022. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries . In Conference on Robot Learning, 180--191. PMLR

  43. [43]

    Xu, Y.; Yang, X.; Gong, L.; Lin, H.-C.; Wu, T.-Y.; Li, Y.; and Vasconcelos, N. 2020. Explainable Object-Induced Action Decision for Autonomous Vehicles . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9523--9532

  44. [44]

    Xu, Z.; Jain, S.; and Kankanhalli, M. 2024. Hallucination is Inevitable: An Innate Limitation of Large Language Models . arXiv preprint arXiv:2401.11817

  45. [45]

    K.; Li, Z.; and Zhao, H

    Xu, Z.; Zhang, Y.; Xie, E.; Zhao, Z.; Guo, Y.; Wong, K.-Y. K.; Li, Z.; and Zhao, H. 2024. DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model . IEEE Robotics and Automation Letters

  46. [46]

    Yang, J.; Gao, S.; Qiu, Y.; Chen, L.; Li, T.; Dai, B.; Chitta, K.; Wu, P.; Zeng, J.; Luo, P.; et al. 2024. Generalized Predictive Model for Autonomous Driving . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14662--14672

  47. [47]

    Yuan, J.; Sun, S.; Omeiza, D.; Zhao, B.; Newman, P.; Kunze, L.; and Gadd, M. 2024. RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model . arXiv preprint arXiv:2402.10828

  48. [48]

    Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L. 2023. Sigmoid Loss for Language Image Pre-Training . In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11975--11986

  49. [49]

    Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; et al. 2025. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models . Computational Linguistics, 1--45

  50. [50]

    Zhou, X.; Han, X.; Yang, F.; Ma, Y.; and Knoll, A. C. 2025. OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model . arXiv preprint arXiv:2503.23463

  51. [51]

    Zhou, X.; and Knoll, A. C. 2024. GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events . arXiv preprint arXiv:2402.02205

  52. [52]

    L.; Zimmer, W.; Cao, H.; and Knoll, A

    Zhou, X.; Liu, M.; Yurtsever, E.; Zagar, B. L.; Zimmer, W.; Cao, H.; and Knoll, A. C. 2024. Vision Language Models in Autonomous Driving: A Survey and Outlook . IEEE Transactions on Intelligent Vehicles

  53. [53]

    Zhu, J.; Wang, W.; Chen, Z.; Liu, Z.; Ye, S.; Gu, L.; Tian, H.; Duan, Y.; Su, W.; Shao, J.; et al. 2025. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models . arXiv preprint arXiv:2504.10479

  54. [54]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  55. [55]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...