pith. machine review for the scientific record. sign in

arxiv: 2506.13757 · v3 · submitted 2025-06-16 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords AutoVLAvision-language-action modelend-to-end autonomous drivingtrajectory tokenizationadaptive reasoningreinforcement fine-tuningchain-of-thought planning
0
0 comments X

The pith

AutoVLA unifies semantic reasoning and trajectory planning inside one autoregressive model that reads raw images and language instructions for end-to-end driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AutoVLA as a single vision-language-action model that produces both chain-of-thought reasoning and driving trajectories directly from camera images and text commands. Continuous paths are turned into discrete action tokens so the language model can generate them autoregressively. Training begins with supervised fine-tuning that creates a fast mode outputting only trajectories and a slow mode that adds explicit reasoning steps. A reinforcement stage using Group Relative Policy Optimization then trims unnecessary reasoning in simple situations. Experiments on nuPlan, nuScenes, Waymo, and CARLA show the model reaches competitive open-loop and closed-loop performance while adapting its reasoning depth to the scene.

Core claim

AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions by tokenizing continuous trajectories into discrete feasible actions inside a single autoregressive generation model. Supervised fine-tuning equips the model with fast-thinking (trajectory-only) and slow-thinking (chain-of-thought) modes; reinforcement fine-tuning via Group Relative Policy Optimization then reduces redundant reasoning in straightforward scenarios, yielding competitive results across real-world and simulated benchmarks in both open- and closed-loop settings.

What carries the argument

A single autoregressive vision-language-action model that outputs both reasoning text and tokenized trajectory actions from raw images and instructions.

If this is right

  • The model can switch between quick trajectory output and detailed reasoning without separate modules.
  • End-to-end training becomes possible from perception all the way to control tokens.
  • Reinforcement fine-tuning can cut computation by skipping reasoning steps when the scene is simple.
  • The same architecture can be applied to other embodied tasks that mix language instructions with physical actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the discrete action set proves sufficient, future systems could run the entire planner on a single lightweight language model instead of modular perception-plus-planning stacks.
  • Adaptive reasoning depth may lower average latency in real-time driving compared with always running full chain-of-thought.
  • The approach invites tests on whether language instructions can override or conflict with visual evidence without breaking safety.

Load-bearing premise

Turning continuous driving paths into a fixed vocabulary of discrete actions keeps enough precision for safe control.

What would settle it

A closed-loop test in which the required steering or speed change falls between two discrete action tokens and the model produces a collision or lane departure that a continuous-output planner avoids.

read the original abstract

Recent advancements in Vision-Language-Action (VLA) models have shown promise for end-to-end autonomous driving by leveraging world knowledge and reasoning capabilities. However, current VLA models often struggle with physically infeasible action outputs, complex model structures, or unnecessarily long reasoning. In this paper, we propose AutoVLA, a novel VLA model that unifies reasoning and action generation within a single autoregressive generation model for end-to-end autonomous driving. AutoVLA performs semantic reasoning and trajectory planning directly from raw visual inputs and language instructions. We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model. For training, we employ supervised fine-tuning to equip the model with dual thinking modes: fast thinking (trajectory-only) and slow thinking (enhanced with chain-of-thought reasoning). To further enhance planning performance and efficiency, we introduce a reinforcement fine-tuning method based on Group Relative Policy Optimization (GRPO), reducing unnecessary reasoning in straightforward scenarios. Extensive experiments across real-world and simulated datasets and benchmarks, including nuPlan, nuScenes, Waymo, and CARLA, demonstrate the competitive performance of AutoVLA in both open-loop and closed-loop settings. Qualitative results showcase the adaptive reasoning and accurate planning capabilities of AutoVLA in diverse scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes AutoVLA, a Vision-Language-Action model for end-to-end autonomous driving that unifies semantic reasoning and trajectory planning in a single autoregressive generation model. It tokenizes continuous trajectories into discrete feasible actions for direct LM integration, employs supervised fine-tuning to support dual modes (fast thinking via trajectory-only outputs and slow thinking with chain-of-thought), and applies Group Relative Policy Optimization (GRPO) reinforcement fine-tuning to improve planning while reducing unnecessary reasoning in simple cases. Experiments across nuPlan, nuScenes, Waymo, and CARLA benchmarks report competitive performance in both open-loop and closed-loop settings, with qualitative results illustrating adaptive reasoning.

Significance. If the central claims hold, the work advances VLA models for driving by mitigating infeasible actions and excessive reasoning through a unified autoregressive architecture and adaptive thinking modes. The GRPO-based fine-tuning for efficiency is a concrete methodological contribution that could generalize to other embodied reasoning tasks. Empirical coverage of both real-world and simulated datasets in open- and closed-loop regimes provides a useful benchmark, though the overall significance hinges on verifying that the discretization step preserves control precision.

major comments (2)
  1. [Method (trajectory tokenization subsection)] The central closed-loop claims on nuPlan and CARLA rest on tokenizing continuous trajectories into a fixed discrete action vocabulary, yet the manuscript provides no vocabulary size, binning procedure (e.g., for lateral/longitudinal/heading controls), or reconstruction-error bounds. This is load-bearing: without quantified quantization error, it is impossible to confirm that discretization errors remain below the threshold where they compound into unsafe deviations in closed-loop control.
  2. [Experiments (Section 4)] The experimental section asserts competitive performance across multiple benchmarks but omits exact metric values, full baseline comparisons, error bars or standard deviations, and precise train/validation/test splits. These omissions prevent independent assessment of whether the reported gains over prior VLA models are statistically robust or sensitive to post-hoc scenario selection.
minor comments (1)
  1. [Method] Notation for the fast-thinking versus slow-thinking modes could be introduced earlier and used consistently when describing the GRPO reward formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We appreciate the emphasis on methodological details and experimental rigor, which will help improve the clarity and verifiability of our work. Below, we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [Method (trajectory tokenization subsection)] The central closed-loop claims on nuPlan and CARLA rest on tokenizing continuous trajectories into a fixed discrete action vocabulary, yet the manuscript provides no vocabulary size, binning procedure (e.g., for lateral/longitudinal/heading controls), or reconstruction-error bounds. This is load-bearing: without quantified quantization error, it is impossible to confirm that discretization errors remain below the threshold where they compound into unsafe deviations in closed-loop control.

    Authors: We agree that providing these details is essential for substantiating the closed-loop results. In the revised version, we will expand the trajectory tokenization subsection to specify the vocabulary size (1024 tokens), the binning strategy for lateral offset, longitudinal velocity, and heading angle (uniform quantization with bounds derived from feasible driving ranges), and report reconstruction errors (mean position error < 0.05m, heading error < 0.3°). We will also add analysis showing that these errors do not lead to unsafe deviations in closed-loop simulations. revision: yes

  2. Referee: [Experiments (Section 4)] The experimental section asserts competitive performance across multiple benchmarks but omits exact metric values, full baseline comparisons, error bars or standard deviations, and precise train/validation/test splits. These omissions prevent independent assessment of whether the reported gains over prior VLA models are statistically robust or sensitive to post-hoc scenario selection.

    Authors: We acknowledge this limitation in the current presentation. The revised manuscript will include comprehensive tables with exact metric values, standard deviations from 3-5 runs, full comparisons against all mentioned baselines, and explicit details on the data splits (e.g., nuPlan's official train/val/test partitions). This will enable independent verification of the statistical significance of our improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper describes an empirical architecture for a VLA model that tokenizes trajectories into discrete actions, applies supervised fine-tuning for dual thinking modes, and uses GRPO-based reinforcement fine-tuning. All core claims rest on training and evaluation against external benchmarks (nuPlan, nuScenes, Waymo, CARLA) rather than any derivation that reduces by construction to fitted inputs or self-citations. No equations, uniqueness theorems, or ansatzes are shown to be smuggled in or renamed as novel results; the method is self-contained against independent datasets and does not invoke load-bearing self-references for its performance assertions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard supervised and reinforcement learning assumptions plus the design choice of trajectory discretization; no new free parameters, axioms, or invented entities are introduced beyond the model architecture itself.

pith-pipeline@v0.9.0 · 5552 in / 1095 out tokens · 32499 ms · 2026-05-14T21:41:29.409994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DiscretenessForcing continuous_no_stable_minima echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We tokenize continuous trajectories into discrete, feasible actions, enabling direct integration into the language model.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  2. Action Emergence from Streaming Intent

    cs.RO 2026-05 unverdicted novelty 7.0

    A new VLA model called SI uses a four-step chain-of-thought to derive driving intent and applies it via classifier-free guidance to a flow-matching trajectory generator, showing competitive Waymo scores and intent-con...

  3. VECTOR-Drive: Tightly Coupled Vision-Language and Trajectory Expert Routing for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 7.0

    VECTOR-DRIVE couples vision-language reasoning and trajectory planning in one Transformer via semantic expert routing and flow-matching, reaching 88.91 driving score on Bench2Drive.

  4. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.

  5. Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 7.0

    The SNG framework and SNG-VLA model enable end-to-end driving systems to better incorporate global navigation for state-of-the-art route following without auxiliary perception losses.

  6. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  7. Fail2Drive: Benchmarking Closed-Loop Driving Generalization

    cs.RO 2026-04 conditional novelty 7.0

    Fail2Drive is the first paired-route benchmark for closed-loop generalization in CARLA, showing an average 22.8% success-rate drop on shifted scenarios and revealing failure modes such as ignoring visible LiDAR objects.

  8. The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.

  9. MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MAPLE performs closed-loop multi-agent training of VLA driving models entirely in latent space using supervised fine-tuning followed by RL with safety, progress, and diversity rewards, reaching SOTA on Bench2Drive.

  10. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).

  11. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  12. Action Emergence from Streaming Intent

    cs.RO 2026-05 unverdicted novelty 6.0

    Streaming Intent lets a VLA model derive driving intent via streamed chain-of-thought reasoning and use it to steer a flow-matching action head, yielding competitive Waymo scores plus intent-based trajectory control w...

  13. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  14. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  15. Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

    cs.CV 2026-05 unverdicted novelty 6.0

    Smaller end-to-end autonomous driving models achieve optimal 3-second trajectory prediction accuracy at lower or intermediate temporal sampling frequencies, whereas larger VLA-style models perform best at the highest ...

  16. DriveFuture: Future-Aware Latent World Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.

  17. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.

  18. Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

    cs.CV 2026-04 unverdicted novelty 6.0

    Creates LTD dataset for open-ended traffic VQA and trains UniVLT model to achieve SOTA on unified microscopic AD and macroscopic traffic reasoning tasks.

  19. OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

    cs.CV 2026-04 unverdicted novelty 6.0

    OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.

  20. ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  21. Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    Sim2Real-AD enables zero-shot transfer of CARLA-trained VLM-guided RL policies to full-scale vehicles, reporting 75-90% success rates in car-following, obstacle avoidance, and stop-sign scenarios without real-world RL...

  22. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  23. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  24. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

  25. CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop perf...

  26. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

  27. RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

    cs.CV 2026-04 unverdicted novelty 5.0

    RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

  28. CrowdVLA: Embodied Vision-Language-Action Agents for Context-Aware Crowd Simulation

    cs.GR 2026-04 unverdicted novelty 5.0

    CrowdVLA introduces vision-language-action agents for crowd simulation that reason about scene semantics, social norms, and action consequences using fine-tuned models and simulation rollouts.

  29. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · cited by 24 Pith papers · 10 internal anchors

  1. [1]

    Bevformer: learning bird’s-eye-view representation from lidar-camera via spa- tiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spa- tiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  2. [2]

    Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

    Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022

  3. [3]

    Bevfusion: A simple and robust lidar-camera fusion framework.Advances in Neural Information Processing Systems, 35:10421–10434, 2022

    Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework.Advances in Neural Information Processing Systems, 35:10421–10434, 2022

  4. [4]

    Qcnext: A next-generation framework for joint multi-agent trajectory prediction.arXiv preprint arXiv:2306.10508, 2023

    Zikang Zhou, Zihao Wen, Jianping Wang, Yung-Hui Li, and Yu-Kai Huang. Qcnext: A next-generation framework for joint multi-agent trajectory prediction.arXiv preprint arXiv:2306.10508, 2023

  5. [5]

    Shaoshuai Shi, Li Jiang, Dengxin Dai, and Bernt Schiele. Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3955–3971, 2024

  6. [6]

    Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving

    Zhiyu Huang, Haochen Liu, and Chen Lv. Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3903–3913, 2023

  7. [7]

    Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning.arXiv preprint arXiv:2410.05582, 2024

    Zhiyu Huang, Xinshuo Weng, Maximilian Igl, Yuxiao Chen, Yulong Cao, Boris Ivanovic, Marco Pavone, and Chen Lv. Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning.arXiv preprint arXiv:2410.05582, 2024

  8. [8]

    Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving.IEEE transactions on neural networks and learning systems, 2023

    Zhiyu Huang, Haochen Liu, Jingda Wu, and Chen Lv. Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving.IEEE transactions on neural networks and learning systems, 2023

  9. [9]

    Hybrid- prediction integrated planning for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Haochen Liu, Zhiyu Huang, Wenhui Huang, Haohan Yang, Xiaoyu Mo, and Chen Lv. Hybrid- prediction integrated planning for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  10. [10]

    Reasonnet: End-to-end driving with temporal and global reasoning

    Hao Shao, Letian Wang, Ruobing Chen, Steven L Waslander, Hongsheng Li, and Yu Liu. Reasonnet: End-to-end driving with temporal and global reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13723–13733, 2023

  11. [11]

    Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

    Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21983–21994, 2023

  12. [12]

    Vlp: Vision language planning for autonomous driving

    Chenbin Pan, Burhaneddin Yaman, Tommaso Nesti, Abhirup Mallik, Alessandro G Allievi, Senem Velipasalar, and Liu Ren. Vlp: Vision language planning for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14760–14769, 2024. 11

  13. [13]

    Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14749–14759, June 2024

  14. [14]

    St- p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning

    Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, and Dacheng Tao. St- p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, pages 533–549. Springer, 2022

  15. [15]

    Tbp-former: Learn- ing temporal bird’s-eye-view pyramid for joint perception and prediction in vision-centric autonomous driving

    Shaoheng Fang, Zi Wang, Yiqi Zhong, Junhao Ge, and Siheng Chen. Tbp-former: Learn- ing temporal bird’s-eye-view pyramid for joint perception and prediction in vision-centric autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1368–1378, 2023

  16. [16]

    Vision language models in autonomous driving: A survey and outlook

    Xingcheng Zhou, Mingyu Liu, Ekim Yurtsever, Bare Luka Zagar, Walter Zimmer, Hu Cao, and Alois C Knoll. Vision language models in autonomous driving: A survey and outlook. IEEE Transactions on Intelligent Vehicles, 2024

  17. [17]

    Driving with regulation: Interpretable decision-making for autonomous vehicles with retrieval-augmented reasoning via llm.arXiv preprint arXiv:2410.04759, 2024

    Tianhui Cai, Yifan Liu, Zewei Zhou, Haoxuan Ma, Seth Z Zhao, Zhiwen Wu, and Jiaqi Ma. Driving with regulation: Interpretable decision-making for autonomous vehicles with retrieval-augmented reasoning via llm.arXiv preprint arXiv:2410.04759, 2024

  18. [18]

    End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  19. [19]

    Drive like a human: Rethinking autonomous driving with large language models

    Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, and Yu Qiao. Drive like a human: Rethinking autonomous driving with large language models. In2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), pages 910–919. IEEE, 2024

  20. [20]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  21. [21]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  22. [22]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  23. [23]

    Automated evaluation of large vision-language models on self-driving corner cases

    Kai Chen, Yanze Li, Wenhua Zhang, Yanxin Liu, Pengxiang Li, Ruiyuan Gao, Lanqing Hong, Meng Tian, Xinhai Zhao, Zhenguo Li, et al. Automated evaluation of large vision-language models on self-driving corner cases. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7817–7826. IEEE, 2025

  24. [24]

    Driving everywhere with large language model policy adaptation

    Boyi Li, Yue Wang, Jiageng Mao, Boris Ivanovic, Sushant Veer, Karen Leung, and Marco Pavone. Driving everywhere with large language model policy adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14948–14957, 2024

  25. [25]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 2024

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 2024

  26. [26]

    arXiv preprint arXiv:2405.01533 , year=

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning.arXiv preprint arXiv:2405.01533, 2024

  27. [27]

    Dolphins: Mul- timodal language model for driving

    Yingzi Ma, Yulong Cao, Jiachen Sun, Marco Pavone, and Chaowei Xiao. Dolphins: Mul- timodal language model for driving. InEuropean Conference on Computer Vision, pages 403–420. Springer, 2024. 12

  28. [28]

    Bevdriver: Leveraging bev maps in llms for robust closed-loop driving.arXiv preprint arXiv:2503.03074, 2025

    Katharina Winter, Mark Azer, and Fabian B Flohr. Bevdriver: Leveraging bev maps in llms for robust closed-loop driving.arXiv preprint arXiv:2503.03074, 2025

  29. [29]

    Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

    Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

  30. [30]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  31. [31]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025

  32. [32]

    Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. Nora: A small open-sourced generalist vision language action model for embodied tasks, 2025. URLhttps://arxiv.org/abs/2504.19854

  33. [33]

    Covla: Comprehensive vision-language-action dataset for autonomous driving

    Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

  34. [34]

    arXiv preprint arXiv:2503.23463 (2025) 4

    Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, and Alois C Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model.arXiv preprint arXiv:2503.23463, 2025

  35. [35]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15120–15130, 2024

  36. [36]

    Simlingo: Vision-only closed-loop autonomous driving with language-action alignment.arXiv preprint arXiv:2503.09594, 2025

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment.arXiv preprint arXiv:2503.09594, 2025

  37. [37]

    EMMA: End-to-End Multimodal Model for Autonomous Driving

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  38. [38]

    Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

    Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025

  39. [39]

    Senna: Bridging large vision-language mod- els and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to- end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

  40. [40]

    Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

    Wenhai Wang, Jiangwei Xie, ChuanYang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, et al. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving.arXiv preprint arXiv:2312.09245, 2023

  41. [41]

    Carllava: Vision language models for camera- only closed-loop driving.arXiv preprint arXiv:2406.10165, 2024

    Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoit Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera- only closed-loop driving.arXiv preprint arXiv:2406.10165, 2024

  42. [42]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755,

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755, 2025

  43. [43]

    Lingo-2: Driving with natural language, 2024

    Waywe Research Team et al. Lingo-2: Driving with natural language, 2024. 13

  44. [44]

    Openemma: Open-source multimodal model for end-to-end autonomous driving

    Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1001–1009, 2025

  45. [45]

    arXiv preprint arXiv:2403.16996 , year=

    Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, and Ping Luo. Drivecot: Integrating chain-of-thought reasoning with end-to-end driving.arXiv preprint arXiv:2403.16996, 2024

  46. [46]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  47. [47]

    Reason- ing models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

    Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reason- ing models can be effective without thinking.arXiv preprint arXiv:2504.09858, 2025

  48. [48]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  49. [49]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  50. [50]

    Towards learning-based planning: The nuplan benchmark for real-world autonomous driving

    Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning-based planning: The nuplan benchmark for real-world autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 629...

  51. [51]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Informa- tion Processing Systems, 37:28706–28719, 2024

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Informa- tion Processing Systems, 37:28706–28719, 2024

  52. [52]

    2510.26125 , archivePrefix =

    Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Kate Tolstaya, Sarah Tang, Brandyn White, Sapp Ben, Mingxing Tan, Jyh-Jing Hwang, and Dragomir Anguelov. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025

  53. [53]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

  54. [54]

    Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

  55. [55]

    Hidden biases of end-to-end driving models

    Bernhard Jaeger, Kashyap Chitta, and Andreas Geiger. Hidden biases of end-to-end driving models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8240–8249, October 2023

  56. [56]

    arXiv preprint arXiv:2408.03601 (2024) 13

    Chengran Yuan, Zhanqi Zhang, Jiawei Sun, Shuo Sun, Zefan Huang, Christina Dao Wen Lee, Dongen Li, Yuhang Han, Anthony Wong, Keng Peng Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba.arXiv preprint arXiv:2408.03601, 2024

  57. [57]

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  58. [58]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022. 14

  59. [59]

    Centaur: Robust end-to-end autonomous driving with test-time training.arXiv preprint arXiv:2503.11650, 2025

    Chonghao Sima, Kashyap Chitta, Zhiding Yu, Shiyi Lan, Ping Luo, Andreas Geiger, Hongyang Li, and Jose M Alvarez. Centaur: Robust end-to-end autonomous driving with test-time training.arXiv preprint arXiv:2503.11650, 2025

  60. [60]

    V2XPnP: Vehicle-to-everything spatio- temporal fusion for multi-agent perception and prediction.arXiv preprint arXiv:2412.01812, 2024

    Zewei Zhou, Hao Xiang, Zhaoliang Zheng, Seth Z Zhao, Mingyue Lei, Yun Zhang, Tianhui Cai, Xinyi Liu, Johnson Liu, Maheswari Bajji, et al. V2XPnP: Vehicle-to-everything spatio- temporal fusion for multi-agent perception and prediction.arXiv preprint arXiv:2412.01812, 2024

  61. [61]

    Sparsedrive: End-to-end autonomous driving via sparse scene representation.arXiv preprint arXiv:2405.19620, 2024

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation.arXiv preprint arXiv:2405.19620, 2024

  62. [62]

    Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan

    Yiheng Li, Seth Z. Zhao, Chenfeng Xu, Chen Tang, Chenran Li, Mingyu Ding, Masayoshi Tomizuka, and Wei Zhan. Pre-training on synthetic driving data for trajectory prediction. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5910–5917, 2024. doi: 10.1109/IROS58592.2024.10802492

  63. [63]

    Gaussianad: Gaussian-centric end-to-end autonomous driving.arXiv preprint arXiv:2412.10371, 2024

    Wenzhao Zheng, Junjie Wu, Yao Zheng, Sicheng Zuo, Zixun Xie, Longchao Yang, Yong Pan, Zhihui Hao, Peng Jia, Xianpeng Lang, et al. Gaussianad: Gaussian-centric end-to-end autonomous driving.arXiv preprint arXiv:2412.10371, 2024

  64. [64]

    Yuping Wang, Shuo Xing, Cui Can, Renjie Li, Hongyuan Hua, Kexin Tian, Zhaobin Mo, Xiangbo Gao, Keshu Wu, Sulong Zhou, Hengxu You, Juntong Peng, Junge Zhang, Zehao Wang, Rui Song, Mingxuan Yan, Walter Zimmer, Xingcheng Zhou, Peiran Li, Zhaohan Lu, Chia-Ju Chen, Yue Huang, Ryan A. Rossi, Lichao Sun, Hongkai Yu, Zhiwen Fan, Frank Hao Yang, Yuhao Kang, Ross G...

  65. [65]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  66. [66]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  67. [67]

    Para-drive: Parallelized architecture for real-time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024

  68. [68]

    Genad: Generative end-to-end autonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

  69. [69]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139,

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139, 2024

  70. [70]

    Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

    Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

  71. [71]

    Adapt: Action-aware driving caption transformer

    Bu Jin, Xinyu Liu, Yupeng Zheng, Pengfei Li, Hao Zhao, Tong Zhang, Yuhang Zheng, Guyue Zhou, and Jingjing Liu. Adapt: Action-aware driving caption transformer. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7554–7561. IEEE, 2023. 15

  72. [72]

    Vlaad: Vision and language assistant for autonomous driving

    SungYeon Park, MinJae Lee, JiHyuk Kang, Hahyeon Choi, Yoonah Park, Juhwan Cho, Adam Lee, and DongKyu Kim. Vlaad: Vision and language assistant for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 980–987, 2024

  73. [73]

    Hint-ad: Holistically aligned interpretabil- ity in end-to-end autonomous driving.arXiv preprint arXiv:2409.06702, 2024

    Kairui Ding, Boyuan Chen, Yuchen Su, Huan-ang Gao, Bu Jin, Chonghao Sima, Wuqiang Zhang, Xiaohui Li, Paul Barsch, Hongyang Li, et al. Hint-ad: Holistically aligned interpretabil- ity in end-to-end autonomous driving.arXiv preprint arXiv:2409.06702, 2024

  74. [74]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

  75. [75]

    Womd- reasoning: A large-scale dataset for interaction reasoning in driving, 2025

    Yiheng Li, Cunxin Fan, Chongjian Ge, Zhihao Zhao, Chenran Li, Chenfeng Xu, Huaxiu Yao, Masayoshi Tomizuka, Bolei Zhou, Chen Tang, Mingyu Ding, and Wei Zhan. Womd- reasoning: A large-scale dataset for interaction reasoning in driving, 2025. URL https: //arxiv.org/abs/2407.04281

  76. [76]

    Vlm-ad: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024

    Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M Wolff, and Xin Huang. Vlm-ad: End-to-end autonomous driving through vision-language model supervision.arXiv preprint arXiv:2412.14446, 2024

  77. [77]

    Distilling multi- modal large language models for autonomous driving.arXiv preprint arXiv:2501.09757, 2025

    Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Distilling multi- modal large language models for autonomous driving.arXiv preprint arXiv:2501.09757, 2025

  78. [78]

    Vlm-e2e: Enhancing end-to-end autonomous driving with multimodal driver attention fusion.arXiv preprint arXiv:2502.18042, 2025

    Pei Liu, Haipeng Liu, Haichao Liu, Xin Liu, Jinxin Ni, and Jun Ma. Vlm-e2e: Enhancing end-to-end autonomous driving with multimodal driver attention fusion.arXiv preprint arXiv:2502.18042, 2025

  79. [79]

    Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting.arXiv preprint arXiv:2503.07234, 2025

    Haicheng Liao, Hanlin Kong, Bonan Wang, Chengyue Wang, Wang Ye, Zhengbing He, Chengzhong Xu, and Zhenning Li. Cot-drive: Efficient motion forecasting for autonomous driving with llms and chain-of-thought prompting.arXiv preprint arXiv:2503.07234, 2025

  80. [80]

    Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a

    Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023

Showing first 80 references.