pith. sign in

arxiv: 2605.22089 · v1 · pith:GVQQEZRSnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

Pith reviewed 2026-05-22 07:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords autonomous drivingvision-language-action modelslatent visual representationsfuture scene predictiontrajectory generationBench2Drive benchmarkworld models
0
0 comments X

The pith

LVDrive improves VLA autonomous driving by predicting future scenes in high-level latent space instead of reconstructing pixels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing vision-language-action models for end-to-end driving rely on sparse action labels or pixel-level world models that reconstruct images. LVDrive adds a future scene prediction task learned entirely in latent space with auxiliary supervision from a pretrained vision backbone. It jointly predicts future scenes and motion in one unified embedding space and single forward pass, then refines trajectories with a two-stage decoder that uses those latent representations. Experiments on the Bench2Drive benchmark show better closed-loop driving performance than both pure action-supervised baselines and image-reconstruction world models. A sympathetic reader would care because this suggests semantic future awareness can be obtained more efficiently than full visual reconstruction.

Core claim

LVDrive introduces a future scene prediction task into the VLA paradigm where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. It jointly models future scene and motion prediction within a unified embedding space processed in a single forward pass to conduct future-aware reasoning, and designs a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation.

What carries the argument

Future scene prediction task in high-level latent space under auxiliary supervision from a pretrained vision backbone, jointly modeled with motion prediction in a unified embedding space processed in one forward pass.

If this is right

  • LVDrive achieves significant improvements in closed-loop driving performance on the Bench2Drive benchmark.
  • It outperforms both action-supervised VLA methods and image-reconstruction-based world model approaches.
  • Joint modeling of future scene and motion in a single forward pass enables future-aware reasoning without autoregressive generation.
  • The two-stage trajectory decoder explicitly uses learned latent future representations to refine outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This latent-space approach could reduce compute in other VLA robotics tasks by avoiding full image reconstruction.
  • It raises the question of how much visual detail is truly needed for planning versus high-level semantic forecasts.
  • Testing the same latent prediction on real-world driving datasets would check whether simulation gains transfer.
  • The unified embedding space might allow tighter integration with language instructions for conditional driving behaviors.

Load-bearing premise

Future scene representations learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone will provide semantically meaningful information that improves the VLA model's future-aware reasoning and trajectory generation for driving.

What would settle it

An ablation that removes the latent future prediction task and auxiliary supervision but shows equal or better closed-loop metrics on Bench2Drive would falsify the claim that these latent representations drive the performance gains.

Figures

Figures reproduced from arXiv: 2605.22089 by Dan Xu, Diankun Zhang, Guang Chen, Hangjun Ye, Hongwei Xie, Xiaodong Mei.

Figure 1
Figure 1. Figure 1: The comparison of different VLA paradigms. Standard VLA approaches, as depicted in (a), rely on sparse action supervision. Our LVDrive, illustrated in (c), performs the future visual and action representation learning jointly. Unlike VLA with the world modeling paradigm in (b), LVDrive predicts future scenes entirely in latent space, capturing rich semantic features without pixel-level reconstruction. The … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LVDrive. LVDrive is a VLA framework that unifies latent future scene representation learning and motion planning, with dense auxiliary supervision provided by a pre￾trained vision backbone. Given multi-view images, the model encodes current and historical scene features and performs future-aware reasoning to predict both latent visual representations and motion features in a single forward pass… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of our LVDrive and Mbase in an Overtaking scenario from Bench2Drive. The ego vehicle encounters an accident ahead that blocks its driving lane, while a steady stream of oncoming traffic occupies the adjacent lane. The blue line denotes the generated trajectory. The ego vehicle controlled by Mbase becomes immobilized at the accident site. In contrast, our LVDrive successfully and smoothl… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of our LVDrive and Mbase in an Emergency Brake scenario. The ego vehicle is required to first perform an unprotected left turn at an intersection without a traffic light, then yield to a bicycle crossing its path. Upon encountering the crossing bicycle, the ego vehicle should execute an emergency brake, wait for the bicycle to clear the road, and subsequently resume driving. The ego veh… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of our LVDrive and Mbase in a Merging scenario. Navigating highway exits and merging onto narrow roads requires precise perception of the road layout and fine-grained trajectory planning. The ego vehicle of Mbase fails to capture the exact road boundary, leading to a lateral deviation that results in a collision with the guardrail. In contrast, LVDrive accurately perceives the off-ramp … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of our LVDrive and Mbase in a Traffic Sign scenario. The ego vehicle is required to make a right turn at a non-signalized intersection without traffic lights or signs, while avoiding collisions with surrounding vehicles. The ego vehicle controlled by Mbase plans an inaccurate trajectory that deviates from the intended route and ultimately collides with a traffic sign. In contrast, LVDri… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of our LVDrive and Mbase in an Overtaking scenario. The ego vehicle drives forward in its lane, and the front car in the adjacent lane stops and opens the door, which blocks the driving lane. The ego vehicle of Mbase stops and gets stuck in the place. In contrast, our LVDrive successfully plans the safe trajectory to bypass the front car with the open door and drives forward continuousl… view at source ↗
Figure 8
Figure 8. Figure 8: Failure case of our LVDrive in a Give Way scenario. The ego vehicle is required to yield to the emergency vehicle that approaches from behind. LVDrive maintains the straight route and fails to yield to the ambulance. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LVDrive, a Vision-Language-Action (VLA) framework for autonomous driving that augments standard action supervision with a future scene prediction task. Future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. The architecture jointly models future scene and motion prediction in a unified embedding space via a single forward pass, then applies a two-stage trajectory decoder that conditions on the learned latents to refine outputs. Experiments on the Bench2Drive benchmark are reported to show significant closed-loop performance gains over both action-supervised VLAs and image-reconstruction-based world models.

Significance. If the central claims hold after addressing the noted concerns, the work would offer a computationally efficient route to dense future-aware supervision within VLA models, avoiding the overhead of pixel-level reconstruction while still leveraging semantic priors. The single-pass joint modeling and explicit conditioning in the decoder represent a clean architectural contribution that could influence subsequent VLA designs for driving. The approach directly targets the underutilization of scene understanding in sparse-action regimes, which is a timely issue in end-to-end autonomy.

major comments (2)
  1. [Method (future scene prediction and two-stage decoder)] The central claim that performance gains on Bench2Drive arise from semantically meaningful future scene representations (rather than increased capacity or the auxiliary loss itself) is load-bearing, yet the manuscript provides no probing, visualization, or ablation that demonstrates the latents encode driving-relevant elements such as object trajectories, lane topology, or traffic rules instead of generic backbone statistics. This directly affects attribution of the reported improvements to the proposed future-aware mechanism.
  2. [Experiments and results] The experimental section reports 'significant improvements' and outperformance on Bench2Drive but does not include quantitative metrics with error bars, statistical significance tests, or ablations that isolate the contribution of the latent future representations versus baseline capacity increases. Without these controls, the strength of the empirical support for the weakest assumption remains unclear.
minor comments (2)
  1. [Method] Notation for the unified embedding space and the conditioning in the two-stage decoder could be clarified with an explicit diagram or equation reference to avoid ambiguity in how the latents are injected.
  2. [Abstract] The abstract would benefit from a single concrete performance delta or metric to give readers an immediate sense of the scale of improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the empirical support and attribution of our results.

read point-by-point responses
  1. Referee: [Method (future scene prediction and two-stage decoder)] The central claim that performance gains on Bench2Drive arise from semantically meaningful future scene representations (rather than increased capacity or the auxiliary loss itself) is load-bearing, yet the manuscript provides no probing, visualization, or ablation that demonstrates the latents encode driving-relevant elements such as object trajectories, lane topology, or traffic rules instead of generic backbone statistics. This directly affects attribution of the reported improvements to the proposed future-aware mechanism.

    Authors: We agree that direct evidence linking the learned latents to driving-specific semantics would strengthen attribution of the gains. The auxiliary supervision from the pretrained vision backbone is intended to promote semantic alignment rather than generic statistics, but we acknowledge the manuscript lacks explicit probing or visualizations to confirm this. In the revised version, we will add t-SNE visualizations of the latent space, probing classifiers for elements like object presence and lane topology, and an ablation comparing against a capacity-matched model without the future prediction objective. revision: yes

  2. Referee: [Experiments and results] The experimental section reports 'significant improvements' and outperformance on Bench2Drive but does not include quantitative metrics with error bars, statistical significance tests, or ablations that isolate the contribution of the latent future representations versus baseline capacity increases. Without these controls, the strength of the empirical support for the weakest assumption remains unclear.

    Authors: We concur that reporting error bars, statistical significance, and capacity-controlled ablations would improve the rigor of the results. The current experiments compare against both action-supervised VLAs and image-reconstruction baselines, but do not fully isolate capacity effects. In the revision, we will include standard deviations over multiple random seeds, paired t-tests or similar for key comparisons, and an additional ablation where baseline models are scaled to match LVDrive's parameter count while removing the latent future prediction component. revision: yes

Circularity Check

0 steps flagged

LVDrive derivation is self-contained with no reductions to fitted inputs or self-definitions

full rationale

The paper introduces architectural components (latent-space future scene prediction under auxiliary backbone supervision, unified embedding for joint scene-motion modeling in one forward pass, and two-stage trajectory decoder) and evaluates them empirically on Bench2Drive. No equations, loss terms, or claimed predictions are shown to equal their own inputs by construction, nor does any load-bearing step rely on self-citation chains that collapse to unverified priors. The performance claims rest on benchmark comparisons rather than tautological re-labeling of fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard machine learning assumptions about the utility of pretrained vision models for semantic supervision and the value of latent representations for efficient prediction. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption A pretrained vision backbone can provide effective auxiliary supervision for learning semantically meaningful high-level scene representations in latent space.
    Invoked to justify the future scene prediction task departing from pixel-level reconstruction.

pith-pipeline@v0.9.0 · 5738 in / 1399 out tokens · 65022 ms · 2026-05-22T07:33:17.614193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 19 internal anchors

  1. [1]

    Deep learning using rectified linear units (relu)

    Abien Fred Agarap. Deep learning using rectified linear units (relu). 2018. 15

  2. [2]

    Rabbat, Yann LeCun, and Nicolas Ballas

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael G. Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, 2023. 2

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong...

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 3

  5. [5]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel M. Salz, Maxim Neumann, Ibrahim M. Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey A. Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Martin Eisenschlos, Rishabh Kab...

  6. [6]

    Pluto: Pushing the limit of imitation learning- based planning for autonomous driving.ArXiv, abs/2404.14327, 2024

    Jie Cheng, Yingbing Chen, and Qifeng Chen. Pluto: Pushing the limit of imitation learning- based planning for autonomous driving.ArXiv, abs/2404.14327, 2024. 6

  7. [7]

    Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving

    Kairui Ding, Boyuan Chen, Yuchen Su, Huan ang Gao, Bu Jin, Chonghao Sima, Wuqiang Zhang, Xiaohui Li, Paul Barsch, Hongyang Li, and Hao Zhao. Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving. InConference on Robot Learning, 2024. 3

  8. [8]

    López, and Vladlen Koltun

    Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio M. López, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on Robot Learning, 2017. 5

  9. [9]

    Taming transformers for high-resolution image synthesis, 2020

    Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020. 6, 8, 15 10

  10. [10]

    Eva-02: A visual representation for neon genesis.Image Vis

    Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image Vis. Comput., 149:105171, 2023. 6, 15

  11. [11]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025. 3, 5, 6, 7, 8, 15

  12. [12]

    Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025. 3, 6, 7

  13. [13]

    Vista: A generalizable driving world model with high fidelity and versatile controllability.ArXiv, abs/2405.17398, 2024

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.ArXiv, abs/2405.17398, 2024. 3

  14. [14]

    World Models

    David R Ha and Jürgen Schmidhuber. World models.ArXiv, abs/1803.10122, 2018. 3

  15. [15]

    Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Martelleto Bressane Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, and Alexandre Alahi. Gem: A generalizable ego-vision...

  16. [16]

    Gaussian error linear units (gelus).arXiv: Learning, 2016

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv: Learning, 2016. 15

  17. [17]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. ArXiv, abs/2309.17080, 2023. 3

  18. [18]

    LoRA: Low-Rank Adaptation of Large Language Models

    J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685,

  19. [19]

    Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

    Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, and Junwei Liang. Vision-language- action models for autonomous driving: Past, present, and future.ArXiv, abs/2512.16760, 2025. 1

  20. [20]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 6, 7

  21. [21]

    Emma: End-to-end multimodal model for autonomous driving.Trans

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, James Guo, Drago Anguelov, and Mingxing Tan. Emma: End-to-end multimodal model for autonomous driving.Trans. Mach. Learn. Res., 2025,

  22. [22]

    Driveworld-vla: Unified latent-space world modeling with vision-language-action for au- tonomous driving.ArXiv, abs/2602.06521, 2026

    Feiyang Jia, Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, and Long Chen. Driveworld-vla: Unified latent-space world modeling with vision-language-action for au- tonomous driving.ArXiv, abs/2602.06521, 2026. 1, 2, 3

  23. [23]

    Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end au- tonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7919–7929, 2023. 6, 7 11

  24. [24]

    Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

    Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023. 6, 7

  25. [25]

    Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024. 5

  26. [26]

    Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving

    Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025. 6, 7

  27. [27]

    Diffvla: Vision- language guided diffusion planning for autonomous driving.ArXiv, abs/2505.19381, 2025

    Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuwen Heng, Hao Jiang, Zongzheng Zhang, Xianda Guo, Hao Sun, and Hao Zhao. Diffvla: Vision- language guided diffusion planning for autonomous driving.ArXiv, abs/2505.19381, 2025. 3

  28. [28]

    Vad: Vectorized scene representation for efficient autonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, 2023

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, 2023. 5, 6, 7

  29. [29]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.ArXiv, abs/2410.22313, 2024. 1, 3

  30. [30]

    Enhancing End-to-End Autonomous Driving with Latent World Model

    Yingyan Li, Lue Fan, Jiawei He, Yu-Quan Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.ArXiv, abs/2406.08481,

  31. [31]

    DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yu-Quan Wang, Yun- tao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, and Zhaoxiang Zhang. Drivevla-w0: World models amplify data scaling law in autonomous driving.ArXiv, abs/2510.12796, 2025. 1, 2, 3

  32. [32]

    ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025. 1, 3, 6, 7

  33. [33]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 3, 6

  34. [34]

    Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model

    Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng, Sheng Wang, Zhennan Wang, Shijia Chen, Boyang Wang, Yu Zhang, Xianming Liu, Shuguang Cui, and Zhen Li. Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model. ArXiv, abs/2512.11226, 2025. 3

  35. [35]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 3

  36. [36]

    DriveVA: Video Action Models are Zero-Shot Drivers

    Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, and Hao Cheng. Driveva: Video action models are zero-shot drivers.arXiv preprint arXiv:2604.04198, 2026. 3

  37. [37]

    Unleashing vla potentials in autonomous driving via explicit learning from failures

    Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, and Fuxi Wen. Unleashing vla potentials in autonomous driving via explicit learning from failures. arXiv preprint arXiv:2603.01063, 2026. 3

  38. [38]

    Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

    Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026. 3 12

  39. [39]

    Xiaodong Mei, Sheng Wang, Jie Cheng, Yingbing Chen, and Dan Xu. Hamf: A hybrid attention- mamba framework for joint scene context understanding and future motion representation learning.2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4107–4114, 2025. 6

  40. [40]

    Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

    NVIDIA, Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger, Ed ...

  41. [41]

    Carllava: Vision language models for camera- only closed-loop driving.ArXiv, abs/2406.10165, 2024

    Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoît Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera- only closed-loop driving.ArXiv, abs/2406.10165, 2024. 3

  42. [42]

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11993–12003, 2025. 1, 3

  43. [43]

    Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

    Shuyao Shang, Yuntao Chen, Yu-Quan Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.ArXiv, abs/2509.17940, 2025. 6

  44. [44]

    Oriane Sim’eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Juli...

  45. [45]

    Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22432–22441, 2025

    Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22432–22441, 2025. 6

  46. [46]

    Latent Chain-of-Thought World Modeling for End-to-End Driving

    Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, et al. Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025. 3

  47. [47]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.ArXiv, abs/2402.12289, 2024. 3

  48. [48]

    Diffad: A unified diffusion modeling approach for autonomous driving, 2025

    Tao Wang, Cong Zhang, Xingguang Qu, Kun Li, Weiwei Liu, and Chang Huang. Diffad: A unified diffusion modeling approach for autonomous driving, 2025. 6, 7

  49. [49]

    Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3, 2023

    Wenhai Wang, Jiangwei Xie, Chuanyan Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, and Jifeng Dai. Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3, 2023. 3

  50. [50]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Lian zi Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need.ArXiv, ...

  51. [51]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

    Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. InNeurIPS,

  52. [52]

    Yaman, Shen Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren

    Zhexiao Xiong, Xin Ye, B. Yaman, Shen Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.ArXiv, abs/2601.04453, 2026. 2, 3, 6, 7

  53. [53]

    Occ-llm: Enhancing autonomous driving with occupancy-based large language models.2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8434–8441, 2025

    Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-llm: Enhancing autonomous driving with occupancy-based large language models.2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8434–8441, 2025. 3

  54. [54]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9:8186–8193, 2023

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee Kenneth Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9:8186–8193, 2023. 1, 3

  55. [55]

    Visual point cloud forecasting enables scalable autonomous driving.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14673–14684, 2023

    Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable autonomous driving.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14673–14684, 2023. 3

  56. [56]

    DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

    Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.ArXiv, abs/2505.16278, 2025. 3, 6, 7

  57. [57]

    Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in CARLA v2)

    Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in CARLA v2). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 6, 7

  58. [58]

    FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. ArXiv, abs/2505.17685, 2025. 2, 3

  59. [59]

    Epona: Autoregressive diffusion world model for autonomous driving.ArXiv, abs/2506.24113, 2025

    Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jing Huang, Li Yuan, Qian Zhang, Xiaoxiao Long, Xun Cao, and Wei Yin. Epona: Autoregressive diffusion world model for autonomous driving.ArXiv, abs/2506.24113, 2025. 2, 3

  60. [60]

    Copi- lot4d: Learning unsupervised world models for autonomous driving via discrete diffusion

    Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun. Copi- lot4d: Learning unsupervised world models for autonomous driving via discrete diffusion. In International Conference on Learning Representations, 2023. 3

  61. [61]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena.ArXiv, abs/2306.05685, 2023. 6

  62. [62]

    Genad: Generative end-to-end autonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, 2024. 5

  63. [63]

    Doe-1: Closed-loop autonomous driving with large world model.ArXiv, abs/2412.09627, 2024

    Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed-loop autonomous driving with large world model.ArXiv, abs/2412.09627, 2024. 2, 3

  64. [64]

    World4drive: End-to-end autonomous driving via intention-aware physical latent world model.ArXiv, abs/2507.00603,

    Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, and Dongbin Zhao. World4drive: End-to-end autonomous driving via intention-aware physical latent world model.ArXiv, abs/2507.00603,

  65. [65]

    Opendrivevla: Towards end-to-end au- tonomous driving with large vision language action model

    Xingcheng Zhou, Xu Han, Feng Yang, Yunpu Ma, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model.ArXiv, abs/2503.23463,

  66. [66]

    Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

    Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation.ArXiv, abs/2501.14729, 2025. 2

  67. [67]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.ArXiv, abs/2506.13757, 2025. 3 14 A Technical appendices and supplementary material We first provide more detailed implementations of our LV...