LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

Dan Xu; Diankun Zhang; Guang Chen; Hangjun Ye; Hongwei Xie; Xiaodong Mei

arxiv: 2605.22089 · v1 · pith:GVQQEZRSnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

Xiaodong Mei , Diankun Zhang , Hongwei Xie , Guang Chen , Hangjun Ye , Dan Xu This is my paper

Pith reviewed 2026-05-22 07:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords autonomous drivingvision-language-action modelslatent visual representationsfuture scene predictiontrajectory generationBench2Drive benchmarkworld models

0 comments

The pith

LVDrive improves VLA autonomous driving by predicting future scenes in high-level latent space instead of reconstructing pixels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing vision-language-action models for end-to-end driving rely on sparse action labels or pixel-level world models that reconstruct images. LVDrive adds a future scene prediction task learned entirely in latent space with auxiliary supervision from a pretrained vision backbone. It jointly predicts future scenes and motion in one unified embedding space and single forward pass, then refines trajectories with a two-stage decoder that uses those latent representations. Experiments on the Bench2Drive benchmark show better closed-loop driving performance than both pure action-supervised baselines and image-reconstruction world models. A sympathetic reader would care because this suggests semantic future awareness can be obtained more efficiently than full visual reconstruction.

Core claim

LVDrive introduces a future scene prediction task into the VLA paradigm where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. It jointly models future scene and motion prediction within a unified embedding space processed in a single forward pass to conduct future-aware reasoning, and designs a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation.

What carries the argument

Future scene prediction task in high-level latent space under auxiliary supervision from a pretrained vision backbone, jointly modeled with motion prediction in a unified embedding space processed in one forward pass.

If this is right

LVDrive achieves significant improvements in closed-loop driving performance on the Bench2Drive benchmark.
It outperforms both action-supervised VLA methods and image-reconstruction-based world model approaches.
Joint modeling of future scene and motion in a single forward pass enables future-aware reasoning without autoregressive generation.
The two-stage trajectory decoder explicitly uses learned latent future representations to refine outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This latent-space approach could reduce compute in other VLA robotics tasks by avoiding full image reconstruction.
It raises the question of how much visual detail is truly needed for planning versus high-level semantic forecasts.
Testing the same latent prediction on real-world driving datasets would check whether simulation gains transfer.
The unified embedding space might allow tighter integration with language instructions for conditional driving behaviors.

Load-bearing premise

Future scene representations learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone will provide semantically meaningful information that improves the VLA model's future-aware reasoning and trajectory generation for driving.

What would settle it

An ablation that removes the latent future prediction task and auxiliary supervision but shows equal or better closed-loop metrics on Bench2Drive would falsify the claim that these latent representations drive the performance gains.

Figures

Figures reproduced from arXiv: 2605.22089 by Dan Xu, Diankun Zhang, Guang Chen, Hangjun Ye, Hongwei Xie, Xiaodong Mei.

**Figure 1.** Figure 1: The comparison of different VLA paradigms. Standard VLA approaches, as depicted in (a), rely on sparse action supervision. Our LVDrive, illustrated in (c), performs the future visual and action representation learning jointly. Unlike VLA with the world modeling paradigm in (b), LVDrive predicts future scenes entirely in latent space, capturing rich semantic features without pixel-level reconstruction. The … view at source ↗

**Figure 2.** Figure 2: Overview of LVDrive. LVDrive is a VLA framework that unifies latent future scene representation learning and motion planning, with dense auxiliary supervision provided by a pretrained vision backbone. Given multi-view images, the model encodes current and historical scene features and performs future-aware reasoning to predict both latent visual representations and motion features in a single forward pass… view at source ↗

**Figure 3.** Figure 3: Qualitative results of our LVDrive and Mbase in an Overtaking scenario from Bench2Drive. The ego vehicle encounters an accident ahead that blocks its driving lane, while a steady stream of oncoming traffic occupies the adjacent lane. The blue line denotes the generated trajectory. The ego vehicle controlled by Mbase becomes immobilized at the accident site. In contrast, our LVDrive successfully and smoothl… view at source ↗

**Figure 4.** Figure 4: Qualitative results of our LVDrive and Mbase in an Emergency Brake scenario. The ego vehicle is required to first perform an unprotected left turn at an intersection without a traffic light, then yield to a bicycle crossing its path. Upon encountering the crossing bicycle, the ego vehicle should execute an emergency brake, wait for the bicycle to clear the road, and subsequently resume driving. The ego veh… view at source ↗

**Figure 5.** Figure 5: Qualitative results of our LVDrive and Mbase in a Merging scenario. Navigating highway exits and merging onto narrow roads requires precise perception of the road layout and fine-grained trajectory planning. The ego vehicle of Mbase fails to capture the exact road boundary, leading to a lateral deviation that results in a collision with the guardrail. In contrast, LVDrive accurately perceives the off-ramp … view at source ↗

**Figure 6.** Figure 6: Qualitative results of our LVDrive and Mbase in a Traffic Sign scenario. The ego vehicle is required to make a right turn at a non-signalized intersection without traffic lights or signs, while avoiding collisions with surrounding vehicles. The ego vehicle controlled by Mbase plans an inaccurate trajectory that deviates from the intended route and ultimately collides with a traffic sign. In contrast, LVDri… view at source ↗

**Figure 7.** Figure 7: Qualitative results of our LVDrive and Mbase in an Overtaking scenario. The ego vehicle drives forward in its lane, and the front car in the adjacent lane stops and opens the door, which blocks the driving lane. The ego vehicle of Mbase stops and gets stuck in the place. In contrast, our LVDrive successfully plans the safe trajectory to bypass the front car with the open door and drives forward continuousl… view at source ↗

**Figure 8.** Figure 8: Failure case of our LVDrive in a Give Way scenario. The ego vehicle is required to yield to the emergency vehicle that approaches from behind. LVDrive maintains the straight route and fails to yield to the ambulance. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LVDrive adds latent future scene prediction to VLA driving models via joint embedding and two-stage decoding, but the abstract leaves the performance gains and semantic value of those latents under-supported.

read the letter

The main point is that LVDrive tries to improve VLA models for autonomous driving by adding a future scene prediction task entirely in latent space, supervised by a pretrained vision backbone. It jointly handles scene and motion in one forward pass inside a shared embedding and then applies a two-stage decoder that conditions trajectory output on those future latents. This moves away from both pure action supervision and pixel-level world model reconstruction, which is the clearest departure from prior work in the area.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LVDrive, a Vision-Language-Action (VLA) framework for autonomous driving that augments standard action supervision with a future scene prediction task. Future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. The architecture jointly models future scene and motion prediction in a unified embedding space via a single forward pass, then applies a two-stage trajectory decoder that conditions on the learned latents to refine outputs. Experiments on the Bench2Drive benchmark are reported to show significant closed-loop performance gains over both action-supervised VLAs and image-reconstruction-based world models.

Significance. If the central claims hold after addressing the noted concerns, the work would offer a computationally efficient route to dense future-aware supervision within VLA models, avoiding the overhead of pixel-level reconstruction while still leveraging semantic priors. The single-pass joint modeling and explicit conditioning in the decoder represent a clean architectural contribution that could influence subsequent VLA designs for driving. The approach directly targets the underutilization of scene understanding in sparse-action regimes, which is a timely issue in end-to-end autonomy.

major comments (2)

[Method (future scene prediction and two-stage decoder)] The central claim that performance gains on Bench2Drive arise from semantically meaningful future scene representations (rather than increased capacity or the auxiliary loss itself) is load-bearing, yet the manuscript provides no probing, visualization, or ablation that demonstrates the latents encode driving-relevant elements such as object trajectories, lane topology, or traffic rules instead of generic backbone statistics. This directly affects attribution of the reported improvements to the proposed future-aware mechanism.
[Experiments and results] The experimental section reports 'significant improvements' and outperformance on Bench2Drive but does not include quantitative metrics with error bars, statistical significance tests, or ablations that isolate the contribution of the latent future representations versus baseline capacity increases. Without these controls, the strength of the empirical support for the weakest assumption remains unclear.

minor comments (2)

[Method] Notation for the unified embedding space and the conditioning in the two-stage decoder could be clarified with an explicit diagram or equation reference to avoid ambiguity in how the latents are injected.
[Abstract] The abstract would benefit from a single concrete performance delta or metric to give readers an immediate sense of the scale of improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the empirical support and attribution of our results.

read point-by-point responses

Referee: [Method (future scene prediction and two-stage decoder)] The central claim that performance gains on Bench2Drive arise from semantically meaningful future scene representations (rather than increased capacity or the auxiliary loss itself) is load-bearing, yet the manuscript provides no probing, visualization, or ablation that demonstrates the latents encode driving-relevant elements such as object trajectories, lane topology, or traffic rules instead of generic backbone statistics. This directly affects attribution of the reported improvements to the proposed future-aware mechanism.

Authors: We agree that direct evidence linking the learned latents to driving-specific semantics would strengthen attribution of the gains. The auxiliary supervision from the pretrained vision backbone is intended to promote semantic alignment rather than generic statistics, but we acknowledge the manuscript lacks explicit probing or visualizations to confirm this. In the revised version, we will add t-SNE visualizations of the latent space, probing classifiers for elements like object presence and lane topology, and an ablation comparing against a capacity-matched model without the future prediction objective. revision: yes
Referee: [Experiments and results] The experimental section reports 'significant improvements' and outperformance on Bench2Drive but does not include quantitative metrics with error bars, statistical significance tests, or ablations that isolate the contribution of the latent future representations versus baseline capacity increases. Without these controls, the strength of the empirical support for the weakest assumption remains unclear.

Authors: We concur that reporting error bars, statistical significance, and capacity-controlled ablations would improve the rigor of the results. The current experiments compare against both action-supervised VLAs and image-reconstruction baselines, but do not fully isolate capacity effects. In the revision, we will include standard deviations over multiple random seeds, paired t-tests or similar for key comparisons, and an additional ablation where baseline models are scaled to match LVDrive's parameter count while removing the latent future prediction component. revision: yes

Circularity Check

0 steps flagged

LVDrive derivation is self-contained with no reductions to fitted inputs or self-definitions

full rationale

The paper introduces architectural components (latent-space future scene prediction under auxiliary backbone supervision, unified embedding for joint scene-motion modeling in one forward pass, and two-stage trajectory decoder) and evaluates them empirically on Bench2Drive. No equations, loss terms, or claimed predictions are shown to equal their own inputs by construction, nor does any load-bearing step rely on self-citation chains that collapse to unverified priors. The performance claims rest on benchmark comparisons rather than tautological re-labeling of fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard machine learning assumptions about the utility of pretrained vision models for semantic supervision and the value of latent representations for efficient prediction. No free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption A pretrained vision backbone can provide effective auxiliary supervision for learning semantically meaningful high-level scene representations in latent space.
Invoked to justify the future scene prediction task departing from pixel-level reconstruction.

pith-pipeline@v0.9.0 · 5738 in / 1399 out tokens · 65022 ms · 2026-05-22T07:33:17.614193+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone... jointly model future scene and motion prediction within a unified embedding space... two-stage trajectory decoding strategy
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LVDrive achieves significant improvements in closed-loop driving performance on Bench2Drive

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 19 internal anchors

[1]

Deep learning using rectified linear units (relu)

Abien Fred Agarap. Deep learning using rectified linear units (relu). 2018. 15

work page 2018
[2]

Rabbat, Yann LeCun, and Nicolas Ballas

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael G. Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, 2023. 2

work page 2023
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel M. Salz, Maxim Neumann, Ibrahim M. Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey A. Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Martin Eisenschlos, Rishabh Kab...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Pluto: Pushing the limit of imitation learning- based planning for autonomous driving.ArXiv, abs/2404.14327, 2024

Jie Cheng, Yingbing Chen, and Qifeng Chen. Pluto: Pushing the limit of imitation learning- based planning for autonomous driving.ArXiv, abs/2404.14327, 2024. 6

work page arXiv 2024
[7]

Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving

Kairui Ding, Boyuan Chen, Yuchen Su, Huan ang Gao, Bu Jin, Chonghao Sima, Wuqiang Zhang, Xiaohui Li, Paul Barsch, Hongyang Li, and Hao Zhao. Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving. InConference on Robot Learning, 2024. 3

work page 2024
[8]

López, and Vladlen Koltun

Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio M. López, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on Robot Learning, 2017. 5

work page 2017
[9]

Taming transformers for high-resolution image synthesis, 2020

Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020. 6, 8, 15 10

work page 2020
[10]

Eva-02: A visual representation for neon genesis.Image Vis

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image Vis. Comput., 149:105171, 2023. 6, 15

work page 2023
[11]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025. 3, 5, 6, 7, 8, 15

work page 2025
[12]

Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025. 3, 6, 7

work page arXiv 2025
[13]

Vista: A generalizable driving world model with high fidelity and versatile controllability.ArXiv, abs/2405.17398, 2024

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.ArXiv, abs/2405.17398, 2024. 3

work page arXiv 2024
[14]

World Models

David R Ha and Jürgen Schmidhuber. World models.ArXiv, abs/1803.10122, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Martelleto Bressane Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, and Alexandre Alahi. Gem: A generalizable ego-vision...

work page 2025
[16]

Gaussian error linear units (gelus).arXiv: Learning, 2016

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv: Learning, 2016. 15

work page 2016
[17]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. ArXiv, abs/2309.17080, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

LoRA: Low-Rank Adaptation of Large Language Models

J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, and Junwei Liang. Vision-language- action models for autonomous driving: Past, present, and future.ArXiv, abs/2512.16760, 2025. 1

work page arXiv 2025
[20]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 6, 7

work page 2023
[21]

Emma: End-to-end multimodal model for autonomous driving.Trans

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, James Guo, Drago Anguelov, and Mingxing Tan. Emma: End-to-end multimodal model for autonomous driving.Trans. Mach. Learn. Res., 2025,

work page 2025
[22]

Driveworld-vla: Unified latent-space world modeling with vision-language-action for au- tonomous driving.ArXiv, abs/2602.06521, 2026

Feiyang Jia, Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, and Long Chen. Driveworld-vla: Unified latent-space world modeling with vision-language-action for au- tonomous driving.ArXiv, abs/2602.06521, 2026. 1, 2, 3

work page arXiv 2026
[23]

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end au- tonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7919–7929, 2023. 6, 7 11

work page 2023
[24]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023. 6, 7

work page 2023
[25]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024. 5

work page 2024
[26]

Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025. 6, 7

work page 2025
[27]

Diffvla: Vision- language guided diffusion planning for autonomous driving.ArXiv, abs/2505.19381, 2025

Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuwen Heng, Hao Jiang, Zongzheng Zhang, Xianda Guo, Hao Sun, and Hao Zhao. Diffvla: Vision- language guided diffusion planning for autonomous driving.ArXiv, abs/2505.19381, 2025. 3

work page arXiv 2025
[28]

Vad: Vectorized scene representation for efficient autonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, 2023

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, 2023. 5, 6, 7

work page 2023
[29]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.ArXiv, abs/2410.22313, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He, Yu-Quan Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.ArXiv, abs/2406.08481,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yu-Quan Wang, Yun- tao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, and Zhaoxiang Zhang. Drivevla-w0: World models amplify data scaling law in autonomous driving.ArXiv, abs/2510.12796, 2025. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 3, 6

work page 2025
[34]

Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model

Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng, Sheng Wang, Zhennan Wang, Shijia Chen, Boyang Wang, Yu Zhang, Xianming Liu, Shuguang Cui, and Zhen Li. Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model. ArXiv, abs/2512.11226, 2025. 3

work page arXiv 2025
[35]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 3

work page 2023
[36]

DriveVA: Video Action Models are Zero-Shot Drivers

Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, and Hao Cheng. Driveva: Video action models are zero-shot drivers.arXiv preprint arXiv:2604.04198, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Unleashing vla potentials in autonomous driving via explicit learning from failures

Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, and Fuxi Wen. Unleashing vla potentials in autonomous driving via explicit learning from failures. arXiv preprint arXiv:2603.01063, 2026. 3

work page arXiv 2026
[38]

Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026. 3 12

work page arXiv 2026
[39]

Xiaodong Mei, Sheng Wang, Jie Cheng, Yingbing Chen, and Dan Xu. Hamf: A hybrid attention- mamba framework for joint scene context understanding and future motion representation learning.2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4107–4114, 2025. 6

work page 2025
[40]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA, Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger, Ed ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Carllava: Vision language models for camera- only closed-loop driving.ArXiv, abs/2406.10165, 2024

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoît Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera- only closed-loop driving.ArXiv, abs/2406.10165, 2024. 3

work page arXiv 2024
[42]

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11993–12003, 2025. 1, 3

work page 2025
[43]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

Shuyao Shang, Yuntao Chen, Yu-Quan Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.ArXiv, abs/2509.17940, 2025. 6

work page arXiv 2025
[44]

Oriane Sim’eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Juli...

work page 2025
[45]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22432–22441, 2025

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22432–22441, 2025. 6

work page 2025
[46]

Latent Chain-of-Thought World Modeling for End-to-End Driving

Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, et al. Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.ArXiv, abs/2402.12289, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Diffad: A unified diffusion modeling approach for autonomous driving, 2025

Tao Wang, Cong Zhang, Xingguang Qu, Kun Li, Weiwei Liu, and Chang Huang. Diffad: A unified diffusion modeling approach for autonomous driving, 2025. 6, 7

work page 2025
[49]

Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3, 2023

Wenhai Wang, Jiangwei Xie, Chuanyan Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, and Jifeng Dai. Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3, 2023. 3

work page 2023
[50]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Lian zi Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need.ArXiv, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. InNeurIPS,

work page
[52]

Yaman, Shen Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren

Zhexiao Xiong, Xin Ye, B. Yaman, Shen Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.ArXiv, abs/2601.04453, 2026. 2, 3, 6, 7

work page arXiv 2026
[53]

Occ-llm: Enhancing autonomous driving with occupancy-based large language models.2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8434–8441, 2025

Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-llm: Enhancing autonomous driving with occupancy-based large language models.2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8434–8441, 2025. 3

work page 2025
[54]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9:8186–8193, 2023

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee Kenneth Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9:8186–8193, 2023. 1, 3

work page 2023
[55]

Visual point cloud forecasting enables scalable autonomous driving.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14673–14684, 2023

Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable autonomous driving.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14673–14684, 2023. 3

work page 2024
[56]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.ArXiv, abs/2505.16278, 2025. 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in CARLA v2)

Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in CARLA v2). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 6, 7

work page 2026
[58]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. ArXiv, abs/2505.17685, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Epona: Autoregressive diffusion world model for autonomous driving.ArXiv, abs/2506.24113, 2025

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jing Huang, Li Yuan, Qian Zhang, Xiaoxiao Long, Xun Cao, and Wei Yin. Epona: Autoregressive diffusion world model for autonomous driving.ArXiv, abs/2506.24113, 2025. 2, 3

work page arXiv 2025
[60]

Copi- lot4d: Learning unsupervised world models for autonomous driving via discrete diffusion

Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun. Copi- lot4d: Learning unsupervised world models for autonomous driving via discrete diffusion. In International Conference on Learning Representations, 2023. 3

work page 2023
[61]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena.ArXiv, abs/2306.05685, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Genad: Generative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, 2024. 5

work page 2024
[63]

Doe-1: Closed-loop autonomous driving with large world model.ArXiv, abs/2412.09627, 2024

Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed-loop autonomous driving with large world model.ArXiv, abs/2412.09627, 2024. 2, 3

work page arXiv 2024
[64]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model.ArXiv, abs/2507.00603,

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, and Dongbin Zhao. World4drive: End-to-end autonomous driving via intention-aware physical latent world model.ArXiv, abs/2507.00603,

work page arXiv
[65]

Opendrivevla: Towards end-to-end au- tonomous driving with large vision language action model

Xingcheng Zhou, Xu Han, Feng Yang, Yunpu Ma, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model.ArXiv, abs/2503.23463,

work page arXiv
[66]

Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation.ArXiv, abs/2501.14729, 2025. 2

work page arXiv 2025
[67]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.ArXiv, abs/2506.13757, 2025. 3 14 A Technical appendices and supplementary material We first provide more detailed implementations of our LV...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Deep learning using rectified linear units (relu)

Abien Fred Agarap. Deep learning using rectified linear units (relu). 2018. 15

work page 2018

[2] [2]

Rabbat, Yann LeCun, and Nicolas Ballas

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael G. Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, 2023. 2

work page 2023

[3] [3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel M. Salz, Maxim Neumann, Ibrahim M. Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey A. Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Martin Eisenschlos, Rishabh Kab...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Pluto: Pushing the limit of imitation learning- based planning for autonomous driving.ArXiv, abs/2404.14327, 2024

Jie Cheng, Yingbing Chen, and Qifeng Chen. Pluto: Pushing the limit of imitation learning- based planning for autonomous driving.ArXiv, abs/2404.14327, 2024. 6

work page arXiv 2024

[7] [7]

Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving

Kairui Ding, Boyuan Chen, Yuchen Su, Huan ang Gao, Bu Jin, Chonghao Sima, Wuqiang Zhang, Xiaohui Li, Paul Barsch, Hongyang Li, and Hao Zhao. Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving. InConference on Robot Learning, 2024. 3

work page 2024

[8] [8]

López, and Vladlen Koltun

Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio M. López, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on Robot Learning, 2017. 5

work page 2017

[9] [9]

Taming transformers for high-resolution image synthesis, 2020

Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020. 6, 8, 15 10

work page 2020

[10] [10]

Eva-02: A visual representation for neon genesis.Image Vis

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image Vis. Comput., 149:105171, 2023. 6, 15

work page 2023

[11] [11]

Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025. 3, 5, 6, 7, 8, 15

work page 2025

[12] [12]

Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025. 3, 6, 7

work page arXiv 2025

[13] [13]

Vista: A generalizable driving world model with high fidelity and versatile controllability.ArXiv, abs/2405.17398, 2024

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.ArXiv, abs/2405.17398, 2024. 3

work page arXiv 2024

[14] [14]

World Models

David R Ha and Jürgen Schmidhuber. World models.ArXiv, abs/1803.10122, 2018. 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Martelleto Bressane Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, and Alexandre Alahi. Gem: A generalizable ego-vision...

work page 2025

[16] [16]

Gaussian error linear units (gelus).arXiv: Learning, 2016

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv: Learning, 2016. 15

work page 2016

[17] [17]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. ArXiv, abs/2309.17080, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

LoRA: Low-Rank Adaptation of Large Language Models

J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, and Junwei Liang. Vision-language- action models for autonomous driving: Past, present, and future.ArXiv, abs/2512.16760, 2025. 1

work page arXiv 2025

[20] [20]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 6, 7

work page 2023

[21] [21]

Emma: End-to-end multimodal model for autonomous driving.Trans

Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, James Guo, Drago Anguelov, and Mingxing Tan. Emma: End-to-end multimodal model for autonomous driving.Trans. Mach. Learn. Res., 2025,

work page 2025

[22] [22]

Driveworld-vla: Unified latent-space world modeling with vision-language-action for au- tonomous driving.ArXiv, abs/2602.06521, 2026

Feiyang Jia, Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, and Long Chen. Driveworld-vla: Unified latent-space world modeling with vision-language-action for au- tonomous driving.ArXiv, abs/2602.06521, 2026. 1, 2, 3

work page arXiv 2026

[23] [23]

Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end au- tonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7919–7929, 2023. 6, 7 11

work page 2023

[24] [24]

Think twice before driving: Towards scalable decoders for end-to-end autonomous driving

Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023. 6, 7

work page 2023

[25] [25]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024. 5

work page 2024

[26] [26]

Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025. 6, 7

work page 2025

[27] [27]

Diffvla: Vision- language guided diffusion planning for autonomous driving.ArXiv, abs/2505.19381, 2025

Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuwen Heng, Hao Jiang, Zongzheng Zhang, Xianda Guo, Hao Sun, and Hao Zhao. Diffvla: Vision- language guided diffusion planning for autonomous driving.ArXiv, abs/2505.19381, 2025. 3

work page arXiv 2025

[28] [28]

Vad: Vectorized scene representation for efficient autonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, 2023

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, 2023. 5, 6, 7

work page 2023

[29] [29]

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.ArXiv, abs/2410.22313, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He, Yu-Quan Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.ArXiv, abs/2406.08481,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yu-Quan Wang, Yun- tao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, and Zhaoxiang Zhang. Drivevla-w0: World models amplify data scaling law in autonomous driving.ArXiv, abs/2510.12796, 2025. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025. 1, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 3, 6

work page 2025

[34] [34]

Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model

Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng, Sheng Wang, Zhennan Wang, Shijia Chen, Boyang Wang, Yu Zhang, Xianming Liu, Shuguang Cui, and Zhen Li. Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model. ArXiv, abs/2512.11226, 2025. 3

work page arXiv 2025

[35] [35]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 3

work page 2023

[36] [36]

DriveVA: Video Action Models are Zero-Shot Drivers

Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, and Hao Cheng. Driveva: Video action models are zero-shot drivers.arXiv preprint arXiv:2604.04198, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Unleashing vla potentials in autonomous driving via explicit learning from failures

Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, and Fuxi Wen. Unleashing vla potentials in autonomous driving via explicit learning from failures. arXiv preprint arXiv:2603.01063, 2026. 3

work page arXiv 2026

[38] [38]

Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026. 3 12

work page arXiv 2026

[39] [39]

Xiaodong Mei, Sheng Wang, Jie Cheng, Yingbing Chen, and Dan Xu. Hamf: A hybrid attention- mamba framework for joint scene context understanding and future motion representation learning.2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4107–4114, 2025. 6

work page 2025

[40] [40]

Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail

NVIDIA, Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger, Ed ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Carllava: Vision language models for camera- only closed-loop driving.ArXiv, abs/2406.10165, 2024

Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoît Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera- only closed-loop driving.ArXiv, abs/2406.10165, 2024. 3

work page arXiv 2024

[42] [42]

Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11993–12003, 2025. 1, 3

work page 2025

[43] [43]

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940, 2025

Shuyao Shang, Yuntao Chen, Yu-Quan Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.ArXiv, abs/2509.17940, 2025. 6

work page arXiv 2025

[44] [44]

Oriane Sim’eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Juli...

work page 2025

[45] [45]

Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22432–22441, 2025

Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22432–22441, 2025. 6

work page 2025

[46] [46]

Latent Chain-of-Thought World Modeling for End-to-End Driving

Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, et al. Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.ArXiv, abs/2402.12289, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Diffad: A unified diffusion modeling approach for autonomous driving, 2025

Tao Wang, Cong Zhang, Xingguang Qu, Kun Li, Weiwei Liu, and Chang Huang. Diffad: A unified diffusion modeling approach for autonomous driving, 2025. 6, 7

work page 2025

[49] [49]

Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3, 2023

Wenhai Wang, Jiangwei Xie, Chuanyan Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, and Jifeng Dai. Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3, 2023. 3

work page 2023

[50] [50]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Lian zi Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need.ArXiv, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline

Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. InNeurIPS,

work page

[52] [52]

Yaman, Shen Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren

Zhexiao Xiong, Xin Ye, B. Yaman, Shen Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.ArXiv, abs/2601.04453, 2026. 2, 3, 6, 7

work page arXiv 2026

[53] [53]

Occ-llm: Enhancing autonomous driving with occupancy-based large language models.2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8434–8441, 2025

Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-llm: Enhancing autonomous driving with occupancy-based large language models.2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8434–8441, 2025. 3

work page 2025

[54] [54]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9:8186–8193, 2023

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee Kenneth Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9:8186–8193, 2023. 1, 3

work page 2023

[55] [55]

Visual point cloud forecasting enables scalable autonomous driving.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14673–14684, 2023

Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable autonomous driving.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14673–14684, 2023. 3

work page 2024

[56] [56]

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.ArXiv, abs/2505.16278, 2025. 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in CARLA v2)

Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in CARLA v2). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 6, 7

work page 2026

[58] [58]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. ArXiv, abs/2505.17685, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Epona: Autoregressive diffusion world model for autonomous driving.ArXiv, abs/2506.24113, 2025

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jing Huang, Li Yuan, Qian Zhang, Xiaoxiao Long, Xun Cao, and Wei Yin. Epona: Autoregressive diffusion world model for autonomous driving.ArXiv, abs/2506.24113, 2025. 2, 3

work page arXiv 2025

[60] [60]

Copi- lot4d: Learning unsupervised world models for autonomous driving via discrete diffusion

Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun. Copi- lot4d: Learning unsupervised world models for autonomous driving via discrete diffusion. In International Conference on Learning Representations, 2023. 3

work page 2023

[61] [61]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena.ArXiv, abs/2306.05685, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

Genad: Generative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, 2024. 5

work page 2024

[63] [63]

Doe-1: Closed-loop autonomous driving with large world model.ArXiv, abs/2412.09627, 2024

Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed-loop autonomous driving with large world model.ArXiv, abs/2412.09627, 2024. 2, 3

work page arXiv 2024

[64] [64]

World4drive: End-to-end autonomous driving via intention-aware physical latent world model.ArXiv, abs/2507.00603,

Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, and Dongbin Zhao. World4drive: End-to-end autonomous driving via intention-aware physical latent world model.ArXiv, abs/2507.00603,

work page arXiv

[65] [65]

Opendrivevla: Towards end-to-end au- tonomous driving with large vision language action model

Xingcheng Zhou, Xu Han, Feng Yang, Yunpu Ma, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model.ArXiv, abs/2503.23463,

work page arXiv

[66] [66]

Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation

Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation.ArXiv, abs/2501.14729, 2025. 2

work page arXiv 2025

[67] [67]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.ArXiv, abs/2506.13757, 2025. 3 14 A Technical appendices and supplementary material We first provide more detailed implementations of our LV...

work page internal anchor Pith review Pith/arXiv arXiv 2025