LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
Pith reviewed 2026-05-22 07:33 UTC · model grok-4.3
The pith
LVDrive improves VLA autonomous driving by predicting future scenes in high-level latent space instead of reconstructing pixels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LVDrive introduces a future scene prediction task into the VLA paradigm where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. It jointly models future scene and motion prediction within a unified embedding space processed in a single forward pass to conduct future-aware reasoning, and designs a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation.
What carries the argument
Future scene prediction task in high-level latent space under auxiliary supervision from a pretrained vision backbone, jointly modeled with motion prediction in a unified embedding space processed in one forward pass.
If this is right
- LVDrive achieves significant improvements in closed-loop driving performance on the Bench2Drive benchmark.
- It outperforms both action-supervised VLA methods and image-reconstruction-based world model approaches.
- Joint modeling of future scene and motion in a single forward pass enables future-aware reasoning without autoregressive generation.
- The two-stage trajectory decoder explicitly uses learned latent future representations to refine outputs.
Where Pith is reading between the lines
- This latent-space approach could reduce compute in other VLA robotics tasks by avoiding full image reconstruction.
- It raises the question of how much visual detail is truly needed for planning versus high-level semantic forecasts.
- Testing the same latent prediction on real-world driving datasets would check whether simulation gains transfer.
- The unified embedding space might allow tighter integration with language instructions for conditional driving behaviors.
Load-bearing premise
Future scene representations learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone will provide semantically meaningful information that improves the VLA model's future-aware reasoning and trajectory generation for driving.
What would settle it
An ablation that removes the latent future prediction task and auxiliary supervision but shows equal or better closed-loop metrics on Bench2Drive would falsify the claim that these latent representations drive the performance gains.
Figures
read the original abstract
Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LVDrive, a Vision-Language-Action (VLA) framework for autonomous driving that augments standard action supervision with a future scene prediction task. Future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. The architecture jointly models future scene and motion prediction in a unified embedding space via a single forward pass, then applies a two-stage trajectory decoder that conditions on the learned latents to refine outputs. Experiments on the Bench2Drive benchmark are reported to show significant closed-loop performance gains over both action-supervised VLAs and image-reconstruction-based world models.
Significance. If the central claims hold after addressing the noted concerns, the work would offer a computationally efficient route to dense future-aware supervision within VLA models, avoiding the overhead of pixel-level reconstruction while still leveraging semantic priors. The single-pass joint modeling and explicit conditioning in the decoder represent a clean architectural contribution that could influence subsequent VLA designs for driving. The approach directly targets the underutilization of scene understanding in sparse-action regimes, which is a timely issue in end-to-end autonomy.
major comments (2)
- [Method (future scene prediction and two-stage decoder)] The central claim that performance gains on Bench2Drive arise from semantically meaningful future scene representations (rather than increased capacity or the auxiliary loss itself) is load-bearing, yet the manuscript provides no probing, visualization, or ablation that demonstrates the latents encode driving-relevant elements such as object trajectories, lane topology, or traffic rules instead of generic backbone statistics. This directly affects attribution of the reported improvements to the proposed future-aware mechanism.
- [Experiments and results] The experimental section reports 'significant improvements' and outperformance on Bench2Drive but does not include quantitative metrics with error bars, statistical significance tests, or ablations that isolate the contribution of the latent future representations versus baseline capacity increases. Without these controls, the strength of the empirical support for the weakest assumption remains unclear.
minor comments (2)
- [Method] Notation for the unified embedding space and the conditioning in the two-stage decoder could be clarified with an explicit diagram or equation reference to avoid ambiguity in how the latents are injected.
- [Abstract] The abstract would benefit from a single concrete performance delta or metric to give readers an immediate sense of the scale of improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the empirical support and attribution of our results.
read point-by-point responses
-
Referee: [Method (future scene prediction and two-stage decoder)] The central claim that performance gains on Bench2Drive arise from semantically meaningful future scene representations (rather than increased capacity or the auxiliary loss itself) is load-bearing, yet the manuscript provides no probing, visualization, or ablation that demonstrates the latents encode driving-relevant elements such as object trajectories, lane topology, or traffic rules instead of generic backbone statistics. This directly affects attribution of the reported improvements to the proposed future-aware mechanism.
Authors: We agree that direct evidence linking the learned latents to driving-specific semantics would strengthen attribution of the gains. The auxiliary supervision from the pretrained vision backbone is intended to promote semantic alignment rather than generic statistics, but we acknowledge the manuscript lacks explicit probing or visualizations to confirm this. In the revised version, we will add t-SNE visualizations of the latent space, probing classifiers for elements like object presence and lane topology, and an ablation comparing against a capacity-matched model without the future prediction objective. revision: yes
-
Referee: [Experiments and results] The experimental section reports 'significant improvements' and outperformance on Bench2Drive but does not include quantitative metrics with error bars, statistical significance tests, or ablations that isolate the contribution of the latent future representations versus baseline capacity increases. Without these controls, the strength of the empirical support for the weakest assumption remains unclear.
Authors: We concur that reporting error bars, statistical significance, and capacity-controlled ablations would improve the rigor of the results. The current experiments compare against both action-supervised VLAs and image-reconstruction baselines, but do not fully isolate capacity effects. In the revision, we will include standard deviations over multiple random seeds, paired t-tests or similar for key comparisons, and an additional ablation where baseline models are scaled to match LVDrive's parameter count while removing the latent future prediction component. revision: yes
Circularity Check
LVDrive derivation is self-contained with no reductions to fitted inputs or self-definitions
full rationale
The paper introduces architectural components (latent-space future scene prediction under auxiliary backbone supervision, unified embedding for joint scene-motion modeling in one forward pass, and two-stage trajectory decoder) and evaluates them empirically on Bench2Drive. No equations, loss terms, or claimed predictions are shown to equal their own inputs by construction, nor does any load-bearing step rely on self-citation chains that collapse to unverified priors. The performance claims rest on benchmark comparisons rather than tautological re-labeling of fitted quantities.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pretrained vision backbone can provide effective auxiliary supervision for learning semantically meaningful high-level scene representations in latent space.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone... jointly model future scene and motion prediction within a unified embedding space... two-stage trajectory decoding strategy
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LVDrive achieves significant improvements in closed-loop driving performance on Bench2Drive
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Deep learning using rectified linear units (relu)
Abien Fred Agarap. Deep learning using rectified linear units (relu). 2018. 15
work page 2018
-
[2]
Rabbat, Yann LeCun, and Nicolas Ballas
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael G. Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, 2023. 2
work page 2023
-
[3]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel M. Salz, Maxim Neumann, Ibrahim M. Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey A. Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Martin Eisenschlos, Rishabh Kab...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Jie Cheng, Yingbing Chen, and Qifeng Chen. Pluto: Pushing the limit of imitation learning- based planning for autonomous driving.ArXiv, abs/2404.14327, 2024. 6
-
[7]
Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving
Kairui Ding, Boyuan Chen, Yuchen Su, Huan ang Gao, Bu Jin, Chonghao Sima, Wuqiang Zhang, Xiaohui Li, Paul Barsch, Hongyang Li, and Hao Zhao. Hint-ad: Holistically aligned interpretability in end-to-end autonomous driving. InConference on Robot Learning, 2024. 3
work page 2024
-
[8]
Alexey Dosovitskiy, Germán Ros, Felipe Codevilla, Antonio M. López, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on Robot Learning, 2017. 5
work page 2017
-
[9]
Taming transformers for high-resolution image synthesis, 2020
Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2020. 6, 8, 15 10
work page 2020
-
[10]
Eva-02: A visual representation for neon genesis.Image Vis
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image Vis. Comput., 149:105171, 2023. 6, 15
work page 2023
-
[11]
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025. 3, 5, 6, 7, 8, 15
work page 2025
-
[12]
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, and Xiang Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv Preprint arXiv:2512.13636, 2025. 3, 6, 7
-
[13]
Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.ArXiv, abs/2405.17398, 2024. 3
-
[14]
David R Ha and Jürgen Schmidhuber. World models.ArXiv, abs/1803.10122, 2018. 3
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Martelleto Bressane Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, and Alexandre Alahi. Gem: A generalizable ego-vision...
work page 2025
-
[16]
Gaussian error linear units (gelus).arXiv: Learning, 2016
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv: Learning, 2016. 15
work page 2016
-
[17]
GAIA-1: A Generative World Model for Autonomous Driving
Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. ArXiv, abs/2309.17080, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
LoRA: Low-Rank Adaptation of Large Language Models
J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.ArXiv, abs/2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, and Junwei Liang. Vision-language- action models for autonomous driving: Past, present, and future.ArXiv, abs/2512.16760, 2025. 1
-
[20]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 6, 7
work page 2023
-
[21]
Emma: End-to-end multimodal model for autonomous driving.Trans
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, James Guo, Drago Anguelov, and Mingxing Tan. Emma: End-to-end multimodal model for autonomous driving.Trans. Mach. Learn. Res., 2025,
work page 2025
-
[22]
Feiyang Jia, Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, and Long Chen. Driveworld-vla: Unified latent-space world modeling with vision-language-action for au- tonomous driving.ArXiv, abs/2602.06521, 2026. 1, 2, 3
-
[23]
Xiaosong Jia, Yulu Gao, Li Chen, Junchi Yan, Patrick Langechuan Liu, and Hongyang Li. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end au- tonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7919–7929, 2023. 6, 7 11
work page 2023
-
[24]
Think twice before driving: Towards scalable decoders for end-to-end autonomous driving
Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023. 6, 7
work page 2023
-
[25]
Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving
Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving. InNeurIPS 2024 Datasets and Benchmarks Track, 2024. 5
work page 2024
-
[26]
Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving
Xiaosong Jia, Junqi You, Zhiyuan Zhang, and Junchi Yan. Drivetransformer: Unified trans- former for scalable end-to-end autonomous driving. InThe Thirteenth International Conference on Learning Representations, 2025. 6, 7
work page 2025
-
[27]
Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuwen Heng, Hao Jiang, Zongzheng Zhang, Xianda Guo, Hao Sun, and Hao Zhao. Diffvla: Vision- language guided diffusion planning for autonomous driving.ArXiv, abs/2505.19381, 2025. 3
-
[28]
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8306–8316, 2023. 5, 6, 7
work page 2023
-
[29]
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.ArXiv, abs/2410.22313, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Enhancing End-to-End Autonomous Driving with Latent World Model
Yingyan Li, Lue Fan, Jiawei He, Yu-Quan Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.ArXiv, abs/2406.08481,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yu-Quan Wang, Yun- tao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, and Zhaoxiang Zhang. Drivevla-w0: World models amplify data scaling law in autonomous driving.ArXiv, abs/2510.12796, 2025. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025. 1, 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving
Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 3, 6
work page 2025
-
[34]
Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model
Hongbin Lin, Yiming Yang, Yifan Zhang, Chaoda Zheng, Jie Feng, Sheng Wang, Zhennan Wang, Shijia Chen, Boyang Wang, Yu Zhang, Xianming Liu, Shuguang Cui, and Zhen Li. Futurex: Enhance end-to-end autonomous driving via latent chain-of-thought world model. ArXiv, abs/2512.11226, 2025. 3
-
[35]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 3
work page 2023
-
[36]
DriveVA: Video Action Models are Zero-Shot Drivers
Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, and Hao Cheng. Driveva: Video action models are zero-shot drivers.arXiv preprint arXiv:2604.04198, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
Unleashing vla potentials in autonomous driving via explicit learning from failures
Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, and Fuxi Wen. Unleashing vla potentials in autonomous driving via explicit learning from failures. arXiv preprint arXiv:2603.01063, 2026. 3
-
[38]
Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026. 3 12
-
[39]
Xiaodong Mei, Sheng Wang, Jie Cheng, Yingbing Chen, and Dan Xu. Hamf: A hybrid attention- mamba framework for joint scene context understanding and future motion representation learning.2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4107–4114, 2025. 6
work page 2025
-
[40]
NVIDIA, Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger, Ed ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Carllava: Vision language models for camera- only closed-loop driving.ArXiv, abs/2406.10165, 2024
Katrin Renz, Long Chen, Ana-Maria Marcu, Jan Hünermann, Benoît Hanotte, Alice Karnsund, Jamie Shotton, Elahe Arani, and Oleg Sinavski. Carllava: Vision language models for camera- only closed-loop driving.ArXiv, abs/2406.10165, 2024. 3
-
[42]
Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11993–12003, 2025. 1, 3
work page 2025
-
[43]
Shuyao Shang, Yuntao Chen, Yu-Quan Wang, Yingyan Li, and Zhaoxiang Zhang. Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.ArXiv, abs/2509.17940, 2025. 6
-
[44]
Oriane Sim’eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Juli...
work page 2025
-
[45]
Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22432–22441, 2025. 6
work page 2025
-
[46]
Latent Chain-of-Thought World Modeling for End-to-End Driving
Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, et al. Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Chenxu Hu, Yang Wang, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.ArXiv, abs/2402.12289, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Diffad: A unified diffusion modeling approach for autonomous driving, 2025
Tao Wang, Cong Zhang, Xingguang Qu, Kun Li, Weiwei Liu, and Chang Huang. Diffad: A unified diffusion modeling approach for autonomous driving, 2025. 6, 7
work page 2025
-
[49]
Wenhai Wang, Jiangwei Xie, Chuanyan Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, and Jifeng Dai. Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving.Visual Intelligence, 3, 2023. 3
work page 2023
-
[50]
Emu3: Next-Token Prediction is All You Need
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Lian zi Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need.ArXiv, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline
Penghao Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. InNeurIPS,
-
[52]
Yaman, Shen Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren
Zhexiao Xiong, Xin Ye, B. Yaman, Shen Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. Unidrive-wm: Unified understanding, planning and generation world model for autonomous driving.ArXiv, abs/2601.04453, 2026. 2, 3, 6, 7
-
[53]
Tianshuo Xu, Hao Lu, Xu Yan, Yingjie Cai, Bingbing Liu, and Yingcong Chen. Occ-llm: Enhancing autonomous driving with occupancy-based large language models.2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8434–8441, 2025. 3
work page 2025
-
[54]
Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee Kenneth Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 9:8186–8193, 2023. 1, 3
work page 2023
-
[55]
Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable autonomous driving.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14673–14684, 2023. 3
work page 2024
-
[56]
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, and Junchi Yan. Drivemoe: Mixture-of-experts for vision-language-action model in end-to-end autonomous driving.ArXiv, abs/2505.16278, 2025. 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Zhenjie Yang, Xiaosong Jia, Qifeng Li, Xue Yang, Maoqing Yao, and Junchi Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in CARLA v2). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. 6, 7
work page 2026
-
[58]
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving. ArXiv, abs/2505.17685, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Epona: Autoregressive diffusion world model for autonomous driving.ArXiv, abs/2506.24113, 2025
Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jing Huang, Li Yuan, Qian Zhang, Xiaoxiao Long, Xun Cao, and Wei Yin. Epona: Autoregressive diffusion world model for autonomous driving.ArXiv, abs/2506.24113, 2025. 2, 3
-
[60]
Copi- lot4d: Learning unsupervised world models for autonomous driving via discrete diffusion
Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun. Copi- lot4d: Learning unsupervised world models for autonomous driving via discrete diffusion. In International Conference on Learning Representations, 2023. 3
work page 2023
-
[61]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena.ArXiv, abs/2306.05685, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Genad: Generative end-to-end autonomous driving
Wenzhao Zheng, Ruiqi Song, Xianda Guo, and Long Chen. Genad: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, 2024. 5
work page 2024
-
[63]
Doe-1: Closed-loop autonomous driving with large world model.ArXiv, abs/2412.09627, 2024
Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, and Jiwen Lu. Doe-1: Closed-loop autonomous driving with large world model.ArXiv, abs/2412.09627, 2024. 2, 3
-
[64]
Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, and Dongbin Zhao. World4drive: End-to-end autonomous driving via intention-aware physical latent world model.ArXiv, abs/2507.00603,
-
[65]
Opendrivevla: Towards end-to-end au- tonomous driving with large vision language action model
Xingcheng Zhou, Xu Han, Feng Yang, Yunpu Ma, and Alois Knoll. Opendrivevla: Towards end-to-end autonomous driving with large vision language action model.ArXiv, abs/2503.23463,
-
[66]
Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation
Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, and Xiang Bai. Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation.ArXiv, abs/2501.14729, 2025. 2
-
[67]
Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.ArXiv, abs/2506.13757, 2025. 3 14 A Technical appendices and supplementary material We first provide more detailed implementations of our LV...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.