LWDrive: Layer-Wise World-Model-Guided Vision-Language Model Planning for Autonomous Driving

Chen Yang; Delin Ouyang; Guofa Li; Jie Li; Lingfeng Qi; Shuang Liang; Yuhao Wei; Ze Xu; Ziheng Zou

arxiv: 2606.29879 · v2 · pith:5FFYFQRNnew · submitted 2026-06-29 · 💻 cs.CV · cs.AI

LWDrive: Layer-Wise World-Model-Guided Vision-Language Model Planning for Autonomous Driving

Chen Yang , Yuhao Wei , Ze Xu , Ziheng Zou , Shuang Liang , Delin Ouyang , Lingfeng Qi , Jie Li

show 1 more author

Guofa Li

This is my paper

Pith reviewed 2026-07-01 06:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords autonomous drivingvision-language modelsworld modeltrajectory planningend-to-end planningforesight cascade plannerNAVSIM benchmarklayer-wise guidance

0 comments

The pith

Layer-wise world-model guidance refines coarse VLM trajectories into geometrically precise autonomous driving plans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that Vision-Language Models can supply high-level driving intentions but need additional structure to produce trajectories that are spatially accurate and grounded in future scene dynamics. It introduces future-frame generation as training supervision so that the model's internal hidden states begin to carry predictive information, then routes those states through a cascade of refinement steps that progressively correct positions and motion using multi-view cues. A sympathetic reader would care because this hybrid keeps the commonsense reasoning of the large model while adding the geometric precision required for safe end-to-end control. The result is reported as benchmark scores of 92.0 on NAVSIM and 89.6 on NAVSIM-v2.

Core claim

LWDrive treats the VLM output as an intent-aware coarse plan rather than a final trajectory, expands candidate trajectories around it, and refines them progressively with the Foresight Cascade Planner; the planner draws on VLM features from multiple layers together with historical temporal states, Action-Query representations, and current-frame multi-view BEV features, after the VLM has been trained with future-frame generation supervision to embed planning-relevant predictive dynamics in its hidden states.

What carries the argument

The Foresight Cascade Planner (FCP), which performs coarse-to-fine refinement by integrating VLM hidden states across layers with temporal and multi-view BEV features.

If this is right

The refined candidates preserve the high-level driving intention from the VLM while correcting spatial positions and motion trends.
Multi-view BEV features ground the refinement process at each cascade stage.
A final score head selects the best refined trajectory as the planning output.
The approach yields 92.0 on NAVSIM and 89.6 on NAVSIM-v2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-wise supervision pattern could be tested on VLM planning for other embodied tasks such as robotic manipulation.
If intermediate-layer features prove consistently useful, future VLM training for control might routinely include auxiliary prediction heads at multiple depths.
The coarse-to-fine cascade structure suggests a general template for turning any high-level generative model output into a set of low-level control candidates.

Load-bearing premise

Future-frame generation supervision will cause the VLM hidden states to encode predictive dynamics that the Foresight Cascade Planner can then exploit for geometric refinement.

What would settle it

Training the same VLM architecture without the future-frame generation loss and measuring whether the Foresight Cascade Planner still improves trajectory accuracy on the NAVSIM benchmark.

Figures

Figures reproduced from arXiv: 2606.29879 by Chen Yang, Delin Ouyang, Guofa Li, Jie Li, Lingfeng Qi, Shuang Liang, Yuhao Wei, Ze Xu, Ziheng Zou.

**Figure 1.** Figure 1: Comparison of VLM planning paradigms for autonomous driving. (a) Direct VLM-to-trajectory decoding lacks fine-grained correction. (b) Single-stage VLM/backbone fusion injects VLM semantics only once. (c) LWDrive performs world-model-guided coarse-to-fine refinement. also requires fine-grained spatial accuracy, temporal consistency, and physical feasibility under multi-view scene constraints, making direc… view at source ↗

**Figure 2.** Figure 2: Overall architecture of LWDrive. LWDrive first leverages future-frame world-model supervision to guide the VLM toward predictive scene representations and an intent-aware coarse trajectory. Based on this coarse plan, it constructs a candidate trajectory pool and progressively refines it with the Foresight Cascade Planner by integrating layer-wise foresight features, temporal states, action-query memories, … view at source ↗

**Figure 3.** Figure 3: Foresight Cascade Planner. Bridge Attention injects proposal interaction, action-query memory, ego-state context, and layer-wise VLM foresight features, while BEV refinement grounds the proposals with multi-view geometric cues and predicts residual updates. and st denotes ego-state information. Different from using only the final-layer VLM feature, this cascade schedule exposes the planner to representat… view at source ↗

**Figure 4.** Figure 4: Qualitative visualization of LWDrive. For each scene, we show the current front-view image, the future frame predicted by the world-model head, the trajectory planning result projected onto the front-view image, and the corresponding BEV trajectory planning result. Method L2 (m)↓ CR (%)↓ Non-Autoregressive Methods ST-P3 (Hu et al. 2022) 2.11 0.71 VAD (Jiang et al. 2023) 1.25 1.09 Ego-MLP (Li et al. 2024b) … view at source ↗

read the original abstract

Vision-Language Models (VLMs) provide powerful semantic understanding and commonsense reasoning for End-to-End Autonomous Driving (E2E-AD) planning. However, trajectories directly generated by VLMs often encode only coarse driving intentions and remain insufficient for geometrically accurate, future-aware, and multi-view-grounded planning. To address these limitations, we develop the Layer-Wise World-Model-Guided Driving framework (LWDrive). LWDrive is a VLM planning framework that refines coarse trajectories through layer-wise world-model guidance. Instead of treating the VLM output as the final trajectory, LWDrive uses it as an intent-aware coarse plan, expands a diverse candidate space around it, and progressively refines the candidates through a Foresight Cascade Planner (FCP). Specifically, we introduce future-frame generation supervision to encourage the VLM to learn forward-looking scene representations, thereby injecting planning-relevant predictive dynamics into its internal hidden states. Built upon these world-model-supervised representations, FCP exploits VLM features across multiple layers and integrates historical temporal states, Action-Query representations, and current-frame multi-view Bird's-Eye-View (BEV) features to refine candidate trajectories in a coarse-to-fine manner. This design enables progressive correction of spatial positions and motion trends while grounding trajectory refinement with multi-view scene cues and preserving the high-level driving intention produced by the large model. Finally, a score head evaluates the refined candidates and selects the best trajectory as the final planning output. Experiments show that LWDrive achieves a score of 92.0 on the NAVSIM benchmark and 89.6 on NAVSIM-v2. Code and models will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LWDrive's NAVSIM scores rest on an untested claim that future-frame supervision actually puts usable predictive dynamics into the VLM hidden states.

read the letter

The paper's main move is to supervise a VLM with future-frame generation so its hidden states pick up forward-looking scene dynamics, then feed those states layer by layer into a Foresight Cascade Planner that expands and refines coarse VLM trajectories using historical states, action queries, and multi-view BEV features.

What is new is the concrete combination of that supervision objective with the cascade refinement that tries to keep the VLM's high-level intent while fixing spatial and motion errors. The description of how FCP progressively corrects candidates across layers is clear enough to follow.

The paper does a decent job stating the problem with direct VLM trajectories and sketching a pipeline that adds geometric grounding without discarding the semantic reasoning.

The soft spot is exactly the one in the stress-test note. The abstract reports 92.0 on NAVSIM and 89.6 on NAVSIM-v2 but supplies no ablation that removes the future-frame loss, no probe of whether the hidden states actually improve at future prediction, and no isolation of the multi-layer FCP contribution versus the base VLM plus candidate expansion. Without those checks the performance numbers cannot be attributed to the claimed world-model guidance.

This is for people working on VLM planners for driving who want to see one pattern for adding predictive supervision and layered refinement. A reader looking for integration ideas would get something useful from the design even if the evidence for the mechanism stays thin.

I would send it to peer review so the experiments and any ablations can be checked directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LWDrive, a VLM-based framework for end-to-end autonomous driving planning. It treats VLM-generated trajectories as coarse, intent-aware plans, expands candidate trajectories around them, and refines them in a coarse-to-fine manner via the Foresight Cascade Planner (FCP). Future-frame generation supervision is added during training to encourage forward-looking representations in VLM hidden states; FCP then exploits these states across multiple layers together with historical temporal states, Action-Query representations, and current multi-view BEV features. A final score head selects the best refined trajectory. The method reports 92.0 on NAVSIM and 89.6 on NAVSIM-v2.

Significance. If the claimed mechanism is shown to be responsible for the gains, the work would offer a concrete route to combine the commonsense reasoning of large VLMs with geometrically precise, future-aware planning. The planned public release of code and models is a positive contribution that would allow the community to build on the layer-wise guidance idea.

major comments (2)

[Abstract] Abstract: The central claim that future-frame generation supervision 'injects planning-relevant predictive dynamics into its internal hidden states' which FCP then exploits is load-bearing, yet the manuscript supplies no ablation that removes this supervision, no probing of hidden-state predictive accuracy, and no isolation of the multi-layer FCP contribution versus the coarse VLM plan alone. Without these checks the reported benchmark scores cannot be attributed to the asserted world-model guidance mechanism.
[§4] §4 (Experiments): The manuscript reports final scores of 92.0 / 89.6 but provides no error bars, no statistical significance tests across runs, and no ablation tables that would allow readers to verify whether the layer-wise guidance, rather than candidate expansion or the score head, drives the improvement.

minor comments (2)

[Abstract] The abstract introduces the acronym FCP before its full expansion; a parenthetical definition on first use would improve readability.
[§3] Notation for the Action-Query representations and the precise integration of historical temporal states inside FCP is only sketched at a high level; a diagram or pseudocode block would clarify the data flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional ablations and statistical reporting are needed to strengthen attribution of gains to the world-model guidance mechanism, and we will revise the manuscript to address these points.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that future-frame generation supervision 'injects planning-relevant predictive dynamics into its internal hidden states' which FCP then exploits is load-bearing, yet the manuscript supplies no ablation that removes this supervision, no probing of hidden-state predictive accuracy, and no isolation of the multi-layer FCP contribution versus the coarse VLM plan alone. Without these checks the reported benchmark scores cannot be attributed to the asserted world-model guidance mechanism.

Authors: We acknowledge that the manuscript does not currently include the requested ablations or probing experiments. In the revised version we will add: (1) an ablation removing future-frame generation supervision, (2) analysis of hidden-state predictive accuracy (e.g., via probing or reconstruction metrics), and (3) comparisons isolating the multi-layer FCP contribution against the coarse VLM plan alone. These additions will allow direct attribution of performance gains to the claimed mechanism. revision: yes
Referee: [§4] §4 (Experiments): The manuscript reports final scores of 92.0 / 89.6 but provides no error bars, no statistical significance tests across runs, and no ablation tables that would allow readers to verify whether the layer-wise guidance, rather than candidate expansion or the score head, drives the improvement.

Authors: We agree that the current experimental section lacks error bars, significance testing, and sufficiently granular ablations. The revision will report results over multiple random seeds with standard deviations, include statistical significance tests, and expand the ablation tables to isolate the contributions of layer-wise guidance, candidate expansion around the VLM plan, and the final score head. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation chain

full rationale

The paper describes a VLM-based planning framework (LWDrive) that adds future-frame generation supervision to encourage forward-looking representations in hidden states, then applies a Foresight Cascade Planner (FCP) for coarse-to-fine refinement of candidate trajectories, reporting empirical scores of 92.0 and 89.6 on external NAVSIM benchmarks. No equations, fitted parameters, self-citations, or ansatzes are present that reduce any claimed prediction or result to its inputs by construction. The performance claims rest on benchmark evaluation rather than internal re-derivation, so the chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities beyond the named components are described.

invented entities (1)

Foresight Cascade Planner (FCP) no independent evidence
purpose: Progressively refines candidate trajectories using multi-layer VLM features, historical states, and multi-view BEV features
New component introduced to perform the coarse-to-fine refinement step.

pith-pipeline@v0.9.1-grok · 5863 in / 1185 out tokens · 38673 ms · 2026-07-01T06:51:55.080867+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 21 canonical work pages · 8 internal anchors

[1]

Dauner, Daniel and Hallgarten, Marcel and Li, Tianyu and Weng, Xinshuo and Huang, Zhiyu and Yang, Zetong and Li, Hongyang and Gilitschenski, Igor and Ivanovic, Boris and Pavone, Marco and Geiger, Andreas and Chitta, Kashyap , booktitle =
[2]

2025 , eprint =

Pseudo-Simulation for Autonomous Driving , author =. 2025 , eprint =

2025
[3]

and Vora, Sourabh and Liong, Venice Erin and Xu, Qiang and Krishnan, Anush and Pan, Yu and Baldan, Giancarlo and Beijbom, Oscar , booktitle =

Caesar, Holger and Bankiti, Varun and Lang, Alex H. and Vora, Sourabh and Liong, Venice Erin and Xu, Qiang and Krishnan, Anush and Pan, Yu and Baldan, Giancarlo and Beijbom, Oscar , booktitle =
[4]

2023 , howpublished =

2023
[5]

Qwen2.5-VL Technical Report

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =

Deep Residual Learning for Image Recognition , author =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =
[7]

Advances in Neural Information Processing Systems , volume =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =
[8]

Advances in Neural Information Processing Systems , volume =

Denoising Diffusion Probabilistic Models , author =. Advances in Neural Information Processing Systems , volume =
[9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

High-Resolution Image Synthesis with Latent Diffusion Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
[10]

Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Yu, Qiao and Dai, Jifeng , journal =
[11]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =
[12]

Hu, Shengchao and Chen, Li and Wu, Penghao and Li, Hongyang and Yan, Junchi and Tao, Dacheng , booktitle =
[13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Planning-Oriented Autonomous Driving , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
[14]

Jiang, Bo and Chen, Shaoyu and Xu, Qing and Liao, Bencheng and Chen, Jiajie and Zhou, Helong and Zhang, Qian and Liu, Wenyu and Huang, Chang and Wang, Xinggang , booktitle =
[15]

2025 IEEE International Conference on Robotics and Automation , pages =

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation , author =. 2025 IEEE International Conference on Robotics and Automation , pages =

2025
[16]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
[17]

arXiv preprint arXiv:2503.12820 (2025)

Li, Keyu and Li, Zhiqi and Lan, Shiyi and Xie, Yichen and Zhang, Ziyang and Liu, Jiaming and Wu, Zuxuan and Yu, Zehui and Alvarez, Jose M. , year =. Hydra-. 2503.12820 , archivePrefix =

work page arXiv
[18]

ipad: Iterative proposal-centric end-to-end autonomous driving.arXiv preprint arXiv:2505.15111, 2025

Guo, Ke and Liu, Haochen and Wu, Xiaojun and Pan, Jia and Lv, Chen , year =. 2505.15111 , archivePrefix =

work page arXiv
[19]

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Wozniak, Maciej K. and Liu, Lianhang and Cai, Yixi and Jensfelt, Patric , year =. 2507.17596 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[20]

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving

Song, Ziying and Liu, Lin and Pan, Hongyu and Liao, Bencheng and Guo, Mingzhe and Yang, Lei and Zhang, Yongchang and Xu, Shaoqing and Jia, Caiyan and Luo, Yadan , year =. 2507.04049 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2408.03601 (2024)

Yuan, Chengran and Zhang, Zhanqi and Sun, Jiawei and Sun, Shuo and Huang, Zefan and Lee, Christina Dao Wen and Li, Dongen and Han, Yuhang and Wong, Anthony and Tee, Keng Peng and Ang, Marcelo H. , year =. 2408.03601 , archivePrefix =

work page arXiv
[22]

2505.19239 , archivePrefix =

Shi, Chen and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li , year =. 2505.19239 , archivePrefix =

work page arXiv
[23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
[24]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End-to-End Autonomous Driving , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
[25]

2504.19580 , archivePrefix =

Feng, Rui and Xi, Ning and Chu, Duanfeng and Wang, Rukang and Deng, Zejian and Wang, Anzheng and Lu, Liping and Wang, Jinxiang and Huang, Yanjun , year =. 2504.19580 , archivePrefix =

work page arXiv
[26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

DriveWorld: 4D Pre-Trained Scene Understanding via World Models for Autonomous Driving , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
[28]

2024 , eprint =

Enhancing End-to-End Autonomous Driving with Latent World Model , author =. 2024 , eprint =

2024
[29]

End-to-End Driving with Online Trajectory Evaluation via

Li, Yingyan and Wang, Yuqi and Liu, Yang and He, Jiawei and Fan, Lue and Zhang, Zhaoxiang , year =. End-to-End Driving with Online Trajectory Evaluation via. 2504.01941 , archivePrefix =

work page arXiv
[30]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

Epona: Autoregressive Diffusion World Model for Autonomous Driving , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =
[31]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

DrivingGPT: Unifying Driving World Modeling and Planning with Multimodal Autoregressive Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =
[32]

Zheng, Wenzhao and Chen, Weiliang and Huang, Yuanhui and Zhang, Borui and Duan, Yueqi and Lu, Jiwen , booktitle =
[33]

2025 , eprint =

World4Drive: End-to-End Autonomous Driving via Intention-Aware Physical Latent World Model , author =. 2025 , eprint =

2025
[34]

2602.06521 , archivePrefix =

Jia, Feiyang and Liu, Lin and Song, Ziying and Jia, Caiyan and Ye, Hangjun and Hao, Xiaoshuai and Chen, Long , year =. 2602.06521 , archivePrefix =

work page arXiv
[35]

2501.14729 , archivePrefix =

Zhou, Xinyu and Liang, Dingkang and Tu, Shuyuan and Chen, Xinyu and Ding, Yuhang and Zhang, Dong and Tan, Fei and Zhao, Hang and Bai, Xiang , year =. 2501.14729 , archivePrefix =

work page arXiv
[36]

2023 , eprint =

Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion , author =. 2023 , eprint =

2023
[37]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Visual Point Cloud Forecasting Enables Scalable Autonomous Driving , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
[38]

and Liu, Yu and Li, Hongsheng , booktitle =

Shao, Hao and Hu, Yuxuan and Wang, Letian and Song, Guanglu and Waslander, Steven L. and Liu, Yu and Li, Hongsheng , booktitle =
[39]

Tian, Xiaoyu and Gu, Junru and Li, Boyuan and Liu, Yang and Wang, Yuxuan and Zhao, Zhiyuan and Zhan, Kai and Jia, Peng and Lang, Xianpeng and Zhao, Hang , booktitle =
[40]

GPT-Driver: Learning to Drive with GPT

Mao, Jiageng and Qian, Yuxi and Ye, Junjie and Zhao, Hang and Wang, Yue , year =. 2310.01415 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[41]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zhou, Zewei and Cai, Tianhui and Zhao, Seth Z. and Zhang, Yun and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi , year =. 2506.13757 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[42]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Li, Yingyan and Shang, Shuyao and Liu, Weisong and Zhan, Bing and Wang, Haochen and Wang, Yuqi and Chen, Yuntao and Wang, Xiaoman and An, Yasong and Tang, Chufeng and Hou, Lu and Fan, Lue and Zhang, Zhaoxiang , year =. 2510.12796 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[43]

2025 , eprint =

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving , author =. 2025 , eprint =

2025
[44]

Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving.arXiv preprint arXiv:2601.05640, 2026

Li, Jingyu and Wu, Junjie and Hu, Dongnan and Huang, Xiangkai and Sun, Bin and Hao, Zhihui and Lang, Xianpeng and Zhu, Xiatian and Zhang, Li , year =. 2601.05640 , archivePrefix =

work page arXiv
[45]

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Zhou, Zewei and Yang, Ruining and Qi, Xuewei and Guo, Yiluan and Chen, Sherry X. and Feng, Tao and Pistunova, Kateryna and Shen, Yishan and Su, Lili and Ma, Jiaqi , year =. 2604.19710 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[46]

2025 , eprint =

Unified Vision-Language-Action Model , author =. 2025 , eprint =

2025
[47]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Zeng, Shuang and Chang, Xinyuan and Xie, Mengwei and Liu, Xinran and Bai, Yifan and Pan, Zheng and Xu, Mu and Wei, Xing , year =. FutureSightDrive: Thinking Visually with Spatio-Temporal. 2505.17685 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

OpenDriveVLA: Towards End-to-End Autonomous Driving with Large Vision Language Action Model , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
[49]

Fu, Haoyu and Zhang, Diankun and Zhao, Zongchuang and Cui, Jianfeng and Liang, Dingkang and Zhang, Chong and Zhang, Dingyuan and Xie, Hongwei and Wang, Bing and Bai, Xiang , booktitle =
[50]

Diffvla: Vision-language guided diffusion planning for autonomous driving

Jiang, Anqing and Gao, Yu and Sun, Zhigang and Wang, Yiru and Wang, Jijun and Chai, Jinghao and Cao, Qian and Heng, Yuweng and Jiang, Hao and Zhang, Zongzheng and Guo, Xianda and Sun, Hao and Zhao, Hao , year =. 2505.19381 , archivePrefix =

work page arXiv
[51]

2512.11872 , archivePrefix =

Xu, Mengwei and others , year =. 2512.11872 , archivePrefix =

work page arXiv
[52]

DriveFine: Refining-Augmented Masked Diffusion

Dang, Cheng and others , year =. DriveFine: Refining-Augmented Masked Diffusion. 2602.14577 , archivePrefix =

work page arXiv
[53]

Hwang, Jyh-Jing and Xu, Runsheng and Lin, Hao and Hung, Wei-Chih and Ji, Jingwei and Choi, Kyle and Huang, Dengxin and He, Tong and Covington, Paul and Sapp, Benjamin and Zhou, Yin and Guo, Jiong and Anguelov, Dragomir and Tan, Mingxing , booktitle =
[54]

2024 , eprint =

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning , author =. 2024 , eprint =

2024
[55]

2401.03641 , archivePrefix =

Han, Wencheng and Guo, Dongqian and Xu, Cheng-Zhong and Shen, Jianbing , year =. 2401.03641 , archivePrefix =

work page arXiv
[56]

Proceedings of the 38th International Conference on Machine Learning , pages =

Learning Transferable Visual Models from Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =
[57]

Advances in Neural Information Processing Systems , volume =

Flamingo: A Visual Language Model for Few-Shot Learning , author =. Advances in Neural Information Processing Systems , volume =
[58]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =
[59]

Advances in Neural Information Processing Systems , volume =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems , volume =
[60]

Computer Vision -- ECCV 2024 , pages =

Marcu, Ana-Maria and Chen, Long and H. Computer Vision -- ECCV 2024 , pages =

2024
[61]

Qian, Tianwen and Chen, Jingjing and Zhuo, Linhai and Jiao, Yang and Jiang, Yu-Gang , booktitle =
[62]

Computer Vision -- ECCV 2024 , pages =

Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving , author =. Computer Vision -- ECCV 2024 , pages =

2024
[63]

Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Long and Zhang, Han and Xie, Chunjing and Beisswenger, Jan and Luo, Ping and Geiger, Andreas and Li, Hongyang , booktitle =
[64]

Computer Vision -- ECCV 2024 , year =

Dolphins: Multimodal Language Model for Driving , author =. Computer Vision -- ECCV 2024 , year =

2024
[65]

and Velipasalar, Senem and Ren, Liu , booktitle =

Pan, Chenbin and Yaman, Burhaneddin and Nesti, Tommaso and Mallik, Abhirup and Allievi, Alessandro G. and Velipasalar, Senem and Ren, Liu , booktitle =
[66]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Generative Planning with 3D-Vision Language Pre-training for End-to-End Autonomous Driving , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
[67]

Song, Wenxuan and Zhou, Ziyang and Zhao, Han and Chen, Jiayi and Ding, Pengxiang and Yan, Haodong and Huang, Yuxin and Tang, Feilong and Wang, Donglin and Li, Haoang , booktitle =
[68]

Proceedings of The 7th Conference on Robot Learning , series =

Parting with Misconceptions about Learning-Based Vehicle Motion Planning , author =. Proceedings of The 7th Conference on Robot Learning , series =
[69]

2025 , eprint =

Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models , author =. 2025 , eprint =

2025
[70]

2025 , eprint =

InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving , author =. 2025 , eprint =

2025
[71]

2024 , eprint =

Doe-1: Closed-Loop Autonomous Driving with Large World Model , author =. 2024 , eprint =

2024
[72]

2024 , eprint =

Making Large Language Models Better Planners with Reasoning-Decision Alignment , author =. 2024 , eprint =

2024
[73]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops , pages =

OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving , author =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops , pages =
[74]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =
[75]

2026 , eprint =

Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation , author =. 2026 , eprint =

2026

[1] [1]

Dauner, Daniel and Hallgarten, Marcel and Li, Tianyu and Weng, Xinshuo and Huang, Zhiyu and Yang, Zetong and Li, Hongyang and Gilitschenski, Igor and Ivanovic, Boris and Pavone, Marco and Geiger, Andreas and Chitta, Kashyap , booktitle =

[2] [2]

2025 , eprint =

Pseudo-Simulation for Autonomous Driving , author =. 2025 , eprint =

2025

[3] [3]

and Vora, Sourabh and Liong, Venice Erin and Xu, Qiang and Krishnan, Anush and Pan, Yu and Baldan, Giancarlo and Beijbom, Oscar , booktitle =

Caesar, Holger and Bankiti, Varun and Lang, Alex H. and Vora, Sourabh and Liong, Venice Erin and Xu, Qiang and Krishnan, Anush and Pan, Yu and Baldan, Giancarlo and Beijbom, Oscar , booktitle =

[4] [4]

2023 , howpublished =

2023

[5] [5]

Qwen2.5-VL Technical Report

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =

Deep Residual Learning for Image Recognition , author =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =

[7] [7]

Advances in Neural Information Processing Systems , volume =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =

[8] [8]

Advances in Neural Information Processing Systems , volume =

Denoising Diffusion Probabilistic Models , author =. Advances in Neural Information Processing Systems , volume =

[9] [9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

High-Resolution Image Synthesis with Latent Diffusion Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

[10] [10]

Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Yu, Qiao and Dai, Jifeng , journal =

[11] [11]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

[12] [12]

Hu, Shengchao and Chen, Li and Wu, Penghao and Li, Hongyang and Yan, Junchi and Tao, Dacheng , booktitle =

[13] [13]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Planning-Oriented Autonomous Driving , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

[14] [14]

Jiang, Bo and Chen, Shaoyu and Xu, Qing and Liao, Bencheng and Chen, Jiajie and Zhou, Helong and Zhang, Qian and Liu, Wenyu and Huang, Chang and Wang, Xinggang , booktitle =

[15] [15]

2025 IEEE International Conference on Robotics and Automation , pages =

SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation , author =. 2025 IEEE International Conference on Robotics and Automation , pages =

2025

[16] [16]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

[17] [17]

arXiv preprint arXiv:2503.12820 (2025)

Li, Keyu and Li, Zhiqi and Lan, Shiyi and Xie, Yichen and Zhang, Ziyang and Liu, Jiaming and Wu, Zuxuan and Yu, Zehui and Alvarez, Jose M. , year =. Hydra-. 2503.12820 , archivePrefix =

work page arXiv

[18] [18]

ipad: Iterative proposal-centric end-to-end autonomous driving.arXiv preprint arXiv:2505.15111, 2025

Guo, Ke and Liu, Haochen and Wu, Xiaojun and Pan, Jia and Lv, Chen , year =. 2505.15111 , archivePrefix =

work page arXiv

[19] [19]

PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

Wozniak, Maciej K. and Liu, Lianhang and Cai, Yixi and Jensfelt, Patric , year =. 2507.17596 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving

Song, Ziying and Liu, Lin and Pan, Hongyu and Liao, Bencheng and Guo, Mingzhe and Yang, Lei and Zhang, Yongchang and Xu, Shaoqing and Jia, Caiyan and Luo, Yadan , year =. 2507.04049 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2408.03601 (2024)

Yuan, Chengran and Zhang, Zhanqi and Sun, Jiawei and Sun, Shuo and Huang, Zefan and Lee, Christina Dao Wen and Li, Dongen and Han, Yuhang and Wong, Anthony and Tee, Keng Peng and Ang, Marcelo H. , year =. 2408.03601 , archivePrefix =

work page arXiv

[22] [22]

2505.19239 , archivePrefix =

Shi, Chen and Shi, Shaoshuai and Sheng, Kehua and Zhang, Bo and Jiang, Li , year =. 2505.19239 , archivePrefix =

work page arXiv

[23] [23]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

[24] [24]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End-to-End Autonomous Driving , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

[25] [25]

2504.19580 , archivePrefix =

Feng, Rui and Xi, Ning and Chu, Duanfeng and Wang, Rukang and Deng, Zejian and Wang, Anzheng and Lu, Liping and Wang, Jinxiang and Huang, Yanjun , year =. 2504.19580 , archivePrefix =

work page arXiv

[26] [26]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

[27] [27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

DriveWorld: 4D Pre-Trained Scene Understanding via World Models for Autonomous Driving , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

[28] [28]

2024 , eprint =

Enhancing End-to-End Autonomous Driving with Latent World Model , author =. 2024 , eprint =

2024

[29] [29]

End-to-End Driving with Online Trajectory Evaluation via

Li, Yingyan and Wang, Yuqi and Liu, Yang and He, Jiawei and Fan, Lue and Zhang, Zhaoxiang , year =. End-to-End Driving with Online Trajectory Evaluation via. 2504.01941 , archivePrefix =

work page arXiv

[30] [30]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

Epona: Autoregressive Diffusion World Model for Autonomous Driving , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , year =

[31] [31]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

DrivingGPT: Unifying Driving World Modeling and Planning with Multimodal Autoregressive Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =

[32] [32]

Zheng, Wenzhao and Chen, Weiliang and Huang, Yuanhui and Zhang, Borui and Duan, Yueqi and Lu, Jiwen , booktitle =

[33] [33]

2025 , eprint =

World4Drive: End-to-End Autonomous Driving via Intention-Aware Physical Latent World Model , author =. 2025 , eprint =

2025

[34] [34]

2602.06521 , archivePrefix =

Jia, Feiyang and Liu, Lin and Song, Ziying and Jia, Caiyan and Ye, Hangjun and Hao, Xiaoshuai and Chen, Long , year =. 2602.06521 , archivePrefix =

work page arXiv

[35] [35]

2501.14729 , archivePrefix =

Zhou, Xinyu and Liang, Dingkang and Tu, Shuyuan and Chen, Xinyu and Ding, Yuhang and Zhang, Dong and Tan, Fei and Zhao, Hang and Bai, Xiang , year =. 2501.14729 , archivePrefix =

work page arXiv

[36] [36]

2023 , eprint =

Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion , author =. 2023 , eprint =

2023

[37] [37]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Visual Point Cloud Forecasting Enables Scalable Autonomous Driving , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

[38] [38]

and Liu, Yu and Li, Hongsheng , booktitle =

Shao, Hao and Hu, Yuxuan and Wang, Letian and Song, Guanglu and Waslander, Steven L. and Liu, Yu and Li, Hongsheng , booktitle =

[39] [39]

Tian, Xiaoyu and Gu, Junru and Li, Boyuan and Liu, Yang and Wang, Yuxuan and Zhao, Zhiyuan and Zhan, Kai and Jia, Peng and Lang, Xianpeng and Zhao, Hang , booktitle =

[40] [40]

GPT-Driver: Learning to Drive with GPT

Mao, Jiageng and Qian, Yuxi and Ye, Junjie and Zhao, Hang and Wang, Yue , year =. 2310.01415 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zhou, Zewei and Cai, Tianhui and Zhao, Seth Z. and Zhang, Yun and Huang, Zhiyu and Zhou, Bolei and Ma, Jiaqi , year =. 2506.13757 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Li, Yingyan and Shang, Shuyao and Liu, Weisong and Zhan, Bing and Wang, Haochen and Wang, Yuqi and Chen, Yuntao and Wang, Xiaoman and An, Yasong and Tang, Chufeng and Hou, Lu and Fan, Lue and Zhang, Zhaoxiang , year =. 2510.12796 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

2025 , eprint =

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving , author =. 2025 , eprint =

2025

[44] [44]

Sgdrive: Scene-to-goal hierarchical world cognition for autonomous driving.arXiv preprint arXiv:2601.05640, 2026

Li, Jingyu and Wu, Junjie and Hu, Dongnan and Huang, Xiangkai and Sun, Bin and Hao, Zhihui and Lang, Xianpeng and Zhu, Xiatian and Zhang, Li , year =. 2601.05640 , archivePrefix =

work page arXiv

[45] [45]

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Zhou, Zewei and Yang, Ruining and Qi, Xuewei and Guo, Yiluan and Chen, Sherry X. and Feng, Tao and Pistunova, Kateryna and Shen, Yishan and Su, Lili and Ma, Jiaqi , year =. 2604.19710 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

2025 , eprint =

Unified Vision-Language-Action Model , author =. 2025 , eprint =

2025

[47] [47]

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Zeng, Shuang and Chang, Xinyuan and Xie, Mengwei and Liu, Xinran and Bai, Yifan and Pan, Zheng and Xu, Mu and Wei, Xing , year =. FutureSightDrive: Thinking Visually with Spatio-Temporal. 2505.17685 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

OpenDriveVLA: Towards End-to-End Autonomous Driving with Large Vision Language Action Model , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

[49] [49]

Fu, Haoyu and Zhang, Diankun and Zhao, Zongchuang and Cui, Jianfeng and Liang, Dingkang and Zhang, Chong and Zhang, Dingyuan and Xie, Hongwei and Wang, Bing and Bai, Xiang , booktitle =

[50] [50]

Diffvla: Vision-language guided diffusion planning for autonomous driving

Jiang, Anqing and Gao, Yu and Sun, Zhigang and Wang, Yiru and Wang, Jijun and Chai, Jinghao and Cao, Qian and Heng, Yuweng and Jiang, Hao and Zhang, Zongzheng and Guo, Xianda and Sun, Hao and Zhao, Hao , year =. 2505.19381 , archivePrefix =

work page arXiv

[51] [51]

2512.11872 , archivePrefix =

Xu, Mengwei and others , year =. 2512.11872 , archivePrefix =

work page arXiv

[52] [52]

DriveFine: Refining-Augmented Masked Diffusion

Dang, Cheng and others , year =. DriveFine: Refining-Augmented Masked Diffusion. 2602.14577 , archivePrefix =

work page arXiv

[53] [53]

Hwang, Jyh-Jing and Xu, Runsheng and Lin, Hao and Hung, Wei-Chih and Ji, Jingwei and Choi, Kyle and Huang, Dengxin and He, Tong and Covington, Paul and Sapp, Benjamin and Zhou, Yin and Guo, Jiong and Anguelov, Dragomir and Tan, Mingxing , booktitle =

[54] [54]

2024 , eprint =

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning , author =. 2024 , eprint =

2024

[55] [55]

2401.03641 , archivePrefix =

Han, Wencheng and Guo, Dongqian and Xu, Cheng-Zhong and Shen, Jianbing , year =. 2401.03641 , archivePrefix =

work page arXiv

[56] [56]

Proceedings of the 38th International Conference on Machine Learning , pages =

Learning Transferable Visual Models from Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =

[57] [57]

Advances in Neural Information Processing Systems , volume =

Flamingo: A Visual Language Model for Few-Shot Learning , author =. Advances in Neural Information Processing Systems , volume =

[58] [58]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =

[59] [59]

Advances in Neural Information Processing Systems , volume =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems , volume =

[60] [60]

Computer Vision -- ECCV 2024 , pages =

Marcu, Ana-Maria and Chen, Long and H. Computer Vision -- ECCV 2024 , pages =

2024

[61] [61]

Qian, Tianwen and Chen, Jingjing and Zhuo, Linhai and Jiao, Yang and Jiang, Yu-Gang , booktitle =

[62] [62]

Computer Vision -- ECCV 2024 , pages =

Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving , author =. Computer Vision -- ECCV 2024 , pages =

2024

[63] [63]

Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Long and Zhang, Han and Xie, Chunjing and Beisswenger, Jan and Luo, Ping and Geiger, Andreas and Li, Hongyang , booktitle =

[64] [64]

Computer Vision -- ECCV 2024 , year =

Dolphins: Multimodal Language Model for Driving , author =. Computer Vision -- ECCV 2024 , year =

2024

[65] [65]

and Velipasalar, Senem and Ren, Liu , booktitle =

Pan, Chenbin and Yaman, Burhaneddin and Nesti, Tommaso and Mallik, Abhirup and Allievi, Alessandro G. and Velipasalar, Senem and Ren, Liu , booktitle =

[66] [66]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Generative Planning with 3D-Vision Language Pre-training for End-to-End Autonomous Driving , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

[67] [67]

Song, Wenxuan and Zhou, Ziyang and Zhao, Han and Chen, Jiayi and Ding, Pengxiang and Yan, Haodong and Huang, Yuxin and Tang, Feilong and Wang, Donglin and Li, Haoang , booktitle =

[68] [68]

Proceedings of The 7th Conference on Robot Learning , series =

Parting with Misconceptions about Learning-Based Vehicle Motion Planning , author =. Proceedings of The 7th Conference on Robot Learning , series =

[69] [69]

2025 , eprint =

Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models , author =. 2025 , eprint =

2025

[70] [70]

2025 , eprint =

InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving , author =. 2025 , eprint =

2025

[71] [71]

2024 , eprint =

Doe-1: Closed-Loop Autonomous Driving with Large World Model , author =. 2024 , eprint =

2024

[72] [72]

2024 , eprint =

Making Large Language Models Better Planners with Reasoning-Decision Alignment , author =. 2024 , eprint =

2024

[73] [73]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops , pages =

OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving , author =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops , pages =

[74] [74]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

[75] [75]

2026 , eprint =

Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation , author =. 2026 , eprint =

2026