pith. machine review for the scientific record. sign in

arxiv: 2605.09701 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

DriveFuture: Future-Aware Latent World Models for Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords latent world modelsautonomous drivingfuture conditioningtrajectory planningdiffusion plannercross-attentionforesight in decisions
0
0 comments X

The pith

Conditioning current latent states on future world states improves trajectory planning in autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that latent world models succeed when they use future states to shape the current representation rather than treating futures only as separate prediction targets. It does this by first forecasting future latents from the present state and ego action, then refining those forecasts against real future observations through cross-attention. The refined future-aware latent then directly conditions a diffusion planner. This separation of concerns during training allows the model to carry foresight into inference, where only its own predictions are available. A reader would care because the method promises decisions that are explicitly forward-looking without mixing current and future information in the same latent space.

Core claim

DriveFuture predicts future latent world states from the current latent state and ego action, then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference the model substitutes its own predicted future latent for the ground-truth version.

What carries the argument

Cross-attention refinement of predicted future latents against ground-truth futures, which produces a planning-oriented future-aware latent used to condition the trajectory planner.

If this is right

  • Current and future features become less entangled because the refinement step forces the model to treat futures as an explicit conditioning signal.
  • The diffusion planner receives a latent that already encodes planning-relevant foresight rather than raw scene dynamics.
  • Performance remains high when ground-truth futures are replaced by model predictions, showing the training procedure transfers to deployment.
  • The same conditioning pattern can be applied to other latent world models that currently treat future states only as auxiliary targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same future-conditioning pattern might reduce the need for very long prediction horizons by letting short-term futures already shape immediate actions.
  • Extending the refinement step to multiple future time steps could allow planners to balance short-term safety with longer-term goals.
  • The approach may generalize to non-driving sequential tasks where decisions must anticipate downstream states without explicit supervision on those states.

Load-bearing premise

The cross-attention step during training creates a latent encoding that still extracts useful future information when the model must rely on its own imperfect predictions at inference time.

What would settle it

An ablation that removes the cross-attention refinement step and shows no drop in planning performance on the same driving benchmarks would falsify the claim that future conditioning is the key mechanism.

Figures

Figures reproduced from arXiv: 2605.09701 by Lei Yang, Lin Liu, Shaoqing Xu, Xiangpo Zhou, Xiaotian Zhou, Yadan Luo, Yingyan Li, Yufeng Hong, Ziying Song.

Figure 1
Figure 1. Figure 1: Motivation of DriveFuture. (a) Existing latent world models [17–19, 23, 20–22] primarily simulate future latent states and use them as prediction targets or supervision signals, without explicitly shaping the current representation for planning. (b) DriveFuture uses future latent states as direct conditions for the planning process. It adopts GT future states during training and predicted future states dur… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DriveFuture. Multi-view observations at time t are encoded by a shared Perception Encoder into a scene latent Zt. The Latent Dynamics Predictor conditions on Zt and a tokenised trajectory intent to produce a predicted future latent Zˆ t+T . During training, the future observation at t+T is encoded by the same Perception Encoder into Zt+T , the Future Alignment Adapter grounds Zˆ t+T via cross-a… view at source ↗
Figure 3
Figure 3. Figure 3: Visual comparison between latent world model World4Drive [ [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Compared with World4Drive[18], Drive￾Future performs better in common failure cases, including collision, inefficient, and braking. This shows that conditioning current decision representa￾tions on future states is more effective than using future states merely as prediction targets. 5 Conclusion In this work, we propose DriveFuture, a future-aware latent world modeling framework for au￾tonomous driving. U… view at source ↗
Figure 4
Figure 4. Figure 4: DriveFuture across multiple driving scenarios from NAVSIM-v2 navhard. [1]. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
read the original abstract

Existing latent world models for autonomous driving have opened a promising path toward future-aware driving intelligence. However, they typically treat future latent states as prediction targets or auxiliary signals, rather than directly conditioning trajectory planning. This can entangle current and future features in latent space. In this work, we propose DriveFuture, a future-aware latent world modeling framework for autonomous driving that explicitly learns planning-oriented foresight by conditioning the current latent state modeling process on future world states. Specifically, during training, the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference, DriveFuture conditions on the predicted future latent state instead of the ground-truth future state. DriveFuture achieves SOTA performance on the public NAVSIM benchmarks, reaching \textbf{55.5} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}, \textbf{89.9} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navtest}}}, and \textbf{90.7} PDMS on NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}, respectively. These results suggest that the key to latent world modeling lies not merely in simulating future states, but more importantly in conditioning current decision-making on future states. Notably, as of April 2026, DriveFuture ranks \textbf{1st} on the \href{https://huggingface.co/spaces/AGC2025/e2e-driving-navhard}{NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}} leaderboard and achieves SOTA performance on \href{https://huggingface.co/spaces/AGC2024-P/e2e-driving-navtest}{NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces DriveFuture, a latent world model for autonomous driving that predicts future latent states from the current latent and ego action, then refines the prediction via cross-attention against ground-truth future latents exclusively during training. The resulting future-aware latent explicitly conditions a diffusion-based trajectory planner. At inference the planner receives only the raw predicted future latent. The method reports SOTA EPDMS scores of 55.5 on NAVSIM-v2 navhard, 89.9 on NAVSIM-v2 navtest, and 90.7 PDMS on NAVSIM-v1 navtest, claiming first place on the navhard leaderboard and arguing that the key advance is conditioning current decisions on future states rather than treating futures only as targets.

Significance. If the reported gains prove robust to the train-inference mismatch and are attributable to the explicit future-conditioning mechanism, the work would offer a concrete demonstration that foresight should directly shape current planning in latent world models. The SOTA numbers on public NAVSIM benchmarks would then indicate a practical step toward more anticipatory end-to-end driving policies.

major comments (2)
  1. Abstract and §3 (method description): The cross-attention refinement is performed only against ground-truth future latents during training, yet inference conditions the planner on unrefined predicted latents. This introduces an unquantified distribution shift. No measurements of latent prediction error, cosine similarity between refined and raw latents, or error propagation to the planner are supplied, leaving the central claim that future-state conditioning is the key driver unverified.
  2. Experiments section and results tables: The SOTA EPDMS figures (55.5 navhard, etc.) are presented without an ablation that removes the GT-refinement step while keeping the future prediction and diffusion planner fixed. Without this control, it remains possible that the gains arise from the latent encoder, diffusion architecture, or training data rather than the future-aware conditioning mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the training-inference consistency and the need for stronger isolation of the future-conditioning contribution. We address each major comment below and outline revisions to strengthen the paper.

read point-by-point responses
  1. Referee: Abstract and §3 (method description): The cross-attention refinement is performed only against ground-truth future latents during training, yet inference conditions the planner on unrefined predicted latents. This introduces an unquantified distribution shift. No measurements of latent prediction error, cosine similarity between refined and raw latents, or error propagation to the planner are supplied, leaving the central claim that future-state conditioning is the key driver unverified.

    Authors: We acknowledge the train-inference discrepancy introduced by the training-only cross-attention refinement. The refinement step is intended to improve the quality of the learned future latent representations by aligning predictions more closely with ground-truth futures during optimization, thereby enabling the model to produce better raw predictions at inference time. While the original submission did not include quantitative analysis of latent prediction error or cosine similarity, the strong benchmark results suggest the approach is effective. To directly address the concern and verify the central claim, we will add measurements of latent prediction error, cosine similarity between refined and raw predicted latents, and an analysis of error propagation to the planner in the revised manuscript. revision: yes

  2. Referee: Experiments section and results tables: The SOTA EPDMS figures (55.5 navhard, etc.) are presented without an ablation that removes the GT-refinement step while keeping the future prediction and diffusion planner fixed. Without this control, it remains possible that the gains arise from the latent encoder, diffusion architecture, or training data rather than the future-aware conditioning mechanism.

    Authors: We agree that an ablation isolating the GT-refinement step is necessary to attribute performance gains specifically to the future-aware conditioning mechanism. In the revised version, we will include a controlled ablation that disables the cross-attention refinement during training while retaining the future prediction module and diffusion-based planner unchanged. This will allow direct comparison of EPDMS scores and clarify whether the explicit future-state conditioning is the primary driver of the reported SOTA results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on benchmarks

full rationale

The paper proposes DriveFuture as an architectural framework: it predicts future latent states from current latent + ego action, applies cross-attention refinement against ground-truth future latents exclusively during training to produce a future-aware latent, and conditions a diffusion planner on that latent. At inference the planner uses the raw predicted latent. The central claim is that explicitly conditioning current decision-making on future states (rather than treating futures only as targets) yields better planning, supported by reported SOTA EPDMS scores on public NAVSIM benchmarks. No equations, parameter-fitting steps, uniqueness theorems, or self-citation chains appear in the provided text; the result is presented as an empirical engineering outcome rather than a derivation that reduces to its inputs by construction. The training/inference distinction is explicitly stated, so no load-bearing step collapses into a tautology or fitted input renamed as prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce new mathematical axioms or free parameters; the framework relies on standard latent world model and diffusion planner components whose details are not supplied here.

pith-pipeline@v0.9.0 · 5681 in / 1099 out tokens · 25381 ms · 2026-05-12T02:55:47.781445+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 12 internal anchors

  1. [1]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking,

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking,

  2. [2]

    URLhttps://arxiv.org/abs/2406.15349

  3. [3]

    Progressive robustness-aware world models in autonomous driving: A review and outlook.Authorea Preprints, 2025

    Feiyang Jia, Caiyan Jia, Ziying Song, Zhicheng Bao, Lin Liu, Shaoqing Xu, Yan Gong, Lei Yang, Xinyu Zhang, Bin Sun, et al. Progressive robustness-aware world models in autonomous driving: A review and outlook.Authorea Preprints, 2025

  4. [4]

    A survey of world models for autonomous driving.arXiv preprint arXiv:2501.11260, 2025

    Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving. arXiv preprint arXiv:2501.11260, 2025

  5. [5]

    The role of world models in shaping autonomous driving: A comprehensive survey.arXiv preprint arXiv:2502.10498, 2025

    Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, and Xiang Bai. The role of world models in shaping autonomous driving: A comprehensive survey.arXiv preprint arXiv:2502.10498, 2025

  6. [6]

    Robustness-aware 3d object detection in autonomous driving: A review and outlook

    Ziying Song, Lin Liu, Feiyang Jia, Yadan Luo, Caiyan Jia, Guoxin Zhang, Lei Yang, and Li Wang. Robustness-aware 3d object detection in autonomous driving: A review and outlook. IEEE Transactions on Intelligent Transportation Systems, 25(11):15407–15436, 2024

  7. [7]

    Unleashing VLA potentials in autonomous driving via explicit learning from failures.arXiv preprint arXiv:2603.01063, 2026

    Yuechen Luo, Qimao Chen, Fang Li, Shaoqing Xu, Jaxin Liu, Ziying Song, Zhi-xin Yang, and Fuxi Wen. Unleashing vla potentials in autonomous driving via explicit learning from failures. arXiv preprint arXiv:2603.01063, 2026

  8. [8]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17853–17862, June 2023

  9. [9]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  10. [10]

    Sparsedrive: End-to-end autonomous driving via sparse scene representation

    Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

  11. [11]

    Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving

    Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025

  12. [12]

    GuideFlow: Constraint-guided flow matching for planning in end-to-end autonomous driving

    Lin Liu, Caiyan Jia, Guanyi Yu, Ziying Song, JunQiao Li, Feiyang Jia, Peiliang Wu, Xiaoshuai Hao, and Yandan Luo. Guideflow: Constraint-guided flow matching for planning in end-to-end autonomous driving.arXiv preprint arXiv:2511.18729, 2025

  13. [13]

    FocalAD: Local Motion Planning for End-to-End Autonomous Driving

    Bin Sun, Boao Zhang, Jiayi Lu, Xinjie Feng, Jiachen Shang, Rui Cao, Mengchao Zheng, Chuanye Wang, Shichun Yang, Yaoguang Cao, et al. Focalad: Local motion planning for end-to-end autonomous driving.arXiv preprint arXiv:2506.11419, 2025

  14. [14]

    Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving

    Ziying Song, Caiyan Jia, Lin Liu, Hongyu Pan, Yongchang Zhang, Junming Wang, Xingyu Zhang, Shaoqing Xu, Lei Yang, and Yadan Luo. Don’t shake the wheel: Momentum-aware planning in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22432–22441, 2025

  15. [15]

    Minddrive: An all-in-one framework bridging world models and vision-language model for end-to-end autonomous driving.arXiv preprint arXiv:2512.04441, 2025

    Bin Suna, Yaoguang Caob, Yan Wanga, Rui Wanga, Jiachen Shanga, Xiejie Fenga, Jiayi Lu, Jia Shi, Shichun Yang, Xiaoyu Yane, et al. Minddrive: An all-in-one framework bridging world models and vision-language model for end-to-end autonomous driving.arXiv preprint arXiv:2512.04441, 2025. 10

  16. [16]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

  17. [17]

    ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

  18. [18]

    Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

  19. [19]

    World4drive: End-to-end autonomous driving via intention-aware physical latent world model

    Yupeng Zheng, Pengxuan Yang, Zebin Xing, Qichao Zhang, Yuhang Zheng, Yinfeng Gao, Pengfei Li, Teng Zhang, Zhongpu Xia, Peng Jia, et al. World4drive: End-to-end autonomous driving via intention-aware physical latent world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 28632–28642, 2025

  20. [20]

    Worldrft: Latent world model planning with reinforcement fine-tuning for autonomous driving

    Pengxuan Yang, Ben Lu, Zhongpu Xia, Chao Han, Yinfeng Gao, Teng Zhang, Kun Zhan, XianPeng Lang, Yupeng Zheng, and Qichao Zhang. Worldrft: Latent world model planning with reinforcement fine-tuning for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11649–11657, 2026

  21. [21]

    DriveWorld-VLA: Unified latent- space world modeling with vision-language-action for autonomous driving.arXiv preprint arXiv:2602.06521, 2026

    Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, Long Chen, et al. Driveworld- vla: Unified latent-space world modeling with vision-language-action for autonomous driving. arXiv preprint arXiv:2602.06521, 2026

  22. [22]

    DriveVLA-W0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

  23. [23]

    DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

    Tianze Xia, Yongkang Li, Lijun Zhou, Jingfeng Yao, Kaixin Xiong, Haiyang Sun, Bing Wang, Kun Ma, Guang Chen, Hangjun Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world.arXiv preprint arXiv:2512.23421, 2025

  24. [24]

    Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

    Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024

  25. [25]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

  26. [26]

    Drive- dreamer: Towards real-world-drive world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InEuropean confer- ence on computer vision, pages 55–72. Springer, 2024

  27. [27]

    Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

  28. [28]

    Epona: Autoregressive diffusion world model for autonomous driving

    Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27220–27230, 2025

  29. [29]

    ReSim: Reliable World Simulation for Autonomous Driving

    Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for au- tonomous driving.arXiv preprint arXiv:2506.09981, 2025

  30. [30]

    Consisdrive: Identity-preserving driving world models for video generation by instance mask.arXiv preprint arXiv:2602.03213, 2026

    Zhuoran Yang and Yanyong Zhang. Consisdrive: Identity-preserving driving world models for video generation by instance mask.arXiv preprint arXiv:2602.03213, 2026. 11

  31. [31]

    Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

  32. [32]

    Available: https://arxiv.org/abs/2311.13549

    Fan Jia, Weixin Mao, Yingfei Liu, Yucheng Zhao, Yuqing Wen, Chi Zhang, Xiangyu Zhang, and Tiancai Wang. Adriver-i: A general world model for autonomous driving.arXiv preprint arXiv:2311.13549, 2023

  33. [33]

    Adawm: Adaptive world model based planning for autonomous driving

    Hang Wang, Xin Ye, Feng Tao, Chenbin Pan, Abhirup Mallik, Burhaneddin Yaman, Liu Ren, and Junshan Zhang. Adawm: Adaptive world model based planning for autonomous driving. arXiv preprint arXiv:2501.13072, 2025

  34. [34]

    Occworld: Learning a 3d occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024

  35. [35]

    Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving

    Yu Yang, Jianbiao Mei, Yukai Ma, Siliang Du, Wenqing Chen, Yijie Qian, Yuxiang Feng, and Yong Liu. Driving in the occupancy world: Vision-centric 4d occupancy forecasting and planning via world models for autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 9327–9335, 2025

  36. [36]

    Occsora: 4d occupancy generation models as world simulators for autonomous driving,

    Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiyong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for autonomous driving. arXiv preprint arXiv:2405.20337, 2024

  37. [37]

    Occllama: An occupancy- language-action generative world model for autonomous driving,

    Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for autonomous driving. arXiv preprint arXiv:2409.03272, 2024

  38. [38]

    Bevworld: A multimodal world model for autonomous driving via unified bev latent space,

    Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiaofan Li, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, and Haifeng Wang. Bevworld: A multimodal world simulator for autonomous driving via scene-level bev latents.arXiv preprint arXiv:2407.05679, 2024

  39. [39]

    End-to-end driving with online trajectory evaluation via bev world model

    Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online trajectory evaluation via bev world model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27137–27146, 2025

  40. [40]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

  41. [41]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  42. [42]

    M2da: Multi-modal fusion transformer incorporating driver attention for autonomous driving.arXiv preprint arXiv:2403.12552, 2024

    Dongyang Xu, Haokun Li, Qingfan Wang, Ziying Song, Lei Chen, and Hanming Deng. M2da: Multi-modal fusion transformer incorporating driver attention for autonomous driving.arXiv preprint arXiv:2403.12552, 2024

  43. [43]

    End- to-end autonomous driving without costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Mingzhe Guo, Zhipeng Zhang, Yuan He, Ke Wang, Liping Jing, and Haibin Ling. End- to-end autonomous driving without costly modularization and 3d manual annotation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  44. [44]

    Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

    Zhenxin Li, Kailin Li, Shihao Wang, Shiyi Lan, Zhiding Yu, Yishen Ji, Zhiqi Li, Ziyue Zhu, Jan Kautz, Zuxuan Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  45. [45]

    Fully unified motion planning for end-to-end autonomous driving.arXiv preprint arXiv:2504.12667, 2025

    Lin Liu, Caiyan Jia, Ziying Song, Hongyu Pan, Bencheng Liao, Wenchao Sun, Yongchang Zhang, Lei Yang, Yandan Luo, et al. Fully unified motion planning for end-to-end autonomous driving.arXiv preprint arXiv:2504.12667, 2025. 12

  46. [46]

    Genad: Gen- erative end-to-end autonomous driving

    Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. Genad: Gen- erative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2024

  47. [47]

    Bridging past and future: End-to-end autonomous driving with historical prediction and planning

    Bozhou Zhang, Nan Song, Xin Jin, and Li Zhang. Bridging past and future: End-to-end autonomous driving with historical prediction and planning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6854–6863, 2025

  48. [48]

    Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Haochen Liu, Tianyu Li, Haohan Yang, Li Chen, Caojun Wang, Ke Guo, Haochen Tian, Hongchen Li, Hongyang Li, and Chen Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  49. [49]

    Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving.arXiv preprint arXiv:2507.04049, 2025

    Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, and Yadan Luo. Diver: Reinforced diffusion breaks imitation bottlenecks in end-to-end autonomous driving.arXiv preprint arXiv:2507.04049, 2025

  50. [50]

    Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

    Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M Alvarez. Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

  51. [51]

    Openscene: The largest up-to-date 3d occupancy prediction bench- mark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

    OpenScene Contributors. Openscene: The largest up-to-date 3d occupancy prediction bench- mark in autonomous driving.https://github.com/OpenDriveLab/OpenScene, 2023

  52. [52]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. Nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles, 2022. URLhttps://arxiv.org/abs/2106.11810

  53. [53]

    Pseudo-simulation for autonomous driving.arXiv preprint arXiv:2506.04218,

    Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, et al. Pseudo-simulation for autonomous driving.arXiv preprint arXiv:2506.04218, 2025

  54. [54]

    Senna: Bridging large vision-language mod- els and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

    Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024

  55. [55]

    arXiv preprint arXiv:2506.06659 (2025)

    Wenhao Yao, Zhenxin Li, Shiyi Lan, Zi Wang, Xinglong Sun, Jose M. Alvarez, and Zuxuan Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning.arXiv preprint arXiv:2506.06659, 2025

  56. [56]

    Zhenxin Li, Wenhao Yao, Zi Wang, Xinglong Sun, Joshua Chen, Nadine Chang, Maying Shen, Zuxuan Wu, Shiyi Lan, and Jose M. Alvarez. Ztrs: Zero-imitation end-to-end autonomous driving with trajectory scoring.arXiv preprint arXiv:2510.24108, 2025

  57. [57]

    H. Tian, T. Li, H. Liu, J. Yang, Y . Qiu, G. Li, J. Wang, Y . Gao, Z. Zhang, et al. Simscale: Learning to drive via real-world simulation at scale.arXiv preprint arXiv:2511.23369, 2025

  58. [58]

    Driving on registers.arXiv preprint arXiv:2601.05083, 2026

    Ellington Kirby, Alexandre Boulch, Yihong Xu, Yuan Yin, Gilles Puy, Eloi Zablocki, Andrei Bursuc, Spyros Gidaris, Renaud Marlet, Florent Bartoccioni, Anh-Quan Cao, Nermin Samet, Tuan-Hung Vu, and Matthieu Cord. Driving on registers.arXiv preprint arXiv:2601.05083, 2026

  59. [59]

    SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    Zewei Zhou, Ruining Yang, Xuewei Qi, Yiluan Guo, Sherry X. Chen, Tao Feng, Kateryna Pistunova, Yishan Shen, et al. Spanvla: Efficient action bridging and learning from negative- recovery samples for vision-language-action model.arXiv preprint arXiv:2604.19710, 2026

  60. [60]

    Diffvla: Vision-language guided dif- fusion planning for autonomous driving.arXiv preprint arXiv:2505.19381, 2025

    Anqing Jiang, Yu Gao, Zhigang Sun, Yiru Wang, Jijun Wang, Jinghao Chai, Qian Cao, Yuweng Heng, Hao Jiang, Zongzheng Zhang, Xianda Guo, Hao Sun, and Hao Zhao. Diffvla: Vision- language guided diffusion planning for autonomous driving.arXiv preprint arXiv:2505.19381, 2025

  61. [61]

    Kailin Li, Zhenxin Li, Shiyi Lan, Y . Xie, Z. Zhang, J. Liu, Z. Wu, Z. Yu, and Jose M. Alvarez. Hydra-mdp++: Advancing end-to-end driving via expert-guided hydra-distillation.arXiv e-prints, 2025. 13

  62. [62]

    Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving

    Renju Feng, Ning Xi, Duanfeng Chu, Rukang Wang, Zejian Deng, Anzheng Wang, Liping Lu, Jinxiang Wang, Yanjun Huang, et al. Artemis: Autoregressive end-to-end trajectory planning with mixture of experts for autonomous driving.arXiv preprint arXiv:2504.19580, 2025

  63. [63]

    DiffusionDriveV2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving.arXiv preprint arXiv:2512.07745, 2025

    Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, et al. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving.arXiv preprint arXiv:2512.07745, 2025

  64. [64]

    Latent-wam: Latent world action modeling for end-to-end autonomous driving

    Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, et al. Latent-wam: Latent world action modeling for end-to-end autonomous driving.arXiv preprint arXiv:2603.24581, 2026

  65. [65]

    PARA-Drive: Parallelized architecture for real-time autonomous driving

    Xinshuo Weng et al. PARA-Drive: Parallelized architecture for real-time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  66. [66]

    C. Yuan, Z. Zhang, J. Sun, S. Sun, Z. Huang, C. D. W. Lee, D. Li, Y . Han, A. Wong, K. P. Tee, et al. Drama: An efficient end-to-end motion planner for autonomous driving with mamba. In International Symposium on Robotics Research, 2024

  67. [67]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

  68. [68]

    Transfuser: Imitation with transformer-based sensor fu- sion for autonomous driving,

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving, 2022. URL https://arxiv.org/abs/2205.15997

  69. [69]

    2603.29163 , archivePrefix =

    Wenchao Sun, Xuewu Lin, Keyu Chen, Zixiang Pei, Xiang Li, Yining Shi, and Sifa Zheng. Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026. 14 A Experiment Details A.1 Datasets and Benchmarks Datasets.We train and evaluate DriveFuture on the nuPlan (OpenScene) data used by the public NAVSIMbenchmar...