pith. sign in

arxiv: 2512.23421 · v3 · submitted 2025-12-29 · 💻 cs.CV

DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Pith reviewed 2026-05-16 19:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords autonomous drivingworld modelsvideo generationmotion planningdiffusion plannerlatent representationstrajectory planningNAVSIM benchmark
0
0 comments X

The pith

DriveLaW unifies video generation and motion planning by injecting the video model's latent representations directly into a diffusion planner.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DriveLaW to combine the tasks of generating future driving videos and planning vehicle trajectories within a single system. It achieves this by taking the internal latent codes produced by its video generation component and feeding them straight into the planning module. This direct connection is meant to guarantee that the imagined futures align with the chosen actions without extra adjustments. A sympathetic reader would care because current approaches treat prediction and planning separately, which can lead to inconsistencies when handling rare real-world driving scenarios.

Core claim

DriveLaW unifies planning and video generation in a latent driving world by directly injecting the latent representation from its video generator into the planner. This ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. The system includes DriveLaW-Video for high-fidelity forecasting with expressive latents and DriveLaW-Act as a diffusion planner that uses those latents to generate trajectories, trained via a three-stage progressive strategy, leading to state-of-the-art performance on video prediction and the NAVSIM planning benchmark.

What carries the argument

Direct injection of the latent representation from the DriveLaW-Video generator into the DriveLaW-Act diffusion planner.

If this is right

  • Video prediction improves by 33.3% in FID and 1.8% in FVD over prior best-performing work.
  • The system sets a new record on the NAVSIM planning benchmark.
  • High-fidelity future generation and trajectory planning become inherently consistent without separate alignment steps.
  • The unified approach addresses long-tail challenges in autonomous driving through integrated world modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The three-stage progressive training may allow balancing the video and planning objectives more effectively than joint end-to-end training.
  • This latent-injection pattern could apply to other sequential decision domains where future prediction must stay aligned with control outputs.
  • Removing the need for post-hoc consistency checks between modules could simplify deployment pipelines in real vehicles.

Load-bearing premise

The latent representations produced by the video generator are sufficient by themselves to drive a diffusion planner that produces reliable trajectories without loss of performance in either component.

What would settle it

An experiment showing that a planner trained without the injected video latents achieves equal or better trajectory quality on the NAVSIM benchmark would falsify the need for the direct unification.

Figures

Figures reproduced from arXiv: 2512.23421 by Bing Wang, Guang Chen, Haiyang Sun, Hangjun Ye, Jingfeng Yao, Kaixin Xiong, Kun Ma, Lijun Zhou, Tianze Xia, Wenyu Liu, Xinggang Wang, Yongkang Li.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the overall architecture of DriveLaW. The model first encodes historical observations (images, actions) into a unified latent world representation through a powerful video diffusion model. In order to improve the generation quality, we introduced the Noise Reinjection mechanism to explore and select the optimal generation path in the early stage of denoising. The denoised video latents produced… view at source ↗
Figure 3
Figure 3. Figure 3: Restoring Structural and Temporal Consistency via Noise Reinjection. This comparison highlights the impact of our method. The baseline generation shows significant degradation, including (a) blurring, (b) structural inconsistency, and (c) artifacts. By integrating noise reinjection, our model preserves sharp details, maintains object structures, and produces clean, artifact-free frames, demonstrating a cru… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison with state-of-the-art driving world model. We compare DriveLaW with Epona [80] on nuScenes validation set. DriveLaW generates videos with (1) clearer vehicle details and more stable structural integrity, (2) well-preserved pedestrian shapes that remain easily identifiable, and (3) correct recognition and maintenance of inconspicuous objects (e.g., the yellow van), demonstrating super… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative analysis of latent representations. We visualize the quality of latent representations from three different feature sources: BEV features extracted from BEVFormer [45]’s ResNet-101 backbone, VLM features from the pretrained Qwen2.5-VL model in ReCogDrive [44], and VGM (Video Generation Model) features from our DriveLaW-Video. To enable visual comparison, we apply PCA to reduce each representati… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples of DriveLaW video generation on the nuScenes dataset. (a) Conventional urban driving scenarios, showing stable lane keeping and interactions with surrounding traffic. (b) Complex urban driving scenarios involving dense multi-agent interactions, turning maneuvers, and occlusions. (c) Night driving scenarios, demonstrating the model’s robustness to low-light conditions while preserving t… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on the Navtest benchmark. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes DriveLaW, a unified framework that integrates video generation and motion planning for autonomous driving. It features DriveLaW-Video as a world model producing high-fidelity forecasts and expressive latent representations, and DriveLaW-Act as a diffusion planner that generates trajectories by directly injecting these latents. The components are trained progressively in three stages, yielding state-of-the-art performance on video prediction metrics (FID improved by 33.3%, FVD by 1.8%) and the NAVSIM planning benchmark.

Significance. Should the unification via latent injection prove robust, the work would represent a meaningful step toward consistent world models in driving, potentially reducing discrepancies between predicted futures and planned actions. The empirical SOTA claims on both generation and planning tasks suggest broad applicability if the evidence holds under scrutiny.

major comments (1)
  1. [Abstract] Abstract: The central unification claim—that direct injection of the latent representation from DriveLaW-Video into the diffusion planner DriveLaW-Act ensures inherent consistency without performance loss—rests on an unexamined assumption that the latents are information-complete for reliable trajectory generation. No ablation isolating the latent-only case, no analysis of compression loss, and no comparison against additional conditioning (e.g., ego-state or map features) are supplied, leaving the sufficiency of the latent representation unsupported.
minor comments (1)
  1. Title: missing space after colon ('DriveLaW:Unifying' should read 'DriveLaW: Unifying').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. We address the single major comment point-by-point below. The suggested additions will strengthen the empirical support for our unification claim, and we will incorporate them in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central unification claim—that direct injection of the latent representation from DriveLaW-Video into the diffusion planner DriveLaW-Act ensures inherent consistency without performance loss—rests on an unexamined assumption that the latents are information-complete for reliable trajectory generation. No ablation isolating the latent-only case, no analysis of compression loss, and no comparison against additional conditioning (e.g., ego-state or map features) are supplied, leaving the sufficiency of the latent representation unsupported.

    Authors: We appreciate this observation. While the current results (SOTA video metrics and new NAVSIM record) are achieved precisely with latent-only injection into the diffusion planner, we agree that the manuscript would benefit from explicit verification of the latent representation's sufficiency. In the revised version we will add: (1) an ablation comparing latent-only conditioning against the full model, (2) quantitative analysis of information retention and compression loss in the latent space (e.g., reconstruction FID/FVD and dimensionality studies), and (3) controlled comparisons that augment the latent with ego-state or map features. These experiments will directly test whether the latents are information-complete for reliable planning. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The provided abstract and context describe an architectural unification via latent injection and a three-stage training strategy, but contain no equations, parameter fits, or self-citations that reduce any claimed prediction or consistency result to an input by construction. The strongest claim (inherent consistency from direct latent injection) is presented as an empirical outcome of the model design rather than a definitional or fitted tautology. No load-bearing steps match the enumerated circularity patterns; the derivation remains self-contained against external benchmarks like NAVSIM and FID/FVD metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of concrete free parameters, axioms, or invented entities; the three-stage training and latent-injection design are methodological choices whose details are not supplied.

pith-pipeline@v0.9.0 · 5556 in / 1094 out tokens · 39301 ms · 2026-05-16T19:31:33.714777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Imitation: Learning Safe End-to-End Autonomous Driving from Hard Negatives

    cs.RO 2026-05 unverdicted novelty 6.0

    BeyondDrive augments imitation learning with synthesized safety-critical negative trajectories and a repulsive loss to improve safety in autonomous driving, reporting 89.7 PDMS on NAVSIMv1 and generalization to other models.

  2. Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    Xiaomi EV World Model integrates WorldRec for sparse-query 3D Gaussian reconstruction and WorldGen for fast causal video generation via bidirectional pretraining and causal fine-tuning to support autonomous driving si...

  3. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  4. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  5. DriveFuture: Future-Aware Latent World Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.

  6. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  7. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  8. CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop perf...

  9. RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

    cs.CV 2026-04 unverdicted novelty 5.0

    RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 8 Pith papers · 26 internal anchors

  1. [1]

    Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

    Hervé Abdi and Lynne J Williams. Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010. 7

  2. [2]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video founda- tion models for physical ai.arXiv preprint arXiv:2511.00062,

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2

  4. [4]

    Vavim and vavam: Autonomous driving through video generative modeling

    Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Cham- bon, Spyros Gidaris, Serkan Odabas, David Hurych, et al. Vavim and vavam: Autonomous driving through video gen- erative modeling.arXiv preprint arXiv:2502.15672, 2025. 3

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision- language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164. 3

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

  7. [7]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 2

  8. [8]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker- 9 Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first Interna- tional Conference on Machine Learning, 2024. 2, 3

  9. [9]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2, 6

  10. [10]

    NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based plan- ning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021. 6

  11. [11]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 4

  12. [12]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243,

  13. [13]

    Drivinggpt: Unifying driving world modeling and planning with multi- modal autoregressive transformers

    Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi- modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26890–26900, 2025. 3, 7

  14. [14]

    Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 3

  15. [15]

    Transfuser: Imita- tion with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

    Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imita- tion with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022. 7

  16. [16]

    Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmark- ing.Advances in Neural Information Processing Systems, 37: 28706–28719, 2024

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmark- ing.Advances in Neural Information Processing Systems, 37: 28706–28719, 2024. 2, 6

  17. [17]

    Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

    Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driv- ing policy via large-scale 3dgs-based reinforcement learning. arXiv preprint arXiv:2502.13144, 2025. 1, 2

  18. [18]

    Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

    Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023. 1, 2, 3

  19. [19]

    Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control

    Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28135–28144, 2025. 3

  20. [20]

    Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024. 1, 2, 7

  21. [21]

    Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation

    Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, et al. Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation. arXiv preprint arXiv:2503.15208, 2025

  22. [22]

    Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal con- sistency.arXiv preprint arXiv:2506.07497, 2025

    Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Li- jun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, et al. Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal consistency. arXiv preprint arXiv:2506.07497, 2025. 2

  23. [23]

    Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

    Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark. arXiv preprint arXiv:2510.26802, 2025. 3

  24. [24]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 4, 6

  25. [25]

    Flexible diffusion modeling of long videos.Advances in neural information processing systems, 35:27953–27965, 2022

    William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- tian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos.Advances in neural information processing systems, 35:27953–27965, 2022. 3

  26. [26]

    Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control

    Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control. InProceedings of the Computer Vision and Pattern Recognition C...

  27. [27]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 6

  28. [28]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

  29. [29]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023. 1, 2, 3

  30. [30]

    Driving- world: Constructingworld model for autonomous driving via video gpt

    Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructing world model for autonomous driving via video gpt.arXiv preprint arXiv:2412.19505, 2024. 3 10

  31. [31]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023. 1, 7

  32. [32]

    BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

    Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object de- tection in bird-eye-view.arXiv preprint arXiv:2112.11790,

  33. [33]

    Diffuseslide: Training-free high frame rate video generation diffusion, 2025

    Geunmin Hwang, Hyun kyu Ko, Younghyun Kim, Seungry- ong Lee, and Eunbyung Park. Diffuseslide: Training-free high frame rate video generation diffusion, 2025. 5

  34. [34]

    Vad: Vectorized scene representation for ef- ficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for ef- ficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340– 8350, 2023. 1

  35. [35]

    Drivegan: Towards a controllable high-quality neural simulation

    Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5820–5829,

  36. [36]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 4

  37. [37]

    Pyramid- flow: High-resolution defect contrastive localization using pyramid normalizing flow

    Jiarui Lei, Xiaobo Hu, Yue Wang, and Dong Liu. Pyramid- flow: High-resolution defect contrastive localization using pyramid normalizing flow. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14143–14152, 2023. 4

  38. [38]

    Uniscene: Unified occupancy-centric driving scene generation

    Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11971–11981, 2025. 2

  39. [39]

    Omninwm: Omniscient driving navigation world models

    Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving naviga- tion world models.arXiv preprint arXiv:2510.18313, 2025. 1, 2, 3

  40. [40]

    Drivingdiffusion: layout-guided multi-view driving scenarios video generation with latent diffusion model

    Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: layout-guided multi-view driving scenarios video generation with latent diffusion model. InEuropean Conference on Computer Vision, pages 469–485. Springer, 2024. 1, 3

  41. [41]

    Enhancing End-to-End Autonomous Driving with Latent World Model

    Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024. 1, 2, 7

  42. [42]

    DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World mod- els amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025. 1, 2, 3, 6, 7

  43. [43]

    turn left

    Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online tra- jectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025. 2, 7

  44. [44]

    ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive frame- work for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025. 1, 3, 5, 6, 7, 9

  45. [45]

    Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1, 9

  46. [46]

    Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024. 3

  47. [47]

    Maptr: Structured modeling and learning for online vectorized hd map construction

    Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction.arXiv preprint arXiv:2208.14437, 2022. 1

  48. [48]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

    Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 1, 3, 5, 6, 7

  49. [49]

    Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Sheng- cong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jian- lan Luo, et al. Genie envisioner: A unified world foun- dation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 2

  50. [50]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 5, 6

  51. [51]

    Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

    William Lotter, Gabriel Kreiman, and David Cox. Deep pre- dictive coding networks for video prediction and unsupervised learning.arXiv preprint arXiv:1605.08104, 2016. 3

  52. [52]

    Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation

    Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation. InEuropean Conference on Computer Vision, pages 329–345. Springer,

  53. [53]

    Unleashing generalization of end-to-end autonomous driving with controllable long video generation.arXiv preprint arXiv:2406.01349, 2024

    Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Junpeng Jiang, Kun Zhan, Peng Jia, Xianpeng Lang, Haiyang Sun, et al. Unleashing generalization of end-to-end au- tonomous driving with controllable long video generation. arXiv preprint arXiv:2406.01349, 2024. 2

  54. [54]

    Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

    Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024. 1, 7

  55. [55]

    Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction,

    Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Xinze Chen, Guanghong Jia, Guan Huang, and Wenjun Mei. Recondreamer-rl: Enhancing reinforce- 11 ment learning via diffusion-based scene reconstruction.arXiv preprint arXiv:2508.08170, 2025. 2

  56. [56]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  57. [57]

    Openscene: 3d scene understanding with open vocabularies

    Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasac- chi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 815–824, 2023. 6

  58. [58]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  59. [59]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

    Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world founda- tion models.arXiv preprint arXiv:2506.09042, 2025. 3

  60. [60]

    GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fe- doseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

  61. [61]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  62. [62]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

  63. [63]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

  64. [64]

    Decomposing Motion and Content for Natural Video Sequence Prediction

    Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and con- tent for natural video sequence prediction.arXiv preprint arXiv:1706.08033, 2017. 3

  65. [65]

    Mila: Multi-view intensive-fidelity long-term video gener- ation world model for autonomous driving.arXiv preprint arXiv:2503.15875, 2025

    Haiguang Wang, Daqi Liu, Hongwei Xie, Haisong Liu, Enhui Ma, Kaicheng Yu, Limin Wang, and Bing Wang. Mila: Multi-view intensive-fidelity long-term video gener- ation world model for autonomous driving.arXiv preprint arXiv:2503.15875, 2025. 2, 3

  66. [66]

    Terasim-world: Worldwide safety-critical data synthesis for end-to-end autonomous driving.arXiv preprint arXiv:2509.13164, 2025

    Jiawei Wang, Haowei Sun, Xintao Yan, Shuo Feng, Jun Gao, and Henry X Liu. Terasim-world: Worldwide safety-critical data synthesis for end-to-end autonomous driving.arXiv preprint arXiv:2509.13164, 2025. 3

  67. [67]

    Occsora: 4d occupancy generation models as world simulators for au- tonomous driving

    Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiy- ong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for autonomous driv- ing.arXiv preprint arXiv:2405.20337, 2024. 2

  68. [68]

    Drivedreamer: Towards real-world- drive world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 1, 2, 3, 7

  69. [69]

    Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

    Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024. 1, 2, 3

  70. [70]

    Panacea: Panoramic and controllable video generation for autonomous driving

    Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, 2024. 3

  71. [71]

    Para-drive: Parallelized architecture for real- time autonomous driving

    Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024. 7

  72. [72]

    Video models are zero-shot learners and reasoners

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025. 3

  73. [73]

    Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

    Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 3

  74. [74]

    Street gaussians: Modeling dynamic urban scenes with gaussian splatting

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 2

  75. [75]

    Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective su- pervision

    Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective su- pervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17830–17839,

  76. [76]

    Generalized predictive model for autonomous driving

    Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14662–14672, 2024. 3

  77. [77]

    ReSim: Reliable World Simulation for Autonomous Driving

    Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for autonomous driving.arXiv preprint arXiv:2506.09981,

  78. [78]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 4

  79. [79]

    Root mean square layer 12 normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer 12 normalization.Advances in neural information processing systems, 32, 2019. 4

  80. [80]

    Epona: Autoregressive diffusion world model for autonomous driving

    Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive dif- fusion world model for autonomous driving.arXiv preprint arXiv:2506.24113, 2025. 2, 3, 6, 7, 8

Showing first 80 references.