DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Bing Wang; Guang Chen; Haiyang Sun; Hangjun Ye; Jingfeng Yao; Kaixin Xiong; Kun Ma; Lijun Zhou; Tianze Xia; Wenyu Liu

arxiv: 2512.23421 · v3 · submitted 2025-12-29 · 💻 cs.CV

DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

Tianze Xia , Yongkang Li , Lijun Zhou , Jingfeng Yao , Kaixin Xiong , Haiyang Sun , Bing Wang , Kun Ma

show 4 more authors

Guang Chen Hangjun Ye Wenyu Liu Xinggang Wang

This is my paper

Pith reviewed 2026-05-16 19:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords autonomous drivingworld modelsvideo generationmotion planningdiffusion plannerlatent representationstrajectory planningNAVSIM benchmark

0 comments

The pith

DriveLaW unifies video generation and motion planning by injecting the video model's latent representations directly into a diffusion planner.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DriveLaW to combine the tasks of generating future driving videos and planning vehicle trajectories within a single system. It achieves this by taking the internal latent codes produced by its video generation component and feeding them straight into the planning module. This direct connection is meant to guarantee that the imagined futures align with the chosen actions without extra adjustments. A sympathetic reader would care because current approaches treat prediction and planning separately, which can lead to inconsistencies when handling rare real-world driving scenarios.

Core claim

DriveLaW unifies planning and video generation in a latent driving world by directly injecting the latent representation from its video generator into the planner. This ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. The system includes DriveLaW-Video for high-fidelity forecasting with expressive latents and DriveLaW-Act as a diffusion planner that uses those latents to generate trajectories, trained via a three-stage progressive strategy, leading to state-of-the-art performance on video prediction and the NAVSIM planning benchmark.

What carries the argument

Direct injection of the latent representation from the DriveLaW-Video generator into the DriveLaW-Act diffusion planner.

If this is right

Video prediction improves by 33.3% in FID and 1.8% in FVD over prior best-performing work.
The system sets a new record on the NAVSIM planning benchmark.
High-fidelity future generation and trajectory planning become inherently consistent without separate alignment steps.
The unified approach addresses long-tail challenges in autonomous driving through integrated world modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The three-stage progressive training may allow balancing the video and planning objectives more effectively than joint end-to-end training.
This latent-injection pattern could apply to other sequential decision domains where future prediction must stay aligned with control outputs.
Removing the need for post-hoc consistency checks between modules could simplify deployment pipelines in real vehicles.

Load-bearing premise

The latent representations produced by the video generator are sufficient by themselves to drive a diffusion planner that produces reliable trajectories without loss of performance in either component.

What would settle it

An experiment showing that a planner trained without the injected video latents achieves equal or better trajectory quality on the NAVSIM benchmark would falsify the need for the direct unification.

Figures

Figures reproduced from arXiv: 2512.23421 by Bing Wang, Guang Chen, Haiyang Sun, Hangjun Ye, Jingfeng Yao, Kaixin Xiong, Kun Ma, Lijun Zhou, Tianze Xia, Wenyu Liu, Xinggang Wang, Yongkang Li.

**Figure 2.** Figure 2: Overview of the overall architecture of DriveLaW. The model first encodes historical observations (images, actions) into a unified latent world representation through a powerful video diffusion model. In order to improve the generation quality, we introduced the Noise Reinjection mechanism to explore and select the optimal generation path in the early stage of denoising. The denoised video latents produced… view at source ↗

**Figure 3.** Figure 3: Restoring Structural and Temporal Consistency via Noise Reinjection. This comparison highlights the impact of our method. The baseline generation shows significant degradation, including (a) blurring, (b) structural inconsistency, and (c) artifacts. By integrating noise reinjection, our model preserves sharp details, maintains object structures, and produces clean, artifact-free frames, demonstrating a cru… view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison with state-of-the-art driving world model. We compare DriveLaW with Epona [80] on nuScenes validation set. DriveLaW generates videos with (1) clearer vehicle details and more stable structural integrity, (2) well-preserved pedestrian shapes that remain easily identifiable, and (3) correct recognition and maintenance of inconspicuous objects (e.g., the yellow van), demonstrating super… view at source ↗

**Figure 5.** Figure 5: Qualitative analysis of latent representations. We visualize the quality of latent representations from three different feature sources: BEV features extracted from BEVFormer [45]’s ResNet-101 backbone, VLM features from the pretrained Qwen2.5-VL model in ReCogDrive [44], and VGM (Video Generation Model) features from our DriveLaW-Video. To enable visual comparison, we apply PCA to reduce each representati… view at source ↗

**Figure 6.** Figure 6: Qualitative examples of DriveLaW video generation on the nuScenes dataset. (a) Conventional urban driving scenarios, showing stable lane keeping and interactions with surrounding traffic. (b) Complex urban driving scenarios involving dense multi-agent interactions, turning maneuvers, and occlusions. (c) Night driving scenarios, demonstrating the model’s robustness to low-light conditions while preserving t… view at source ↗

**Figure 7.** Figure 7: Qualitative results on the Navtest benchmark. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DriveLaW's latent-injection trick links video generation and diffusion planning for driving, delivering reported SOTA gains, but the claim that those latents carry everything needed still needs direct checks.

read the letter

DriveLaW's main contribution is piping the latent codes from its video world model straight into a diffusion planner instead of running the two tasks separately. The abstract and setup describe a three-stage training process where the video model first learns to forecast, then the planner is trained on those latents, and both are refined together. This produces the claimed consistency between generated futures and planned trajectories, plus the reported numbers: 33.3% better FID and 1.8% better FVD on video prediction, plus a new high on the NAVSIM planning benchmark.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes DriveLaW, a unified framework that integrates video generation and motion planning for autonomous driving. It features DriveLaW-Video as a world model producing high-fidelity forecasts and expressive latent representations, and DriveLaW-Act as a diffusion planner that generates trajectories by directly injecting these latents. The components are trained progressively in three stages, yielding state-of-the-art performance on video prediction metrics (FID improved by 33.3%, FVD by 1.8%) and the NAVSIM planning benchmark.

Significance. Should the unification via latent injection prove robust, the work would represent a meaningful step toward consistent world models in driving, potentially reducing discrepancies between predicted futures and planned actions. The empirical SOTA claims on both generation and planning tasks suggest broad applicability if the evidence holds under scrutiny.

major comments (1)

[Abstract] Abstract: The central unification claim—that direct injection of the latent representation from DriveLaW-Video into the diffusion planner DriveLaW-Act ensures inherent consistency without performance loss—rests on an unexamined assumption that the latents are information-complete for reliable trajectory generation. No ablation isolating the latent-only case, no analysis of compression loss, and no comparison against additional conditioning (e.g., ego-state or map features) are supplied, leaving the sufficiency of the latent representation unsupported.

minor comments (1)

Title: missing space after colon ('DriveLaW:Unifying' should read 'DriveLaW: Unifying').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. We address the single major comment point-by-point below. The suggested additions will strengthen the empirical support for our unification claim, and we will incorporate them in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central unification claim—that direct injection of the latent representation from DriveLaW-Video into the diffusion planner DriveLaW-Act ensures inherent consistency without performance loss—rests on an unexamined assumption that the latents are information-complete for reliable trajectory generation. No ablation isolating the latent-only case, no analysis of compression loss, and no comparison against additional conditioning (e.g., ego-state or map features) are supplied, leaving the sufficiency of the latent representation unsupported.

Authors: We appreciate this observation. While the current results (SOTA video metrics and new NAVSIM record) are achieved precisely with latent-only injection into the diffusion planner, we agree that the manuscript would benefit from explicit verification of the latent representation's sufficiency. In the revised version we will add: (1) an ablation comparing latent-only conditioning against the full model, (2) quantitative analysis of information retention and compression loss in the latent space (e.g., reconstruction FID/FVD and dimensionality studies), and (3) controlled comparisons that augment the latent with ego-state or map features. These experiments will directly test whether the latents are information-complete for reliable planning. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The provided abstract and context describe an architectural unification via latent injection and a three-stage training strategy, but contain no equations, parameter fits, or self-citations that reduce any claimed prediction or consistency result to an input by construction. The strongest claim (inherent consistency from direct latent injection) is presented as an empirical outcome of the model design rather than a definitional or fitted tautology. No load-bearing steps match the enumerated circularity patterns; the derivation remains self-contained against external benchmarks like NAVSIM and FID/FVD metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of concrete free parameters, axioms, or invented entities; the three-stage training and latent-injection design are methodological choices whose details are not supplied.

pith-pipeline@v0.9.0 · 5556 in / 1094 out tokens · 39301 ms · 2026-05-16T19:31:33.714777+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning... three-stage progressive training strategy
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DriveLaW-Video... spatiotemporal VAE... Video DiT... Action DiT... noise reinjection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Imitation: Learning Safe End-to-End Autonomous Driving from Hard Negatives
cs.RO 2026-05 unverdicted novelty 6.0

BeyondDrive augments imitation learning with synthesized safety-critical negative trajectories and a repulsive loss to improve safety in autonomous driving, reporting 89.7 PDMS on NAVSIMv1 and generalization to other models.
Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

Xiaomi EV World Model integrates WorldRec for sparse-query 3D Gaussian reconstruction and WorldGen for fast causal video generation via bidirectional pretraining and causal fine-tuning to support autonomous driving si...
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
DriveFuture: Future-Aware Latent World Models for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

DriveFuture achieves SOTA results on NAVSIM by conditioning latent world model states on future predictions to directly inform trajectory planning.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
EponaV2: Driving World Model with Comprehensive Future Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies
cs.LG 2026-05 unverdicted novelty 5.0

CRAFT is an on-policy RL fine-tuning framework that decomposes closed-loop policy gradients into a group-normalized counterfactual proxy plus residual correction from interaction events, achieving top closed-loop perf...
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
cs.CV 2026-04 unverdicted novelty 5.0

RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 8 Pith papers · 26 internal anchors

[1]

Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

Hervé Abdi and Lynne J Williams. Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010. 7

work page 2010
[2]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video founda- tion models for physical ai.arXiv preprint arXiv:2511.00062,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Vavim and vavam: Autonomous driving through video generative modeling

Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Cham- bon, Spyros Gidaris, Serkan Odabas, David Hurych, et al. Vavim and vavam: Autonomous driving through video gen- erative modeling.arXiv preprint arXiv:2502.15672, 2025. 3

work page arXiv 2025
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision- language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 2

work page 2024
[8]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker- 9 Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first Interna- tional Conference on Machine Learning, 2024. 2, 3

work page 2024
[9]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2, 6

work page 2020
[10]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based plan- ning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021. 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Drivinggpt: Unifying driving world modeling and planning with multi- modal autoregressive transformers

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi- modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26890–26900, 2025. 3, 7

work page 2025
[14]

Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 3

work page 2025
[15]

Transfuser: Imita- tion with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imita- tion with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022. 7

work page 2022
[16]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmark- ing.Advances in Neural Information Processing Systems, 37: 28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmark- ing.Advances in Neural Information Processing Systems, 37: 28706–28719, 2024. 2, 6

work page 2024
[17]

Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driv- ing policy via large-scale 3dgs-based reinforcement learning. arXiv preprint arXiv:2502.13144, 2025. 1, 2

work page arXiv 2025
[18]

Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023. 1, 2, 3

work page arXiv 2023
[19]

Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28135–28144, 2025. 3

work page 2025
[20]

Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024. 1, 2, 7

work page 2024
[21]

Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation

Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, et al. Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation. arXiv preprint arXiv:2503.15208, 2025

work page arXiv 2025
[22]

Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal con- sistency.arXiv preprint arXiv:2506.07497, 2025

Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Li- jun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, et al. Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal consistency. arXiv preprint arXiv:2506.07497, 2025. 2

work page arXiv 2025
[23]

Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark. arXiv preprint arXiv:2510.26802, 2025. 3

work page arXiv 2025
[24]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Flexible diffusion modeling of long videos.Advances in neural information processing systems, 35:27953–27965, 2022

William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- tian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos.Advances in neural information processing systems, 35:27953–27965, 2022. 3

work page 2022
[26]

Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control. InProceedings of the Computer Vision and Pattern Recognition C...

work page 2025
[27]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 6

work page 2017
[28]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Driving- world: Constructingworld model for autonomous driving via video gpt

Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructing world model for autonomous driving via video gpt.arXiv preprint arXiv:2412.19505, 2024. 3 10

work page arXiv 2024
[31]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023. 1, 7

work page 2023
[32]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object de- tection in bird-eye-view.arXiv preprint arXiv:2112.11790,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Diffuseslide: Training-free high frame rate video generation diffusion, 2025

Geunmin Hwang, Hyun kyu Ko, Younghyun Kim, Seungry- ong Lee, and Eunbyung Park. Diffuseslide: Training-free high frame rate video generation diffusion, 2025. 5

work page 2025
[34]

Vad: Vectorized scene representation for ef- ficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for ef- ficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340– 8350, 2023. 1

work page 2023
[35]

Drivegan: Towards a controllable high-quality neural simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5820–5829,

work page
[36]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Pyramid- flow: High-resolution defect contrastive localization using pyramid normalizing flow

Jiarui Lei, Xiaobo Hu, Yue Wang, and Dong Liu. Pyramid- flow: High-resolution defect contrastive localization using pyramid normalizing flow. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14143–14152, 2023. 4

work page 2023
[38]

Uniscene: Unified occupancy-centric driving scene generation

Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11971–11981, 2025. 2

work page 2025
[39]

Omninwm: Omniscient driving navigation world models

Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving naviga- tion world models.arXiv preprint arXiv:2510.18313, 2025. 1, 2, 3

work page arXiv 2025
[40]

Drivingdiffusion: layout-guided multi-view driving scenarios video generation with latent diffusion model

Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: layout-guided multi-view driving scenarios video generation with latent diffusion model. InEuropean Conference on Computer Vision, pages 469–485. Springer, 2024. 1, 3

work page 2024
[41]

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024. 1, 2, 7

work page internal anchor Pith review arXiv 2024
[42]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World mod- els amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025. 1, 2, 3, 6, 7

work page internal anchor Pith review arXiv 2025
[43]

turn left

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online tra- jectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025. 2, 7

work page arXiv 2025
[44]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive frame- work for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025. 1, 3, 5, 6, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1, 9

work page 2024
[46]

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024. 3

work page internal anchor Pith review arXiv 2024
[47]

Maptr: Structured modeling and learning for online vectorized hd map construction

Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction.arXiv preprint arXiv:2208.14437, 2022. 1

work page arXiv 2022
[48]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 1, 3, 5, 6, 7

work page 2025
[49]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Sheng- cong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jian- lan Luo, et al. Genie envisioner: A unified world foun- dation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

William Lotter, Gabriel Kreiman, and David Cox. Deep pre- dictive coding networks for video prediction and unsupervised learning.arXiv preprint arXiv:1605.08104, 2016. 3

work page internal anchor Pith review Pith/arXiv arXiv 2016
[52]

Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation

Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation. InEuropean Conference on Computer Vision, pages 329–345. Springer,

work page
[53]

Unleashing generalization of end-to-end autonomous driving with controllable long video generation.arXiv preprint arXiv:2406.01349, 2024

Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Junpeng Jiang, Kun Zhan, Peng Jia, Xianpeng Lang, Haiyang Sun, et al. Unleashing generalization of end-to-end au- tonomous driving with controllable long video generation. arXiv preprint arXiv:2406.01349, 2024. 2

work page arXiv 2024
[54]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024. 1, 7

work page 2024
[55]

Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction,

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Xinze Chen, Guanghong Jia, Guan Huang, and Wenjun Mei. Recondreamer-rl: Enhancing reinforce- 11 ment learning via diffusion-based scene reconstruction.arXiv preprint arXiv:2508.08170, 2025. 2

work page arXiv 2025
[56]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[57]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasac- chi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 815–824, 2023. 6

work page 2023
[58]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world founda- tion models.arXiv preprint arXiv:2506.09042, 2025. 3

work page arXiv 2025
[60]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fe- doseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

work page internal anchor Pith review Pith/arXiv arXiv
[61]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[62]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[63]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

work page 2017
[64]

Decomposing Motion and Content for Natural Video Sequence Prediction

Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and con- tent for natural video sequence prediction.arXiv preprint arXiv:1706.08033, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[65]

Mila: Multi-view intensive-fidelity long-term video gener- ation world model for autonomous driving.arXiv preprint arXiv:2503.15875, 2025

Haiguang Wang, Daqi Liu, Hongwei Xie, Haisong Liu, Enhui Ma, Kaicheng Yu, Limin Wang, and Bing Wang. Mila: Multi-view intensive-fidelity long-term video gener- ation world model for autonomous driving.arXiv preprint arXiv:2503.15875, 2025. 2, 3

work page arXiv 2025
[66]

Terasim-world: Worldwide safety-critical data synthesis for end-to-end autonomous driving.arXiv preprint arXiv:2509.13164, 2025

Jiawei Wang, Haowei Sun, Xintao Yan, Shuo Feng, Jun Gao, and Henry X Liu. Terasim-world: Worldwide safety-critical data synthesis for end-to-end autonomous driving.arXiv preprint arXiv:2509.13164, 2025. 3

work page arXiv 2025
[67]

Occsora: 4d occupancy generation models as world simulators for au- tonomous driving

Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiy- ong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for autonomous driv- ing.arXiv preprint arXiv:2405.20337, 2024. 2

work page arXiv 2024
[68]

Drivedreamer: Towards real-world- drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 1, 2, 3, 7

work page 2024
[69]

Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024. 1, 2, 3

work page 2024
[70]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, 2024. 3

work page 2024
[71]

Para-drive: Parallelized architecture for real- time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024. 7

work page 2024
[72]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 3

work page 2025
[74]

Street gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 2

work page 2024
[75]

Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective su- pervision

Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective su- pervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17830–17839,

work page
[76]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14662–14672, 2024. 3

work page 2024
[77]

ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for autonomous driving.arXiv preprint arXiv:2506.09981,

work page internal anchor Pith review Pith/arXiv arXiv
[78]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Root mean square layer 12 normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer 12 normalization.Advances in neural information processing systems, 32, 2019. 4

work page 2019
[80]

Epona: Autoregressive diffusion world model for autonomous driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive dif- fusion world model for autonomous driving.arXiv preprint arXiv:2506.24113, 2025. 2, 3, 6, 7, 8

work page arXiv 2025

Showing first 80 references.

[1] [1]

Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

Hervé Abdi and Lynne J Williams. Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010. 7

work page 2010

[2] [2]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video founda- tion models for physical ai.arXiv preprint arXiv:2511.00062,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Vavim and vavam: Autonomous driving through video generative modeling

Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Cham- bon, Spyros Gidaris, Serkan Odabas, David Hurych, et al. Vavim and vavam: Autonomous driving through video gen- erative modeling.arXiv preprint arXiv:2502.15672, 2025. 3

work page arXiv 2025

[5] [5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision- language-action flow model for general robot control. corr, abs/2410.24164, 2024. doi: 10.48550.arXiv preprint ARXIV .2410.24164. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024. 2

work page 2024

[8] [8]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker- 9 Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. InForty-first Interna- tional Conference on Machine Learning, 2024. 2, 3

work page 2024

[9] [9]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2, 6

work page 2020

[10] [10]

NuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles

Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric Wolff, Alex Lang, Luke Fletcher, Oscar Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based plan- ning benchmark for autonomous vehicles.arXiv preprint arXiv:2106.11810, 2021. 6

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart- alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Drivinggpt: Unifying driving world modeling and planning with multi- modal autoregressive transformers

Yuntao Chen, Yuqi Wang, and Zhaoxiang Zhang. Drivinggpt: Unifying driving world modeling and planning with multi- modal autoregressive transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26890–26900, 2025. 3, 7

work page 2025

[14] [14]

Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action dif- fusion.The International Journal of Robotics Research, 44 (10-11):1684–1704, 2025. 3

work page 2025

[15] [15]

Transfuser: Imita- tion with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imita- tion with transformer-based sensor fusion for autonomous driving.IEEE transactions on pattern analysis and machine intelligence, 45(11):12878–12895, 2022. 7

work page 2022

[16] [16]

Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmark- ing.Advances in Neural Information Processing Systems, 37: 28706–28719, 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmark- ing.Advances in Neural Information Processing Systems, 37: 28706–28719, 2024. 2, 6

work page 2024

[17] [17]

Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, et al. Rad: Training an end-to-end driv- ing policy via large-scale 3dgs-based reinforcement learning. arXiv preprint arXiv:2502.13144, 2025. 1, 2

work page arXiv 2025

[18] [18]

Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023

Ruiyuan Gao, Kai Chen, Enze Xie, Lanqing Hong, Zhenguo Li, Dit-Yan Yeung, and Qiang Xu. Magicdrive: Street view generation with diverse 3d geometry control.arXiv preprint arXiv:2310.02601, 2023. 1, 2, 3

work page arXiv 2023

[19] [19]

Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive-v2: High-resolution long video generation for autonomous driving with adaptive control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28135–28144, 2025. 3

work page 2025

[20] [20]

Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems, 37:91560–91596, 2024. 1, 2, 7

work page 2024

[21] [21]

Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation

Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, et al. Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation. arXiv preprint arXiv:2503.15208, 2025

work page arXiv 2025

[22] [22]

Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal con- sistency.arXiv preprint arXiv:2506.07497, 2025

Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Li- jun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, et al. Genesis: Multimodal driving scene generation with spatio-temporal and cross-modal consistency. arXiv preprint arXiv:2506.07497, 2025. 2

work page arXiv 2025

[23] [23]

Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark.arXiv preprint arXiv:2510.26802, 2025

Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, and Pheng-Ann Heng. Are video models ready as zero-shot reasoners? an empirical study with the mme-cof benchmark. arXiv preprint arXiv:2510.26802, 2025. 3

work page arXiv 2025

[24] [24]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Flexible diffusion modeling of long videos.Advances in neural information processing systems, 35:27953–27965, 2022

William Harvey, Saeid Naderiparizi, Vaden Masrani, Chris- tian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos.Advances in neural information processing systems, 35:27953–27965, 2022. 3

work page 2022

[26] [26]

Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene com- position control. InProceedings of the Computer Vision and Pattern Recognition C...

work page 2025

[27] [27]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 6

work page 2017

[28] [28]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Cor- rado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Driving- world: Constructingworld model for autonomous driving via video gpt

Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian Zhang, Xiaoxiao Long, and Ping Tan. Driving- world: Constructing world model for autonomous driving via video gpt.arXiv preprint arXiv:2412.19505, 2024. 3 10

work page arXiv 2024

[31] [31]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023. 1, 7

work page 2023

[32] [32]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object de- tection in bird-eye-view.arXiv preprint arXiv:2112.11790,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Diffuseslide: Training-free high frame rate video generation diffusion, 2025

Geunmin Hwang, Hyun kyu Ko, Younghyun Kim, Seungry- ong Lee, and Eunbyung Park. Diffuseslide: Training-free high frame rate video generation diffusion, 2025. 5

work page 2025

[34] [34]

Vad: Vectorized scene representation for ef- ficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for ef- ficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340– 8350, 2023. 1

work page 2023

[35] [35]

Drivegan: Towards a controllable high-quality neural simulation

Seung Wook Kim, Jonah Philion, Antonio Torralba, and Sanja Fidler. Drivegan: Towards a controllable high-quality neural simulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5820–5829,

work page

[36] [36]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Pyramid- flow: High-resolution defect contrastive localization using pyramid normalizing flow

Jiarui Lei, Xiaobo Hu, Yue Wang, and Dong Liu. Pyramid- flow: High-resolution defect contrastive localization using pyramid normalizing flow. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14143–14152, 2023. 4

work page 2023

[38] [38]

Uniscene: Unified occupancy-centric driving scene generation

Bohan Li, Jiazhe Guo, Hongsi Liu, Yingshuang Zou, Yikang Ding, Xiwu Chen, Hu Zhu, Feiyang Tan, Chi Zhang, Tiancai Wang, et al. Uniscene: Unified occupancy-centric driving scene generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11971–11981, 2025. 2

work page 2025

[39] [39]

Omninwm: Omniscient driving navigation world models

Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, et al. Omninwm: Omniscient driving naviga- tion world models.arXiv preprint arXiv:2510.18313, 2025. 1, 2, 3

work page arXiv 2025

[40] [40]

Drivingdiffusion: layout-guided multi-view driving scenarios video generation with latent diffusion model

Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: layout-guided multi-view driving scenarios video generation with latent diffusion model. InEuropean Conference on Computer Vision, pages 469–485. Springer, 2024. 1, 3

work page 2024

[41] [41]

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He, Yuqi Wang, Yuntao Chen, Zhaoxiang Zhang, and Tieniu Tan. Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024. 1, 2, 7

work page internal anchor Pith review arXiv 2024

[42] [42]

DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World mod- els amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025. 1, 2, 3, 6, 7

work page internal anchor Pith review arXiv 2025

[43] [43]

turn left

Yingyan Li, Yuqi Wang, Yang Liu, Jiawei He, Lue Fan, and Zhaoxiang Zhang. End-to-end driving with online tra- jectory evaluation via bev world model.arXiv preprint arXiv:2504.01941, 2025. 2, 7

work page arXiv 2025

[44] [44]

ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive frame- work for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025. 1, 3, 5, 6, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1, 9

work page 2024

[46] [46]

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, et al. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024. 3

work page internal anchor Pith review arXiv 2024

[47] [47]

Maptr: Structured modeling and learning for online vectorized hd map construction

Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction.arXiv preprint arXiv:2208.14437, 2022. 1

work page arXiv 2022

[48] [48]

Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving

Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025. 1, 3, 5, 6, 7

work page 2025

[49] [49]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Sheng- cong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jian- lan Luo, et al. Genie envisioner: A unified world foun- dation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[51] [51]

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

William Lotter, Gabriel Kreiman, and David Cox. Deep pre- dictive coding networks for video prediction and unsupervised learning.arXiv preprint arXiv:1605.08104, 2016. 3

work page internal anchor Pith review Pith/arXiv arXiv 2016

[52] [52]

Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation

Jiachen Lu, Ze Huang, Zeyu Yang, Jiahui Zhang, and Li Zhang. Wovogen: World volume-aware diffusion for control- lable multi-camera driving scene generation. InEuropean Conference on Computer Vision, pages 329–345. Springer,

work page

[53] [53]

Unleashing generalization of end-to-end autonomous driving with controllable long video generation.arXiv preprint arXiv:2406.01349, 2024

Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Junpeng Jiang, Kun Zhan, Peng Jia, Xianpeng Lang, Haiyang Sun, et al. Unleashing generalization of end-to-end au- tonomous driving with controllable long video generation. arXiv preprint arXiv:2406.01349, 2024. 2

work page arXiv 2024

[54] [54]

Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15522–15533, 2024. 1, 7

work page 2024

[55] [55]

Recondreamer-rl: Enhancing reinforcement learning via diffusion-based scene reconstruction,

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Xinze Chen, Guanghong Jia, Guan Huang, and Wenjun Mei. Recondreamer-rl: Enhancing reinforce- 11 ment learning via diffusion-based scene reconstruction.arXiv preprint arXiv:2508.08170, 2025. 2

work page arXiv 2025

[56] [56]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page

[57] [57]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasac- chi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 815–824, 2023. 6

work page 2023

[58] [58]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world founda- tion models.arXiv preprint arXiv:2506.09042, 2025. 3

work page arXiv 2025

[60] [60]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fe- doseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page

[62] [62]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018

[63] [63]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 4

work page 2017

[64] [64]

Decomposing Motion and Content for Natural Video Sequence Prediction

Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and con- tent for natural video sequence prediction.arXiv preprint arXiv:1706.08033, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017

[65] [65]

Mila: Multi-view intensive-fidelity long-term video gener- ation world model for autonomous driving.arXiv preprint arXiv:2503.15875, 2025

Haiguang Wang, Daqi Liu, Hongwei Xie, Haisong Liu, Enhui Ma, Kaicheng Yu, Limin Wang, and Bing Wang. Mila: Multi-view intensive-fidelity long-term video gener- ation world model for autonomous driving.arXiv preprint arXiv:2503.15875, 2025. 2, 3

work page arXiv 2025

[66] [66]

Terasim-world: Worldwide safety-critical data synthesis for end-to-end autonomous driving.arXiv preprint arXiv:2509.13164, 2025

Jiawei Wang, Haowei Sun, Xintao Yan, Shuo Feng, Jun Gao, and Henry X Liu. Terasim-world: Worldwide safety-critical data synthesis for end-to-end autonomous driving.arXiv preprint arXiv:2509.13164, 2025. 3

work page arXiv 2025

[67] [67]

Occsora: 4d occupancy generation models as world simulators for au- tonomous driving

Lening Wang, Wenzhao Zheng, Yilong Ren, Han Jiang, Zhiy- ong Cui, Haiyang Yu, and Jiwen Lu. Occsora: 4d occupancy generation models as world simulators for autonomous driv- ing.arXiv preprint arXiv:2405.20337, 2024. 2

work page arXiv 2024

[68] [68]

Drivedreamer: Towards real-world- drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jia- gang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world- drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024. 1, 2, 3, 7

work page 2024

[69] [69]

Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for au- tonomous driving. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024. 1, 2, 3

work page 2024

[70] [70]

Panacea: Panoramic and controllable video generation for autonomous driving

Yuqing Wen, Yucheng Zhao, Yingfei Liu, Fan Jia, Yanhui Wang, Chong Luo, Chi Zhang, Tiancai Wang, Xiaoyan Sun, and Xiangyu Zhang. Panacea: Panoramic and controllable video generation for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6902–6912, 2024. 3

work page 2024

[71] [71]

Para-drive: Parallelized architecture for real- time autonomous driving

Xinshuo Weng, Boris Ivanovic, Yan Wang, Yue Wang, and Marco Pavone. Para-drive: Parallelized architecture for real- time autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15449–15458, 2024. 7

work page 2024

[72] [72]

Video models are zero-shot learners and reasoners

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving

Zebin Xing, Xingyu Zhang, Yang Hu, Bo Jiang, Tong He, Qian Zhang, Xiaoxiao Long, and Wei Yin. Goalflow: Goal- driven flow matching for multimodal trajectories generation in end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1602–1611, 2025. 3

work page 2025

[74] [74]

Street gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 2

work page 2024

[75] [75]

Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective su- pervision

Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective su- pervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17830–17839,

work page

[76] [76]

Generalized predictive model for autonomous driving

Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, et al. Generalized predictive model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14662–14672, 2024. 3

work page 2024

[77] [77]

ReSim: Reliable World Simulation for Autonomous Driving

Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for autonomous driving.arXiv preprint arXiv:2506.09981,

work page internal anchor Pith review Pith/arXiv arXiv

[78] [78]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[79] [79]

Root mean square layer 12 normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer 12 normalization.Advances in neural information processing systems, 32, 2019. 4

work page 2019

[80] [80]

Epona: Autoregressive diffusion world model for autonomous driving

Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, et al. Epona: Autoregressive dif- fusion world model for autonomous driving.arXiv preprint arXiv:2506.24113, 2025. 2, 3, 6, 7, 8

work page arXiv 2025