DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Bo Zhang; Chen Shi; Jinrui Xu; Kehua Sheng; Li Jiang; Shaoshuai Shi

arxiv: 2605.28544 · v1 · pith:4YOKZKTWnew · submitted 2026-05-27 · 💻 cs.CV

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

Chen Shi , Jinrui Xu , Shaoshuai Shi , Kehua Sheng , Bo Zhang , Li Jiang This is my paper

classification 💻 cs.CV

keywords drivingvideodrivewampretrainedactionautonomousmodelspriors

0 comments

read the original abstract

Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse
cs.DC 2026-06 unverdicted novelty 6.0

Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.
Diffusion Transformer World-Action Model for AV Scene Prediction
cs.CV 2026-06 unverdicted novelty 6.0

A Diffusion Transformer world model in V-JEPA2 latent space predicts action-conditioned future scenes on nuScenes, outperforming regression on KID/FID while preserving steering controllability and adding a jump model ...
Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning
cs.RO 2026-06 unverdicted novelty 5.0

Discrete-WAM unifies world modeling and policy learning for autonomous driving by representing observations, states, decisions, and actions as tokens in one space and using hierarchical token editing for planning.
World Action Models: A Survey
cs.RO 2026-06 unverdicted novelty 3.0

A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.