DriveVA: Video Action Models are Zero-Shot Drivers

Diankun Zhang; Francesco Nex; Guang Chen; Hangjun Ye; Hao Cheng; Hongwei Xie; Jianfeng Cui; Jiuming Liu; Mengmeng Liu; Michael Ying Yang

arxiv: 2604.04198 · v1 · submitted 2026-04-05 · 💻 cs.CV · cs.RO

DriveVA: Video Action Models are Zero-Shot Drivers

Mengmeng Liu , Diankun Zhang , Jiuming Liu , Jianfeng Cui , Hongwei Xie , Guang Chen , Hangjun Ye , Michael Ying Yang

show 2 more authors

Francesco Nex Hao Cheng

This is my paper

classification 💻 cs.CV cs.RO

keywords drivevaactionfuturegeneralizationplanningvideoautonomouschallenge

0 comments

read the original abstract

Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal evolution and causal interaction patterns. To this end, DriveVA employs a DiT-based decoder to jointly predict future action sequences (trajectories) and videos, enabling tighter alignment between planning and scene evolution. We also introduce a video continuation strategy to strengthen long-duration rollout consistency. DriveVA achieves an impressive closed-loop performance of 90.9 PDM score on the challenge NAVSIM. Extensive experiments also demonstrate the zero-shot capability and cross-domain generalization of DriveVA, which reduces average L2 error and collision rate by 78.9% and 83.3% on nuScenes and 52.5% and 52.4% on the Bench2drive built on CARLA v2 compared with the state-of-the-art world-model-based planner.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
cs.CV 2026-05 unverdicted novelty 5.0

LVDrive improves closed-loop driving on Bench2Drive by adding latent future scene prediction to VLA models via unified embedding space processing and two-stage trajectory decoding.