VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
Canonical reference
Omninwm: Omniscient driving navigation world models
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. However, existing methods are typically restricted to fragmented modality modeling, short-horizon drift, and imprecise action control, while lacking intrinsic mechanisms for policy evaluation. In this paper, we introduce OmniNWM, an Omniscient panoramic Navigation World Model that addresses all three dimensions within a consistent probabilistic framework. For State, OmniNWM generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy, ensuring pixel-level alignment across modalities with joint distribution modeling. To mitigate autoregressive exposure bias, we propose a structured panoramic forcing strategy to stabilize long-horizon generation via stochastic manifold thickening. For Action, we introduce canonical geometric action encoding with normalized panoramic Pl\"ucker ray-maps. This representation decouples motion dynamics from sensor intrinsics, enabling precise, zero-shot trajectory control across heterogeneous datasets and camera configurations. For Reward, we derive intrinsic occupancy-grounded dense rewards directly from generated 3D volumes, establishing a reliable closed-loop simulation cycle for evaluating diverse planning agents. Extensive experiments demonstrate that OmniNWM achieves SOTA performance in generation fidelity and control precision, with remarkable zero-shot robustness to novel scenes on NuPlan and in-house datasets with distinct camera rigs. Project page is available at https://arlo0o.github.io/OmniNWM/.
citation-role summary
citation-polarity summary
fields
cs.CV 9roles
background 5polarities
background 5representative citing papers
PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
ReWorld applies future-predictive, cross-modal, and hard-negative supervision directly to intermediate representations in Video and Action DiTs for WAMs, reporting 23.9% FVD improvement and PDMS rise from 89.1 to 90.4 on nuScenes and NAVSIM.
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.
citing papers explorer
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
PanoWorld: Geometry-Consistent Panoramic Video World Modeling
PanoWorld adds depth consistency and trajectory consistency losses plus spherical adaptations to a pre-trained video model, plus a new PanoGeo dataset, to produce geometry-consistent 360 video.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
-
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.
-
ReWorld: Learning Better Representations for World Action Models
ReWorld applies future-predictive, cross-modal, and hard-negative supervision directly to intermediate representations in Video and Action DiTs for WAMs, reporting 23.9% FVD improvement and PDMS rise from 89.1 to 90.4 on nuScenes and NAVSIM.
-
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
-
Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends
This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.
- OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
- ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving