Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving
read the original abstract
End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.
This paper has not been read by Pith yet.
Forward citations
Cited by 6 Pith papers
-
NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving
NTR adds a self-distillation masked latent reconstruction objective that uses only scene tokens to reconstruct masked patch features, improving visual representation quality and planning performance in end-to-end auto...
-
The DAWN of World-Action Interactive Models
DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
Zero-Label Driving Scenario Complexity Detection via Joint Embedding Predictive Architecture
A self-supervised JEPA model on nuPlan data uses temporal prediction error to score driving scenario complexity without labels, assigning higher scores to turns and pedestrian interactions and achieving AP 0.512 in an...
-
Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning
Discrete-WAM unifies world modeling and policy learning for autonomous driving by representing observations, states, decisions, and actions as tokens in one space and using hierarchical token editing for planning.
-
CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving
CLEAR achieves state-of-the-art PDMS of 93.7 on NAVSIM v1 by combining single-step VAE latent drift with Qwen 3.5-guided adaptive scheduling and trajectory scoring for end-to-end driving.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.