pith. sign in

arxiv: 2601.22032 · v2 · pith:JHOSYE3Ynew · submitted 2026-01-29 · 💻 cs.CV

Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

classification 💻 cs.CV
keywords drivingend-to-endtrajectoryvideodrive-jepamultimodalpretrainingv-jepa
0
0 comments X
read the original abstract

End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    NTR adds a self-distillation masked latent reconstruction objective that uses only scene tokens to reconstruct masked patch features, improving visual representation quality and planning performance in end-to-end auto...

  2. The DAWN of World-Action Interactive Models

    cs.CV 2026-05 unverdicted novelty 6.0

    DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.

  3. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  4. Zero-Label Driving Scenario Complexity Detection via Joint Embedding Predictive Architecture

    cs.CV 2026-06 unverdicted novelty 5.0

    A self-supervised JEPA model on nuPlan data uses temporal prediction error to score driving scenario complexity without labels, assigning higher scores to turns and pedestrian interactions and achieving AP 0.512 in an...

  5. Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

    cs.RO 2026-06 unverdicted novelty 5.0

    Discrete-WAM unifies world modeling and policy learning for autonomous driving by representing observations, states, decisions, and actions as tokens in one space and using hierarchical token editing for planning.

  6. CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving

    cs.RO 2026-06 unverdicted novelty 4.0

    CLEAR achieves state-of-the-art PDMS of 93.7 on NAVSIM v1 by combining single-step VAE latent drift with Qwen 3.5-guided adaptive scheduling and trajectory scoring for end-to-end driving.