hub

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba · 2025 · cs.AI · arXiv 2506.09985

64 Pith papers cite this work. Polarity classification is still indexing.

64 Pith papers citing it

open full Pith review browse 64 citing papers arXiv PDF

abstract

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on

co-cited works

representative citing papers

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

cs.RO · 2026-04-22 · conditional · novelty 8.0

Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.

Learning Visual Feature-Based World Models via Residual Latent Action

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

A foundation model of vision, audition, and language for in-silico neuroscience

q-bio.NC · 2026-05-05 · unverdicted · novelty 7.0

TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.

Latent State Design for World Models under Sufficiency Constraints

cs.AI · 2026-05-03 · unverdicted · novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

cs.AI · 2026-04-24 · unverdicted · novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced agentic modeling.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

AnimationBench: Are Video Models Good at Character-Centric Animation?

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.

GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.

Action Images: End-to-End Policy Learning via Multiview Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

cs.RO · 2026-04-06 · conditional · novelty 7.0

StarVLA delivers a Lego-like open-source framework for VLA models with swappable backbones and action heads, reusable training methods, and unified evaluation across major benchmarks.

Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A noise-aware contrastive loss built on temporal self-supervision learns polyp tracklet representations from 27 videos that outperform prior self-supervised and supervised baselines and match foundation models on retrieval, re-identification, size estimation, and histology classification.

Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

In configurable enterprise systems, runtime discovery of transition dynamics from system configuration is more robust to deployment shifts than offline-trained world models.

The DAWN of World-Action Interactive Models

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.

Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

A five-phase co-training framework enables stable JEPA pretraining on EHR trajectories, producing converging latent rollouts and higher multi-task AUROC than baselines on MIMIC-IV ICU data.

RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

cs.RO · 2026-05-10 · unverdicted · novelty 6.0

RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and corrective trajectories.

Latent Geometry Beyond Search: Amortizing Planning in World Models

cs.RO · 2026-05-09 · unverdicted · novelty 6.0

In regularized latent spaces of world models, planning can be amortized into a goal-conditioned inverse dynamics model that matches CEM performance at 100-130x lower per-decision cost.

Predictive but Not Plannable: RC-aux for Latent World Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

3D MRI Image Pretraining via Controllable 2D Slice Navigation Task

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

Converting 3D MRI volumes into action-conditioned 2D slice navigation sequences offers a complementary self-supervised pretraining signal for learning anatomical and spatial representations.

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

Understanding Self-Supervised Learning via Latent Distribution Matching

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

Self-supervised learning is cast as latent distribution matching that aligns representations to a model while enforcing uniformity, unifying multiple SSL families and proving identifiability for predictive variants even with nonlinear predictors.

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Watching Physics: the Generative Science of Matter and Motion cs.CE · 2026-04-18 · unverdicted · none · ref 2 · internal anchor
Generative video models recover physical quantities like surface strain from visible motion when coupled with experiments and simulations, but fail when internal variables dominate, defining a new Generative Science of Matter and Motion.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

hub tools

claims ledger

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer