pith. machine review for the scientific record. sign in

hub

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

64 Pith papers cite this work. Polarity classification is still indexing.

64 Pith papers citing it
abstract

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

hub tools

claims ledger

  • abstract A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on

co-cited works

years

2026 62 2025 2

clear filters

representative citing papers

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

cs.AI · 2026-04-24 · unverdicted · novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced agentic modeling.

Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A noise-aware contrastive loss built on temporal self-supervision learns polyp tracklet representations from 27 videos that outperform prior self-supervised and supervised baselines and match foundation models on retrieval, re-identification, size estimation, and histology classification.

The DAWN of World-Action Interactive Models

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.

Predictive but Not Plannable: RC-aux for Latent World Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

Understanding Self-Supervised Learning via Latent Distribution Matching

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

Self-supervised learning is cast as latent distribution matching that aligns representations to a model while enforcing uniformity, unifying multiple SSL families and proving identifiability for predictive variants even with nonlinear predictors.

citing papers explorer

Showing 50 of 58 citing papers after filters.

  • JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning cs.LG · 2026-05-13 · unverdicted · none · ref 21 · internal anchor

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.

  • Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor

    Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.

  • Learning Visual Feature-Based World Models via Residual Latent Action cs.CV · 2026-05-08 · unverdicted · none · ref 17 · internal anchor

    RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

  • A foundation model of vision, audition, and language for in-silico neuroscience q-bio.NC · 2026-05-05 · unverdicted · none · ref 67 · internal anchor

    TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.

  • Latent State Design for World Models under Sufficiency Constraints cs.AI · 2026-05-03 · unverdicted · none · ref 4 · internal anchor

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  • Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond cs.AI · 2026-04-24 · unverdicted · none · ref 15 · internal anchor

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced agentic modeling.

  • Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 1 · internal anchor

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

  • RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation cs.RO · 2026-04-21 · unverdicted · none · ref 3 · internal anchor

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  • AnimationBench: Are Video Models Good at Character-Centric Animation? cs.CV · 2026-04-16 · unverdicted · none · ref 1 · internal anchor

    AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.

  • GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models cs.CV · 2026-04-12 · unverdicted · none · ref 5 · internal anchor

    GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.

  • Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 2 · internal anchor

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  • Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos cs.CV · 2026-05-12 · unverdicted · none · ref 2 · internal anchor

    A noise-aware contrastive loss built on temporal self-supervision learns polyp tracklet representations from 27 videos that outperform prior self-supervised and supervised baselines and match foundation models on retrieval, re-identification, size estimation, and histology classification.

  • Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics cs.AI · 2026-05-12 · unverdicted · none · ref 1 · internal anchor

    In configurable enterprise systems, runtime discovery of transition dynamics from system configuration is more robust to deployment shifts than offline-trained world models.

  • The DAWN of World-Action Interactive Models cs.CV · 2026-05-12 · unverdicted · none · ref 2 · internal anchor

    DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.

  • Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories cs.LG · 2026-05-11 · unverdicted · none · ref 9 · 2 links · internal anchor

    A five-phase co-training framework enables stable JEPA pretraining on EHR trajectories, producing converging latent rollouts and higher multi-task AUROC than baselines on MIMIC-IV ICU data.

  • RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models cs.RO · 2026-05-10 · unverdicted · none · ref 2 · internal anchor

    RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and corrective trajectories.

  • Latent Geometry Beyond Search: Amortizing Planning in World Models cs.RO · 2026-05-09 · unverdicted · none · ref 1 · internal anchor

    In regularized latent spaces of world models, planning can be amortized into a goal-conditioned inverse dynamics model that matches CEM performance at 100-130x lower per-decision cost.

  • Predictive but Not Plannable: RC-aux for Latent World Models cs.LG · 2026-05-08 · unverdicted · none · ref 2 · internal anchor

    RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

  • 3D MRI Image Pretraining via Controllable 2D Slice Navigation Task cs.CV · 2026-05-07 · unverdicted · none · ref 3 · internal anchor

    Converting 3D MRI volumes into action-conditioned 2D slice navigation sequences offers a complementary self-supervised pretraining signal for learning anatomical and spatial representations.

  • ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 2 · internal anchor

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  • Understanding Self-Supervised Learning via Latent Distribution Matching cs.LG · 2026-05-05 · unverdicted · none · ref 2 · internal anchor

    Self-supervised learning is cast as latent distribution matching that aligns representations to a model while enforcing uniformity, unifying multiple SSL families and proving identifiability for predictive variants even with nonlinear predictors.

  • Text-Conditional JEPA for Learning Semantically Rich Visual Representations cs.LG · 2026-05-05 · unverdicted · none · ref 1 · internal anchor

    TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.

  • Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models cs.CV · 2026-05-03 · unverdicted · none · ref 2 · internal anchor

    M²-REPA decouples modality-specific features inside a diffusion model and aligns each to its matching expert foundation model via an alignment loss plus a decoupling regularizer, yielding better visual quality and long-term consistency in multi-modal video generation.

  • Alethia: A Foundational Encoder for Voice Deepfakes cs.SD · 2026-04-30 · unverdicted · none · ref 1 · internal anchor

    Alethia is a pretrained audio encoder using continuous embedding prediction and generative flow-matching reconstruction that outperforms existing speech foundation models on voice deepfake tasks with better robustness and zero-shot generalization.

  • LA-Pose: Latent Action Pretraining Meets Pose Estimation cs.CV · 2026-04-30 · unverdicted · none · ref 3 · internal anchor

    LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of magnitude less labeled 3D data.

  • Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation cs.RO · 2026-04-27 · unverdicted · none · ref 44 · internal anchor

    MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.

  • Only Brains Align with Brains: Cross-Region Alignment Patterns Expose Limits of Normative Models q-bio.NC · 2026-04-23 · unverdicted · none · ref 5 · internal anchor

    Alignment pattern analysis reveals that models aligned to individual brain ROIs do not reproduce the stable cross-region alignment profiles observed across human subjects.

  • Exploring High-Order Self-Similarity for Video Understanding cs.CV · 2026-04-22 · unverdicted · none · ref 2 · internal anchor

    The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

  • Active World-Model with 4D-informed Retrieval for Exploration and Awareness cs.CV · 2026-04-17 · unverdicted · none · ref 2 · internal anchor

    AW4RE is a generative world model that estimates action-conditioned observations via 4D-informed evidence retrieval, geometric support, and conditional completion to enable better exploration under partial observability.

  • Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 9 · internal anchor

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.

  • Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CV · 2026-04-13 · unverdicted · none · ref 5 · internal anchor

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

  • Zero-shot World Models Are Developmentally Efficient Learners cs.AI · 2026-04-11 · unverdicted · none · ref 25 · internal anchor

    A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

  • Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation? cs.RO · 2026-04-06 · unverdicted · none · ref 1 · internal anchor

    Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

  • Hierarchical Planning with Latent World Models cs.LG · 2026-04-03 · unverdicted · none · ref 2 · internal anchor

    Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.

  • World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 4 · internal anchor

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.

  • Video models are zero-shot learners and reasoners cs.LG · 2025-09-24 · unverdicted · none · ref 54 · internal anchor

    Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.

  • Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 7 · internal anchor

    RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

  • Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness cs.CV · 2026-05-08 · unverdicted · none · ref 4 · internal anchor

    Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multimodal baselines on UK Biobank.

  • HaM-World: Soft-Hamiltonian World Models with Selective Memory for Planning cs.AI · 2026-05-07 · unverdicted · none · ref 40 · internal anchor

    HaM-World integrates soft-Hamiltonian dynamics with selective state-space memory to reduce long-horizon rollout error by 55% and achieve top returns under 12 OOD perturbations on DeepMind Control Suite tasks.

  • Video Generation with Predictive Latents cs.CV · 2026-05-04 · unverdicted · none · ref 2 · internal anchor

    PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

  • Embody4D: A Generalist 4D World Model for Embodied AI cs.CV · 2026-05-03 · unverdicted · none · ref 2 · internal anchor

    Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.

  • Lifting Embodied World Models for Planning and Control cs.CV · 2026-04-28 · unverdicted · none · ref 2 · internal anchor

    Composing a policy that maps 2D waypoints to joint actions with a frozen world model yields a lifted world model that achieves 3.8 times lower mean joint error than direct low-level search while being more compute-efficient and generalizing to unseen environments.

  • Sapiens2 cs.CV · 2026-04-23 · unverdicted · none · ref 3 · internal anchor

    Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.

  • Cortex 2.0: Grounding World Models in Real-World Industrial Deployment cs.RO · 2026-04-22 · unverdicted · none · ref 31 · internal anchor

    Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and unpacking.

  • Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance cs.CV · 2026-04-17 · unverdicted · none · ref 26 · internal anchor

    ST-STORM introduces a dual-branch SSL framework that disentangles semantic content from stylistic appearance using gated latent streams, JEPA for content invariance, and adversarial constraints for style capture.

  • NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results cs.CV · 2026-04-16 · unverdicted · none · ref 5 · internal anchor

    The NTIRE 2026 Challenge released a public dataset of 2,000 videos with crowdsourced saliency maps and reported results from participating teams using standard quality metrics.

  • Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 2 · internal anchor

    A unified cost-aware formulation couples fine-grained high-resolution sampling decisions with cross-patch representation prediction to achieve superior performance-cost trade-offs on remote sensing recognition and retrieval tasks using a new 10M-image benchmark.

  • Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics cs.CV · 2026-04-09 · unverdicted · none · ref 5 · internal anchor

    Phantom generates visually realistic and physically consistent videos by jointly modeling visual content and latent physical dynamics via an abstract physics-aware representation.

  • A Machine Learning Framework for Turbofan Health Estimation via Inverse Problem Formulation cs.LG · 2026-04-09 · unverdicted · none · ref 5 · internal anchor

    A new turbofan dataset with realistic maintenance patterns is used to benchmark Bayesian filters as strong baselines against self-supervised learning representations for component health estimation.

  • The Cartesian Cut in Agentic AI cs.AI · 2026-04-09 · unverdicted · none · ref 5 · internal anchor

    LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.