super hub Mixed citations

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Adrien Bardes, David Fan, Mido Assran, Mojtaba, Quentin Garrido, Russell Howes · 2025 · cs.AI · arXiv 2506.09985

Mixed citation behavior. Most common role is background (62%).

144 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 144 citing papers more from Adrien Bardes arXiv PDF

abstract

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 35 method 9 baseline 6

citation-polarity summary

background 31 use method 9 baseline 6 unclear 4

claims ledger

abstract A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on

authors

Adrien Bardes David Fan Mido Assran Mojtaba Quentin Garrido Russell Howes

co-cited works

representative citing papers

When Does LeJEPA Learn a World Model?

stat.ML · 2026-05-25 · unverdicted · novelty 8.0

LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

OmniRetriever-7B uses fusion-as-teacher distillation plus Tuple-InfoNCE to improve any-to-any audio-video-text retrieval over prior open and closed models.

Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

cs.RO · 2026-05-20 · unverdicted · novelty 7.0

Demo-JEPA enables one-shot cross-embodiment imitation by mapping visual demonstrations to shared latent future trajectories that serve as subgoals for the target agent's own forward dynamics planning.

PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.

Learning Visual Feature-Based World Models via Residual Latent Action

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

A foundation model of vision, audition, and language for in-silico neuroscience

q-bio.NC · 2026-05-05 · unverdicted · novelty 7.0

TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.

Latent State Design for World Models under Sufficiency Constraints

cs.AI · 2026-05-03 · unverdicted · novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

cs.RO · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

AnimationBench: Are Video Models Good at Character-Centric Animation?

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.

GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models

cs.CV · 2026-04-12 · unverdicted · novelty 7.0

GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.

Action Images: End-to-End Policy Learning via Multiview Video Generation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

cs.RO · 2026-04-06 · conditional · novelty 7.0

StarVLA delivers a Lego-like open-source framework for VLA models with swappable backbones and action heads, reusable training methods, and unified evaluation across major benchmarks.

Factorization Regret mediates compositional generalization in latent space

cs.LG · 2026-03-28 · unverdicted · novelty 7.0

Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.

Emergent Compositional Communication for Latent World Properties

cs.MA · 2026-03-18 · conditional · novelty 7.0

Multi-agent iterated learning produces emergent positionally disentangled communication protocols for latent physical properties from unsupervised video features.

PlayWorld: Learning Robot World Models from Autonomous Play

cs.RO · 2026-03-09 · unverdicted · novelty 7.0

PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.

Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping

cs.CV · 2026-02-25 · unverdicted · novelty 7.0

MoGaF groups Gaussians by motion in 4D splatting representations to enable stable long-term forecasting of dynamic scenes.

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

cs.RO · 2026-02-06 · unverdicted · novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.

citing papers explorer

Showing 50 of 144 citing papers.

When Does LeJEPA Learn a World Model? stat.ML · 2026-05-25 · unverdicted · none · ref 6 · internal anchor
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
MemLearner: Learning to Query Context memory for Video World Models cs.CV · 2026-06-30 · unverdicted · none · ref 4 · internal anchor
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models cs.RO · 2026-06-22 · unverdicted · none · ref 2 · internal anchor
LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi
X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining cs.CV · 2026-06-07 · unverdicted · none · ref 6 · internal anchor
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences cs.LG · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.
OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation cs.CV · 2026-05-26 · unverdicted · none · ref 5 · internal anchor
OmniRetriever-7B uses fusion-as-teacher distillation plus Tuple-InfoNCE to improve any-to-any audio-video-text retrieval over prior open and closed models.
Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation cs.RO · 2026-05-20 · unverdicted · none · ref 17 · internal anchor
Demo-JEPA enables one-shot cross-embodiment imitation by mapping visual demonstrations to shared latent future trajectories that serve as subgoals for the target agent's own forward dynamics planning.
PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment cs.LG · 2026-05-17 · unverdicted · none · ref 4 · internal anchor
PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning cs.LG · 2026-05-13 · unverdicted · none · ref 21 · internal anchor
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.
Learning Visual Feature-Based World Models via Residual Latent Action cs.CV · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
A foundation model of vision, audition, and language for in-silico neuroscience q-bio.NC · 2026-05-05 · unverdicted · none · ref 67 · internal anchor
TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
Latent State Design for World Models under Sufficiency Constraints cs.AI · 2026-05-03 · unverdicted · none · ref 4 · internal anchor
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 1 · internal anchor
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation cs.RO · 2026-04-21 · unverdicted · none · ref 3 · 2 links · internal anchor
RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.
AnimationBench: Are Video Models Good at Character-Centric Animation? cs.CV · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models cs.CV · 2026-04-12 · unverdicted · none · ref 5 · internal anchor
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
Action Images: End-to-End Policy Learning via Multiview Video Generation cs.CV · 2026-04-07 · unverdicted · none · ref 2 · internal anchor
Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing cs.RO · 2026-04-06 · conditional · none · ref 1 · internal anchor
StarVLA delivers a Lego-like open-source framework for VLA models with swappable backbones and action heads, reusable training methods, and unified evaluation across major benchmarks.
Factorization Regret mediates compositional generalization in latent space cs.LG · 2026-03-28 · unverdicted · none · ref 4 · internal anchor
Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
Emergent Compositional Communication for Latent World Properties cs.MA · 2026-03-18 · conditional · none · ref 2 · internal anchor
Multi-agent iterated learning produces emergent positionally disentangled communication protocols for latent physical properties from unsupervised video features.
PlayWorld: Learning Robot World Models from Autonomous Play cs.RO · 2026-03-09 · unverdicted · none · ref 51 · internal anchor
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy performance via model-based RL.
Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping cs.CV · 2026-02-25 · unverdicted · none · ref 1 · internal anchor
MoGaF groups Gaussians by motion in 4D splatting representations to enable stable long-term forecasting of dynamic scenes.
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos cs.RO · 2026-02-06 · unverdicted · none · ref 3 · internal anchor
DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.
Recurrent Video Masked Autoencoders cs.CV · 2025-12-15 · unverdicted · none · ref 4 · internal anchor
RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.
POMA-3D: The Point Map Way to 3D Scene Understanding cs.CV · 2025-11-20 · unverdicted · none · ref 3 · internal anchor
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
UniTacVLA: Unified Tactile Understanding and Prediction in Vision Language Action Models cs.RO · 2026-06-30 · unverdicted · none · ref 37 · internal anchor
UniTacVLA builds a state-aware and dynamics-aware tactile prior via unified latent space, tactile chain-of-thought, and mixed real/predicted feedback controller to boost dexterous manipulation performance.
Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding cs.AI · 2026-06-30 · unverdicted · none · ref 1 · internal anchor
Delta-JEPA augments latent forward prediction with a Latent Difference Action Decoder that reconstructs actions from embedding displacements, yielding action-sensitive world models that improve planning on four visual continuous-control tasks over JEPA baselines.
Sequential Planning via Anchored Robotic Keypoints cs.RO · 2026-06-29 · unverdicted · none · ref 51 · internal anchor
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
Flow Matching in Feature Space for Stochastic World Modeling cs.CV · 2026-06-27 · unverdicted · none · ref 6 · internal anchor
FlowWM applies flow matching directly in pretrained feature space with a one-step projection mechanism, improving perception accuracy, mode coverage, and horizon robustness on synthetic and real-world benchmarks.
J-LAW: Joint Localization and Actionable World Modeling via Coupled Latent Factor Graphs cs.RO · 2026-06-27 · unverdicted · none · ref 30 · internal anchor
J-LAW introduces a coupled latent factor graph that jointly optimizes metric poses, latent states, and landmark embeddings to produce maps that are both metric and actionable for planning.
Frequency-Aware Self-Supervised Music Representation Learning cs.SD · 2026-06-24 · unverdicted · none · ref 14 · internal anchor
PupuJEPA applies a visual JEPA framework to 2D spectrograms with music-specific adaptations and outperforms 1D SSL models on the MARBLE benchmark for multiple MIR tasks.
BRo-JEPA: Learning Modular Arithmetic in Latent Space cs.LG · 2026-05-31 · unverdicted · none · ref 2 · internal anchor
A block-rotation predictor inside a JEPA model imposes the circular geometry of modular arithmetic on latent representations of MNIST digits and yields strong zero-shot generalization to unseen operations.
Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA cs.RO · 2026-05-31 · unverdicted · none · ref 19 · internal anchor
Empirical study introduces behavioral and representational diagnostics showing architecture-dependent gains in object targeting and predictive structure for WAMs over VLAs on LIBERO and RoboTwin2.0.
SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models cs.RO · 2026-05-30 · unverdicted · none · ref 24 · internal anchor
SKIP achieves 4.16x faster dense video rollouts for robot world models by synthesizing only multimodal-identified keyframes and interpolating the rest, preserving policy training effectiveness with minimal success rate drops.
NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving cs.CV · 2026-05-29 · unverdicted · none · ref 22 · 2 links · internal anchor
NTR adds a self-distillation masked latent reconstruction objective that uses only scene tokens to reconstruct masked patch features, improving visual representation quality and planning performance in end-to-end autonomous driving.
Feat2Go: Visual Feature-Grounded Value Estimation for Embodied Reinforcement Learning cs.RO · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
Feat2Go uses patch-level similarity from a visual world model and trend-based clustering to create progress targets for training value models that improve reward shaping in embodied RL for VLA policies, yielding large gains on ManiSkill3 and RoboTwin benchmarks.
MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding cs.LG · 2026-05-28 · unverdicted · none · ref 19 · internal anchor
MIRAGE uses adaptive multimodal gating on native multimodal backbones plus a transformer encoder to achieve state-of-the-art whole-brain fMRI prediction for naturalistic audiovisual stimuli, outperforming post-hoc unimodal aggregation.
Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation cs.CV · 2026-05-27 · unverdicted · none · ref 3 · internal anchor
Proprio uses flow residuals from latent perturbations in frozen video generators as a self-scoring signal for physical plausibility, yielding reported gains of 16.5% on Physics-IQ and 20.6% on VideoPhy2-hard.
LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation cs.CV · 2026-05-22 · unverdicted · none · ref 2 · internal anchor
LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.
The TIME Machine: On The Power of Motion for Efficient Perception cs.CV · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
TIME is a motion-based embedding from point tracks, trained only on synthetic data via masked autoencoding, that matches state-of-the-art video model performance with up to 10,000x less training data.
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving cs.CV · 2026-05-21 · unverdicted · none · ref 3 · 2 links · internal anchor
Sensor2Sensor uses 4D Gaussian Splatting to create synthetic training pairs and a diffusion model to convert monocular dashcam videos into high-fidelity multi-modal AV sensor data.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning cs.AI · 2026-05-21 · unverdicted · none · ref 5 · internal anchor
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction cs.CV · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
TrajPilot predicts candidate future trajectories from egocentric context and uses them to condition action prediction in an embedding space, outperforming VLM and planner baselines on Ego-Exo4D, Ego4D, and other datasets with gains increasing at longer horizons.
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks cs.CV · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.
LACE: Latent Visual Representation for Cross-Embodiment Learning cs.RO · 2026-05-16 · unverdicted · none · ref 41 · internal anchor
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
DiLA: Disentangled Latent Action World Models cs.CV · 2026-05-15 · unverdicted · none · ref 1 · internal anchor
DiLA uses content-structure disentanglement driven by predictive bottlenecks to create semantically structured latent actions for high-fidelity video world models.
Latent Video Prediction Learns Better World Models cs.CV · 2026-05-15 · unverdicted · none · ref 3 · internal anchor
Latent prediction video models exhibit a distinct robustness profile across corruption, occlusion, fine-grained discrimination, and temporal sensitivity compared to other self-supervised video models when used as world models.
EgoExo-WM: Unlocking Exo Video for Ego World Models cs.CV · 2026-05-14 · unverdicted · none · ref 4 · 2 links · internal anchor
Method converts exocentric videos to egocentric format via body-pose extraction and kinematics to improve egocentric world-model prediction and planning.
Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction cs.CV · 2026-05-14 · unverdicted · none · ref 2 · 2 links · internal anchor
IA-JEPA applies motion-centric masking in JEPA to focus on entity interactions, reporting 14.26% causal reasoning accuracy on CLEVRER versus 3.22% for standard baselines plus higher latent entropy and R²=0.43 energy linearization.

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer