Back to the features: Dino as a foundation for video world models

Federico Baldassarre, Marc Szafraniec, Basile Terver, Vasil Khalidov, Francisco Massa, Yann Le- Cun, Patrick Labatut, Maximilian Seitzer, Piotr Bojanowski · 2025 · arXiv 2507.19468

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Learning Visual Feature-Based World Models via Residual Latent Action

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

cs.AI · 2026-05-25 · unverdicted · novelty 5.0

TC-WM converts foundation-model visual embeddings into parsimonious task-sufficient world model latents via linear projection, contrastive physical-state alignment, and embedding reconstruction, with a theoretical identification guarantee.

Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.

Video Generation with Predictive Latents

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

cs.AI · 2025-12-30 · unverdicted · novelty 5.0

An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Learning Visual Feature-Based World Models via Residual Latent Action cs.CV · 2026-05-08 · unverdicted · none · ref 21
RLA-WM predicts residual latent actions via flow matching to create visual feature world models that outperform prior feature-based and diffusion approaches while enabling offline video-based robot RL.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction cs.CV · 2026-04-13 · unverdicted · none · ref 6
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV · 2026-05-07 · unverdicted · none · ref 8
Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
Video Generation with Predictive Latents cs.CV · 2026-05-04 · unverdicted · none · ref 3
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

Back to the features: Dino as a foundation for video world models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer