The paper formulates JEPA pretraining as conditional spectral graph learning equivalent to low-rank factorization of an action-conditioned co-occurrence matrix and derives a finite-sample generalization bound connecting pretraining error to downstream planning regret.
hub Mixed citations
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Mixed citation behavior. Most common role is background (44%).
abstract
Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.
hub tools
citation-role summary
citation-polarity summary
years
2026 69representative citing papers
LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.
LeVLJEPA is the first non-contrastive vision-language pretraining method that learns via cross-modal prediction without negatives, producing stronger dense features than contrastive baselines on VQA and segmentation tasks.
Equilibrium World Models are a deep-learning solver that enforces exact equilibrium conditions on broad model-generated state distributions to globally solve dynamic stochastic models featuring rare disasters, binding constraints, and counterfactual states.
SkyJEPA learns long-horizon latent dynamics for quadrotors via JEPA plus a physics prober, enabling zero-shot sim-to-real control with sampling-based MPC and automated sim data generation.
VLWMs learn variable-length action-conditioned dynamics in latent space with curriculum training, yielding 13% average gains over prior latent world models on long-horizon tasks.
World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.
X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.
Cross-trajectory negative sampling in contrastive predictive objectives causes encoding of slow noise over dynamics; intra-trajectory sampling eliminates the shortcut and recovers dynamical variables even under strong noise.
Latent prediction SSL recovers latent trees from PCFG data with sample complexity constant in hierarchy depth L (up to logs), unlike exponential for token-level or supervised methods.
JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampling than pixel diffusion baselines.
Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.
World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.
ACID improves decision-time planning in world models by adding per-step action consistency residuals from an inverse dynamics model to the planning cost via an adaptive weight, yielding better performance with less compute across manipulation and navigation tasks.
UniTacVLA builds a state-aware and dynamics-aware tactile prior via unified latent space, tactile chain-of-thought, and mixed real/predicted feedback controller to boost dexterous manipulation performance.
Delta-JEPA augments latent forward prediction with a Latent Difference Action Decoder that reconstructs actions from embedding displacements, yielding action-sensitive world models that improve planning on four visual continuous-control tasks over JEPA baselines.
ScaleAware-JEPA combines Constrained Diffusion Decomposition with a scale-tied JEPA objective to learn label-free latent coordinates that recover coherent morphology in multiscale fields such as MHD turbulence and interstellar gas.
FlowWM applies flow matching directly in pretrained feature space with a one-step projection mechanism, improving perception accuracy, mode coverage, and horizon robustness on synthetic and real-world benchmarks.
A self-supervised framework learns implicit 3D physics by lifting V-JEPA features into voxels and performing volumetric feature advection conditioned on actions.
Fast-LeWM uses action-prefix encoding and parallel latent prediction to replace sequential rollout, improving success rates and cutting planning time in LeWorldModel tasks.
citing papers explorer
-
Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging
Neuro-JEPA is a sparse multimodal foundation model pretrained on 1,551,862 brain MRI scans that shows stronger and more consistent performance than existing models and CNN baselines across 47 tasks from clinical and public datasets.
-
PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation
PLUME jointly models parameter beliefs and conditioned dynamics in a latent space for dexterous manipulation, enabling zero-shot sim-to-real transfer that outperforms offline RL and behavior cloning baselines on turning, lifting, and flicking tasks.
-
PRISM: PRior-guided Imagination Sampling in world Models
PRISM derives a state-conditioned action prior from a world model's encoder and integrates it into sampling-based planning via product-of-Gaussians fusion, claiming 35 and 32 percentage point gains on Cube and PushT tasks.
-
Generalization of World Models under Environmental Variability for Vision-based Quadrotor Navigation
Robustness of world models during cross-environment SSL pretraining predicts sim-to-real transfer success for quadrotor navigation, with discrete latent size and training sequence length as dominant factors.
-
IMWM: Intuition Models Complement World Models for Latent Planning
IMWM combines a world model with an intuition model from demonstrations to improve sample-based latent planning success rates over world-model-only baselines on pixel control tasks.
-
Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics
TRM trains a small horizon-matched pairwise head on trajectory data to improve terminal-state ranking in latent MPC, raising success from 7% to 97% on TwoRoom and 32.7% to 84% on PLDM without changing the encoder or dynamics.
-
ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data
CMWM is a recurrent latent world model for forecasting patient trajectories like annual eGFR in CKD, reporting 7.28% lower MAE than a tuned GPT-5.5 baseline on a 2232-patient cohort with gains from dialogue data.
-
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation
The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.
-
Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
A unified embodied foundation model uses one VLM for understanding and reasoning plus a joint video-action future generator, reporting competitive scores on VLM, world modeling, and robot benchmarks without apparent compromise.
-
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift
JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.
-
EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation
EA-WM adds task-specification-grounded event prediction and verification to frozen visual-feature world models for improved long-horizon robot manipulation planning.
-
Towards World Models in Biomedical Research
Proposes biomedical world models that learn latent states and intervention-conditioned dynamics to enable simulation of future biological trajectories for discovery in virtual cells, organoids, patients, and surgery.
-
WALL-WM: Carving World Action Modeling at the Event Joints
WALL-WM introduces event-grounded Vision-Language-Action pretraining that uses semantic events as the atomic unit to address granularity mismatch in world action models and reports state-of-the-art generalization.
-
Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems
A literature review that defines silent physical-action failures in Physical AI and identifies the lack of complete runtime authorization boundaries across surveyed technical streams.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.
-
From World Models to World Action Models: A Concise Tutorial for Robotics
A tutorial taxonomizes world models for robotics into observation-space and state-space types and introduces world action models via four paradigms linking predictions to executable actions.
- IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents