LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Damien Scieur; Lucas Maes; Quentin Le Lidec; Randall Balestriero; Yann LeCun

arxiv: 2603.19312 · v3 · pith:OA7KLRV4new · submitted 2026-03-13 · 💻 cs.LG · cs.AI

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes , Quentin Le Lidec , Damien Scieur , Yann LeCun , Randall Balestriero This is my paper

Pith reviewed 2026-05-15 04:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords joint embedding predictive architectureworld modelsrepresentation learninglatent embeddingscontrol tasksgaussian regularizerend-to-end training

0 comments

The pith

LeWorldModel trains the first stable end-to-end JEPA from raw pixels using only two loss terms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LeWorldModel as a joint-embedding predictive architecture that learns world models directly from pixel inputs. It achieves stable training by pairing a next-embedding prediction loss with a regularizer that forces latent embeddings to follow a Gaussian distribution. This cuts the number of tunable loss hyperparameters from six to one and permits training a 15-million-parameter model on a single GPU in hours. The resulting models plan up to 48 times faster than larger foundation-model alternatives while matching performance on 2D and 3D control tasks and encoding detectable physical structure in their latents.

Core claim

LeWM is the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detectsphys

What carries the argument

The Gaussian regularizer on latent embeddings, which keeps representations from collapsing by enforcing a Gaussian distribution during end-to-end training from pixels.

If this is right

World-model training becomes feasible with only one tunable hyperparameter instead of six.
Models with 15 million parameters can be trained on a single GPU and still produce competitive policies.
Planning speed improves by up to 48 times relative to larger foundation-model world models.
Latent embeddings can be probed to recover physical quantities such as positions and velocities.
Surprise signals in the latent space reliably flag physically implausible transitions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-term recipe may generalize to video prediction or robotic manipulation domains beyond the current control benchmarks.
If the Gaussian constraint preserves physical structure, it could serve as a lightweight prior for other latent-space predictive models.
Removing the need for pre-trained encoders opens the door to fully self-supervised world-model learning on raw sensor streams.
Faster planning combined with physical interpretability could enable real-time model-based control on embedded hardware.

Load-bearing premise

The Gaussian regularizer alone is sufficient to prevent representation collapse across diverse 2D and 3D control tasks without auxiliary supervision or pre-trained encoders.

What would settle it

Training LeWM on a new suite of control tasks without the Gaussian regularizer and observing immediate representation collapse would falsify the claim that the regularizer alone guarantees stability.

read the original abstract

Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LeWM simplifies JEPA training to two losses from pixels but the Gaussian regularizer's ability to prevent collapse needs direct verification beyond the abstract.

read the letter

LeWM claims to be the first end-to-end JEPA from raw pixels that uses only a next-embedding prediction loss plus a Gaussian regularizer on the embeddings. This reduces the number of tunable loss hyperparameters to one, which is the key simplification over earlier versions that needed six or more terms plus EMAs or pretraining. The work does well on the practical side. The model has about 15 million parameters, trains on a single GPU in a few hours, and delivers planning speeds up to 48 times faster than foundation-model world models while matching performance on a mix of 2D and 3D control tasks. The additional checks that the latent space encodes physical quantities and detects implausible events through surprise evaluation are solid extras that show the representations are not just arbitrary. The main soft spot is the reliance on the Gaussian regularizer to avoid collapse. Prior JEPA papers added multiple terms because basic regularizers often let embeddings degenerate on some tasks. The abstract does not include training curves, embedding distribution plots, or ablations that would confirm the latents stay Gaussian and useful without extra help. If the single hyperparameter still requires careful per-task adjustment, the claimed stability edge is less clear. This paper targets researchers building compact world models for model-based planning in robotics and simulation. Anyone tired of heavy pretrained encoders or complex loss schedules will find the recipe worth trying. The results are on relevant benchmarks, so the work deserves a serious referee to examine the training details and robustness tests. I recommend sending it to peer review. The idea is clean enough that referees can directly test whether the two-loss setup delivers the promised stability.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces LeWorldModel (LeWM), a Joint-Embedding Predictive Architecture (JEPA) that claims to be the first to train stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one. The ~15M-parameter model trains on a single GPU in hours, plans up to 48x faster than foundation-model baselines, performs competitively on diverse 2D and 3D control tasks, encodes physical quantities in its latent space (via probing), and detects implausible events through surprise evaluation.

Significance. If the empirical claims hold, the work would be significant for simplifying JEPA training in self-supervised world-model learning, removing reliance on multi-term losses, EMAs, or pretrained encoders. The single-hyperparameter design and efficiency could broaden accessibility for control applications, while the physical-structure probing offers a concrete advance beyond task performance metrics.

major comments (3)

Abstract: The claim that the Gaussian regularizer alone suffices to prevent representation collapse (the weakest assumption) is load-bearing for the 'first stable two-loss JEPA' assertion, yet no formulation of the regularizer, its weight schedule, or embedding statistics (variance, mode coverage) across tasks is provided; without this, it is impossible to verify whether it replaces the auxiliary terms used in prior JEPAs.
Experiments section: No ablation table or figure isolates the effect of the Gaussian regularizer versus the prediction loss alone, nor reports the single tunable hyperparameter value per task; this undermines the reduction-from-six-to-one claim, especially given that prior work required additional terms precisely because simpler regularizers often led to collapse on similar 2D/3D benchmarks.
Results: The competitive performance and 48x planning speedup are stated without reference to specific baseline tables, error bars, or statistical tests; the abstract-only presentation leaves the soundness of these quantitative claims unverifiable.

minor comments (1)

Abstract: The ~15M parameter count and single-GPU training time should be tied to a specific model diagram or experimental-setup paragraph for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of clarity and verifiability. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The claim that the Gaussian regularizer alone suffices to prevent representation collapse (the weakest assumption) is load-bearing for the 'first stable two-loss JEPA' assertion, yet no formulation of the regularizer, its weight schedule, or embedding statistics (variance, mode coverage) across tasks is provided; without this, it is impossible to verify whether it replaces the auxiliary terms used in prior JEPAs.

Authors: We agree that explicit details on the regularizer are necessary to support the stability claim. In the revised manuscript, we will add the precise mathematical formulation of the Gaussian regularizer (including its implementation as a KL-divergence term to a standard normal), the weighting schedule used during training, and quantitative embedding statistics (mean variance, effective mode coverage, and collapse metrics) across all 2D and 3D tasks. These additions will allow direct verification that the two-loss formulation suffices without auxiliary terms. revision: yes
Referee: Experiments section: No ablation table or figure isolates the effect of the Gaussian regularizer versus the prediction loss alone, nor reports the single tunable hyperparameter value per task; this undermines the reduction-from-six-to-one claim, especially given that prior work required additional terms precisely because simpler regularizers often led to collapse on similar 2D/3D benchmarks.

Authors: We acknowledge this gap in the experimental presentation. The revised version will include a new ablation table and accompanying figure that directly compares training with only the prediction loss against the full two-loss objective (prediction + Gaussian regularizer). We will also tabulate the single tunable hyperparameter value used for each task and environment, along with sensitivity analysis showing stability across a narrow range around the reported value. This will substantiate the hyperparameter reduction claim. revision: yes
Referee: Results: The competitive performance and 48x planning speedup are stated without reference to specific baseline tables, error bars, or statistical tests; the abstract-only presentation leaves the soundness of these quantitative claims unverifiable.

Authors: We will update the results section to explicitly reference the relevant baseline comparison tables (currently in the supplementary material but now moved to the main text), include error bars computed over multiple random seeds, and add statistical significance tests (e.g., paired t-tests) for the reported performance metrics and planning speedups. The abstract will be revised to point to these tables, ensuring all quantitative claims are directly verifiable from the main paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims without derivation chain

full rationale

The manuscript introduces LeWorldModel as an empirical architecture that trains end-to-end from pixels using a next-embedding prediction loss plus a Gaussian regularizer on latent embeddings. No equations, formal derivations, or proof steps are presented that would allow any claimed prediction or result to reduce by construction to fitted inputs, self-citations, or ansatzes. The central assertions (stable training, reduced hyperparameters, competitive performance on 2D/3D tasks, and physical structure in latents) are supported solely by experimental outcomes rather than any self-referential mathematical structure. This leaves the work self-contained against external benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the single tunable hyperparameter is the only explicit free parameter mentioned.

free parameters (1)

single tunable loss hyperparameter
Abstract states reduction from six to one tunable loss hyperparameter, implying one remains that must be chosen for the Gaussian regularizer or prediction loss.

pith-pipeline@v0.9.0 · 5486 in / 1130 out tokens · 23333 ms · 2026-05-15T04:03:29.162452+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

a regularizer enforcing Gaussian-distributed latent embeddings, promoting feature diversity... to prevent trivial collapse
Cost Jcost_nonneg echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

SIGReg regularization term enforces Gaussian-distributed latent embeddings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...
ProteinJEPA: Latent prediction complements protein language models
cs.LG 2026-05 unverdicted novelty 7.0

Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites
cs.AI 2026-05 unverdicted novelty 7.0

AGWM improves world model accuracy in compositional environments by learning an explicit DAG of action affordance prerequisites to handle dynamic executability.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 7.0

NOVA represents world states as INR weights for decoder-free rendering, compactness, and unsupervised disentanglement of background, foreground, and motion in video world models.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS
cs.RO 2026-04 unverdicted novelty 7.0

3D-ALP achieves 0.65 success on memory-dependent 5-step robotic reach tasks versus near-zero for reactive baselines by anchoring MCTS planning to a persistent 3D camera-to-world frame.
LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.
Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
cs.RO 2026-05 unverdicted novelty 6.0

Pelican-Unified 1.0 trains a single VLM plus Unified Future Generator to jointly optimize understanding, reasoning, future video prediction, and action generation, reporting top-tier scores on VLM, WorldArena, and Rob...
SCAR: Self-Supervised Continuous Action Representation Learning
cs.RO 2026-05 unverdicted novelty 6.0

SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.
Do multimodal models imagine electric sheep?
cs.CV 2026-05 conditional novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement
cs.CV 2026-05 unverdicted novelty 6.0

NOVA represents scene states as INR weights for analytical rendering without decoders and achieves structural disentanglement of content and dynamics in video world models.
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
cs.LG 2026-05 unverdicted novelty 6.0

AeroJEPA applies joint-embedding predictive learning to produce scalable, semantically organized latent representations for 3D aerodynamic fields that support both field reconstruction and downstream design tasks.
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data
physics.data-an 2026-04 unverdicted novelty 6.0

DySIB recovers a two-dimensional representation matching the phase space of a physical pendulum from high-dimensional video data by maximizing predictive mutual information in latent space.
Sonata: A Hybrid World Model for Inertial Kinematics under Clinical Data Scarcity
cs.LG 2026-04 unverdicted novelty 6.0

Sonata is a small hybrid world model pre-trained to predict future IMU states that outperforms autoregressive baselines on clinical discrimination, fall-risk prediction, and cross-cohort transfer while fitting on-devi...
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 6.0

IntentScore is a plan-aware reward model trained on 398K GUI steps using contrastive and ranking objectives that reaches 97.5% pairwise accuracy and raises task success by 6.9 points on an unseen agent and environment.
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 6.0

IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.
Metriplector: From Field Theory to Neural Architecture
cs.AI 2026-03 unverdicted novelty 6.0

Metriplector treats neural computation as coupled metriplectic field dynamics whose stress-energy tensor readout achieves competitive results on vision, control, Sudoku, language modeling, and pathfinding with small p...
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics
cs.LG 2026-05 unverdicted novelty 5.0

TRM trains a small horizon-matched pairwise head on trajectory data to improve terminal-state ranking in latent MPC, raising success from 7% to 97% on TwoRoom and 32.7% to 84% on PLDM without changing the encoder or dynamics.
ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data
cs.LG 2026-05 unverdicted novelty 5.0

CMWM is a recurrent latent world model for forecasting patient trajectories like annual eGFR in CKD, reporting 7.28% lower MAE than a tuned GPT-5.5 baseline on a 2232-patient cohort with gains from dialogue data.
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation
cs.LG 2026-05 unverdicted novelty 5.0

The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.
Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action
cs.RO 2026-05 unverdicted novelty 5.0

A unified embodied foundation model uses one VLM for understanding and reasoning plus a joint video-action future generator, reporting competitive scores on VLM, world modeling, and robot benchmarks without apparent c...
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift
cs.LG 2026-04 unverdicted novelty 5.0

JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...