Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding

Bingkang Shi; Chenxi Bao; Guoqing Chao; Hongzhu Yi; Jingjing Zhou; Jungang Xu; Tao Yu; Tiankun Yang; Tianyu Zong; Xingchen Chen

arxiv: 2606.31232 · v1 · pith:7CBP4H6Cnew · submitted 2026-06-30 · 💻 cs.AI

Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding

Zhenghao Zhang , Yuanxiang Wang , Zhenyu Guan , Yujia Yang , Bingkang Shi , Tianyu Zong , Hongzhu Yi , Guoqing Chao

show 6 more authors

Xingchen Chen Tiankun Yang Chenxi Bao Tao Yu Jingjing Zhou Jungang Xu

This is my paper

Pith reviewed 2026-07-01 05:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords world modelslatent dynamicsaction decodingJEPAvisual controlrepresentation learningreinforcement learning

0 comments

The pith

Reconstructing actions from latent displacements between observations prevents collapse in action-sensitive world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Delta-JEPA to build visual world models whose latent dynamics stay sensitive to actions. Joint-embedding approaches often collapse into action-insensitive representations that hurt planning. Delta-JEPA adds a decoder that reconstructs the action solely from the vector difference between two consecutive latent states rather than from their concatenation. This forces the latent space to encode distinct transitions for distinct actions while using only forward prediction and action reconstruction. On four visual continuous-control tasks the resulting models plan better than JEPA baselines, and ablations confirm the displacement decoder outperforms endpoint concatenation.

Core claim

Delta-JEPA augments latent forward prediction with a Latent Difference Action Decoder (LDAD) that reconstructs the executed action from the latent displacement between consecutive observations. This displacement-level supervision directly regularizes transition geometry: adjacent embeddings cannot collapse without losing action information, and different actions are encouraged to induce distinguishable latent changes for rollout-based planning. The method uses only latent prediction and action reconstruction, avoiding pixel reconstruction and distribution-matching regularizers, and improves planning across four visual continuous-control tasks.

What carries the argument

Latent Difference Action Decoder (LDAD), which reconstructs the executed action from the latent displacement between consecutive observations to regularize transition geometry.

If this is right

Planning performance improves over JEPA-based and representation-learning world model baselines on four visual continuous-control tasks.
Displacement-based action decoding is consistently more effective than endpoint concatenation in ablations.
Action-sensitivity analyses show clearer action-conditioned latent responses.
Only latent prediction and action reconstruction suffice, without pixel reconstruction or distribution-matching regularizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same displacement supervision could be applied to discrete-action environments to test whether collapse resistance generalizes.
If latent displacements reliably encode action effects, the approach might reduce reliance on auxiliary regularizers in other representation-learning pipelines for control.
Measuring the correlation between displacement magnitude and action scale in physical simulators would test whether the geometry learned here reflects real dynamics.

Load-bearing premise

Reconstructing the executed action from the latent displacement between consecutive observations will reliably prevent collapse and produce distinguishable latent changes for different actions across the evaluated tasks.

What would settle it

Training the same architecture without the displacement decoder and measuring whether latent states still collapse to action-insensitive representations while planning performance drops on the same four tasks.

Figures

Figures reproduced from arXiv: 2606.31232 by Bingkang Shi, Chenxi Bao, Guoqing Chao, Hongzhu Yi, Jingjing Zhou, Jungang Xu, Tao Yu, Tiankun Yang, Tianyu Zong, Xingchen Chen, Yuanxiang Wang, Yujia Yang, Zhenghao Zhang, Zhenyu Guan.

**Figure 2.** Figure 2: Illustration of LDAD-induced action-sensitive la [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity of Push-T planning success to the ac [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: PCA visualization of two Two-Room latent trajectories with nearby initial states but different endpoints, shown across [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: PCA visualization of action-conditioned predictor [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Attention rollout visualizations on Push-T (top) and Two-Room (bottom) using intermediate layers 4–6 of the ViT [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Layer-wise specialization of attention maps on OGB-Cube. Layer 5 highlights the target cube, while layer 7 more [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

Learning visual world models for planning requires compact latent dynamics that remain sensitive to actions, yet reconstruction-free joint-embedding objectives can collapse to action-insensitive representations. We propose Delta-JEPA, an end-to-end reconstruction-free world model that augments latent forward prediction with a Latent Difference Action Decoder (LDAD). Unlike inverse decoders that infer actions from concatenated endpoint embeddings, LDAD reconstructs the executed action from the latent displacement between consecutive observations. This displacement-level supervision directly regularizes transition geometry: adjacent embeddings cannot collapse without losing action information, and different actions are encouraged to induce distinguishable latent changes for rollout-based planning. Delta-JEPA uses only latent prediction and action reconstruction, avoiding pixel reconstruction and distribution-matching regularizers. Across four visual continuous-control tasks, Delta-JEPA improves planning over JEPA-based and representation-learning world model baselines. Ablations show that displacement-based action decoding is consistently more effective than endpoint concatenation, and action-sensitivity analyses show clearer action-conditioned latent responses. These results indicate that supervising latent differences is a simple and effective mechanism for collapse-resistant and action-sensitive world model learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Delta-JEPA adds a latent-difference action decoder to JEPA and reports planning gains on four tasks, with ablations favoring the displacement approach over endpoint concatenation.

read the letter

The main takeaway is that this paper introduces LDAD, which reconstructs actions from the latent displacement between consecutive observations instead of from concatenated embeddings. That change is positioned as a direct regularizer on transition geometry so that collapse would hurt action prediction and different actions produce more distinguishable deltas.

They keep the model reconstruction-free, relying only on latent forward prediction plus this action decoder. The abstract says this beats JEPA-based and other representation-learning baselines on planning across four visual continuous-control tasks, and the ablations show displacement decoding outperforming the concatenated version. They also report clearer action-conditioned latent responses.

The logic holds up without obvious circularity: if adjacent latents collapse, action reconstruction error rises, which matches the stated goal. The stress-test note confirms no load-bearing gaps in the argument as described.

Soft spots are the usual ones for an abstract-level view: no quantitative effect sizes or variance numbers are visible here, so the practical size of the gains is unclear. The task set is narrow, and it would help to know how sensitive the result is to architecture choices or longer horizons. Nothing in the description suggests the improvements are artifacts of post-hoc fitting.

This is for readers already working on JEPA-style or latent world models for planning in robotics and simulation. Someone looking for a simple, verifiable regularizer to keep dynamics action-sensitive would find the mechanism worth testing.

I would send it to peer review. The core idea is straightforward, the internal consistency is there, and the reported ablations give a concrete place to start checking the claims.

Referee Report

0 major / 3 minor

Summary. The paper introduces Delta-JEPA, an end-to-end reconstruction-free world model that augments latent forward prediction with a Latent Difference Action Decoder (LDAD). LDAD reconstructs the executed action from the latent displacement between consecutive observations rather than from concatenated endpoint embeddings. The approach is claimed to directly regularize transition geometry, preventing collapse to action-insensitive representations while encouraging distinguishable latent changes for different actions. Experiments across four visual continuous-control tasks show improved planning performance over JEPA-based and representation-learning baselines, with ablations indicating that displacement-based decoding outperforms endpoint concatenation and yields clearer action-conditioned latent responses.

Significance. If the results hold, the LDAD mechanism provides a simple, reconstruction-free regularization of latent dynamics that directly ties action reconstruction error to embedding collapse, offering a lightweight alternative to pixel reconstruction or distribution-matching losses in world-model learning for planning. The internal logic is consistent (collapse increases action reconstruction error) and the reported ablations plus action-sensitivity analyses supply concrete support for the central mechanism.

minor comments (3)

[Abstract] The abstract states improvements on 'four visual continuous-control tasks' without naming them; listing the specific environments (with references to their standard implementations) would aid immediate assessment of scope and reproducibility.
[§3 or §4] The loss formulation for the LDAD term and its weighting relative to the latent prediction objective are not visible in the provided abstract; including the precise equations (e.g., the action reconstruction loss and any hyperparameters) in §3 or §4 would strengthen reproducibility.
Action-sensitivity analyses are mentioned but the quantitative metric (e.g., mutual information, classification accuracy, or distance between action-conditioned displacements) is not specified; adding this detail would make the supporting evidence easier to interpret.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of Delta-JEPA, the assessment of its significance, and the recommendation for minor revision. The report lists no major comments, so we have no point-by-point responses to provide.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines Delta-JEPA by augmenting latent prediction with an explicit LDAD decoder that reconstructs actions from latent displacements; this is a deliberate architectural choice whose effect on collapse resistance and action sensitivity follows directly from the supervision objective rather than reducing to a fitted input or self-citation by construction. No equations, parameter-fitting procedures, or load-bearing self-citations are described that would equate the claimed improvements to the method's own inputs. The reported ablations and task results constitute independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented physical entities are described.

pith-pipeline@v0.9.1-grok · 5767 in / 968 out tokens · 21416 ms · 2026-07-01T05:45:09.279844+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 11 canonical work pages · 11 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

V-jepa 2: Self-supervised video models en- able understanding, prediction and planning.arXiv preprint arXiv:2506.09985. Assran, M.; Duval, Q.; Misra, I.; Bojanowski, P.; Vincent, P.; Rabbat, M.; LeCun, Y .; and Ballas, N

work page internal anchor Pith review Pith/arXiv arXiv
[2]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Lejepa: Provable and scalable self-supervised learning without the heuristics. arXiv preprint arXiv:2511.08544. Bardes, A.; Garrido, Q.; Ponce, J.; Chen, X.; Rabbat, M.; LeCun, Y .; Assran, M.; and Ballas, N

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Revisiting Feature Prediction for Learning Visual Representations from Video

Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471. Bardes, A.; Ponce, J.; and LeCun, Y

work page internal anchor Pith review Pith/arXiv arXiv
[4]

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Vi- creg: Variance-invariance-covariance regularization for self- supervised learning.arXiv preprint arXiv:2105.04906. Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y .; Burchfiel, B.; Tedrake, R.; and Song, S

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Ha, D.; and Schmidhuber, J. 2018b. World Models.eprint arXiv: 1803.10122. Hafner, D.; Lillicrap, T.; Ba, J.; and Norouzi, M. 2019a. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603. Hafner, D.; Lillicrap, T.; Fischer, I.; Villegas, R.; Ha, D.; Lee, H.; and Davidson, J. 2019b. Learning latent dynam- ics for planni...

work page internal anchor Pith review Pith/arXiv arXiv 1912
[6]

Mastering Diverse Domains through World Models

Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104. Hauri, M.; and Zenke, F

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Dreamer-CDP: Improving Reconstruction-free World Models Via Continuous Deterministic Representation Prediction

Dreamer-CDP: Im- proving Reconstruction-free World Models Via Continu- ous Deterministic Representation Prediction.arXiv preprint arXiv:2603.07083. LeCun, Y .; et al

work page internal anchor Pith review Pith/arXiv arXiv
[8]

2, 2022-06-27.Open Review, 62(1): 1–62

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1): 1–62. Maes, L.; Lidec, Q. L.; Scieur, D.; LeCun, Y .; and Balestriero, R

2022
[9]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312. Oquab, M.; Darcet, T.; Moutakanni, T.; V o, H.; Szafraniec, M.; Khalidov, V .; Fernandez, P.; Haziza, D.; Massa, F.; El- Nouby, A.; et al

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DINOv2: Learning Robust Visual Features without Supervision

Dinov2: Learning robust visual fea- tures without supervision.arXiv preprint arXiv:2304.07193. Park, S.; Frans, K.; Eysenbach, B.; and Levine, S

work page internal anchor Pith review Pith/arXiv arXiv
[11]

InInter- national Conference on Learning Representations, volume 2025, 94937–94982

Og- bench: Benchmarking offline goal-conditioned rl. InInter- national Conference on Learning Representations, volume 2025, 94937–94982. Sobal, U.; Zhang, W.; Cho, K.; Balestriero, R.; Rudner, T. G.; and LeCun, Y

2025
[12]

DeepMind Control Suite

Deepmind control suite.arXiv preprint arXiv:1801.00690. Wu, P.; Escontrela, A.; Hafner, D.; Abbeel, P.; and Gold- berg, K

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Sub-JEPA: Subspace Gaussian Regularization for Stable End-to-End World Models

Sub-JEPA: Subspace Gaussian Regulariza- tion for Stable End-to-End World Models.arXiv preprint arXiv:2605.09241. Zhou, G.; Pan, H.; Lecun, Y .; and Pinto, L

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

V-jepa 2: Self-supervised video models en- able understanding, prediction and planning.arXiv preprint arXiv:2506.09985. Assran, M.; Duval, Q.; Misra, I.; Bojanowski, P.; Vincent, P.; Rabbat, M.; LeCun, Y .; and Ballas, N

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

Lejepa: Provable and scalable self-supervised learning without the heuristics. arXiv preprint arXiv:2511.08544. Bardes, A.; Garrido, Q.; Ponce, J.; Chen, X.; Rabbat, M.; LeCun, Y .; Assran, M.; and Ballas, N

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Revisiting Feature Prediction for Learning Visual Representations from Video

Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471. Bardes, A.; Ponce, J.; and LeCun, Y

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Vi- creg: Variance-invariance-covariance regularization for self- supervised learning.arXiv preprint arXiv:2105.04906. Chi, C.; Xu, Z.; Feng, S.; Cousineau, E.; Du, Y .; Burchfiel, B.; Tedrake, R.; and Song, S

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Ha, D.; and Schmidhuber, J. 2018b. World Models.eprint arXiv: 1803.10122. Hafner, D.; Lillicrap, T.; Ba, J.; and Norouzi, M. 2019a. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603. Hafner, D.; Lillicrap, T.; Fischer, I.; Villegas, R.; Ha, D.; Lee, H.; and Davidson, J. 2019b. Learning latent dynam- ics for planni...

work page internal anchor Pith review Pith/arXiv arXiv 1912

[6] [6]

Mastering Diverse Domains through World Models

Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104. Hauri, M.; and Zenke, F

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Dreamer-CDP: Improving Reconstruction-free World Models Via Continuous Deterministic Representation Prediction

Dreamer-CDP: Im- proving Reconstruction-free World Models Via Continu- ous Deterministic Representation Prediction.arXiv preprint arXiv:2603.07083. LeCun, Y .; et al

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

2, 2022-06-27.Open Review, 62(1): 1–62

A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1): 1–62. Maes, L.; Lidec, Q. L.; Scieur, D.; LeCun, Y .; and Balestriero, R

2022

[9] [9]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312. Oquab, M.; Darcet, T.; Moutakanni, T.; V o, H.; Szafraniec, M.; Khalidov, V .; Fernandez, P.; Haziza, D.; Massa, F.; El- Nouby, A.; et al

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

DINOv2: Learning Robust Visual Features without Supervision

Dinov2: Learning robust visual fea- tures without supervision.arXiv preprint arXiv:2304.07193. Park, S.; Frans, K.; Eysenbach, B.; and Levine, S

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

InInter- national Conference on Learning Representations, volume 2025, 94937–94982

Og- bench: Benchmarking offline goal-conditioned rl. InInter- national Conference on Learning Representations, volume 2025, 94937–94982. Sobal, U.; Zhang, W.; Cho, K.; Balestriero, R.; Rudner, T. G.; and LeCun, Y

2025

[12] [12]

DeepMind Control Suite

Deepmind control suite.arXiv preprint arXiv:1801.00690. Wu, P.; Escontrela, A.; Hafner, D.; Abbeel, P.; and Gold- berg, K

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Sub-JEPA: Subspace Gaussian Regularization for Stable End-to-End World Models

Sub-JEPA: Subspace Gaussian Regulariza- tion for Stable End-to-End World Models.arXiv preprint arXiv:2605.09241. Zhou, G.; Pan, H.; Lecun, Y .; and Pinto, L

work page internal anchor Pith review Pith/arXiv arXiv