UWM-JEPA: Predictive World Models That Imagine in Belief Space

Oktay Goktas; Santosh Kumar Radha

arxiv: 2605.25313 · v1 · pith:MWWGVFFCnew · submitted 2026-05-25 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

UWM-JEPA: Predictive World Models That Imagine in Belief Space

Santosh Kumar Radha , Oktay Goktas This is my paper

Pith reviewed 2026-06-29 22:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords JEPAworld modelsdensity matrixunitary predictorpartial observabilitybelief representationcounterfactual predictionblind rollout

0 comments

The pith

A density-matrix latent on joint system-environment space with unitary predictor lets JEPA models preserve uncertainty exactly through blind rollout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs UWM-JEPA to address the limitation that vector latents in standard JEPAs carry no internal structure for tracking beliefs over hidden futures during blind simulation under partial observability. It places the latent as a density matrix on the joint system-environment space and replaces the usual predictor with a learned unitary operator whose action is guaranteed to leave the joint-state spectrum unchanged. This yields 0.77 accuracy on a five-step hidden-velocity indicator task with masked target observations, against 0.53 for a parameter-matched LSTM-JEPA baseline, while also retaining far more probe R-squared under blind rollout. The performance gap is isolated to the predictor rather than the encoder, and action sensitivity appears only when training uses counterfactual rather than teacher-forced targets.

Core claim

The UWM-JEPA reaches 0.77 accuracy on a hidden-velocity indicator task requiring five-step forward simulation under a given action sequence with the target observation masked, while a parameter-matched LSTM-JEPA collapses to majority-class accuracy (0.53) under every action condition. The construction preserves the joint-state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. Under blind rollout UWM-JEPA loses fewer than ten points of probe R-squared at short horizons while vector-latent baselines lose forty-one and sixty-eight; both tie on a held-out context probe.

What carries the argument

Density-matrix latent on the joint system-environment space paired with a learned unitary predictor that leaves the joint-state spectrum invariant.

If this is right

UWM-JEPA accuracy degrades monotonically when the supplied action sequence is perturbed.
Vector-latent JEPA models lose 41-68 points of probe R-squared under blind rollout at short horizons.
Action sensitivity in the probe appears only when the model is trained against counterfactual targets rather than teacher-forced ones.
The separation between UWM-JEPA and baselines is located in the predictor dynamics, not in context-encoding capacity.
Latent geometry and predictor dynamics together determine whether a JEPA can imagine under partial observability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same spectrum-preservation requirement could be imposed on other recurrent predictors by replacing their update rule with a unitary operator on an appropriately enlarged space.
If the joint-system-environment construction scales, it supplies a concrete route to building world models whose uncertainty representation survives long-horizon counterfactual rollouts without additional regularizers.
The finding that counterfactual training is required for action sensitivity is independent of the unitary parameterisation and can be tested directly on any JEPA variant.

Load-bearing premise

The density-matrix representation on the joint system-environment space combined with a learned unitary predictor exactly preserves the joint-state spectrum during rollout so that the predictor itself cannot dissipate represented uncertainty.

What would settle it

An explicit computation on a small joint system showing that the eigenvalues of the density matrix shift after one or more unitary predictor steps would falsify the exact-preservation claim.

read the original abstract

World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactual actions. Joint Embedding Predictive Architectures (JEPAs) do this in latent space, but a vector-valued latent has no internal structure for carrying the belief over hidden continuations through blind rollout. We introduce the Unitary World Model JEPA (UWM-JEPA), a JEPA world model with a density-matrix latent on a joint system-environment space and a learned unitary predictor. The construction preserves the joint-state spectrum exactly during rollout, so the predictor itself cannot dissipate the represented uncertainty. On a hidden-velocity indicator task requiring five-step forward simulation under a given action sequence with the target observation masked, UWM-JEPA reaches 0.77 accuracy and degrades monotonically as actions are perturbed; a parameter-matched LSTM-JEPA trained under the same counterfactual-target objective and action head collapses to majority-class accuracy (0.53) under every action condition. Under blind rollout, UWM-JEPA loses fewer than ten points of probe R^2 at short horizons while vector-latent baselines lose forty-one and sixty-eight; both nevertheless tie on a held-out context probe, locating the separation in the predictor rather than the encoder. Action sensitivity itself requires training against counterfactual rather than teacher-forced targets, a finding that applies beyond the unitary parameterisation. For JEPA world models to imagine under partial observability, latent geometry and predictor dynamics matter, not frozen context-encoding capacity alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main point is that a density-matrix latent on the joint space plus a learned unitary predictor keeps the spectrum fixed during JEPA rollout and delivers a clear accuracy gap (0.77 vs 0.53) over a matched LSTM on their five-step masked hidden-velocity task.

read the letter

The construction is new in this setting: the unitary acts on the density matrix so the eigenvalues stay exactly the same, which prevents the predictor from dissipating represented uncertainty by design. That is a clean architectural move compared with vector latents that have no built-in spectrum to preserve.

They back it with a direct comparison. On the hidden-velocity indicator task the UWM-JEPA holds 0.77 accuracy while the LSTM version falls to chance under every action sequence. The probe R^2 numbers also degrade far less under blind rollout, and both models tie on the context probe, which pins the difference on the predictor rather than the encoder. The side observation that counterfactual targets are needed for action sensitivity is useful and not tied to the unitary choice.

The numbers are reported cleanly against a parameter-matched baseline under the same objective. That is the strongest part of the evidence.

The main limitation is that the abstract supplies the headline results without the training protocol, split details, or variance numbers, so it is hard to judge how robust the gap is to hyperparameter choices. The task itself is narrow, so the result shows the mechanism works on this instance but does not yet speak to broader planning or real-world POMDPs.

This is for researchers already working on latent world models and JEPA variants who want a concrete way to carry calibrated belief through multi-step prediction. It is worth a serious referee because the empirical separation is localized and the invariance claim is exact rather than approximate.

Referee Report

2 major / 2 minor

Summary. The paper introduces UWM-JEPA, a JEPA-style world model for partially observed environments that uses a density-matrix latent representation over the joint system-environment space together with a learned unitary predictor. The construction is designed to preserve the joint-state spectrum exactly during rollout, preventing the predictor from dissipating represented uncertainty. On a hidden-velocity indicator task that requires five-step forward simulation under a given action sequence with the target observation masked, UWM-JEPA achieves 0.77 accuracy (degrading monotonically with action perturbation) while a parameter-matched LSTM-JEPA collapses to 0.53 majority-class accuracy; under blind rollout the unitary model also retains more probe R² at short horizons. The separation is localized to the predictor rather than the encoder, and the paper notes that action sensitivity requires training against counterfactual rather than teacher-forced targets.

Significance. If the empirical separation holds, the work supplies concrete evidence that latent geometry (density matrix on joint space) and predictor invariance properties (exact spectrum preservation) materially affect a JEPA model's capacity to carry belief over hidden continuations through blind, counterfactual rollouts. The finding that the performance gap appears only under the counterfactual objective and not on a held-out context probe isolates the contribution to the predictor dynamics rather than encoder capacity alone. This supplies a falsifiable architectural distinction and a reproducible empirical test (masked multi-step simulation accuracy plus action-sensitivity curve) that can be checked by other groups.

major comments (2)

[Abstract] Abstract (and presumably §4 Results): the manuscript reports concrete accuracy (0.77 vs. 0.53) and R² numbers together with a clear baseline comparison, yet supplies no information on training procedure, data splits, hyperparameter search, number of random seeds, or statistical significance testing. These details are load-bearing for the central empirical claim that the architectural distinction produces a robust performance gap.
[Abstract] Construction paragraph (abstract): the claim that the density-matrix representation on the joint system-environment space combined with the learned unitary predictor 'exactly preserves the joint-state spectrum during rollout' is asserted as exact by design, but the manuscript must supply the explicit algebraic argument (or short proof sketch) showing why the predictor cannot dissipate the represented uncertainty; without it the invariance property remains an unverified modeling assumption.

minor comments (2)

[Abstract] The abstract introduces the acronym UWM-JEPA but does not expand 'JEPA' on first use; a parenthetical expansion would improve readability for readers outside the immediate subfield.
[Abstract] The phrase 'probe R²' is used without a one-sentence definition of what the probe consists of or how it is computed; a brief clarification would make the blind-rollout comparison self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for experimental reproducibility details and a formal justification of the spectrum-preservation claim. Both points are addressable and we will revise the manuscript to incorporate them.

read point-by-point responses

Referee: [Abstract] Abstract (and presumably §4 Results): the manuscript reports concrete accuracy (0.77 vs. 0.53) and R² numbers together with a clear baseline comparison, yet supplies no information on training procedure, data splits, hyperparameter search, number of random seeds, or statistical significance testing. These details are load-bearing for the central empirical claim that the architectural distinction produces a robust performance gap.

Authors: We agree that the current manuscript omits these experimental details. In the revision we will add a new subsection (or appendix) that specifies: the full training procedure and optimizer settings, the train/validation/test splits used for the hidden-velocity task, the hyperparameter search protocol, the number of random seeds (five), and statistical significance testing with standard errors across seeds. This will allow readers to assess the robustness of the reported 0.77 vs. 0.53 gap. revision: yes
Referee: [Abstract] Construction paragraph (abstract): the claim that the density-matrix representation on the joint system-environment space combined with the learned unitary predictor 'exactly preserves the joint-state spectrum during rollout' is asserted as exact by design, but the manuscript must supply the explicit algebraic argument (or short proof sketch) showing why the predictor cannot dissipate the represented uncertainty; without it the invariance property remains an unverified modeling assumption.

Authors: The abstract states the preservation property by construction, but we acknowledge that an explicit algebraic argument is not supplied in the provided text. In the revision we will insert a concise proof sketch immediately after the construction paragraph: because the predictor applies a learned unitary U to the joint density matrix ρ via UρU†, and unitary conjugation preserves eigenvalues, the spectrum of ρ (hence the represented uncertainty) remains exactly unchanged after each rollout step. This makes the invariance property verifiable rather than assumed. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces UWM-JEPA via an explicit architectural choice (density-matrix latent on joint system-environment space plus learned unitary predictor) whose spectrum-preservation property is stated as holding exactly by construction of the unitary dynamics. Reported performance (0.77 accuracy on the five-step masked hidden-velocity task versus 0.53 for parameter-matched LSTM-JEPA) is obtained through direct empirical comparison under matched objectives and controls, with the separation localized to the predictor rather than encoder capacity. No equations, self-citations, or fitted parameters are shown reducing the invariance claim or accuracy numbers to tautological inputs; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the standard assumption that a learned unitary can be optimized to act as a predictor on density matrices.

pith-pipeline@v0.9.1-grok · 5802 in / 1093 out tokens · 23212 ms · 2026-06-29T22:49:13.560668+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 25 canonical work pages · 18 internal anchors

[1]

Bootstrapyourownlatent: Anew approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florian Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Ávila Pires, Zhaohan Daniel Guo, Moham- madGheshlaghiAzar,etal. Bootstrapyourownlatent: Anew approach to self-supervised learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[2]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, pages 1195–1204, 2017. URLhttps://arxiv.org/abs/1703. 01780

2017
[3]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[4]

Revisiting Feature Prediction for Learning Visual Representations from Video

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Gar- rido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. Revisiting fea- ture prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

A path towards autonomous machine in- telligence.Technical Report, 2022

Yann LeCun. A path towards autonomous machine in- telligence.Technical Report, 2022. URL https:// openreview.net/forum?id=BZ5a1r-kVsf. Available at https://openreview.net/pdf?id=BZ5a1r-kVsf

2022
[6]

Exploring simple siamese representation learning

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15750–15758, 2021. URLhttps://arxiv. org/abs/2011.10566

work page arXiv 2021
[7]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jé- gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021. URL https://arxiv.org/abs/2104.14294

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

VI- CReg: Variance-invariance-covariance regularization for self- supervised learning

Adrien Bardes, Jean Ponce, and Yann LeCun. VI- CReg: Variance-invariance-covariance regularization for self- supervised learning. InInternational Conference on Learning Representations (ICLR), 2022

2022
[9]

Barlow twins: Self-supervised learning via redundancy reduction

JureZbontar,LiJing,IshanMisra,YannLeCun,andStéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InInternational Conference on Machine Learning (ICML), 2021

2021
[10]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. URL https://arxiv. org/abs/1803.10122

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Dream to Control: Learning Behaviors by Latent Imagination

DanijarHafner,TimothyLillicrap,JimmyBa,andMohammad Norouzi. Dreamtocontrol: Learningbehaviorsbylatentimag- ination. InInternational Conference on Learning Representa- tions, 2020. URLhttps://arxiv.org/abs/1912.01603. UWM-JEPA: Predictive World Models That Imagine in Belief Space8

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Mastering Atari with discrete world models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with discrete world models. In InternationalConferenceonLearningRepresentations(ICLR),
[13]

URLhttps://arxiv.org/abs/2010.02193

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. URL https:// arxiv.org/abs/2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Planningandactinginpartiallyobservablestochas- tic domains.Artificial Intelligence, 101(1-2):99–134, 1998

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planningandactinginpartiallyobservablestochas- tic domains.Artificial Intelligence, 101(1-2):99–134, 1998. doi: 10.1016/S0004-3702(98)00023-X

work page doi:10.1016/s0004-3702(98)00023-x 1998
[16]

Deep Recurrent Q-Learning for Partially Observable MDPs

Matthew Hausknecht and Peter Stone. Deep recurrent Q- learning for partially observable MDPs. InAAAI Fall Sym- posium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA15), 2015. URLhttps://arxiv.org/abs/ 1507.06527

work page internal anchor Pith review Pith/arXiv arXiv 2015
[17]

Deep Variational Reinforcement Learning for POMDPs

MaximilianIgl,LuisaZintgraf,TuanAnhLe,FrankWood,and ShimonWhiteson. Deepvariationalreinforcementlearningfor POMDPs. InProceedingsofthe35thInternationalConference on Machine Learning (ICML), pages 2117–2126, 2018. URL https://arxiv.org/abs/1806.02426

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

QMDP-net: Deep learning for planning under partial observability

PeterKarkus,DavidHsu,andWeeSunLee. QMDP-net: Deep learning for planning under partial observability. InAdvances in Neural Information Processing Systems (NeurIPS), pages 4697–4707, 2017. URLhttps://arxiv.org/abs/1703. 06692

2017
[19]

Nielsen and Isaac L

Michael A. Nielsen and Isaac L. Chuang.Quantum Computa- tion and Quantum Information. Cambridge University Press, 10th anniversary edition, 2010

2010
[20]

Oxford University Press, 2002

Heinz-Peter Breuer and Francesco Petruccione.The Theory of Open Quantum Systems. Oxford University Press, 2002

2002
[21]

Forrest Stinespring

W. Forrest Stinespring. Positive functions on𝐶∗-algebras. Proceedings of the American Mathematical Society, 6(2): 211–216, 1955. doi: 10.1090/S0002-9939-1955-0069403-4

work page doi:10.1090/s0002-9939-1955-0069403-4 1955
[23]

URLhttps://arxiv.org/abs/2204.06150

work page arXiv
[24]

Hamiltonian neural networks

Samuel Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. InAdvances in Neural In- formation Processing Systems, volume 32, 2019. URL https://arxiv.org/abs/1906.01563

work page arXiv 2019
[25]

Quan- tum variational rewinding for time series anomaly detec- tion.arXiv preprint arXiv:2210.16438, 2022

Jack S Baker, Haim Horowitz, Santosh Kumar Radha, Ste- nio Fernandes, Colin Jones, Noorain Noorani, Vladimir Skavysh, Philippe Lamontagne, and Barry C Sanders. Quan- tum variational rewinding for time series anomaly detec- tion.arXiv preprint arXiv:2210.16438, 2022. URLhttps: //arxiv.org/abs/2210.16438

work page arXiv 2022
[26]

Action-Conditional Video Prediction using Deep Networks in Atari Games

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, andSatinderSingh. Action-conditionalvideopredictionusing deepnetworksinAtarigames. InAdvancesinNeuralInforma- tion Processing Systems (NeurIPS), pages 2845–2853, 2015. URLhttps://arxiv.org/abs/1507.08750

work page internal anchor Pith review Pith/arXiv arXiv 2015
[27]

Embed to control: A locally linear latent dynamics model for control from raw images

Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. InAd- vances in Neural Information Processing Systems (NeurIPS),
[28]

URLhttps://arxiv.org/abs/1506.07365

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racanière, Arthur Guez, Jean-Baptiste Lespiau, and Nico- las Heess. Woulda, coulda, shoulda: Counterfactually-guided policysearch. InInternationalConferenceonLearningRepre- sentations (ICLR), 2019. URLhttps://arxiv.org/abs/ 1811.06272

work page internal anchor Pith review Pith/arXiv arXiv 2019
[30]

Learning latent dynamicsforplanningfrompixels.InInternationalConference on Machine Learning, pages 2555–2565, 2019

DanijarHafner,TimothyLillicrap,IanFischer,RubenVillegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamicsforplanningfrompixels.InInternationalConference on Machine Learning, pages 2555–2565, 2019

2019
[31]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes. InInternational Conference on Learning Representations (ICLR) Workshop Track, 2017. URLhttps://arxiv.org/abs/1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[33]

Unitary Evolution Recurrent Neural Networks

Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. InInternational Confer- ence on Machine Learning, pages 1120–1128, 2016. URL https://arxiv.org/abs/1511.06464

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

Full-capacityunitaryrecurrentneural networks

Scott Wisdom, Thomas Powers, John R Hershey, Jonathan LeRoux,andLesAtlas. Full-capacityunitaryrecurrentneural networks. InAdvances in Neural Information Processing Systems, volume 29, 2016

2016
[35]

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1724–1734, 2014

2014
[36]

Neural Ordinary Differential Equations

RickyTQChen,YuliaRubanova,JesseBettencourt,andDavid Duvenaud. Neuralordinarydifferentialequations. InAdvances in Neural Information Processing Systems, volume 31, 2018. URLhttps://arxiv.org/abs/1806.07366

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URLhttps://arxiv.org/abs/2111.00396

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. URLhttps://arxiv.org/abs/ 2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems, volume 30, 2017. URL https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Hoffman and Helmut W

Alan J. Hoffman and Helmut W. Wielandt. The variation of the spectrum of a normal matrix.Duke Mathematical Journal, 20(1):37–39, 1953

1953
[41]

Theeffectiverank: Ameasure of effective dimensionality

OlivierRoyandMartinVetterli. Theeffectiverank: Ameasure of effective dimensionality. In2007 15th European Signal Processing Conference (EUSIPCO), pages 606–610. IEEE,
[42]

URL https://ieeexplore.ieee.org/document/ 7098875
[43]

Representation Learning with Contrastive Predictive Coding

AaronvandenOord,YazheLi,andOriolVinyals. Representa- tionlearningwithcontrastivepredictivecoding.arXivpreprint arXiv:1807.03748, 2018. URLhttps://arxiv.org/abs/ 1807.03748. UWM-JEPA: Predictive World Models That Imagine in Belief Space9 Data and Code Availability All code, data, and figure-generation scripts are availableat https://github.com/santoshkumar...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Bootstrapyourownlatent: Anew approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florian Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Do- ersch, Bernardo Ávila Pires, Zhaohan Daniel Guo, Moham- madGheshlaghiAzar,etal. Bootstrapyourownlatent: Anew approach to self-supervised learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020

[2] [2]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, pages 1195–1204, 2017. URLhttps://arxiv.org/abs/1703. 01780

2017

[3] [3]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[4] [4]

Revisiting Feature Prediction for Learning Visual Representations from Video

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Gar- rido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. Revisiting fea- ture prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

A path towards autonomous machine in- telligence.Technical Report, 2022

Yann LeCun. A path towards autonomous machine in- telligence.Technical Report, 2022. URL https:// openreview.net/forum?id=BZ5a1r-kVsf. Available at https://openreview.net/pdf?id=BZ5a1r-kVsf

2022

[6] [6]

Exploring simple siamese representation learning

Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15750–15758, 2021. URLhttps://arxiv. org/abs/2011.10566

work page arXiv 2021

[7] [7]

Emerging Properties in Self-Supervised Vision Transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jé- gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021. URL https://arxiv.org/abs/2104.14294

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

VI- CReg: Variance-invariance-covariance regularization for self- supervised learning

Adrien Bardes, Jean Ponce, and Yann LeCun. VI- CReg: Variance-invariance-covariance regularization for self- supervised learning. InInternational Conference on Learning Representations (ICLR), 2022

2022

[9] [9]

Barlow twins: Self-supervised learning via redundancy reduction

JureZbontar,LiJing,IshanMisra,YannLeCun,andStéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InInternational Conference on Machine Learning (ICML), 2021

2021

[10] [10]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. URL https://arxiv. org/abs/1803.10122

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Dream to Control: Learning Behaviors by Latent Imagination

DanijarHafner,TimothyLillicrap,JimmyBa,andMohammad Norouzi. Dreamtocontrol: Learningbehaviorsbylatentimag- ination. InInternational Conference on Learning Representa- tions, 2020. URLhttps://arxiv.org/abs/1912.01603. UWM-JEPA: Predictive World Models That Imagine in Belief Space8

work page internal anchor Pith review Pith/arXiv arXiv 2020

[12] [12]

Mastering Atari with discrete world models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with discrete world models. In InternationalConferenceonLearningRepresentations(ICLR),

[13] [13]

URLhttps://arxiv.org/abs/2010.02193

work page internal anchor Pith review Pith/arXiv arXiv 2010

[14] [14]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023. URL https:// arxiv.org/abs/2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Planningandactinginpartiallyobservablestochas- tic domains.Artificial Intelligence, 101(1-2):99–134, 1998

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planningandactinginpartiallyobservablestochas- tic domains.Artificial Intelligence, 101(1-2):99–134, 1998. doi: 10.1016/S0004-3702(98)00023-X

work page doi:10.1016/s0004-3702(98)00023-x 1998

[16] [16]

Deep Recurrent Q-Learning for Partially Observable MDPs

Matthew Hausknecht and Peter Stone. Deep recurrent Q- learning for partially observable MDPs. InAAAI Fall Sym- posium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA15), 2015. URLhttps://arxiv.org/abs/ 1507.06527

work page internal anchor Pith review Pith/arXiv arXiv 2015

[17] [17]

Deep Variational Reinforcement Learning for POMDPs

MaximilianIgl,LuisaZintgraf,TuanAnhLe,FrankWood,and ShimonWhiteson. Deepvariationalreinforcementlearningfor POMDPs. InProceedingsofthe35thInternationalConference on Machine Learning (ICML), pages 2117–2126, 2018. URL https://arxiv.org/abs/1806.02426

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

QMDP-net: Deep learning for planning under partial observability

PeterKarkus,DavidHsu,andWeeSunLee. QMDP-net: Deep learning for planning under partial observability. InAdvances in Neural Information Processing Systems (NeurIPS), pages 4697–4707, 2017. URLhttps://arxiv.org/abs/1703. 06692

2017

[19] [19]

Nielsen and Isaac L

Michael A. Nielsen and Isaac L. Chuang.Quantum Computa- tion and Quantum Information. Cambridge University Press, 10th anniversary edition, 2010

2010

[20] [20]

Oxford University Press, 2002

Heinz-Peter Breuer and Francesco Petruccione.The Theory of Open Quantum Systems. Oxford University Press, 2002

2002

[21] [21]

Forrest Stinespring

W. Forrest Stinespring. Positive functions on𝐶∗-algebras. Proceedings of the American Mathematical Society, 6(2): 211–216, 1955. doi: 10.1090/S0002-9939-1955-0069403-4

work page doi:10.1090/s0002-9939-1955-0069403-4 1955

[22] [23]

URLhttps://arxiv.org/abs/2204.06150

work page arXiv

[23] [24]

Hamiltonian neural networks

Samuel Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. InAdvances in Neural In- formation Processing Systems, volume 32, 2019. URL https://arxiv.org/abs/1906.01563

work page arXiv 2019

[24] [25]

Quan- tum variational rewinding for time series anomaly detec- tion.arXiv preprint arXiv:2210.16438, 2022

Jack S Baker, Haim Horowitz, Santosh Kumar Radha, Ste- nio Fernandes, Colin Jones, Noorain Noorani, Vladimir Skavysh, Philippe Lamontagne, and Barry C Sanders. Quan- tum variational rewinding for time series anomaly detec- tion.arXiv preprint arXiv:2210.16438, 2022. URLhttps: //arxiv.org/abs/2210.16438

work page arXiv 2022

[25] [26]

Action-Conditional Video Prediction using Deep Networks in Atari Games

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L. Lewis, andSatinderSingh. Action-conditionalvideopredictionusing deepnetworksinAtarigames. InAdvancesinNeuralInforma- tion Processing Systems (NeurIPS), pages 2845–2853, 2015. URLhttps://arxiv.org/abs/1507.08750

work page internal anchor Pith review Pith/arXiv arXiv 2015

[26] [27]

Embed to control: A locally linear latent dynamics model for control from raw images

Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. InAd- vances in Neural Information Processing Systems (NeurIPS),

[27] [28]

URLhttps://arxiv.org/abs/1506.07365

work page internal anchor Pith review Pith/arXiv arXiv

[28] [29]

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Lars Buesing, Theophane Weber, Yori Zwols, Sebastien Racanière, Arthur Guez, Jean-Baptiste Lespiau, and Nico- las Heess. Woulda, coulda, shoulda: Counterfactually-guided policysearch. InInternationalConferenceonLearningRepre- sentations (ICLR), 2019. URLhttps://arxiv.org/abs/ 1811.06272

work page internal anchor Pith review Pith/arXiv arXiv 2019

[29] [30]

Learning latent dynamicsforplanningfrompixels.InInternationalConference on Machine Learning, pages 2555–2565, 2019

DanijarHafner,TimothyLillicrap,IanFischer,RubenVillegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamicsforplanningfrompixels.InInternationalConference on Machine Learning, pages 2555–2565, 2019

2019

[30] [31]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding inter- mediate layers using linear classifier probes. InInternational Conference on Learning Representations (ICLR) Workshop Track, 2017. URLhttps://arxiv.org/abs/1610.01644

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [32]

Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997

[32] [33]

Unitary Evolution Recurrent Neural Networks

Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. InInternational Confer- ence on Machine Learning, pages 1120–1128, 2016. URL https://arxiv.org/abs/1511.06464

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [34]

Full-capacityunitaryrecurrentneural networks

Scott Wisdom, Thomas Powers, John R Hershey, Jonathan LeRoux,andLesAtlas. Full-capacityunitaryrecurrentneural networks. InAdvances in Neural Information Processing Systems, volume 29, 2016

2016

[34] [35]

Learning phrase representations using RNN encoder-decoder for statistical machine translation

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1724–1734, 2014

2014

[35] [36]

Neural Ordinary Differential Equations

RickyTQChen,YuliaRubanova,JesseBettencourt,andDavid Duvenaud. Neuralordinarydifferentialequations. InAdvances in Neural Information Processing Systems, volume 31, 2018. URLhttps://arxiv.org/abs/1806.07366

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [37]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URLhttps://arxiv.org/abs/2111.00396

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [38]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. URLhttps://arxiv.org/abs/ 2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [39]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems, volume 30, 2017. URL https://arxiv.org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [40]

Hoffman and Helmut W

Alan J. Hoffman and Helmut W. Wielandt. The variation of the spectrum of a normal matrix.Duke Mathematical Journal, 20(1):37–39, 1953

1953

[40] [41]

Theeffectiverank: Ameasure of effective dimensionality

OlivierRoyandMartinVetterli. Theeffectiverank: Ameasure of effective dimensionality. In2007 15th European Signal Processing Conference (EUSIPCO), pages 606–610. IEEE,

[41] [42]

URL https://ieeexplore.ieee.org/document/ 7098875

[42] [43]

Representation Learning with Contrastive Predictive Coding

AaronvandenOord,YazheLi,andOriolVinyals. Representa- tionlearningwithcontrastivepredictivecoding.arXivpreprint arXiv:1807.03748, 2018. URLhttps://arxiv.org/abs/ 1807.03748. UWM-JEPA: Predictive World Models That Imagine in Belief Space9 Data and Code Availability All code, data, and figure-generation scripts are availableat https://github.com/santoshkumar...

work page internal anchor Pith review Pith/arXiv arXiv 2018