pith. machine review for the scientific record. sign in

arxiv: 2604.26841 · v1 · submitted 2026-04-29 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data

Bao Pham, Dmitry Krotov, Luca Ambrogioni, Matteo Negri, Mohammed J. Zaki

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords language diffusion modelsassociative memoriesmemorization to generalization transitionconditional entropybasins of attractiontoken recoverydiscrete diffusioncreative capabilities
0
0 comments X

The pith

Uniform-based discrete diffusion models function as associative memories that recover both training data and unseen examples through basins of attraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that uniform-based discrete diffusion models act like associative memories by forming basins of attraction around data points through conditional likelihood maximization rather than energy functions. As training dataset size increases, recovery of training examples decreases while recovery of unseen test examples increases until both stabilize at comparable levels. This shift marks a transition from memorization to generalization and can be identified by tracking the conditional entropy of the model's token predictions, which drops to near zero in the memorization regime but stays finite during generalization. A sympathetic reader would care because the work supplies a concrete diagnostic for when these models are recalling stored data versus producing novel outputs.

Core claim

Uniform-based Discrete Diffusion Models behave as associative memories with emergent creative capabilities by forming distinct basins of attraction around both training and test data points through conditional likelihood maximization. As the size of the training dataset increases, basins around training examples shrink while those around unseen test examples expand until token recovery rates for both converge. The transition between memorization and generalization regimes can be detected solely by the conditional entropy of predicted token sequences, with vanishing entropy characterizing memorization and finite entropy indicating the generalization regime.

What carries the argument

Basins of attraction created by conditional likelihood maximization during the diffusion process, which enable reliable recovery of stored points without requiring an explicit energy function.

If this is right

  • Recovery performance on training data falls while performance on unseen test data rises as the training set enlarges.
  • Conditional entropy of token sequences provides a direct indicator of whether the model is in a memorization or generalization regime.
  • At sufficiently large dataset sizes the model recovers training and test examples at comparable rates.
  • The same conditional-likelihood mechanism that creates stable attractors around training points also creates attractors around unseen points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The entropy-based probe could be applied to monitor regime shifts during training without storing the full dataset.
  • Similar basin-expansion dynamics may occur in other likelihood-based generative models when dataset size is scaled.
  • Adjusting training set size or monitoring entropy could offer a practical lever for balancing recall and novelty in deployed systems.

Load-bearing premise

Observed changes in how often the model recovers training versus test tokens as dataset size grows are produced by the growth and shrinkage of attraction basins under conditional likelihood maximization.

What would settle it

An experiment that varies training set size while holding model capacity fixed and finds that token recovery rates for training and test data do not follow the predicted opposing trends or that conditional entropy does not separate the two regimes.

Figures

Figures reproduced from arXiv: 2604.26841 by Bao Pham, Dmitry Krotov, Luca Ambrogioni, Matteo Negri, Mohammed J. Zaki.

Figure 1
Figure 1. Figure 1: Basins around training examples shrink and basins around test examples expand as the view at source ↗
Figure 2
Figure 2. Figure 2: The convergence of corrupted training and test token recovery rates marks the phase view at source ↗
Figure 3
Figure 3. Figure 3: Token-level conditional entropy highlights different token recovery behaviors view at source ↗
Figure 4
Figure 4. Figure 4: Sequence conditional entropy highlights the memorization to generalization transition view at source ↗
Figure 5
Figure 5. Figure 5: Average conditional entropy for training versus synthetic sequences view at source ↗
Figure 9
Figure 9. Figure 9: Shrinkage and expansion of training and test samples’ basins of attraction during the view at source ↗
Figure 10
Figure 10. Figure 10: An illustration of the model’s ability to recover tokens from perturbed sequences on training examples at three different fractions of the training dataset and sizes of the UDDMs. Perturbation is computed at t = 0.5 and the typical stochastic reverse process is performed afterwards. As the training dataset size increases, the model’s ability to recover perturbed tokens becomes worse and in contrast, its g… view at source ↗
Figure 11
Figure 11. Figure 11: An illustration of the model’s ability to recover tokens from perturbed sequences on test examples at different fractions of the training dataset and sizes of UDDMs. Perturbation is computed at t = 0.5 and the stochastic reverse process is performed afterwards. In the beginning, the model is unable to recognize unperturbed unseen test tokens, where it often change them to another tokens. However, as the t… view at source ↗
Figure 12
Figure 12. Figure 12: An illustration of the density of conditional entropy for two categories of tokens, recovered and unrecovered, computed at t = 0.25 for the Tiny model. The subplots are ordered by the fraction of training dataset, ranging from 10−4 (top-left) to 1.0 (bottom-right). As the fraction of training data increases, recovered tokens concentrate near zero entropy (high confidence), while unrecovered tokens exhibit… view at source ↗
Figure 13
Figure 13. Figure 13: An illustration of the density of conditional entropy for two categories of tokens, recovered and unrecovered, computed at t = 0.25 for the Small model. The subplots are ordered by the fraction of training dataset, ranging from 10−4 (top-left) to 1.0 (bottom-right). As the fraction of training data increases, recovered tokens concentrate near zero entropy (high confidence), while unrecovered tokens exhibi… view at source ↗
Figure 14
Figure 14. Figure 14: An illustration of the density of conditional entropy for two categories of tokens, recovered and unrecovered, computed at t = 0.25 for the Medium model. The subplots are ordered by the fraction of training dataset, ranging from 10−4 (top-left) to 1.0 (bottom-right). As the fraction of training data increases, recovered tokens concentrate near zero entropy (high confidence), while unrecovered tokens exhib… view at source ↗
Figure 15
Figure 15. Figure 15: An illustration of the evolution of the density of the average conditional entropy for the probabilities of training and synthetic sequences respectively, computed at t = 10−5 using the Tiny models, as the training dataset size grows. When the fraction of training set is small, there exists a separation in the average conditional entropies of training and synthetic samples. However, as the training datase… view at source ↗
Figure 16
Figure 16. Figure 16: An illustration of the evolution of the density of the average conditional entropy for the probabilities of training and synthetic sequences respectively, computed at t = 10−5 using the Small models, as the training dataset size grows. When the fraction of training set is small, there exists a separation in the average conditional entropies of training and synthetic samples. However, as the training datas… view at source ↗
Figure 17
Figure 17. Figure 17: An illustration of the evolution of the density of the average conditional entropy for the probabilities of training and synthetic sequences respectively, computed at t = 10−5 using the Medium models, as the training dataset size grows. When the fraction of training set is small, there exists a separation in the average conditional entropies of training and synthetic samples. However, as the training data… view at source ↗
read the original abstract

When do language diffusion models memorize their training data, and how to quantitatively assess their true generative regime? We address these questions by showing that Uniform-based Discrete Diffusion Models (UDDMs) fundamentally behave as Associative Memories (AMs) $\textit{with emergent creative capabilities}$. The core idea of an AM is to reliably recover stored data points as $\textit{memories}$ by establishing distinct basins of attraction around them. Historically, models like Hopfield networks use an explicit energy function to guarantee these stable attractors. We broaden this perspective by leveraging the observation that energy is not strictly necessary, as basins of attraction can also be formed via conditional likelihood maximization. By evaluating token recovery of $\textit{training}$ and $\textit{test}$ examples, we identify in UDDMs a sharp memorization-to-generalization transition governed by the size of the training dataset: as it increases, basins around training examples shrink and basins around unseen test examples expand, until both later converge to the same level. Crucially, we can detect this transition using only the conditional entropy of predicted token sequences: memorization is characterized by vanishing conditional entropy, while in the generalization regime the conditional entropy of most tokens remains finite. Thus, conditional entropy offers a practical probe for the memorization-to-generalization transition in deployed models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript claims that Uniform-based Discrete Diffusion Models (UDDMs) fundamentally behave as associative memories (AMs) with emergent creative capabilities. Basins of attraction are formed around data points via conditional likelihood maximization (without explicit energy functions). Experiments on token recovery of training and test examples reveal a sharp memorization-to-generalization transition as training dataset size grows: basins around training points shrink while those around unseen test points expand until recovery rates converge. This transition is detectable using only the conditional entropy of predicted token sequences, with vanishing entropy indicating memorization and finite entropy indicating generalization.

Significance. If the empirical results and basin interpretation hold, the work offers a unifying perspective that connects discrete diffusion models to classical associative memory models such as Hopfield networks. The conditional-entropy probe could provide a practical, training-data-free diagnostic for memorization regimes in deployed generative models. The framing also highlights how scaling dataset size can induce generalization in diffusion-based language models, which may inform training strategies and evaluation practices.

major comments (3)
  1. Abstract and experimental results: Token-recovery rates after diffusion corruption are reported, but no tests of stability under additional perturbations, no comparisons to non-diffusion baselines, and no direct derivation of basin geometry from the conditional likelihood are provided. These omissions leave the central attribution of the observed transition to shrinking/expanding basins of attraction unsupported; the data are also consistent with ordinary scaling-law behavior.
  2. Abstract: The manuscript states that the transition 'can be detected using only the conditional entropy of predicted token sequences,' yet no quantitative validation (e.g., correlation coefficients, statistical significance, or ablation against other metrics) is described to establish entropy as a reliable basin detector rather than a correlated symptom of memorization.
  3. Experimental evaluation: No details are supplied on model sizes, dataset construction, statistical tests, controls for confounds such as optimization dynamics or capacity effects, or the precise definition of the 'sharp' transition threshold. Without these, the claimed sharpness and causal link to basin dynamics cannot be assessed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below and indicate the revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses
  1. Referee: Abstract and experimental results: Token-recovery rates after diffusion corruption are reported, but no tests of stability under additional perturbations, no comparisons to non-diffusion baselines, and no direct derivation of basin geometry from the conditional likelihood are provided. These omissions leave the central attribution of the observed transition to shrinking/expanding basins of attraction unsupported; the data are also consistent with ordinary scaling-law behavior.

    Authors: The reported recovery rates exhibit a distinctive crossover: as training set size grows, recovery of training examples declines while recovery of unseen test examples rises until the two converge. This relative decline for training points is inconsistent with standard scaling laws, which predict monotonic gains on both seen and unseen data. The basin interpretation follows directly from the model's objective of conditional likelihood maximization, which creates stable attractors around high-likelihood sequences without requiring an explicit energy function. In revision we will add (i) a formal derivation linking conditional likelihood maximization to basin geometry, (ii) stability tests under additional perturbations such as random token substitutions, and (iii) comparisons against non-diffusion baselines (e.g., a standard autoregressive transformer) to isolate diffusion-specific effects. revision: yes

  2. Referee: Abstract: The manuscript states that the transition 'can be detected using only the conditional entropy of predicted token sequences,' yet no quantitative validation (e.g., correlation coefficients, statistical significance, or ablation against other metrics) is described to establish entropy as a reliable basin detector rather than a correlated symptom of memorization.

    Authors: We agree that quantitative validation is needed. The revised manuscript will report Pearson correlation coefficients and p-values between conditional entropy and token-recovery accuracy across dataset sizes, together with ablations comparing entropy against alternative metrics such as prediction variance and KL divergence. These additions will establish entropy as a reliable, training-data-free probe for the memorization-to-generalization transition. revision: yes

  3. Referee: Experimental evaluation: No details are supplied on model sizes, dataset construction, statistical tests, controls for confounds such as optimization dynamics or capacity effects, or the precise definition of the 'sharp' transition threshold. Without these, the claimed sharpness and causal link to basin dynamics cannot be assessed.

    Authors: We will supply all omitted details in the revision: model architectures and sizes, dataset construction and split procedures, statistical tests with multiple random seeds and error bars, controls for confounds (fixed capacity while varying data size, hyperparameter sweeps), and a precise definition of the transition threshold (e.g., the dataset size at which train and test recovery rates differ by less than a fixed epsilon). These additions will allow readers to evaluate the sharpness of the transition and its link to basin dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical recovery rates and entropy measurements are independent of the AM basin interpretation.

full rationale

The paper's central claims rest on direct empirical observations of token recovery accuracy for training versus test examples as a function of dataset size, together with conditional entropy of predicted tokens. These quantities are measured independently and then interpreted as evidence for shrinking/expanding basins of attraction formed by conditional likelihood maximization. No equations or definitions reduce the observed recovery rates or entropy values to the target AM behavior by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked as load-bearing steps. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central broadening of the associative-memory definition is treated as an axiom.

axioms (1)
  • domain assumption Basins of attraction can be formed via conditional likelihood maximization without requiring an explicit energy function.
    This is the explicit broadening of the classical associative-memory perspective stated in the abstract.

pith-pipeline@v0.9.0 · 5548 in / 1398 out tokens · 80773 ms · 2026-05-07T12:15:44.661350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 16 canonical work pages

  1. [1]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, 2015

  2. [2]

    Denoisingdiffusionprobabilisticmodels.Advances in Neural Information Processing Systems, 2020

    JonathanHo,AjayJain,andPieterAbbeel. Denoisingdiffusionprobabilisticmodels.Advances in Neural Information Processing Systems, 2020

  3. [3]

    Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

  4. [4]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. 12

  5. [5]

    High-resolutionimagesynthesiswithlatentdiffusionmodels

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolutionimagesynthesiswithlatentdiffusionmodels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  6. [6]

    Diffusion art or digital forgery? investigating data replication in diffusion models

    Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058, 2023

  7. [7]

    Understanding and mitigating copying in diffusion models.Advances in Neural Information Processing Systems, 36:47783–47803, 2023

    Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Understanding and mitigating copying in diffusion models.Advances in Neural Information Processing Systems, 36:47783–47803, 2023

  8. [8]

    Extractingtrainingdatafromdiffusionmodels

    NicholasCarlini,JamieHayes,MiladNasr,MatthewJagielski,VikashSehwag,FlorianTramèr, BorjaBalle,DaphneIppolito,andEricWallace. Extractingtrainingdatafromdiffusionmodels. InProceedings of the 32nd USENIX Conference on Security Symposium, SEC ’23, USA, 2023. USENIX Association

  9. [9]

    A reproducible extraction of training images from diffusion models.arXiv preprint arXiv:2305.08694, 2023

    Ryan Webster. A reproducible extraction of training images from diffusion models.arXiv preprint arXiv:2305.08694, 2023

  10. [10]

    Diffusionprobabilisticmodels generalize when they fail to memorize

    TaeHoYoon,JooYoungChoi,SehyunKwon,andErnestKRyu. Diffusionprobabilisticmodels generalize when they fail to memorize. InICML 2023 Workshop on Structured Probabilistic Inference Generative Modeling, 2023

  11. [11]

    Generalization in diffusion models arises from geometry-adaptive harmonic representation.arXiv preprint arXiv:2310.02557, 2023

    Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and Stéphane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representation.arXiv preprint arXiv:2310.02557, 2023

  12. [12]

    Dynamical regimes of diffusion models.Nature Communications, 15(1), November 2024

    Giulio Biroli, Tony Bonnaire, Valentin de Bortoli, and Marc Mézard. Dynamical regimes of diffusion models.Nature Communications, 15(1), November 2024

  13. [13]

    An analytic theory of creativity in convolutional diffusion models.arXiv preprint arXiv:2412.20292, 2024

    Mason Kamb and Surya Ganguli. An analytic theory of creativity in convolutional diffusion models.arXiv preprint arXiv:2412.20292, 2024

  14. [14]

    arXiv preprint arXiv:2410.08727 , year=

    Beatrice Achilli, Enrico Ventura, Gianluigi Silvestri, Bao Pham, Gabriel Raya, Dmitry Krotov, Carlo Lucibello, and Luca Ambrogioni. Losing dimensions: Geometric memorization in generative diffusion.arXiv preprint arXiv:2410.08727, 2024

  15. [15]

    Detecting, explaining, and mitigating memorization in diffusion models

    Yuxin Wen, Yuchen Liu, Chen Chen, and Lingjuan Lyu. Detecting, explaining, and mitigating memorization in diffusion models. InThe Twelfth International Conference on Learning Representations, 2024

  16. [16]

    Understanding memorization in generative models via sharpness in probability landscapes.arXiv preprint arXiv:2412.04140, 2024

    Dongjae Jeon, Dueun Kim, and Albert No. Understanding memorization in generative models via sharpness in probability landscapes.arXiv preprint arXiv:2412.04140, 2024

  17. [17]

    Memorization to generalization: Emergence of diffusion models from associative memory.arXiv preprint arXiv:2505.21777, 2025

    Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J Zaki, Luca Ambrogioni, and Dmitry Krotov. Memorization to generalization: Emergence of diffusion models from associative memory.arXiv preprint arXiv:2505.21777, 2025. 13

  18. [18]

    Memorization and generalization in generative diffusion under the manifold hypothesis.arXiv preprint arXiv:2502.09578, 2025

    Beatrice Achilli, Luca Ambrogioni, Carlo Lucibello, Marc Mézard, and Enrico Ventura. Memorization and generalization in generative diffusion under the manifold hypothesis.arXiv preprint arXiv:2502.09578, 2025

  19. [19]

    Languagemodelsare few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal, ArvindNeelakantan,PranavShyam,GirishSastry,AmandaAskell,etal. Languagemodelsare few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  20. [20]

    Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022

  21. [21]

    Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

  22. [22]

    Gener- ative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design.arXiv preprint arXiv:2402.04997, 2024

    Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi Jaakkola. Genera- tive flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design.arXiv preprint arXiv:2402.04997, 2024

  23. [23]

    Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385, 2024

    Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385, 2024

  24. [24]

    Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

    SubhamSahoo,MarianneArriola,YairSchiff,AaronGokaslan,EdgarMarroquin,JustinChiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

  25. [25]

    Modern methods in associative memory.arXiv preprint arXiv:2507.06211, 2025

    Dmitry Krotov, Benjamin Hoover, Parikshit Ram, and Bao Pham. Modern methods in associative memory.arXiv preprint arXiv:2507.06211, 2025

  26. [26]

    Neuralnetworksandphysicalsystemswithemergentcollectivecomputational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982

    JohnJ.Hopfield. Neuralnetworksandphysicalsystemswithemergentcollectivecomputational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982

  27. [27]

    The space of interactions in neural network models.Journal of physics A: Mathematical and general, 21(1):257, 1988

    Elizabeth Gardner. The space of interactions in neural network models.Journal of physics A: Mathematical and general, 21(1):257, 1988

  28. [28]

    Statistical mechanics of neural networks near saturation.Annals of physics, 173(1):30–67, 1987

    Daniel J Amit, Hanoch Gutfreund, and Haim Sompolinsky. Statistical mechanics of neural networks near saturation.Annals of physics, 173(1):30–67, 1987

  29. [29]

    Malatesta, and Matteo Negri

    Silvio Kalaj, Clarissa Lauditi, Gabriele Perugini, Carlo Lucibello, Enrico M. Malatesta, and Matteo Negri. Random features hopfield networks generalize retrieval to previously unseen examples.Physica A: Statistical Mechanics and its Applications, 678:130946, 2025

  30. [30]

    The rules-and-facts model for simultaneous generalization and memorization in neural networks.arXiv preprint arXiv:2603.25579, 2026

    Gabriele Farné, Fabrizio Boncoraglio, and Lenka Zdeborová. The rules-and-facts model for simultaneous generalization and memorization in neural networks.arXiv preprint arXiv:2603.25579, 2026. 14

  31. [31]

    Learning patterns and pattern sequences by self-organizing nets of threshold elements.IEEE Transactions on computers, 100(11):1197–1206, 1972

    S-I Amari. Learning patterns and pattern sequences by self-organizing nets of threshold elements.IEEE Transactions on computers, 100(11):1197–1206, 1972

  32. [32]

    Hopfield

    Dmitry Krotov and John J. Hopfield. Dense associative memory for pattern recognition. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

  33. [33]

    Large associative memory problem in neurobiology and machine learning

    Dmitry Krotov and John J Hopfield. Large associative memory problem in neurobiology and machine learning. InInternational Conference on Learning Representations, 2021

  34. [34]

    Pseudo-likelihood produces associative memories able to generalize, even for asymmetric couplings.Physica A: Statistical Mechanics and its Applications, 692:131497, 2026

    Francesco D’Amico, Dario Bocchi, Luca Maria Del Bono, Saverio Rossi, and Matteo Negri. Pseudo-likelihood produces associative memories able to generalize, even for asymmetric couplings.Physica A: Statistical Mechanics and its Applications, 692:131497, 2026

  35. [35]

    Hierarchical associative memory.arXiv preprint 2107.06446, 2021

    Dmitry Krotov. Hierarchical associative memory.arXiv preprint 2107.06446, 2021

  36. [36]

    Energy transformer

    Benjamin Hoover, Yuchen Liang, Bao Pham, Rameswar Panda, Hendrik Strobelt, Duen Horng Chau, Mohammed Zaki, and Dmitry Krotov. Energy transformer. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 27532–27559. Curran Associates, Inc., 2023

  37. [37]

    Neuron–astrocyte associative memory.Proceedings of the National Academy of Sciences, 122(21):e2417788122, 2025

    Leo Kozachkov, Jean-Jacques Slotine, and Dmitry Krotov. Neuron–astrocyte associative memory.Proceedings of the National Academy of Sciences, 122(21):e2417788122, 2025

  38. [38]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  39. [39]

    Spatial interaction and the statistical analysis of lattice systems.Journal of the Royal Statistical Society

    Julian Besag. Spatial interaction and the statistical analysis of lattice systems.Journal of the Royal Statistical Society. Series B (Methodological), 36(2):192–236, 1974

  40. [40]

    The organization of behavior: A neuropsychological theory

    Donald Olding Hebb. The organization of behavior: A neuropsychological theory. 1949

  41. [41]

    Memoryinplainsight: Asurveyoftheuncannyresemblancesbetweendiffusionmodels and associative memories.arXiv preprint arXiv:2309.16750, 2023

    BenjaminHoover,HendrikStrobelt,DmitryKrotov,JudyHoffman,ZsoltKira,andDuenHorng Chau. Memoryinplainsight: Asurveyoftheuncannyresemblancesbetweendiffusionmodels and associative memories.arXiv preprint arXiv:2309.16750, 2023

  42. [42]

    LucaAmbrogioni.Insearchofdispersedmemories: Generativediffusionmodelsareassociative memory networks.Entropy, 26(5):381, 2024

  43. [43]

    Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

    Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

  44. [44]

    The diffusion duality.International Conference on Machine Learning, 42, 2025

    Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin T Chiu, and Volodymyr Kuleshov. The diffusion duality.International Conference on Machine Learning, 42, 2025

  45. [45]

    One billion word benchmark for measuring progress in statistical language modeling

    Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling.arXiv preprint arXiv:1312.3005, 2013. 15

  46. [46]

    Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34, 2021

    Emiel Hoogeboom, Didrik Nielsen, Amir Abdolshahi, and Arash Vahdat. Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34, 2021

  47. [47]

    Chiu, Alexander Rush, and Volodymyr Kuleshov

    AaronLou,ChenlinMeng,andStefanoErmon. Maskeddiffusionmodelsaremaskedlanguage models.arXiv preprint arXiv:2406.07524, 2024

  48. [48]

    Scalablediffusionmodelswithtransformers

    WilliamPeeblesandSainingXie. Scalablediffusionmodelswithtransformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  49. [49]

    Optimalstoragepropertiesofneuralnetworkmodels.Journal of Physics A, 21:271–284, 1988

    E.GardnerandBernardDerrida. Optimalstoragepropertiesofneuralnetworkmodels.Journal of Physics A, 21:271–284, 1988

  50. [50]

    Content-addressability and learning in neural networks.Journal of Physics A: Mathematical and General, 21(1):245, 1988

    BM Forrest. Content-addressability and learning in neural networks.Journal of Physics A: Mathematical and General, 21(1):245, 1988

  51. [51]

    Supervised perceptron learning vs unsupervised hebbian unlearning: Approaching optimal memory retrieval in hopfield-like networks.The Journal of Chemical Physics, 156(10), 2022

    Marco Benedetti, Enrico Ventura, Enzo Marinari, Giancarlo Ruocco, and Francesco Zamponi. Supervised perceptron learning vs unsupervised hebbian unlearning: Approaching optimal memory retrieval in hopfield-like networks.The Journal of Chemical Physics, 156(10), 2022

  52. [52]

    The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

  53. [53]

    Tractability from overparametrization: Theexampleofthenegativeperceptron.Probability Theory and Related Fields,188(3–4):805– 910, 2024

    Andrea Montanari, Yiqiao Zhong, and Kangjie Zhou. Tractability from overparametrization: Theexampleofthenegativeperceptron.Probability Theory and Related Fields,188(3–4):805– 910, 2024. arXiv:2110.15824

  54. [54]

    A new frontier for Hopfield networks.Nature Reviews Physics, 5(7):366–367, July 2023

    Dmitry Krotov. A new frontier for Hopfield networks.Nature Reviews Physics, 5(7):366–367, July 2023

  55. [55]

    Language models are unsupervised multitask learners

    AlecRadford,JeffWu,RewonChild,DavidLuan,DarioAmodei,andIlyaSutskever. Language models are unsupervised multitask learners. 2019. 16 Appendix A Additional Details on Memorization to Generalization 18 B Uniform-state Discrete Diffusion and Duality with Gaussian 19 C Conditional Entropy and Curvature 20 D Additional Results 21 D.1 Shrinkage and Expansion of ...