arxiv: 2605.10795 · v1 · submitted 2026-05-11 · 📊 stat.ML · cond-mat.dis-nn· cond-mat.stat-mech· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

Alessio Giorlandino, Antoine Maillard, Sebastian Goldt

Pith reviewed 2026-05-12 04:09 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncond-mat.stat-mechcs.LG

keywords linear associative memorystorage capacitydecoupled modelstatistical physicsfactual recallHebbian ruleextreme value statistics

0 comments

The pith

Linear associative memories store up to p log p ≈ d²/2 associations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper determines the maximum number of input-output associations a linear network can store while keeping each mapped input well separated from all other targets. It introduces a decoupled model where each input competes against its own independent set of outputs, then uses statistical physics to compute the capacity exactly. This yields the relation p_c log p_c / d² = 1/2. The analysis also shows how the optimal weights differ from a simple Hebbian rule by lifting correct scores just above the highest competitor scores set by extreme-value statistics. The same capacity formula extends to linear two-layer networks.

Core claim

In the decoupled model each input has its own independent set of competing outputs. Statistical physics then shows that the maximum number of storable associations satisfies p log p / d² = 1/2. Numerical and analytical checks indicate this decoupled model reproduces the storage capacity, weight spectra, and learning mechanism of the original linear associative memory. The optimal solution raises the correct output score just above the extreme-value threshold of the competing outputs rather than using broad fluctuations as in Hebbian learning. The calculation of the constant p_c generalizes directly to linear two-layer architectures.

What carries the argument

The decoupled model, in which each input is paired with an independent set of competing targets, allowing separate treatment of the p constraints per association.

If this is right

The storage capacity scales asymptotically as p ≈ d² / (2 log p).
Optimal learning raises correct scores just above the extreme-value threshold of competitors rather than boosting alignments with broad fluctuations.
The same capacity calculation applies to linear two-layer networks.
The decoupled construction supplies a baseline for capacity in deeper or nonlinear networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same extreme-value threshold mechanism may appear in the output layers of trained transformers when factual recall is examined.
Decoupling competing outputs could simplify capacity bounds for nonlinear activation functions.
Weight matrices of trained linear models can be checked for the predicted threshold-raising pattern rather than uniform Hebbian scaling.

Load-bearing premise

The decoupled model is equivalent to the original associative memory in capacity, weight spectra, and storage mechanism.

What would settle it

A simulation for large d in which the measured ratio p log p / d² in the original model deviates from 1/2 while the decoupled model stays at 1/2 would falsify the claimed equivalence.

Figures

Figures reproduced from arXiv: 2605.10795 by Alessio Giorlandino, Antoine Maillard, Sebastian Goldt.

**Figure 1.** Figure 1: Left: The task is to memorise associations between inputs (keys) and outputs (values), illustrated here as cities and their countries. In the original problem, all inputs share a common set of outputs. In the decoupled problem, each input has its own independent set of competing outputs. Right: Empirical accuracy as a function of the load parameter α = p log p/d2 , shown for the original problem (1) (solid… view at source ↗

**Figure 2.** Figure 2: Empirical accuracy as a function of the load parameter [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of the non-zero singular values at the capacity threshold in the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Histograms of the target scores (red) and non-target scores (grey) aggregated over indices [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large language models demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input--output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps $p$ input embeddings in $\mathbb{R}^d$ to their corresponding~$d$-dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~$p$ constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to $p_c \log p_c / d^2 = 1 / 2$ associations, and generalise the computation of $p_c$ to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a na\"ive Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a sharp capacity p log p / d^2 = 1/2 for linear associative memories via a decoupled model plus a mechanistic account of optimal learning, but the equivalence to the original model rests on numerical and analytical evidence rather than a limit proof.

read the letter

The main takeaway is that this paper gives a precise asymptotic capacity for storing p associations in a d-dimensional linear memory, specifically p log p / d^2 reaching 1/2 at the threshold, along with an explanation of how optimal learning works differently from Hebbian rules. They introduce a decoupled model where competitors are independent for each input, show numerically and analytically that it matches the original in capacity and weight properties, then apply statistical physics tools to compute the limit. The mechanistic insight is that the optimal weights push the correct score just above the extreme value of the competitors, avoiding the broad fluctuations of naive learning. They also compute the capacity for linear two-layer versions. This is solid on the decoupled side. The derivations follow established methods for these kinds of problems, and the numerics support the equivalence claim in the regimes they check. The distinction in learning mechanisms is a useful addition that clarifies why the capacity is higher. The potential issue is whether the equivalence holds exactly in the scaling limit. The paper relies on numerical evidence and some analytical arguments for that step, but a full proof that the joint distribution of constraints matches would make the 1/2 constant more robust. Without it, there is a small chance the original model has a different threshold due to the correlations between constraints. This is for researchers looking for theoretical limits on memory in linear networks as a step toward understanding larger models. It provides a clean calculation and some intuition that could be cited as a baseline. A reader interested in capacity results or statistical physics approaches to neural nets will get value from it. It deserves peer review. The work is focused, the claims are stated clearly, and the gaps are identifiable rather than hidden. I would recommend sending it out, expecting referees to examine the equivalence step closely.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that a decoupled model for linear associative memories (mapping p inputs in R^d to d-dimensional targets with strict separation) is equivalent to the original model in storage capacity, weight spectra, and mechanism, based on numerical and analytical evidence. It then applies statistical-physics tools to the decoupled model to derive the sharp capacity p_c log p_c / d^2 = 1/2 and generalizes the computation of p_c to linear two-layer architectures, while providing mechanistic insight that optimal solutions raise correct scores just above the extreme-value threshold of competitors (unlike naive Hebbian rules).

Significance. If the equivalence holds rigorously in the p, d → ∞ limit with p/d^2 fixed, the work supplies a precise statistical-physics characterization of factual-recall capacity in linear networks and a mechanistic baseline for more complex architectures. The derivation of sharp asymptotics via stat-phys methods and the explicit comparison to Hebbian learning are strengths that would be valuable if the central transfer of results is justified.

major comments (2)

[Abstract] Abstract: The equivalence of the decoupled model to the original (in capacity, spectra, and mechanism) is supported only by 'numerical and analytical evidence.' The headline capacity p_c log p_c / d^2 = 1/2 is obtained exclusively on the decoupled model; without a rigorous demonstration that the constraint distributions or partition functions converge in the relevant scaling limit (e.g., via total-variation bounds or coupling), the exact constant 1/2 does not necessarily transfer to the original associative memory.
[Section introducing the decoupled model] Section introducing the decoupled model and equivalence evidence: The paper asserts equivalence on the basis of evidence whose independence and scaling-regime validity are not detailed. If this evidence is only finite-size or approximate, the mechanistic claim that 'the optimal solution raises the correct scores just above the extreme-value threshold' and the generalization of p_c to two-layer nets cannot be taken as applying to the original model.

minor comments (1)

[Abstract] Abstract: The escaped character in 'naïve' should be rendered as 'naive' for consistency with standard LaTeX.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We address the concerns regarding the rigor of the equivalence between the original and decoupled models in our point-by-point responses below. While we maintain that the numerical and analytical evidence strongly supports the equivalence in the relevant scaling limits, we agree that additional clarifications will improve the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The equivalence of the decoupled model to the original (in capacity, spectra, and mechanism) is supported only by 'numerical and analytical evidence.' The headline capacity p_c log p_c / d^2 = 1/2 is obtained exclusively on the decoupled model; without a rigorous demonstration that the constraint distributions or partition functions converge in the relevant scaling limit (e.g., via total-variation bounds or coupling), the exact constant 1/2 does not necessarily transfer to the original associative memory.

Authors: We appreciate this point and acknowledge that our demonstration of equivalence relies on numerical simulations across a range of system sizes and analytical arguments showing matching statistics in the large-d limit, rather than a fully rigorous proof of convergence (such as total-variation bounds on the constraint distributions). The analytical evidence includes showing that the partition functions and effective constraint distributions become equivalent under the scaling p ~ d^2 log d. However, a complete rigorous transfer is indeed not provided. In the revised manuscript, we will modify the abstract and relevant sections to more precisely state that the sharp asymptotic capacity is derived for the decoupled model, with the original model expected to share the same capacity based on the supporting evidence. We will also include a new paragraph discussing the limitations and the strength of the evidence for equivalence. revision: partial
Referee: [Section introducing the decoupled model] Section introducing the decoupled model and equivalence evidence: The paper asserts equivalence on the basis of evidence whose independence and scaling-regime validity are not detailed. If this evidence is only finite-size or approximate, the mechanistic claim that 'the optimal solution raises the correct scores just above the extreme-value threshold' and the generalization of p_c to two-layer nets cannot be taken as applying to the original model.

Authors: We agree that the section could benefit from more explicit details on the independence of the constraints in the decoupled model and the scaling regime. The evidence includes both finite-size numerics showing convergence as d increases (with p/d^2 fixed) and analytical calculations demonstrating that the weight spectra and optimization mechanisms match. The mechanistic insight is derived from the decoupled model but supported by matching behavior in the original. To address this, we will expand the section with additional details on the scaling validity, including plots of capacity convergence with system size, and clarify that while the claims are for the decoupled model, the equivalence evidence justifies applying the insights to the original. This revision will make the applicability clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives the capacity result p_c log p_c / d^2 = 1/2 exclusively for the decoupled model via standard statistical-physics analysis. Equivalence to the original model (in capacity, spectra, and mechanism) is asserted separately on the basis of numerical simulations and analytical arguments, which are presented as independent supporting evidence rather than a definitional reduction or self-referential fit. No load-bearing step reduces by construction to its own inputs, no self-citation chain is invoked for uniqueness or ansatz, and no known result is merely renamed. The derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The derivation relies on standard statistical-physics approximations whose details are not visible in the abstract; no free parameters or new entities are explicitly introduced.

axioms (1)

domain assumption Replica-symmetric or mean-field assumptions typical of statistical-physics analysis of high-dimensional systems
Invoked to obtain the closed-form capacity in the decoupled model.

pith-pipeline@v0.9.0 · 5602 in / 1301 out tokens · 51986 ms · 2026-05-12T04:09:38.962889+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We first introduce a decoupled model ... Using tools from statistical physics, we show that the decoupled model can store up to p_c log p_c / d² = 1/2 associations
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
replica-symmetric assumption ... φ_d := d^{-2} E log V_DP(E,U)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

[1]

Petroni, F.et al. Language Models as Knowledge Bases?inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)(eds Inui, K., Jiang, J., Ng, V. & Wan, X.) (Association for Computational Linguistics, Hong Kong, China, Nov. 2019), 2463–2473...

work page 2019
[2]

& Levy, O.Transformer feed-forward layers are key-value memories inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing(2021), 5484–5495 (cit

Geva, M., Schuster, R., Berant, J. & Levy, O.Transformer feed-forward layers are key-value memories inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing(2021), 5484–5495 (cit. on p. 1)

work page 2021
[3]

& Belinkov, Y

Meng, K., Bau, D., Andonian, A. & Belinkov, Y. Locating and editing factual associations in gpt. Advances in neural information processing systems35,17359–17372 (2022) (cit. on p. 1)

work page 2022
[4]

Roberts, A., Raffel, C. & Shazeer, N.How Much Knowledge Can You Pack Into the Parameters of a Language Model?inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)(eds Webber, B., Cohn, T., He, Y. & Liu, Y.) (Association for Computational Linguistics, Online, Nov. 2020), 5418–5426 (cit. on p. 1)

work page 2020
[5]

& Li, Y.Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Lawsin International Conference on Learning Representations(eds Yue, Y., Garg, A., Peng, N., Sha, F

Allen-Zhu, Z. & Li, Y.Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Lawsin International Conference on Learning Representations(eds Yue, Y., Garg, A., Peng, N., Sha, F. & Yu, R.) 2025(2025), 14937–14946 (cit. on p. 1)

work page 2025
[6]

& Bietti, A.Scaling Laws for Associative MemoriesinInternational Conference on Learning Representations(eds Kim, B.et al.)2024(2024), 9718–9747 (cit

Cabannes, V., Dohmatob, E. & Bietti, A.Scaling Laws for Associative MemoriesinInternational Conference on Learning Representations(eds Kim, B.et al.)2024(2024), 9718–9747 (cit. on pp. 2, 4)

work page 2024
[7]

Cabannes, V., Simsek, B. & Bietti, A.Learning Associative Memories with Gradient Descentin Proceedings of the 41st International Conference on Machine Learning(eds Salakhutdinov, R.et al.) 235(PMLR, 2024), 5114–5134 (cit. on p. 2)

work page 2024
[8]

& Bietti, A.Understanding Factual Recall in Transformers via Associative Memories inInternational Conference on Learning Representations(eds Yue, Y., Garg, A., Peng, N., Sha, F

Nichani, E., Lee, J. & Bietti, A.Understanding Factual Recall in Transformers via Associative Memories inInternational Conference on Learning Representations(eds Yue, Y., Garg, A., Peng, N., Sha, F. & Yu, R.)2025(2025), 4166–4207 (cit. on pp. 2, 4)

work page 2025
[9]

M., Bietti, A., Soltanolkotabi, M

Vural, N. M., Bietti, A., Soltanolkotabi, M. & Wu, D.Learning to Recall with Transformers Beyond Orthogonal EmbeddingsinThe Fourteenth International Conference on Learning Representations (2026) (cit. on p. 2)

work page 2026
[10]

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

Kim, J., Nichani, E., Wu, D., Bietti, A. & Lee, J. D.Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory2026. arXiv:2603.26554 [cs.LG](cit. on pp. 2, 10)

work page internal anchor Pith review Pith/arXiv arXiv
[11]

& Derrida, B

Gardner, E. & Derrida, B. Optimal storage properties of neural network models.Journal of Physics A: Mathematical and Theoretical21,271–284 (1988) (cit. on pp. 2–4, 9). 11

work page 1988
[12]

Cover, T. M. Geometrical and Statistical Properties of Systems of Linear Inequalities with Appli- cations in Pattern Recognition.IEEE Transactions on Electronic ComputersEC-14,326–334 (1965) (cit. on p. 3)

work page 1965
[13]

& Mézard, M

Krauth, W. & Mézard, M. Storage capacity of memory networks with binary couplings.Journal de Physique50,3057–3066 (1989) (cit. on p. 3)

work page 1989
[14]

& Zamponi, F

Franz, S., Parisi, G., Sevelev, M., Urbani, P. & Zamponi, F. Universality of the SAT-UNSAT (jamming) threshold in non-convex continuous constraint satisfaction problems.SciPost Physics2,019 (2017) (cit. on p. 3)

work page 2017
[15]

& Zhou, K

Montanari, A., Zhong, Y. & Zhou, K. Tractability from overparametrization: The example of the negative perceptron.Probability Theory and Related Fields188,805–910 (2024) (cit. on p. 3)

work page 2024
[16]

Huang, B.Capacity threshold for the Ising perceptronin2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)(2024), 1126–1136 (cit. on p. 3)

work page 2024
[17]

Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences79,2554–2558. eprint: https://www. pnas.org/doi/pdf/10.1073/pnas.79.8.2554(1982) (cit. on p. 3)

work page doi:10.1073/pnas.79.8.2554(1982 1982
[18]

J., Gutfreund, H

Amit, D. J., Gutfreund, H. & Sompolinsky, H. Spin-glass models of neural networks.Physical Review A32,1007 (1985) (cit. on p. 3)

work page 1985
[19]

& Hopfield, J

Krotov, D. & Hopfield, J. J. Dense associative memory for pattern recognition.Advances in neural information processing systems29(2016) (cit. on p. 3)

work page 2016
[20]

& Vermet, F

Demircigil, M., Heusel, J., Löwe, M., Upgang, S. & Vermet, F. On a Model of Associative Memory with Huge Storage Capacity.Journal of Statistical Physics168,288–299 (May 2017) (cit. on p. 3)

work page 2017
[21]

& Hopfield, J

Krotov, D. & Hopfield, J. J.Large Associative Memory Problem in Neurobiology and Machine Learning inInternational Conference on Learning Representations(2021) (cit. on p. 3)

work page 2021
[22]

Hopfield Networks is All You NeedinInternational Conference on Learning Representations(2021) (cit

Ramsauer, H.et al. Hopfield Networks is All You NeedinInternational Conference on Learning Representations(2021) (cit. on p. 3)

work page 2021
[23]

& Mézard, M

Lucibello, C. & Mézard, M. Exponential capacity of dense associative memories.Physical Review Letters132,077301 (2024) (cit. on p. 3)

work page 2024
[24]

Anderson, J. A. A simple neural network generating an interactive memory.Mathematical Bio- sciences14,197–220 (1972) (cit. on p. 3)

work page 1972
[25]

Correlation Matrix Memories.IEEE Transactions on Computers21,353–359 (Apr

Kohonen, T. Correlation Matrix Memories.IEEE Transactions on Computers21,353–359 (Apr. 1972) (cit. on p. 3)

work page 1972
[26]

Bidirectional associative memories.IEEE Transactions on Systems, Man, and Cybernetics 18,49–60 (1988) (cit

Kosko, B. Bidirectional associative memories.IEEE Transactions on Systems, Man, and Cybernetics 18,49–60 (1988) (cit. on p. 3)

work page 1988
[27]

& Bordes, A.Memory Networksin3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings(eds Bengio, Y

Weston, J., Chopra, S. & Bordes, A.Memory Networksin3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings(eds Bengio, Y. & LeCun, Y.) (2015) (cit. on p. 3)

work page 2015
[28]

Barnfield, N., Kim, J., Nichani, E., Lee, J. D. & Lu, Y. M.Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval2026. arXiv: 2605.05189 [stat.ML] (cit. on pp. 4, 7)

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Hu, H. & Lu, Y. M. Universality Laws for High-Dimensional Learning With Random Features.IEEE Transactions on Information Theory69,1932–1964 (2023) (cit. on pp. 5, 20)

work page 1932
[30]

Bandeira, A. S. & Maillard, A. Exact threshold for approximate ellipsoid fitting of random points. Electronic Journal of Probability30,1–46 (2025) (cit. on pp. 5, 10, 20)

work page 2025
[31]

Goldt, S.et al. The Gaussian equivalence of generative models for learning with shallow neural networksinProceedings of the 2nd Mathematical and Scientific Machine Learning Conference(eds Bruna, J., Hesthaven, J. & Zdeborova, L.)145(PMLR, 2022), 426–471 (cit. on pp. 5, 20). 12

work page 2022
[32]

& Saeed, B

Montanari, A. & Saeed, B. N.Universality of empirical risk minimizationinConference on Learning Theory(2022), 4310–4312 (cit. on pp. 5, 10, 20)

work page 2022
[33]

& Virasoro, M

Mézard, M., Parisi, G. & Virasoro, M. A.Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications(World Scientific Publishing Company, 1987) (cit. on pp. 9, 18, 21)

work page 1987
[34]

Mean-field inference methods for neural networks.Journal of Physics A: Mathematical and Theoretical53,223002 (2020) (cit

Gabrié, M. Mean-field inference methods for neural networks.Journal of Physics A: Mathematical and Theoretical53,223002 (2020) (cit. on p. 9)

work page 2020
[35]

Spin Glass Theory and Far Beyond: Replica Symmetry Breaking after 40 Years (World Scientific, 2023) (cit

Charbonneau, P.et al. Spin Glass Theory and Far Beyond: Replica Symmetry Breaking after 40 Years (World Scientific, 2023) (cit. on pp. 9, 18)

work page 2023
[36]

& Zdeborová, L

Maillard, A., Troiani, E., Martin, S., Krzakala, F. & Zdeborová, L. Bayes-optimal learning of an extensive-width neural network from quadratically many samples.Advances in Neural Information Processing Systems37,82085–82132 (2024) (cit. on pp. 9, 10, 20)

work page 2024
[37]

& Skerk, R

Barbier, J., Camilli, F., Nguyen, M.-T., Pastore, M. & Skerk, R. Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation.Phys. Rev. X(2026) (cit. on pp. 9, 10)

work page 2026
[38]

& Toninelli, F

Guerra, F. & Toninelli, F. L. The thermodynamic limit in mean field spin glass models.Communica- tions in Mathematical Physics230,71–79 (2002) (cit. on p. 9)

work page 2002
[39]

The Parisi formula.Annals of mathematics,221–263 (2006) (cit

Talagrand, M. The Parisi formula.Annals of mathematics,221–263 (2006) (cit. on p. 9)

work page 2006
[40]

The Parisi ultrametricity conjecture.Annals of Mathematics,383–393 (2013) (cit

Panchenko, D. The Parisi ultrametricity conjecture.Annals of Mathematics,383–393 (2013) (cit. on p. 9)

work page 2013
[41]

& Zdeborová, L

Barbier, J., Krzakala, F., Macris, N., Miolane, L. & Zdeborová, L. Optimal errors and phase transitions in high-dimensional generalized linear models.Proceedings of the National Academy of Sciences 116,5451–5460 (2019) (cit. on p. 9)

work page 2019
[42]

& Krzakala, F

Gerbelot, C., Abbara, A. & Krzakala, F. Asymptotic Errors for Teacher-Student Convex Generalized Linear Models (Or: How to Prove Kabashima’s Replica Formula).IEEE Transactions on Information Theory69,1824–1852 (2023) (cit. on p. 9)

work page 2023
[43]

P., Gerbelot, C

Vilucchio, M., Dandi, Y., Rossignol, M. P., Gerbelot, C. & Krzakala, F. Asymptotics of non-convex generalized linear models in high-dimensions: A proof of the replica formula.arXiv preprint arXiv:2502.20003(2025) (cit. on p. 9)

work page arXiv 2025
[44]

Talagrand, M.Mean field models for spin glasses: Volume I: Basic examples(Springer Science & Business Media, 2010) (cit. on pp. 9, 21)

work page 2010
[45]

& Sáenz, M

Barbier, J., Panchenko, D. & Sáenz, M. Strong replica symmetry for high-dimensional disordered log-concave Gibbs measures.Information and Inference: A Journal of the IMA11,1079–1108 (2022) (cit. on pp. 9, 21)

work page 2022
[46]

S., Belius, D., Dokmanić, I

Maillard, A., Bandeira, A. S., Belius, D., Dokmanić, I. & Nakajima, S. Injectivity of relu networks: perspectives from statistical physics.Applied and Computational Harmonic Analysis76,101736 (2025) (cit. on pp. 10, 19, 22, 27)

work page 2025
[47]

& Zdeborová, L.Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applicationsin38th Annual Conference on Learning Theory(2025) (cit

Xu, Y., Maillard, A., Krzakala, F. & Zdeborová, L.Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applicationsin38th Annual Conference on Learning Theory(2025) (cit. on pp. 10, 20)

work page 2025
[48]

Erba, V., Troiani, E., Zdeborova, L. & Krzakala, F.The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic NetworksinThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2026) (cit. on p. 10)

work page 2026
[49]

& Bach, F

Martin, S., Biroli, G. & Bach, F. High-dimensional analysis of gradient flow for extensive-width quadratic neural networks.arXiv preprint arXiv:2601.10483(2026) (cit. on p. 10)

work page arXiv 2026
[50]

& Van den Broeck, C.Statistical Mechanics of Learning(Cambridge University Press, 2001) (cit

Engel, A. & Van den Broeck, C.Statistical Mechanics of Learning(Cambridge University Press, 2001) (cit. on p. 19). 13

work page 2001
[51]

& Sen, S

Montanari, A. & Sen, S. A friendly tutorial on mean-field spin glass techniques for non-physicists. Foundations and Trends in Machine Learning17,1–173 (2024) (cit. on p. 19)

work page 2024
[52]

& Zdeborová, L

Gerace, F., Krzakala, F., Loureiro, B., Stephan, L. & Zdeborová, L. Gaussian universality of percep- trons with random labels.Physical Review E109,034305 (2024) (cit. on p. 20)

work page 2024
[53]

G., Hu, H., Lu, Y

Wen, G. G., Hu, H., Lu, Y. M., Fan, Z. & Misiakiewicz, T. When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling.arXiv preprint arXiv:2512.03325(2025) (cit. on p. 20)

work page arXiv 2025
[54]

& Kunisky, D

Maillard, A. & Kunisky, D. Fitting an ellipsoid to random points: predictions using the replica method.IEEE Transactions on Information Theory(2024) (cit. on pp. 20, 28–30)

work page 2024
[55]

W., Guionnet, A

Anderson, G. W., Guionnet, A. & Zeitouni, O.An Introduction to Random Matrices(Cambridge University Press, 2009) (cit. on pp. 20, 27–29)

work page 2009
[56]

Vershynin, R.High-dimensional probability: An introduction with applications in data science(Cam- bridge university press, 2018) (cit. on p. 20)

work page 2018
[57]

& Bouchaud, J.-P.A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists(Cambridge University Press, 2020) (cit

Potters, M. & Bouchaud, J.-P.A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists(Cambridge University Press, 2020) (cit. on pp. 27, 33)

work page 2020
[58]

& Zdeborová, L

Erba, V., Troiani, E., Biggio, L., Maillard, A. & Zdeborová, L. Bilinear sequence regression: A model for learning from long sequences of high-dimensional tokens.Physical Review X15,021092 (2025) (cit. on pp. 28, 30)

work page 2025
[59]

& Vivo, P

Livan, G., Novaes, M. & Vivo, P. Introduction to random matrices theory and practice.Monograph A ward63,914 (2018) (cit. on p. 29)

work page 2018
[60]

free entropy

Bun, J., Bouchaud, J.-P., Majumdar, S. N. & Potters, M. Instanton Approach to Large N Harish- Chandra-Itzykson-Zuber Integrals.Physical Review Letters113,070201 (2014) (cit. on p. 31). 14 A Numerical Experiments Common training strategy.All experiments reported in this manuscript follow the same training protocol. Inputs and outputs (in both the original ...

work page 2014