pith. machine review for the scientific record. sign in

arxiv: 2605.10795 · v1 · submitted 2026-05-11 · 📊 stat.ML · cond-mat.dis-nn· cond-mat.stat-mech· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

Alessio Giorlandino, Antoine Maillard, Sebastian Goldt

Pith reviewed 2026-05-12 04:09 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncond-mat.stat-mechcs.LG
keywords linear associative memorystorage capacitydecoupled modelstatistical physicsfactual recallHebbian ruleextreme value statistics
0
0 comments X

The pith

Linear associative memories store up to p log p ≈ d²/2 associations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper determines the maximum number of input-output associations a linear network can store while keeping each mapped input well separated from all other targets. It introduces a decoupled model where each input competes against its own independent set of outputs, then uses statistical physics to compute the capacity exactly. This yields the relation p_c log p_c / d² = 1/2. The analysis also shows how the optimal weights differ from a simple Hebbian rule by lifting correct scores just above the highest competitor scores set by extreme-value statistics. The same capacity formula extends to linear two-layer networks.

Core claim

In the decoupled model each input has its own independent set of competing outputs. Statistical physics then shows that the maximum number of storable associations satisfies p log p / d² = 1/2. Numerical and analytical checks indicate this decoupled model reproduces the storage capacity, weight spectra, and learning mechanism of the original linear associative memory. The optimal solution raises the correct output score just above the extreme-value threshold of the competing outputs rather than using broad fluctuations as in Hebbian learning. The calculation of the constant p_c generalizes directly to linear two-layer architectures.

What carries the argument

The decoupled model, in which each input is paired with an independent set of competing targets, allowing separate treatment of the p constraints per association.

If this is right

  • The storage capacity scales asymptotically as p ≈ d² / (2 log p).
  • Optimal learning raises correct scores just above the extreme-value threshold of competitors rather than boosting alignments with broad fluctuations.
  • The same capacity calculation applies to linear two-layer networks.
  • The decoupled construction supplies a baseline for capacity in deeper or nonlinear networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same extreme-value threshold mechanism may appear in the output layers of trained transformers when factual recall is examined.
  • Decoupling competing outputs could simplify capacity bounds for nonlinear activation functions.
  • Weight matrices of trained linear models can be checked for the predicted threshold-raising pattern rather than uniform Hebbian scaling.

Load-bearing premise

The decoupled model is equivalent to the original associative memory in capacity, weight spectra, and storage mechanism.

What would settle it

A simulation for large d in which the measured ratio p log p / d² in the original model deviates from 1/2 while the decoupled model stays at 1/2 would falsify the claimed equivalence.

Figures

Figures reproduced from arXiv: 2605.10795 by Alessio Giorlandino, Antoine Maillard, Sebastian Goldt.

Figure 1
Figure 1. Figure 1: Left: The task is to memorise associations between inputs (keys) and outputs (values), illustrated here as cities and their countries. In the original problem, all inputs share a common set of outputs. In the decoupled problem, each input has its own independent set of competing outputs. Right: Empirical accuracy as a function of the load parameter α = p log p/d2 , shown for the original problem (1) (solid… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical accuracy as a function of the load parameter [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the non-zero singular values at the capacity threshold in the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Histograms of the target scores (red) and non-target scores (grey) aggregated over indices [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Large language models demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input--output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps $p$ input embeddings in $\mathbb{R}^d$ to their corresponding~$d$-dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~$p$ constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to $p_c \log p_c / d^2 = 1 / 2$ associations, and generalise the computation of $p_c$ to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a na\"ive Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that a decoupled model for linear associative memories (mapping p inputs in R^d to d-dimensional targets with strict separation) is equivalent to the original model in storage capacity, weight spectra, and mechanism, based on numerical and analytical evidence. It then applies statistical-physics tools to the decoupled model to derive the sharp capacity p_c log p_c / d^2 = 1/2 and generalizes the computation of p_c to linear two-layer architectures, while providing mechanistic insight that optimal solutions raise correct scores just above the extreme-value threshold of competitors (unlike naive Hebbian rules).

Significance. If the equivalence holds rigorously in the p, d → ∞ limit with p/d^2 fixed, the work supplies a precise statistical-physics characterization of factual-recall capacity in linear networks and a mechanistic baseline for more complex architectures. The derivation of sharp asymptotics via stat-phys methods and the explicit comparison to Hebbian learning are strengths that would be valuable if the central transfer of results is justified.

major comments (2)
  1. [Abstract] Abstract: The equivalence of the decoupled model to the original (in capacity, spectra, and mechanism) is supported only by 'numerical and analytical evidence.' The headline capacity p_c log p_c / d^2 = 1/2 is obtained exclusively on the decoupled model; without a rigorous demonstration that the constraint distributions or partition functions converge in the relevant scaling limit (e.g., via total-variation bounds or coupling), the exact constant 1/2 does not necessarily transfer to the original associative memory.
  2. [Section introducing the decoupled model] Section introducing the decoupled model and equivalence evidence: The paper asserts equivalence on the basis of evidence whose independence and scaling-regime validity are not detailed. If this evidence is only finite-size or approximate, the mechanistic claim that 'the optimal solution raises the correct scores just above the extreme-value threshold' and the generalization of p_c to two-layer nets cannot be taken as applying to the original model.
minor comments (1)
  1. [Abstract] Abstract: The escaped character in 'naïve' should be rendered as 'naive' for consistency with standard LaTeX.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We address the concerns regarding the rigor of the equivalence between the original and decoupled models in our point-by-point responses below. While we maintain that the numerical and analytical evidence strongly supports the equivalence in the relevant scaling limits, we agree that additional clarifications will improve the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The equivalence of the decoupled model to the original (in capacity, spectra, and mechanism) is supported only by 'numerical and analytical evidence.' The headline capacity p_c log p_c / d^2 = 1/2 is obtained exclusively on the decoupled model; without a rigorous demonstration that the constraint distributions or partition functions converge in the relevant scaling limit (e.g., via total-variation bounds or coupling), the exact constant 1/2 does not necessarily transfer to the original associative memory.

    Authors: We appreciate this point and acknowledge that our demonstration of equivalence relies on numerical simulations across a range of system sizes and analytical arguments showing matching statistics in the large-d limit, rather than a fully rigorous proof of convergence (such as total-variation bounds on the constraint distributions). The analytical evidence includes showing that the partition functions and effective constraint distributions become equivalent under the scaling p ~ d^2 log d. However, a complete rigorous transfer is indeed not provided. In the revised manuscript, we will modify the abstract and relevant sections to more precisely state that the sharp asymptotic capacity is derived for the decoupled model, with the original model expected to share the same capacity based on the supporting evidence. We will also include a new paragraph discussing the limitations and the strength of the evidence for equivalence. revision: partial

  2. Referee: [Section introducing the decoupled model] Section introducing the decoupled model and equivalence evidence: The paper asserts equivalence on the basis of evidence whose independence and scaling-regime validity are not detailed. If this evidence is only finite-size or approximate, the mechanistic claim that 'the optimal solution raises the correct scores just above the extreme-value threshold' and the generalization of p_c to two-layer nets cannot be taken as applying to the original model.

    Authors: We agree that the section could benefit from more explicit details on the independence of the constraints in the decoupled model and the scaling regime. The evidence includes both finite-size numerics showing convergence as d increases (with p/d^2 fixed) and analytical calculations demonstrating that the weight spectra and optimization mechanisms match. The mechanistic insight is derived from the decoupled model but supported by matching behavior in the original. To address this, we will expand the section with additional details on the scaling validity, including plots of capacity convergence with system size, and clarify that while the claims are for the decoupled model, the equivalence evidence justifies applying the insights to the original. This revision will make the applicability clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives the capacity result p_c log p_c / d^2 = 1/2 exclusively for the decoupled model via standard statistical-physics analysis. Equivalence to the original model (in capacity, spectra, and mechanism) is asserted separately on the basis of numerical simulations and analytical arguments, which are presented as independent supporting evidence rather than a definitional reduction or self-referential fit. No load-bearing step reduces by construction to its own inputs, no self-citation chain is invoked for uniqueness or ansatz, and no known result is merely renamed. The derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The derivation relies on standard statistical-physics approximations whose details are not visible in the abstract; no free parameters or new entities are explicitly introduced.

axioms (1)
  • domain assumption Replica-symmetric or mean-field assumptions typical of statistical-physics analysis of high-dimensional systems
    Invoked to obtain the closed-form capacity in the decoupled model.

pith-pipeline@v0.9.0 · 5602 in / 1301 out tokens · 51986 ms · 2026-05-12T04:09:38.962889+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

  1. [1]

    Petroni, F.et al. Language Models as Knowledge Bases?inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)(eds Inui, K., Jiang, J., Ng, V. & Wan, X.) (Association for Computational Linguistics, Hong Kong, China, Nov. 2019), 2463–2473...

  2. [2]

    & Levy, O.Transformer feed-forward layers are key-value memories inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing(2021), 5484–5495 (cit

    Geva, M., Schuster, R., Berant, J. & Levy, O.Transformer feed-forward layers are key-value memories inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing(2021), 5484–5495 (cit. on p. 1)

  3. [3]

    & Belinkov, Y

    Meng, K., Bau, D., Andonian, A. & Belinkov, Y. Locating and editing factual associations in gpt. Advances in neural information processing systems35,17359–17372 (2022) (cit. on p. 1)

  4. [4]

    Roberts, A., Raffel, C. & Shazeer, N.How Much Knowledge Can You Pack Into the Parameters of a Language Model?inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)(eds Webber, B., Cohn, T., He, Y. & Liu, Y.) (Association for Computational Linguistics, Online, Nov. 2020), 5418–5426 (cit. on p. 1)

  5. [5]

    & Li, Y.Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Lawsin International Conference on Learning Representations(eds Yue, Y., Garg, A., Peng, N., Sha, F

    Allen-Zhu, Z. & Li, Y.Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Lawsin International Conference on Learning Representations(eds Yue, Y., Garg, A., Peng, N., Sha, F. & Yu, R.) 2025(2025), 14937–14946 (cit. on p. 1)

  6. [6]

    & Bietti, A.Scaling Laws for Associative MemoriesinInternational Conference on Learning Representations(eds Kim, B.et al.)2024(2024), 9718–9747 (cit

    Cabannes, V., Dohmatob, E. & Bietti, A.Scaling Laws for Associative MemoriesinInternational Conference on Learning Representations(eds Kim, B.et al.)2024(2024), 9718–9747 (cit. on pp. 2, 4)

  7. [7]

    Cabannes, V., Simsek, B. & Bietti, A.Learning Associative Memories with Gradient Descentin Proceedings of the 41st International Conference on Machine Learning(eds Salakhutdinov, R.et al.) 235(PMLR, 2024), 5114–5134 (cit. on p. 2)

  8. [8]

    & Bietti, A.Understanding Factual Recall in Transformers via Associative Memories inInternational Conference on Learning Representations(eds Yue, Y., Garg, A., Peng, N., Sha, F

    Nichani, E., Lee, J. & Bietti, A.Understanding Factual Recall in Transformers via Associative Memories inInternational Conference on Learning Representations(eds Yue, Y., Garg, A., Peng, N., Sha, F. & Yu, R.)2025(2025), 4166–4207 (cit. on pp. 2, 4)

  9. [9]

    M., Bietti, A., Soltanolkotabi, M

    Vural, N. M., Bietti, A., Soltanolkotabi, M. & Wu, D.Learning to Recall with Transformers Beyond Orthogonal EmbeddingsinThe Fourteenth International Conference on Learning Representations (2026) (cit. on p. 2)

  10. [10]

    Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

    Kim, J., Nichani, E., Wu, D., Bietti, A. & Lee, J. D.Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory2026. arXiv:2603.26554 [cs.LG](cit. on pp. 2, 10)

  11. [11]

    & Derrida, B

    Gardner, E. & Derrida, B. Optimal storage properties of neural network models.Journal of Physics A: Mathematical and Theoretical21,271–284 (1988) (cit. on pp. 2–4, 9). 11

  12. [12]

    Cover, T. M. Geometrical and Statistical Properties of Systems of Linear Inequalities with Appli- cations in Pattern Recognition.IEEE Transactions on Electronic ComputersEC-14,326–334 (1965) (cit. on p. 3)

  13. [13]

    & Mézard, M

    Krauth, W. & Mézard, M. Storage capacity of memory networks with binary couplings.Journal de Physique50,3057–3066 (1989) (cit. on p. 3)

  14. [14]

    & Zamponi, F

    Franz, S., Parisi, G., Sevelev, M., Urbani, P. & Zamponi, F. Universality of the SAT-UNSAT (jamming) threshold in non-convex continuous constraint satisfaction problems.SciPost Physics2,019 (2017) (cit. on p. 3)

  15. [15]

    & Zhou, K

    Montanari, A., Zhong, Y. & Zhou, K. Tractability from overparametrization: The example of the negative perceptron.Probability Theory and Related Fields188,805–910 (2024) (cit. on p. 3)

  16. [16]

    Huang, B.Capacity threshold for the Ising perceptronin2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)(2024), 1126–1136 (cit. on p. 3)

  17. [17]

    Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences79,2554–2558. eprint: https://www. pnas.org/doi/pdf/10.1073/pnas.79.8.2554(1982) (cit. on p. 3)

  18. [18]

    J., Gutfreund, H

    Amit, D. J., Gutfreund, H. & Sompolinsky, H. Spin-glass models of neural networks.Physical Review A32,1007 (1985) (cit. on p. 3)

  19. [19]

    & Hopfield, J

    Krotov, D. & Hopfield, J. J. Dense associative memory for pattern recognition.Advances in neural information processing systems29(2016) (cit. on p. 3)

  20. [20]

    & Vermet, F

    Demircigil, M., Heusel, J., Löwe, M., Upgang, S. & Vermet, F. On a Model of Associative Memory with Huge Storage Capacity.Journal of Statistical Physics168,288–299 (May 2017) (cit. on p. 3)

  21. [21]

    & Hopfield, J

    Krotov, D. & Hopfield, J. J.Large Associative Memory Problem in Neurobiology and Machine Learning inInternational Conference on Learning Representations(2021) (cit. on p. 3)

  22. [22]

    Hopfield Networks is All You NeedinInternational Conference on Learning Representations(2021) (cit

    Ramsauer, H.et al. Hopfield Networks is All You NeedinInternational Conference on Learning Representations(2021) (cit. on p. 3)

  23. [23]

    & Mézard, M

    Lucibello, C. & Mézard, M. Exponential capacity of dense associative memories.Physical Review Letters132,077301 (2024) (cit. on p. 3)

  24. [24]

    Anderson, J. A. A simple neural network generating an interactive memory.Mathematical Bio- sciences14,197–220 (1972) (cit. on p. 3)

  25. [25]

    Correlation Matrix Memories.IEEE Transactions on Computers21,353–359 (Apr

    Kohonen, T. Correlation Matrix Memories.IEEE Transactions on Computers21,353–359 (Apr. 1972) (cit. on p. 3)

  26. [26]

    Bidirectional associative memories.IEEE Transactions on Systems, Man, and Cybernetics 18,49–60 (1988) (cit

    Kosko, B. Bidirectional associative memories.IEEE Transactions on Systems, Man, and Cybernetics 18,49–60 (1988) (cit. on p. 3)

  27. [27]

    & Bordes, A.Memory Networksin3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings(eds Bengio, Y

    Weston, J., Chopra, S. & Bordes, A.Memory Networksin3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings(eds Bengio, Y. & LeCun, Y.) (2015) (cit. on p. 3)

  28. [28]

    Barnfield, N., Kim, J., Nichani, E., Lee, J. D. & Lu, Y. M.Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval2026. arXiv: 2605.05189 [stat.ML] (cit. on pp. 4, 7)

  29. [29]

    Hu, H. & Lu, Y. M. Universality Laws for High-Dimensional Learning With Random Features.IEEE Transactions on Information Theory69,1932–1964 (2023) (cit. on pp. 5, 20)

  30. [30]

    Bandeira, A. S. & Maillard, A. Exact threshold for approximate ellipsoid fitting of random points. Electronic Journal of Probability30,1–46 (2025) (cit. on pp. 5, 10, 20)

  31. [31]

    Goldt, S.et al. The Gaussian equivalence of generative models for learning with shallow neural networksinProceedings of the 2nd Mathematical and Scientific Machine Learning Conference(eds Bruna, J., Hesthaven, J. & Zdeborova, L.)145(PMLR, 2022), 426–471 (cit. on pp. 5, 20). 12

  32. [32]

    & Saeed, B

    Montanari, A. & Saeed, B. N.Universality of empirical risk minimizationinConference on Learning Theory(2022), 4310–4312 (cit. on pp. 5, 10, 20)

  33. [33]

    & Virasoro, M

    Mézard, M., Parisi, G. & Virasoro, M. A.Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications(World Scientific Publishing Company, 1987) (cit. on pp. 9, 18, 21)

  34. [34]

    Mean-field inference methods for neural networks.Journal of Physics A: Mathematical and Theoretical53,223002 (2020) (cit

    Gabrié, M. Mean-field inference methods for neural networks.Journal of Physics A: Mathematical and Theoretical53,223002 (2020) (cit. on p. 9)

  35. [35]

    Spin Glass Theory and Far Beyond: Replica Symmetry Breaking after 40 Years (World Scientific, 2023) (cit

    Charbonneau, P.et al. Spin Glass Theory and Far Beyond: Replica Symmetry Breaking after 40 Years (World Scientific, 2023) (cit. on pp. 9, 18)

  36. [36]

    & Zdeborová, L

    Maillard, A., Troiani, E., Martin, S., Krzakala, F. & Zdeborová, L. Bayes-optimal learning of an extensive-width neural network from quadratically many samples.Advances in Neural Information Processing Systems37,82085–82132 (2024) (cit. on pp. 9, 10, 20)

  37. [37]

    & Skerk, R

    Barbier, J., Camilli, F., Nguyen, M.-T., Pastore, M. & Skerk, R. Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation.Phys. Rev. X(2026) (cit. on pp. 9, 10)

  38. [38]

    & Toninelli, F

    Guerra, F. & Toninelli, F. L. The thermodynamic limit in mean field spin glass models.Communica- tions in Mathematical Physics230,71–79 (2002) (cit. on p. 9)

  39. [39]

    The Parisi formula.Annals of mathematics,221–263 (2006) (cit

    Talagrand, M. The Parisi formula.Annals of mathematics,221–263 (2006) (cit. on p. 9)

  40. [40]

    The Parisi ultrametricity conjecture.Annals of Mathematics,383–393 (2013) (cit

    Panchenko, D. The Parisi ultrametricity conjecture.Annals of Mathematics,383–393 (2013) (cit. on p. 9)

  41. [41]

    & Zdeborová, L

    Barbier, J., Krzakala, F., Macris, N., Miolane, L. & Zdeborová, L. Optimal errors and phase transitions in high-dimensional generalized linear models.Proceedings of the National Academy of Sciences 116,5451–5460 (2019) (cit. on p. 9)

  42. [42]

    & Krzakala, F

    Gerbelot, C., Abbara, A. & Krzakala, F. Asymptotic Errors for Teacher-Student Convex Generalized Linear Models (Or: How to Prove Kabashima’s Replica Formula).IEEE Transactions on Information Theory69,1824–1852 (2023) (cit. on p. 9)

  43. [43]

    P., Gerbelot, C

    Vilucchio, M., Dandi, Y., Rossignol, M. P., Gerbelot, C. & Krzakala, F. Asymptotics of non-convex generalized linear models in high-dimensions: A proof of the replica formula.arXiv preprint arXiv:2502.20003(2025) (cit. on p. 9)

  44. [44]

    Talagrand, M.Mean field models for spin glasses: Volume I: Basic examples(Springer Science & Business Media, 2010) (cit. on pp. 9, 21)

  45. [45]

    & Sáenz, M

    Barbier, J., Panchenko, D. & Sáenz, M. Strong replica symmetry for high-dimensional disordered log-concave Gibbs measures.Information and Inference: A Journal of the IMA11,1079–1108 (2022) (cit. on pp. 9, 21)

  46. [46]

    S., Belius, D., Dokmanić, I

    Maillard, A., Bandeira, A. S., Belius, D., Dokmanić, I. & Nakajima, S. Injectivity of relu networks: perspectives from statistical physics.Applied and Computational Harmonic Analysis76,101736 (2025) (cit. on pp. 10, 19, 22, 27)

  47. [47]

    & Zdeborová, L.Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applicationsin38th Annual Conference on Learning Theory(2025) (cit

    Xu, Y., Maillard, A., Krzakala, F. & Zdeborová, L.Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applicationsin38th Annual Conference on Learning Theory(2025) (cit. on pp. 10, 20)

  48. [48]

    Erba, V., Troiani, E., Zdeborova, L. & Krzakala, F.The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic NetworksinThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2026) (cit. on p. 10)

  49. [49]

    & Bach, F

    Martin, S., Biroli, G. & Bach, F. High-dimensional analysis of gradient flow for extensive-width quadratic neural networks.arXiv preprint arXiv:2601.10483(2026) (cit. on p. 10)

  50. [50]

    & Van den Broeck, C.Statistical Mechanics of Learning(Cambridge University Press, 2001) (cit

    Engel, A. & Van den Broeck, C.Statistical Mechanics of Learning(Cambridge University Press, 2001) (cit. on p. 19). 13

  51. [51]

    & Sen, S

    Montanari, A. & Sen, S. A friendly tutorial on mean-field spin glass techniques for non-physicists. Foundations and Trends in Machine Learning17,1–173 (2024) (cit. on p. 19)

  52. [52]

    & Zdeborová, L

    Gerace, F., Krzakala, F., Loureiro, B., Stephan, L. & Zdeborová, L. Gaussian universality of percep- trons with random labels.Physical Review E109,034305 (2024) (cit. on p. 20)

  53. [53]

    G., Hu, H., Lu, Y

    Wen, G. G., Hu, H., Lu, Y. M., Fan, Z. & Misiakiewicz, T. When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling.arXiv preprint arXiv:2512.03325(2025) (cit. on p. 20)

  54. [54]

    & Kunisky, D

    Maillard, A. & Kunisky, D. Fitting an ellipsoid to random points: predictions using the replica method.IEEE Transactions on Information Theory(2024) (cit. on pp. 20, 28–30)

  55. [55]

    W., Guionnet, A

    Anderson, G. W., Guionnet, A. & Zeitouni, O.An Introduction to Random Matrices(Cambridge University Press, 2009) (cit. on pp. 20, 27–29)

  56. [56]

    Vershynin, R.High-dimensional probability: An introduction with applications in data science(Cam- bridge university press, 2018) (cit. on p. 20)

  57. [57]

    & Bouchaud, J.-P.A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists(Cambridge University Press, 2020) (cit

    Potters, M. & Bouchaud, J.-P.A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists(Cambridge University Press, 2020) (cit. on pp. 27, 33)

  58. [58]

    & Zdeborová, L

    Erba, V., Troiani, E., Biggio, L., Maillard, A. & Zdeborová, L. Bilinear sequence regression: A model for learning from long sequences of high-dimensional tokens.Physical Review X15,021092 (2025) (cit. on pp. 28, 30)

  59. [59]

    & Vivo, P

    Livan, G., Novaes, M. & Vivo, P. Introduction to random matrices theory and practice.Monograph A ward63,914 (2018) (cit. on p. 29)

  60. [60]

    free entropy

    Bun, J., Bouchaud, J.-P., Majumdar, S. N. & Potters, M. Instanton Approach to Large N Harish- Chandra-Itzykson-Zuber Integrals.Physical Review Letters113,070201 (2014) (cit. on p. 31). 14 A Numerical Experiments Common training strategy.All experiments reported in this manuscript follow the same training protocol. Inputs and outputs (in both the original ...