pith. machine review for the scientific record. sign in

arxiv: 2605.07870 · v1 · submitted 2026-05-08 · ❄️ cond-mat.dis-nn · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:58 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn cs.AIstat.ML
keywords spectral dynamicsdeep neural networksdynamical mean field theoryoutlier eigenvaluesedge of stabilityhyperparameter transferfeature learningneural tangent kernel
0
0 comments X

The pith

Spectral outliers in wide neural networks evolve predictably during gradient descent, with one scaling regime producing width-independent dynamics and hyperparameter transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a two-level dynamical mean-field theory to track the joint evolution of bulk and outlier parts of the spectrum in the weights of wide neural networks. This theory applies to both infinite-width nonlinear networks and deep linear networks in the proportional limit where width, inputs, and samples grow together. It shows that outlier eigenvalues change with training time, network width, output dimension, and initialization in ways that differ between scaling regimes. In deep linear networks, mean-field scaling produces outlier dynamics that stay consistent across widths and allow learning rates to transfer, including stable approach of the leading mode to the edge of stability. For tasks with many output classes the bulk of the spectrum restructures instead of isolated outliers appearing.

Core claim

A two-level dynamical mean-field theory jointly tracks bulk and outlier spectral dynamics for spiked ensembles with spike directions statistically dependent on the random bulk. In infinite-width nonlinear networks under mean-field scaling and in deep linear networks in the proportional high-dimensional limit, this predicts the evolution of outliers with training time, width, output scale, and initialization variance. Mean-field scaling produces width-consistent outlier dynamics and hyperparameter transfer in deep linear networks, with the leading neural tangent kernel mode growing toward the edge of stability in a width-stable manner, whereas neural tangent kernel parameterization shows st

What carries the argument

The two-level dynamical mean-field theory for spiked random matrix ensembles in which the spike directions remain statistically dependent on the random bulk, applied to track spectral evolution during gradient descent training.

Load-bearing premise

The two-level dynamical mean-field theory accurately captures the dynamics when spike directions stay statistically dependent on the random bulk and when infinite-width or proportional limits represent finite practical networks.

What would settle it

Measuring the growth rate of the leading neural tangent kernel eigenvalue toward the edge of stability in finite-width deep linear networks trained with mean-field scaling and checking whether it remains independent of width.

Figures

Figures reproduced from arXiv: 2605.07870 by Blake Bordelon, Cengiz Pehlevan, Clarissa Lauditi.

Figure 1
Figure 1. Figure 1: Deep (L = 3 hidden layers) nonlinear networks (ϕ = tanh) trained with gradient descent on a single-index polynomial target function with unit scale initialization σ 2 = 1. (a) The hidden feature kernel dynamics are predicted by the DMFT equations (dashed black lines). (b)-(c) The outlier dynamics of the hidden weights are accurately predicted by the A(z) matrix. (a) T = 100 (b) T = 250 (c) T = 1000 [PITH_… view at source ↗
Figure 2
Figure 2. Figure 2: Weight singular values of varying width N ResNets (second convolution in the first residual block) on CIFAR-10 at different snapshots of training. Consistent with the infinite width theory, the MP bulk is unshifted, while the outliers undergo deterministic (and asymptotically width-independent) dynamics. Surprisingly even width N = 256 models exhibit similar dynamics to N = 2048. 4 Linear Networks That Inc… view at source ↗
Figure 3
Figure 3. Figure 3: Spectral density ρ(λ) of the weight covariance of a L = 2 linear network. (a) At initialization (T = 0), the spectrum follows a Marchenko–Pastur (MP) distribution. (b)-(c) As training progresses under GD, the bulk remains unchanged, while a small number of eigenvalues detach from the bulk and evolve as outliers. (d)-(f) The richness γ0 enhances the scale of the outliers. 10 2 10 1 10 3 10 2 10 1 10 0 P = 0… view at source ↗
Figure 4
Figure 4. Figure 4: (a) DMFT can predict the success and failure of learning rate transfer across widths [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Progressive sharpening of the top NTK mode in a deep ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Width-256 µP ResNet18 on ImageNet. Test loss drops around 105 images seen, while the MP-normalized spectrum of the first hidden-layer ResNet block deforms: the bulk shifts left and a right tail emerges beyond the MP edge. 5 Extensive Output Channels: Beyond Constant Bulk + Dynamic Spikes Although Result 1 captures the finite-outlier regime observed in simple models and CIFAR-10 CNNs, larger-output tasks ex… view at source ↗
Figure 7
Figure 7. Figure 7: Losses and spectral densities of weights before and after language model pretraining in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Large-output spectral densities. (a) Two-layer bulk shifts away from MP as [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Classical BBP transition in the rank-one spiked Wigner model. The outlier location at [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Anti-Hebbian Wigner toy model with dynamically generated spike directions, shown after [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Loss Dynamics accurately predicted in the rich regime [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Our theory can also describe networks trained with minibatch SGD. (a) Hidden correlation [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Demonstration of the root finding procedure for [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Root-finding procedure for det A(z) = 0 in a multiclass model (C = 5). We compute the minimum singular value of A(z) as a function of z; values where it approaches zero identify predicted spectral outliers. The presence of multiple minima reflects the rank-C structure of the problem. Dashed lines indicate empirical outlier eigenvalues of W⊤W/N. E.15 NTK spectrum For a two-layer linear network, i.e. L = 1,… view at source ↗
Figure 15
Figure 15. Figure 15: Leading NTK eigenvalue λmax as a function of richness γ0 for different aspect rations ν. As γ0 increases, λmax rapidly grows and the curves collapse, indicating fast convergence to a width-stable sharpening regime. Thus the NTK spectral density satisfies ρNTK(λ;t) = ρM(λ − qg(t);t). (286) Notice that computing the spectrum of Kt would require in principle an additional probe involving the data matrix X. T… view at source ↗
Figure 16
Figure 16. Figure 16: Weight-covariance spectrum after one gradient step for a deep ( [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Weight-covariance spectrum of a deep (L = 2) linear neural network after one gradient step in the proportional limit at fixed ν = 5, χ = 4 and varying richness γ0. As γ0 increases, the bulk shifts and broadens substantially, indicating an extensive learning-induced deformation of the weight covariance. (a) γ0 = 1 (b) γ0 = 2 (c) γ0 = 4 [PITH_FULL_IMAGE:figures/full_fig_p048_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Depth L = 3 linear networks with extensive C = O(N) classes. Singular value density plots after one gradient step at different νs and for different level of richness γ0. F.3 Multi-Layer setting It is easy to extend the calculation to a multi-layer setting. Let the forward pass be Hℓ =  1 √ N Wℓ−1 (0) . . .  1 √ D W0 (0) ∈ R N×D (353) and backwards Gℓ+1 =  1 √ N Wℓ+1(0)⊤  . . .  1 √ N WL−1 (0)⊤  A … view at source ↗
Figure 19
Figure 19. Figure 19: Scaling collapse of the finite-width correction to the upper bulk edge. The relative edge [PITH_FULL_IMAGE:figures/full_fig_p048_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Weight spectra across the first 4 hidden layers of the transformer after pretraining on T = 4B tokens. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_20.png] view at source ↗
read the original abstract

We study the evolution of hidden-weight spectra in wide neural networks trained by (stochastic) gradient descent. We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk. We apply this framework to two settings: (1) infinite-width nonlinear networks in mean-field/$\mu$P scaling and (2) deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios. Our theory predicts how outliers evolve with training time, width, output scale, and initialization variance. In deep linear networks, $\mu$P yields width-consistent outlier dynamics and hyperparameter transfer, including width-stable growth of the leading NTK mode toward the edge of stability (EoS). In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit. We show that this bulk+outlier picture is descriptive of simple tasks with small output channels, but that tasks involving large numbers of outputs (ImageNet classification or GPT language modeling) are better described by a restructuring of the spectral bulk. We develop a toy model with extensive output channels that recapitulates this phenomenon and show that edge of the spectrum still converges for sufficiently wide networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript develops a two-level dynamical mean-field theory (DMFT) to jointly track bulk and outlier spectral dynamics in wide neural networks trained by SGD, for spiked ensembles in which spike directions remain statistically dependent on the random bulk. The framework is applied to (1) infinite-width nonlinear networks under mean-field/μP scaling and (2) deep linear networks in the proportional high-dimensional limit (width, input dimension, and sample size diverging at fixed ratios). It derives predictions for outlier evolution with training time, width, output scale, and initialization variance. In the linear case, μP is claimed to yield width-consistent outlier dynamics and hyperparameter transfer, including stable growth of the leading NTK mode toward the edge of stability, whereas NTK parameterization produces strongly width-dependent dynamics. For tasks with large output channels (e.g., ImageNet or GPT), the paper argues that bulk spectral restructuring dominates and supports this with a toy model showing convergence of the spectral edge for sufficiently wide networks.

Significance. If the two-level DMFT closure is valid and the predictions are quantitatively confirmed, the work would provide a valuable dynamical theory linking feature learning, spectral outliers, and scaling behavior in deep networks. It offers a concrete explanation for why μP enables width-independent dynamics and learning-rate transfer, while distinguishing outlier-driven versus bulk-driven regimes according to output dimension. The technical extension of DMFT to the proportional limit for linear networks, together with the explicit dependence on initialization variance and output scale, constitutes a clear advance over static NTK analyses.

major comments (3)
  1. [§3] §3 (two-level DMFT derivation): the closure of the two-level truncation in the proportional high-dimensional limit is asserted without explicit control on higher-order spike-bulk moments. Non-vanishing triple or quadruple correlations between spike directions and bulk eigenvectors would modify the effective drift and diffusion terms for the outlier eigenvalues, directly affecting the claimed width-independence of μP dynamics.
  2. [§4] §4 (deep linear networks, proportional limit): the prediction that the leading NTK mode grows stably toward the edge of stability under μP (and does so in a width-independent manner) rests on the DMFT equations; however, the manuscript provides neither the explicit DMFT ODEs for the outlier trajectory nor quantitative finite-width simulations with error bars that would confirm the approximation remains accurate when spike-bulk dependence is present.
  3. [§5] §5 (large-output toy model): the claim that bulk restructuring rather than outlier escape dominates for extensive output channels is supported only by a qualitative toy model; no scaling relation with output dimension is derived or tested, leaving open whether the reported edge-of-spectrum convergence holds uniformly or only in a restricted regime.
minor comments (2)
  1. [Abstract] The abstract states that 'edge of the spectrum still converges' without specifying the parameterization or the precise limit; this should be clarified for readers.
  2. [§3] Notation for the two-level DMFT variables (e.g., the decomposition into bulk and spike components) is introduced without a compact summary table; adding one would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help clarify the scope and limitations of our two-level DMFT analysis. We respond point-by-point to the major comments below, indicating where we will revise the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (two-level DMFT derivation): the closure of the two-level truncation in the proportional high-dimensional limit is asserted without explicit control on higher-order spike-bulk moments. Non-vanishing triple or quadruple correlations between spike directions and bulk eigenvectors would modify the effective drift and diffusion terms for the outlier eigenvalues, directly affecting the claimed width-independence of μP dynamics.

    Authors: The two-level closure is obtained by projecting the full DMFT onto the spike and bulk subspaces under the assumption that the spiked ensemble satisfies a spiked covariance model with Gaussian bulk fluctuations. In this setting, the higher-order (triple and quadruple) spike-bulk moments factorize and vanish at leading order in the proportional limit because of the orthogonality between the fixed spike directions and the delocalized bulk eigenvectors. We will add an explicit paragraph in §3 deriving the vanishing of these moments from the moment-generating function of the ensemble, thereby making the control on the truncation transparent. revision: partial

  2. Referee: [§4] §4 (deep linear networks, proportional limit): the prediction that the leading NTK mode grows stably toward the edge of stability under μP (and does so in a width-independent manner) rests on the DMFT equations; however, the manuscript provides neither the explicit DMFT ODEs for the outlier trajectory nor quantitative finite-width simulations with error bars that would confirm the approximation remains accurate when spike-bulk dependence is present.

    Authors: The closed DMFT ODEs for the outlier eigenvalue (including its explicit dependence on initialization variance and output scale) appear in Appendix B. We agree that direct numerical confirmation with error bars is valuable. In the revision we will augment Figure 4 with new panels that overlay finite-width SGD trajectories (N=10 independent seeds, shaded standard-error bands) against the DMFT solution for both μP and NTK parameterizations, confirming that the width-independent growth toward the edge of stability persists under the spike-bulk dependence present in the proportional limit. revision: yes

  3. Referee: [§5] §5 (large-output toy model): the claim that bulk restructuring rather than outlier escape dominates for extensive output channels is supported only by a qualitative toy model; no scaling relation with output dimension is derived or tested, leaving open whether the reported edge-of-spectrum convergence holds uniformly or only in a restricted regime.

    Authors: We will strengthen §5 by deriving the scaling relation for the spectral-edge location as a function of output dimension C and width N. The analysis shows that the edge converges to its infinite-width value whenever C = o(N), which is the regime relevant to ImageNet-scale and language-modeling tasks. We will also add a new figure that numerically tests this scaling across a range of C/N ratios, confirming uniform convergence for sufficiently wide networks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation proceeds from DMFT assumptions to explicit dynamics

full rationale

The paper develops a two-level DMFT from the infinite-width and proportional limits, then derives dynamical equations for bulk and outlier spectra under stated assumptions about spike-bulk dependence. No quoted step reduces a reported prediction to a fitted parameter or self-citation by construction; the central claims about width-consistent outlier evolution and EoS growth follow from solving the closed DMFT equations rather than from re-labeling inputs. Self-citations, if present, are not load-bearing for the uniqueness or closure of the two-level truncation.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on standard mean-field and high-dimensional scaling assumptions plus the spiked-ensemble model; no free parameters or new entities are introduced in the abstract.

axioms (3)
  • domain assumption Infinite-width limit permits closed dynamical mean-field equations for spectra
    Invoked to obtain the two-level DMFT for both nonlinear and linear networks.
  • domain assumption Spiked ensemble with spike directions statistically dependent on the random bulk
    Central modeling choice that enables joint bulk-outlier tracking.
  • domain assumption Proportional high-dimensional limit (width, input dim, sample size diverge with fixed ratios)
    Used for the deep linear network analysis.

pith-pipeline@v0.9.0 · 5547 in / 1519 out tokens · 67901 ms · 2026-05-11T02:58:35.783381+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 4 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  2. [2]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022. 10

  3. [3]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  4. [4]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

  5. [5]

    Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

    Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

  6. [6]

    arXiv preprint arXiv:2102.06701 , year=

    Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.arXiv preprint arXiv:2102.06701, 2021

  7. [7]

    Spectrum dependent learning curves in kernel regression and wide neural networks

    Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. InInternational Conference on Machine Learning, pages 1024–1034. PMLR, 2020

  8. [8]

    Learning curves of generic features maps for realistic datasets with a teacher-student model.Advances in Neural Information Processing Systems, 34:18137–18151, 2021

    Bruno Loureiro, Cedric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mezard, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model.Advances in Neural Information Processing Systems, 34:18137–18151, 2021

  9. [9]

    Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

    Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. InConference on Learning Theory, pages 2388–2464. PMLR, 2019

  10. [10]

    Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020

    Mario Geiger, Stefano Spigler, Arthur Jacot, and Matthieu Wyart. Disentangling feature and lazy training in deep neural networks.Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020

  11. [11]

    Tensor programs iv: Feature learning in infinite-width neural networks

    Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. InInternational Conference on Machine Learning, pages 11727–11737. PMLR, 2021

  12. [12]

    Self-consistent dynamical field theory of kernel evolution in wide neural networks.arXiv preprint arXiv:2205.09653, 2022

    Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks.arXiv preprint arXiv:2205.09653, 2022

  13. [13]

    When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33:14820–14830, 2020

    Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods?Advances in Neural Information Processing Systems, 33:14820–14830, 2020

  14. [14]

    arXiv preprint arXiv:2206.10012 , year=

    Nikhil Vyas, Yamini Bansal, and Preetum Nakkiran. Limitations of the ntk for understanding generalization in deep learning.arXiv preprint arXiv:2206.10012, 2022

  15. [15]

    Dynamical decoupling of generalization and overfitting in large two-layer networks,

    Andrea Montanari and Pierfrancesco Urbani. Dynamical decoupling of generalization and overfitting in large two-layer networks.arXiv preprint arXiv:2502.21269, 2025

  16. [16]

    Practice, theory, and theorems for random matrix theory in modern machine learning

    Michael W Mahoney. Practice, theory, and theorems for random matrix theory in modern machine learning. 2022

  17. [17]

    Liam Hodgkinson, Zhichao Wang, and Michael W. Mahoney. Models of heavy-tailed mechanis- tic universality. InForty-second International Conference on Machine Learning, 2025

  18. [18]

    Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

    Vladimir A Marˇcenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

  19. [19]

    Cambridge University Press, 2020

    Marc Potters and Jean-Philippe Bouchaud.A first course in random matrix theory: for physicists, engineers and data scientists. Cambridge University Press, 2020

  20. [20]

    Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices

    Jinho Baik, Gérard Ben Arous, and Sandrine Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. 2005. 11

  21. [21]

    arXiv:2203.03466 , year=

    Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466, 2022

  22. [22]

    arXiv:2505.01618 , year=

    Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers.arXiv preprint arXiv:2505.01618, 2025

  23. [23]

    A walk with sgd, 2018

    Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd, 2018

  24. [24]

    The break-even point on optimization trajectories of deep neural networks

    Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho*, and Krzysztof Geras*. The break-even point on optimization trajectories of deep neural networks. InInternational Conference on Learning Representations, 2020

  25. [25]

    arXiv preprint arXiv:2103.00065 , year=

    Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gra- dient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

  26. [26]

    arXiv:2207.14484 , year=

    Jeremy M. Cohen, B. Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David E. Cardoze, Zachary Nado, George E. Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability.ArXiv, abs/2207.14484, 2022

  27. [27]

    Edge of stochastic stability: Revisiting the edge of stability for sgd, 2025

    Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisiting the edge of stability for sgd, 2025

  28. [28]

    Understanding the evolution of the neural tangent kernel at the edge of stability

    Kaiqi Jiang, Jeremy Cohen, and Yuanzhi Li. Understanding the evolution of the neural tangent kernel at the edge of stability. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  29. [29]

    arXiv preprint arXiv:2512.22768 , year=

    Nikhil Ghosh, Denny Wu, and Alberto Bietti. Understanding the mechanisms of fast hyperpa- rameter transfer.arXiv preprint arXiv:2512.22768, 2025

  30. [30]

    The singular values and vectors of low rank perturbations of large rectangular random matrices.Journal of Multivariate Analysis, 111:120–135, 2012

    Florent Benaych-Georges and Raj Rao Nadakuditi. The singular values and vectors of low rank perturbations of large rectangular random matrices.Journal of Multivariate Analysis, 111:120–135, 2012

  31. [31]

    Optimality and sub- optimality of pca i: Spiked random matrix models.The Annals of Statistics, 46(5):2416–2451, 2018

    Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality and sub- optimality of pca i: Spiked random matrix models.The Annals of Statistics, 46(5):2416–2451, 2018

  32. [32]

    Limiting eigenvectors of outliers for spiked information-plus-noise type matrices

    Mireille Capitaine. Limiting eigenvectors of outliers for spiked information-plus-noise type matrices. InSéminaire de Probabilités XLIX, pages 119–164. Springer, 2018

  33. [33]

    Fundamental limits in structured principal component analysis and how to reach them.Proceedings of the National Academy of Sciences, 120(30):e2302028120, 2023

    Jean Barbier, Francesco Camilli, Marco Mondelli, and Manuel Sáenz. Fundamental limits in structured principal component analysis and how to reach them.Proceedings of the National Academy of Sciences, 120(30):e2302028120, 2023

  34. [34]

    Bbp phase transition for an extensive number of outliers.arXiv preprint arXiv:2511.18501, 2025

    Niklas Forner, Alexander Maloney, and Bernd Rosenow. Bbp phase transition for an extensive number of outliers.arXiv preprint arXiv:2511.18501, 2025

  35. [35]

    On lazy training in differentiable program- ming.Advances in neural information processing systems, 32, 2019

    Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable program- ming.Advances in neural information processing systems, 32, 2019

  36. [36]

    Kernel and rich regimes in overparametrized models

    Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. InConference on Learning Theory, pages 3635–3673. PMLR, 2020

  37. [37]

    Feature-learning networks are consistent across widths at realistic scales

    Nikhil Vyas, Alexander Atanasov, Blake Bordelon, Depen Morwani, Sabarish Sainathan, and Cengiz Pehlevan. Feature-learning networks are consistent across widths at realistic scales. arXiv preprint arXiv:2305.18411, 2023

  38. [38]

    Dynamic theory of the spin-glass phase.Physical Review Letters, 47(5):359, 1981

    Haim Sompolinsky and Annette Zippelius. Dynamic theory of the spin-glass phase.Physical Review Letters, 47(5):359, 1981. 12

  39. [39]

    Dynamics as a substitute for replicas in systems with quenched random impurities.Physical Review B, 18(9):4913, 1978

    C De Dominicis. Dynamics as a substitute for replicas in systems with quenched random impurities.Physical Review B, 18(9):4913, 1978

  40. [40]

    Disordered dynamics in high dimensions: Connec- tions to random matrices and machine learning.arXiv preprint arXiv:2601.01010, 2026

    Blake Bordelon and Cengiz Pehlevan. Disordered dynamics in high dimensions: Connections to random matrices and machine learning.arXiv preprint arXiv:2601.01010, 2026

  41. [41]

    Path integral approach to random neural networks.Physical Review E, 98(6):062120, 2018

    A Crisanti and H Sompolinsky. Path integral approach to random neural networks.Physical Review E, 98(6):062120, 2018

  42. [42]

    Structure, disorder, and dynamics in task-trained recurrent neural circuits.bioRxiv, pages 2026–03, 2026

    David G Clark, Blake Bordelon, Jacob A Zavatone-Veth, and Cengiz Pehlevan. Structure, disorder, and dynamics in task-trained recurrent neural circuits.bioRxiv, pages 2026–03, 2026

  43. [43]

    Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification.Advances in Neural Information Processing Systems, 33:9540–9550, 2020

    Francesca Mignacco, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification.Advances in Neural Information Processing Systems, 33:9540–9550, 2020

  44. [44]

    The effective noise of stochastic gradient descent.Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, 2022

    Francesca Mignacco and Pierfrancesco Urbani. The effective noise of stochastic gradient descent.Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, 2022

  45. [45]

    Rigorous dynamical mean field theory for stochastic gradient descent methods.arXiv preprint arXiv:2210.06591, 2022

    Cedric Gerbelot, Emanuele Troiani, Francesca Mignacco, Florent Krzakala, and Lenka Zde- borova. Rigorous dynamical mean field theory for stochastic gradient descent methods.arXiv preprint arXiv:2210.06591, 2022

  46. [46]

    A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092, 2024

    Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092, 2024

  47. [47]

    arXiv preprint arXiv:2304.03408 , year=

    Blake Bordelon and Cengiz Pehlevan. Dynamics of finite width kernel and prediction fluctua- tions in mean field neural networks.arXiv preprint arXiv:2304.03408, 2023

  48. [48]

    Infinite limits of multi-head transformer dynamics.arXiv preprint arXiv:2405.15712, 2024

    Blake Bordelon, Hamza Tahir Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head transformer dynamics.arXiv preprint arXiv:2405.15712, 2024

  49. [49]

    Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit

    Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. InThe Twelfth International Conference on Learning Representations, 2024

  50. [50]

    arXiv preprint arXiv:2601.20205 , year=

    Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, and Boris Hanin. Hyperparameter transfer with mixture-of-expert layers.arXiv preprint arXiv:2601.20205, 2026

  51. [51]

    arXiv preprint arXiv:2603.18168 , year=

    Louis-Pierre Chaintron, Lénaïc Chizat, and Javier Maas. Resnets of all shapes and sizes: Convergence of training dynamics in the large-scale limit.arXiv preprint arXiv:2603.18168, 2026

  52. [52]

    Eigenvalues of large sample covariance matrices of spiked population models.Journal of multivariate analysis, 97(6):1382–1408, 2006

    Jinho Baik and Jack W Silverstein. Eigenvalues of large sample covariance matrices of spiked population models.Journal of multivariate analysis, 97(6):1382–1408, 2006

  53. [53]

    Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks

    Charles H Martin and Michael W Mahoney. Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. InProceedings of the 2020 SIAM International Conference on Data Mining, pages 505–513. SIAM, 2020

  54. [54]

    Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

    Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165):1–73, 2021

  55. [55]

    Random matrix analysis of deep neural network weight matrices.Physical Review E, 106(5):054124, 2022

    Matthias Thamm, Max Staats, and Bernd Rosenow. Random matrix analysis of deep neural network weight matrices.Physical Review E, 106(5):054124, 2022

  56. [56]

    Spectral evolution and invariance in linear-width neural networks.Advances in neural information processing systems, 36:20695–20728, 2023

    Zhichao Wang, Andrew Engel, Anand D Sarwate, Ioana Dumitriu, and Tony Chiang. Spectral evolution and invariance in linear-width neural networks.Advances in neural information processing systems, 36:20695–20728, 2023

  57. [57]

    How two-layer neural networks learn, one (giant) step at a time

    Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, and Ludovic Stephan. How two-layer neural networks learn, one (giant) step at a time. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023. 13

  58. [58]

    arXiv preprint arXiv:2310.07891 , year=

    Behrad Moniri, Donghwan Lee, Hamed Hassani, and Edgar Dobriban. A theory of non- linear feature learning with one gradient step in two-layer neural networks.arXiv preprint arXiv:2310.07891, 2023

  59. [59]

    arXiv preprint arXiv:2410.18938 , year=

    Yatin Dandi, Luca Pesce, Hugo Cui, Florent Krzakala, Yue M Lu, and Bruno Loureiro. A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities.arXiv preprint arXiv:2410.18938, 2024

  60. [60]

    arXiv preprint arXiv:2402.04980 , year=

    Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M Lu, Lenka Zdeborová, and Bruno Loureiro. Asymptotics of feature learning in two-layer networks after one gradient-step.arXiv preprint arXiv:2402.04980, 2024

  61. [61]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120, 2013

  62. [62]

    Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning.arXiv preprint arXiv:2406.06158, 2024

    Daniel Kunin, Allan Raventós, Clémentine Dominé, Feng Chen, David Klindt, Andrew Saxe, and Surya Ganguli. Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning.arXiv preprint arXiv:2406.06158, 2024

  63. [63]

    From lazy to rich: Exact learning dynamics in deep linear networks.arXiv preprint arXiv:2409.14623, 2024

    Clémentine CJ Dominé, Nicolas Anguita, Alexandra M Proca, Lukas Braun, Daniel Kunin, Pedro AM Mediano, and Andrew M Saxe. From lazy to rich: Exact learning dynamics in deep linear networks.arXiv preprint arXiv:2409.14623, 2024

  64. [64]

    arXiv:2502.02531 , year=

    Blake Bordelon and Cengiz Pehlevan. Deep linear network training dynamics from random initialization: Data, width, depth, and hyperparameter transfer.arXiv preprint arXiv:2502.02531, 2025

  65. [65]

    Scaling and renormalization in high-dimensional regression

    Alexander B Atanasov, Jacob A Zavatone-Veth, and Cengiz Pehlevan. Scaling and renormaliza- tion in high-dimensional regression.arXiv preprint arXiv:2405.00592, 2024

  66. [66]

    Tuning large neural networks via zero-shot hyperparameter transfer

    Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021

  67. [67]

    Clark and L

    David G. Clark and L. F. Abbott. Theory of coupled neuronal-synaptic dynamics.Phys. Rev. X, 14:021001, Apr 2024

  68. [68]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009

  69. [69]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

  70. [70]

    Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21(1):5485–5551, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.The Journal of Machine Learning Research, 21(1):5485–5551, 2020

  71. [71]

    ψ1(τ)− X s c0(τ, s)g(s) # (69) +i Z dτ ˆψ0(τ)·

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Ma- lik, Willia...

  72. [72]

    Under these assumptions, one can show using DMFT [12] that the feature structural condition holds since we have

    Super-wide Scaling: The network width N→ ∞ with batch size B, number of update stepsT, and input-dimensionDare held fixed. Under these assumptions, one can show using DMFT [12] that the feature structural condition holds since we have

  73. [73]

    Concentration of Correlations and Errors: The quantities C ϕ and C g and ∆ concentrate due to a law of large numbers effect (they are averages over neurons in each layer)

  74. [74]

    −iTrX ⊤X t √ D P ∆(t)ˆh0(t)⊤ + 1√ D ˆ∆(t)v(t)⊤ !#+ X .(143) Averaging over the Gaussian data gives exp

    Decoupling of Neurons: The neurons effectively decouple in their dynamics and become iid random processes throughout training. As a consequence, we can indeed view the single-site equations as defining the features ϕ and g elementwise in terms ofχandξ ϕℓ µ(t) =ϕ ℓ µ,t {χℓ,ξ ℓ} ,g ℓ+1 µ (t) =g ℓ+1 µ,t {χℓ+1,ξ ℓ+1} ,(112) which verifies that our structural ...