arxiv: 2605.14567 · v1 · submitted 2026-05-14 · 📊 stat.ML · cs.LG· math.PR· math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model

Arie Wortsman-Zurich , Hugo Tabanelli , Yatin Dandi , Florent Krzakala , Bruno Loureiro

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:35 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PRmath.STstat.TH

keywords scaling lawsfeature learninghierarchical modelsspectral algorithmsrandom matrix theoryprediction errorneural networksresolvent methods

0 comments

The pith

A layer-wise spectral algorithm on a hierarchical target with power-law feature weights recovers latent directions sequentially and aggregates their sharp thresholds into an explicit power-law decay of prediction error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that scaling laws can emerge directly from feature learning when a high-dimensional target is built from latent compositional features whose importance follows a power law. A layer-wise spectral algorithm tuned to this structure detects the strongest features at modest sample sizes and progressively weaker ones only as more data arrives. Random-matrix analysis supplies sharp recovery thresholds for each individual feature, and summing these transitions produces a smooth power-law drop in overall prediction error. This mechanism yields better scaling than shallow non-adaptive baselines and explains how a cascade of abrupt learning events can appear as gradual improvement.

Core claim

In the solvable hierarchical model the layer-wise spectral algorithm recovers the latent directions sequentially: strong features become detectable at small sample sizes while weaker features require larger samples. Sharp feature-wise recovery thresholds are proved via resolvent-based perturbation arguments that give matching upper and lower bounds. Aggregating these thresholds produces an explicit power-law decay of the prediction error, with improved scaling relative to shallow non-adaptive methods.

What carries the argument

The layer-wise spectral algorithm adapted to the compositional structure, together with resolvent-based perturbation bounds that establish sharp eigenvector recovery thresholds.

If this is right

Strong features become detectable at small sample sizes while weaker features require more data.
Aggregating the sequence of sharp thresholds produces an explicit power-law decay of prediction error.
The layer-wise method outperforms shallow non-adaptive kernels in scaling.
Finite-size effects smooth the thresholds but preserve the overall power-law trend.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Networks trained on real data with similar hierarchical structure may display analogous sequential feature acquisition.
The same resolvent technique could be applied to other layer-wise or greedy algorithms to obtain explicit thresholds.
Removing the power-law weight assumption should eliminate the smooth scaling and leave only abrupt jumps.

Load-bearing premise

The high-dimensional target can be written as a combination of latent compositional features whose weights decrease as a power law, and the algorithm is specifically designed to exploit that structure.

What would settle it

Numerical or analytic evidence that the aggregated error decay deviates from power-law form, or that individual feature thresholds are not sharp, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.14567 by Arie Wortsman-Zurich, Bruno Loureiro, Florent Krzakala, Hugo Tabanelli, Yatin Dandi.

**Figure 1.** Figure 1: (Left) Illustration of the compositional function defined in Eq. (2.2) and studied throughout this paper, where the target is given by an anisotropic combination of high-degree features Heq(x) of the input data x ∈ R d . (Right) The key conceptual idea in our proof is to show that the relevant features of the target can be efficiently learned by a spectral method (Definition 2.1) adapted to the composition… view at source ↗

**Figure 2.** Figure 2: Feature recovery in the power-law setting. Parameters: q = 2, ε = 0.5, γ = 0.4, g ⋆ = id. Left: Test MSE versus α = log(n)/ log(d). Center: First-layer feature overlap q (1) h versus α Right: Spectrum of the first-layer moment matrix for d = 140 and α = 3.5, red markers indicate the top d1 = 12 absolute eigenvalues. This compositional model is closest technically to the hierarchical spectral model of (Taba… view at source ↗

**Figure 3.** Figure 3: Effect of the power-law exponent. Parameters: d = 400, q = 2, ε = 0.5, g ⋆ = id. Left: Test MSE versus α = log(n)/ log(d). Right: First-layer feature overlap q (1) h versus α. et al., 2022; Bietti et al., 2022; Troiani et al., 2025; Damian et al., 2024). Spectral methods are particularly relevant in this context, since they provide sharp recovery guarantees for low-dimensional structure in Gaussian models … view at source ↗

**Figure 4.** Figure 4: Direction-wise recovery. Parameters: q = 2, ε = 0.5, γ = 0.4, g ⋆ = id. Left: Direction-wise alignments cos2 (θi) versus α with d = 400. Center: Direction-wise errors 1 − cos(θi) versus α, with a 1/n guide and d = 400. Right: Empirical overlap q (1) h compared with the theoretical curve mth(α) with d = 800. 3.2 Recovery of the Second Layer and Rates for the Generalization Error Under Assumption 3.1 the sec… view at source ↗

**Figure 5.** Figure 5: Non-identity readout. Parameters: q = 2, ε = 0.5, γ = 0.4, and g ⋆ = tanh. Left: Test MSE versus α = log(n)/ log(d). Right: First-layer feature overlap q (1) h versus α. threshold, corresponding to the strongest direction, decreases with γ, while the last thresholds increase with γ. Larger γ therefore makes the teacher spectrum more anisotropic: the strongest directions are recovered earlier, while the wea… view at source ↗

read the original abstract

We propose a simple mechanism by which scaling laws emerge from feature learning in multi-layer networks. We study a high-dimensional hierarchical target that is a globally high-degree function, but that can be represented by a combination of latent compositional features whose weights decrease as a power law. We show that a layer-wise spectral algorithm adapted to this compositional structure achieves improved scaling relative to shallow, non-adaptive methods, and recovers the latent directions sequentially: strong features become detectable at small sample sizes, while weaker features require more data. We prove sharp feature-wise recovery thresholds and show that aggregating these transitions yields an explicit power-law decay of the prediction error. Technically, the analysis relies on random matrix methods and a resolvent-based perturbation argument, which gives matching upper and lower bounds for individual eigenvector recovery beyond what standard gap-based perturbation bounds provide. Numerical experiments confirm the predicted sequential recovery, finite-size smoothing of the thresholds, and separation from non-hierarchical kernel baselines. Together, these results show how smooth scaling laws can emerge from a cascade of sharp feature-learning transitions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper derives explicit power-law scaling from sharp per-feature recovery thresholds in a solvable hierarchical model using resolvent perturbation.

read the letter

The core contribution is a clean mechanism for power-law scaling: a high-dimensional target built from latent compositional features whose weights decay as a power law, recovered sequentially by a layer-wise spectral algorithm. Strong features cross their detection threshold at small sample sizes; weaker ones need more data. Aggregating the sharp thresholds produces the explicit error decay without post-hoc fitting. The resolvent-based perturbation gives matching upper and lower bounds on eigenvector recovery that improve on standard gap arguments, and the numerics confirm both the thresholds and the finite-size smoothing while beating non-hierarchical kernel baselines. That is the part worth paying attention to. The model is quite specific. It assumes the target really does admit this exact hierarchical decomposition with power-law weights, and the algorithm is tailored to exploit it. If real data lacks that structure, the cascade does not necessarily appear. The abstract leaves some details on the full proof and data exclusion rules implicit, so the finite-N corrections and constant factors would need checking in the full text. Still, once the assumptions are granted, the derivation follows from standard random-matrix tools without circularity. This is for people working on theoretical accounts of scaling and feature learning in high dimensions. Readers who want first-principles derivations rather than empirical fits will get value from the explicit thresholds and the perturbation technique. The work shows clear thinking and honest engagement with the literature, so it deserves a serious referee even if the model assumptions turn out narrow.

Referee Report

2 major / 2 minor

Summary. The paper proposes a solvable hierarchical model in which a high-dimensional target is expressed as a combination of latent compositional features whose weights decay as a power law. It analyzes a layer-wise spectral algorithm that recovers these features sequentially, deriving sharp per-feature recovery thresholds via random-matrix resolvent perturbation that supply matching upper and lower bounds. Aggregation of the ordered thresholds then produces an explicit power-law decay of the prediction error, offering a mechanistic account of scaling laws. Numerical experiments confirm the sequential recovery, finite-size smoothing, and advantage over non-hierarchical kernel baselines.

Significance. If the central claims hold, the work supplies a concrete, analytically tractable example in which smooth scaling laws emerge directly from the summation of sharp, theoretically derived feature-recovery transitions rather than from post-hoc fitting. The resolvent-based perturbation technique yields tighter eigenvector bounds than standard gap arguments, and the explicit construction of the power-law hierarchy is a technical strength. The separation from kernel baselines underscores the benefit of the adapted multi-layer procedure.

major comments (2)

[§3] §3 (resolvent perturbation argument): the matching upper and lower bounds on eigenvector recovery are derived under the assumption that the minimal eigenvalue gap scales with the power-law exponent; the dependence of the perturbation remainder on this gap is not stated explicitly, which is load-bearing for the subsequent aggregation into a clean power-law error decay.
[§4.2] §4.2 (numerical validation): the experiments illustrate finite-size smoothing of the thresholds, yet no quantitative bound or scaling for the width of the transition region is provided; without this, it is difficult to verify that the asymptotic power-law remains a good approximation at the moderate sample sizes shown in the figures.

minor comments (2)

The definition of the latent compositional features and the precise form of the layer-wise spectral update could be stated once in the main text (rather than only in the appendix) to improve readability for readers unfamiliar with the construction.
A short high-level proof sketch of the resolvent perturbation step would help readers follow the argument without immediately consulting the appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and the encouraging assessment of our manuscript. The comments highlight important points for clarification, which we address below. We believe these revisions will further strengthen the presentation of our results.

read point-by-point responses

Referee: [§3] §3 (resolvent perturbation argument): the matching upper and lower bounds on eigenvector recovery are derived under the assumption that the minimal eigenvalue gap scales with the power-law exponent; the dependence of the perturbation remainder on this gap is not stated explicitly, which is load-bearing for the subsequent aggregation into a clean power-law error decay.

Authors: We thank the referee for pointing this out. In the original derivation, the resolvent perturbation bound indeed depends on the eigenvalue gap, which for our power-law model scales as λ_k - λ_{k+1} ∼ k^{-α-1}. We will revise §3 to explicitly state the dependence of the remainder term on this gap, showing that it remains controlled under the power-law assumption. This will make the step to the aggregated power-law error decay fully rigorous and transparent. We will also add a short appendix remark if needed. revision: yes
Referee: [§4.2] §4.2 (numerical validation): the experiments illustrate finite-size smoothing of the thresholds, yet no quantitative bound or scaling for the width of the transition region is provided; without this, it is difficult to verify that the asymptotic power-law remains a good approximation at the moderate sample sizes shown in the figures.

Authors: We agree that quantifying the transition width would be valuable. While a full rigorous bound on the smoothing width is technically challenging and outside the main scope (as it would require more refined large-deviation estimates), we will add a discussion in §4.2 referencing standard random matrix theory results, where the transition width typically scales as O(n^{-1/2}) for eigenvector perturbations. We will also include a supplementary figure showing the empirical width scaling with sample size n to support that the asymptotic regime is approached at the depicted sizes. This constitutes a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The derivation begins from an explicit model assumption that the target is a combination of latent compositional features with power-law decaying weights. Sharp per-feature recovery thresholds are obtained via resolvent perturbation and random-matrix analysis, with matching upper and lower bounds. Aggregating these thresholds then produces an explicit power-law decay for the prediction error as a direct summation over the ordered transitions. This step is a standard analytic consequence of the input power-law weights and the derived thresholds; it does not involve fitting the final exponent, renaming a known result, or reducing to a self-citation chain. No load-bearing self-citations, self-definitional steps, or fitted inputs presented as predictions appear in the chain. The result remains self-contained once the hierarchical model is granted.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on a high-dimensional hierarchical target whose latent features obey power-law weights and on standard random-matrix assumptions for eigenvector recovery in the large-sample limit.

free parameters (1)

power-law exponent of feature weights
Chosen as part of the model definition to produce the target scaling behavior; not fitted to external data.

axioms (1)

standard math High-dimensional limit with random matrix concentration for eigenvector recovery
Invoked for the resolvent-based perturbation bounds that yield sharp per-feature thresholds.

invented entities (1)

latent compositional features with power-law weights no independent evidence
purpose: To construct a globally high-degree target that nonetheless permits sequential recovery
Postulated structure enabling the cascade of sharp thresholds; no independent falsifiable prediction outside the model is given.

pith-pipeline@v0.9.0 · 5507 in / 1277 out tokens · 49563 ms · 2026-05-15T01:35:40.345991+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove sharp feature-wise recovery thresholds and show that aggregating these transitions yields an explicit power-law decay of the prediction error... ni ≍ d^q / a_i² ... MSE(n) ≍ (n/d^q)^{-1+1/(2γ)}
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the target admits a representation as a combination of latent compositional features whose weights decrease as a power law

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

187 extracted references · 133 canonical work pages · 7 internal anchors

[1]

2022 , publisher=

Random matrix methods for machine learning , author=. 2022 , publisher=

work page 2022
[2]

Introduction to the non-asymptotic analysis of random matrices

Introduction to the non-asymptotic analysis of random matrices , author=. arXiv preprint arXiv:1011.3027 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Applied and Computational Harmonic Analysis , volume=

Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration , author=. Applied and Computational Harmonic Analysis , volume=. 2022 , publisher=

2022
[4]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[5]

arXiv preprint arXiv:2305.15501 , year=

Deriving language models from masked language models , author=. arXiv preprint arXiv:2305.15501 , year=

work page arXiv
[6]

Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

Characterizing a joint probability distribution by conditionals , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 1993 , publisher=

work page 1993
[7]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[8]

Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

Masked language model scoring , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

work page
[9]

arXiv preprint arXiv:1904.09324 , year=

Mask-predict: Parallel decoding of conditional masked language models , author=. arXiv preprint arXiv:1904.09324 , year=

work page arXiv 1904
[10]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

On the inductive bias of masked language modeling: From statistical to syntactic dependencies , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2021
[11]

arXiv preprint arXiv:2104.06644 , year=

Masked language modeling and the distributional hypothesis: Order word matters pre-training for little , author=. arXiv preprint arXiv:2104.06644 , year=

work page arXiv
[12]

arXiv preprint arXiv:2407.21046 , year=

Promises and pitfalls of generative masked language modeling: Theoretical framework and practical guidelines , author=. arXiv preprint arXiv:2407.21046 , year=

work page arXiv
[13]

Advances in Neural Information Processing Systems , volume=

How mask matters: Towards theoretical understandings of masked autoencoders , author=. Advances in Neural Information Processing Systems , volume=

work page
[14]

The Annals of Statistics , volume=

Dimension free ridge regression , author=. The Annals of Statistics , volume=. 2024 , publisher=

2024
[15]

SIAM Journal on Mathematics of Data Science , volume=

High-dimensional analysis of double descent for linear regression with random projections , author=. SIAM Journal on Mathematics of Data Science , volume=. 2024 , publisher=

work page 2024
[16]

Isotropic local laws for sample covariance and generalized Wigner matrices , author=

work page
[17]

2020 , publisher=

A first course in random matrix theory: for physicists, engineers and data scientists , author=. 2020 , publisher=

work page 2020
[18]

Sriram , journal =

Narayan, Onuttom and Shastry, B. Sriram , journal =. The Toeplitz matrix

work page
[19]

Journal of Rational Mechanics and Analysis , volume=

On the eigenvalues of certain Hermitian forms , author=. Journal of Rational Mechanics and Analysis , volume=. 1953 , publisher=

work page 1953
[20]

Foundations and trends

An introduction to matrix concentration inequalities , author=. Foundations and trends. 2015 , publisher=

work page 2015
[21]

arXiv preprint arXiv:2512.03325 , year=

When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling , author=. arXiv preprint arXiv:2512.03325 , year=

work page arXiv
[22]

arXiv preprint arXiv:2311.13774 , year=

Learning hierarchical polynomials with three-layer neural networks , author=. arXiv preprint arXiv:2311.13774 , year=

work page arXiv
[23]

Annual review of neuroscience , volume=

Natural image statistics and neural representation , author=. Annual review of neuroscience , volume=. 2001 , publisher=

work page 2001
[24]

1997 , publisher=

Gaussian hilbert spaces , author=. 1997 , publisher=

work page 1997
[25]

2013 , PAGES =

Boucheron, St\'ephane and Lugosi, G\'abor and Massart, Pascal , TITLE =. 2013 , PAGES =. doi:10.1093/acprof:oso/9780199535255.001.0001 , URL =

work page doi:10.1093/acprof:oso/9780199535255.001.0001 2013
[26]

2014 , publisher=

Analysis of boolean functions , author=. 2014 , publisher=

work page 2014
[27]

Annales de l'IHP Probabilit

Multivariate normal approximation using Stein's method and Malliavin calculus , author=. Annales de l'IHP Probabilit

work page
[28]

The Annals of Probability , volume=

Central Limit Theorems For Sequences Of Multiple Stochastic Integrals , author=. The Annals of Probability , volume=
[29]

arXiv preprint arXiv:2603.14573 , year=

Rigorous Asymptotics for First-Order Algorithms Through the Dynamical Cavity Method , author=. arXiv preprint arXiv:2603.14573 , year=

work page arXiv
[30]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

2018
[31]

2012 , publisher=

Normal approximations with Malliavin calculus: from Stein's method to universality , author=. 2012 , publisher=

work page 2012
[32]

2026 , eprint =

Deep Learning of Compositional Targets with Hierarchical Spectral Methods , author =. 2026 , eprint =

work page 2026
[33]

arXiv preprint arXiv:2602.05846 , year=

Optimal scaling laws in learning hierarchical multi-index models , author=. arXiv preprint arXiv:2602.05846 , year=

work page arXiv
[34]

arXiv:2504.19983 , year=

Ren, Yunwei and Nichani, Eshaan and Wu, Denny and Lee, Jason D. , year =. Emergence and Scaling Laws in. 2504.19983 , archivePrefix =

work page arXiv
[35]

Workshop on Scientific Methods for Understanding Deep Learning , year=

Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws , author=. Workshop on Scientific Methods for Understanding Deep Learning , year=

work page
[36]

Advances in Neural Information Processing Systems , volume=

Tuning large neural networks via zero-shot hyperparameter transfer , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

Advances in Neural Information Processing Systems , volume=

4+3 phases of compute-optimal neural scaling laws , author=. Advances in Neural Information Processing Systems , volume=

work page
[38]

arXiv preprint arXiv:2603.18168 , year=

Resnets of all shapes and sizes: Convergence of training dynamics in the large-scale limit , author=. arXiv preprint arXiv:2603.18168 , year=

work page arXiv
[39]

12th International Conference on Learning Representations, ICLR 2024 , year=

Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit , author=. 12th International Conference on Learning Representations, ICLR 2024 , year=

work page 2024
[40]

Advances in Neural Information Processing Systems , volume=

The feature speed formula: a flexible approach to scale hyper-parameters of deep neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page
[41]

Proceedings of the 41st International Conference on Machine Learning , pages=

A dynamical model of neural scaling laws , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

work page
[42]

Scaling and renormalization in high-dimensional regression

Scaling and renormalization in high-dimensional regression , author=. arXiv preprint arXiv:2405.00592 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2210.16859 , year=

A solvable model of neural scaling laws , author=. arXiv preprint arXiv:2210.16859 , year=

work page arXiv
[44]

2025 , eprint =

The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent , author =. 2025 , eprint =

2025
[45]

arXiv preprint arXiv:2508.03688 , year=

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws , author=. arXiv preprint arXiv:2508.03688 , year=

work page arXiv
[46]

Physical Review X , volume =

How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model , author =. Physical Review X , volume =. 2024 , doi =

work page 2024
[47]

Advances in Neural Information Processing Systems , volume =

Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks , author =. Advances in Neural Information Processing Systems , volume =

work page
[48]

International Conference on Learning Representations , year =

Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks , author =. International Conference on Learning Representations , year =
[49]

2020 , eprint =

Scaling Laws for Neural Language Models , author =. 2020 , eprint =

2020
[50]

Advances in Neural Information Processing Systems , volume =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =

work page
[51]

Advances in Neural Information Processing Systems , volume =

An Empirical Analysis of Compute-Optimal Large Language Model Training , author =. Advances in Neural Information Processing Systems , volume =

work page
[52]

Proceedings of the National Academy of Sciences , volume =

Explaining Neural Scaling Laws , author =. Proceedings of the National Academy of Sciences , volume =

work page
[53]

Foundations of Computational Mathematics , volume =

Optimal Rates for the Regularized Least-Squares Algorithm , author =. Foundations of Computational Mathematics , volume =

work page
[54]

International Conference on Machine Learning , pages =

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks , author =. International Conference on Machine Learning , pages =

work page
[55]

Journal of Statistical Mechanics: Theory and Experiment , volume =

Asymptotic Learning Curves of Kernel Methods: Empirical Data versus Teacher--Student Paradigm , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =
[56]

Advances in Neural Information Processing Systems , volume =

Generalization Error Rates in Kernel Regression: The Crossover from the Noiseless to Noisy Regime , author =. Advances in Neural Information Processing Systems , volume =

work page
[57]

Machine Learning: Science and Technology , volume =

Error Scaling Laws for Kernel Classification under Source and Capacity Conditions , author =. Machine Learning: Science and Technology , volume =

work page
[58]

Advances in Neural Information Processing Systems , volume =

Dimension-Free Deterministic Equivalents and Scaling Laws for Random Feature Regression , author =. Advances in Neural Information Processing Systems , volume =
[59]

International Conference on Learning Representations , year =

Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks , author =. International Conference on Learning Representations , year =
[60]

2022 , eprint =

Emergent Abilities of Large Language Models , author =. 2022 , eprint =

2022
[61]

Advances in Neural Information Processing Systems , volume =

Are Emergent Abilities of Large Language Models a Mirage? , author =. Advances in Neural Information Processing Systems , volume =
[62]

, journal =

Davis, Chandler and Kahan, William M. , journal =. The Rotation of Eigenvectors by a Perturbation. 1970 , doi =

1970
[63]

Perturbation Theory for Linear Operators , author =
[64]

The Fourteenth International Conference on Learning Representations , year=

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime , author=. The Fourteenth International Conference on Learning Representations , year=
[65]

2011 , publisher=

Wiener Chaos: Moments, Cumulants and Diagrams: A survey with computer implementation , author=. 2011 , publisher=

2011
[66]

Algorithmic learning theory , pages=

Unperturbed: spectral analysis beyond Davis-Kahan , author=. Algorithmic learning theory , pages=. 2018 , organization=

2018
[67]

SIAM review , volume=

First-order perturbation theory for eigenvalues and eigenvectors , author=. SIAM review , volume=. 2020 , publisher=

work page 2020
[68]

Probability Theory and Related Fields , volume=

Stein’s method on Wiener chaos , author=. Probability Theory and Related Fields , volume=. 2009 , publisher=

2009
[69]

arXiv preprint arXiv:2410.18162 , year=

Stochastic gradient descent in high dimensions for multi-spiked tensor PCA , author=. arXiv preprint arXiv:2410.18162 , year=

work page arXiv
[70]

2002 , publisher=

Learning with kernels: support vector machines, regularization, optimization, and beyond , author=. 2002 , publisher=

work page 2002
[71]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Spherical Harmonics in p Dimensions

Spherical harmonics in p dimensions , author=. arXiv preprint arXiv:1205.3548 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Lecture Notes (Princeton University) , volume=

Probability in high dimension , author=. Lecture Notes (Princeton University) , volume=
[74]

2017 , publisher=

Alice and Bob meet Banach , author=. 2017 , publisher=

work page 2017
[75]

American Journal of Mathematics , volume=

Logarithmic sobolev inequalities , author=. American Journal of Mathematics , volume=. 1975 , publisher=

work page 1975
[76]

A generalization of the Lindeberg principle , volume=

Chatterjee, Sourav , year=. A generalization of the Lindeberg principle , volume=. The Annals of Probability , publisher=. doi:10.1214/009117906000000575 , number=

work page doi:10.1214/009117906000000575
[77]

Conference On Learning Theory , pages=

Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations , author=. Conference On Learning Theory , pages=. 2018 , organization=

2018
[78]

Advances in Neural Information Processing Systems , volume=

What Can ResNet Learn Efficiently, Going Beyond Kernels? , author=. Advances in Neural Information Processing Systems , volume=
[79]

2013 , month =

Boucheron, Stéphane and Lugosi, Gábor and Massart, Pascal , title =. 2013 , month =

work page 2013
[80]

International Journal of Automation and Computing , volume=

Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review , author=. International Journal of Automation and Computing , volume=. 2017 , publisher=

work page 2017

Showing first 80 references.