arxiv: 2605.15082 · v1 · submitted 2026-05-14 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

Average Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models

Libin Zhu , Damek Davis , Dmitriy Drusvyatskiy , Maryam Fazel

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:05 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH

keywords kernel ridge regressionaverage gradient outer productcentral subspacemulti-index modelssubspace recoverysample complexityrepresentation learning

0 comments

The pith

The average gradient outer product from kernel ridge regression recovers the central subspace of multi-index models in sample regimes too small for accurate prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fitting kernel ridge regression to samples from a multi-index polynomial and extracting the average gradient outer product of the fitted function produces a matrix whose leading eigenvectors span the unknown low-dimensional subspace on which the target depends. Recovery holds even when the number of samples is far below what is needed for the predictor to achieve small error, because a low-degree component of the target already encodes the relevant directions. This creates an explicit separation between learning a representation and achieving accurate function approximation. The result supplies a theoretical basis for why iterative kernel procedures can extract useful structure efficiently in practice.

Core claim

For a target function of the form f^*(x) = h(Ux) where U is an unknown r by d matrix with r much smaller than d, the top r-dimensional eigenspace of the average gradient outer product computed from a kernel ridge regression fit recovers the column space of U. When a degree-p component of f^* already carries the relevant directions, this recovery occurs with n on the order of d^{p + δ} samples for any δ in (0,1), whereas accurate prediction requires n on the order of d^{p^*} where p^* is the full degree.

What carries the argument

Average Gradient Outer Product (AGOP) of the kernel ridge regression predictor, defined as the average over training points of the outer product of its gradient with itself; its leading r-dimensional eigenspace aligns with the central subspace.

If this is right

Subspace identification succeeds with far fewer samples than those required for low prediction error.
Iterative procedures that repeatedly apply AGOP can extract features sample-efficiently without first solving the full regression problem.
Representation recovery and function approximation are separable tasks under multi-index structure.
The same mechanism explains why recursive feature machines remain practical even when the kernel bandwidth has not yet reached the interpolation regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Gradient outer products may serve as an early-training diagnostic for hidden low-dimensional structure in other nonparametric estimators.
The separation could extend to neural networks if their early gradient statistics similarly isolate relevant directions before loss decreases substantially.
Designing kernels or regularization that amplify low-degree directional signals might further reduce the sample threshold for subspace recovery.

Load-bearing premise

A low-degree component of the target function already contains all the directions needed for the central subspace, together with standard regularity conditions on the kernel and the multi-index structure.

What would settle it

Generate synthetic data from a known multi-index polynomial of degree p^* greater than p, fit kernel ridge regression with n approximately d to the power p plus a small positive delta, compute the AGOP eigenspace, and check whether its distance to the true column space of U falls below a small constant.

Figures

Figures reproduced from arXiv: 2605.15082 by Damek Davis, Dmitriy Drusvyatskiy, Libin Zhu, Maryam Fazel.

**Figure 1.** Figure 1: For target function (L1), we train RFM with a Gaussian kernel on hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 Iteration… view at source ↗

**Figure 2.** Figure 2: For target function (L2), we train RFM with a Gaussian kernel on hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 6 Conclusion We studied whether a fitted predictor can reveal low-dimensional structure before it predicts accurately. For polynomial mult… view at source ↗

**Figure 3.** Figure 3: For target function (L1), we train RFM with a Gaussian kernel on dense Gaussian data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 Iter… view at source ↗

**Figure 4.** Figure 4: For target function (L2), we train RFM with a Gaussian kernel on dense Gaussian data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: For target function (L1), we train RFM with a Gaussian kernel on sparse hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 It… view at source ↗

**Figure 6.** Figure 6: For target function (L2), we train RFM with a Gaussian kernel on sparse hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. Laplacian kernel. 0 1 2 3 4 Iteration 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Sample-size exponent Test loss 0 1 2 3 4 Iteratio… view at source ↗

**Figure 7.** Figure 7: For target function (L1), we train RFM with a Laplacian kernel on dense Gaussian data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: For target function (L2), we train RFM with a Laplacian kernel on dense Gaussian data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0… view at source ↗

**Figure 9.** Figure 9: For target function (L1), we train RFM with a Laplacian kernel on dense hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 It… view at source ↗

**Figure 10.** Figure 10: For target function (L2), we train RFM with a Laplacian kernel on dense hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: For target function (L1), we train RFM with a Laplacian kernel on sparse hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 … view at source ↗

**Figure 12.** Figure 12: For target function (L2), we train RFM with a Laplacian kernel on sparse hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. B Preliminaries B.1 Preliminaries on orthogonal polynomials In this section, we record notation and preliminaries on orthogonal p… view at source ↗

read the original abstract

We study a prototypical situation when a learned predictor can discover useful low-dimensional structure in data, while using fewer samples than are needed for accurate prediction. Specifically, we consider the problem of recovering a multi-index polynomial $f^*(x)=h(Ux)$, with $U\in\mathbb{R}^{r\times d}$ and $r\ll d$, from finitely many data/label pairs. Importantly, the target function depends on input $x$ only through the projection onto an unknown $r$-dimensional central subspace. The algorithm we analyze is appealingly simple: fit kernel ridge regression (KRR) to the data and compute the Average Gradient Outer Product (AGOP) from the fitted predictor. Our main results show that under reasonable assumptions the top $r$-dimensional eigenspace of AGOP provably recovers the central subspace, even in regimes when the prediction error remains large. Specifically, if the target function $f^*$ has degree $p^*$, it is known that $n\asymp d^{p^*}$ samples are necessary for KRR to achieve accurate prediction. In contrast, we show that if a low degree $p$ component of $f^*$ already carries all relevant directions for prediction, subspace recovery occurs in the much lower sample regime $n\asymp d^{p+\delta}$ for any $\delta\in(0,1)$. Our results thus demonstrate a separation between prediction and representation, and provide an explanation for why iterative kernel methods such as Recursive Feature Machines (RFM) can be sample-efficient in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper studies recovering the central r-dimensional subspace of multi-index models f^*(x)=h(Ux) (r<<d) by computing the Average Gradient Outer Product (AGOP) of a kernel ridge regression (KRR) estimator. It claims that, under reasonable assumptions on the kernel and multi-index structure, the top-r eigenspace of AGOP(ˆf) recovers span(U) at sample sizes n≍d^{p+δ} (any δ∈(0,1)), where p is the degree of a low-degree component of h that already contains all relevant directions, even in regimes where the KRR prediction error remains large.

Significance. If the central claim holds, the result supplies a concrete separation between subspace recovery and accurate prediction for kernel methods, offering a theoretical account for the practical sample efficiency of iterative procedures such as Recursive Feature Machines. It would also furnish falsifiable predictions about the sample complexity at which representation learning succeeds for polynomial multi-index targets.

major comments (2)

[Abstract] Abstract: the phrase 'reasonable assumptions' is used without an explicit list. The proof that the population AGOP of the finite-sample KRR estimator remains close to the AGOP of the degree-p projection of f* (with perturbation controlled at rate n^{-something}) requires concrete bounds on the Mercer eigenvalues, the RKHS projection of f* onto degree >p polynomials, and the operator norm of the gradient map on the orthogonal complement; without these the claimed eigenspace alignment at n≍d^{p+δ} cannot be verified.
[Main result] Main result (Theorem on AGOP recovery): the argument that the top-r eigenspace of AGOP(ˆf) recovers span(U) while prediction error stays large must explicitly control higher-degree leakage in the KRR fit. If the kernel allows non-negligible mass on degree >p terms, the gradient outer-product perturbation need not vanish at the stated rate, undermining the separation between representation and prediction.

minor comments (1)

[Abstract] Notation: distinguish p* (full degree of f*) from p (degree of the relevant component) more clearly in the abstract and introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments highlight important points on making assumptions explicit and strengthening the control of higher-degree terms. We will revise the manuscript to address both issues directly while preserving the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the phrase 'reasonable assumptions' is used without an explicit list. The proof that the population AGOP of the finite-sample KRR estimator remains close to the AGOP of the degree-p projection of f* (with perturbation controlled at rate n^{-something}) requires concrete bounds on the Mercer eigenvalues, the RKHS projection of f* onto degree >p polynomials, and the operator norm of the gradient map on the orthogonal complement; without these the claimed eigenspace alignment at n≍d^{p+δ} cannot be verified.

Authors: We agree the abstract phrasing is imprecise. In the revision we will replace 'reasonable assumptions' with an explicit list of the three conditions used in the proofs: (i) Mercer eigenvalues of the kernel decay at rate at least k^{-α} for α>1, (ii) the RKHS norm of the projection of f* onto degree >p polynomials is bounded by a constant independent of d, and (iii) the gradient operator is Lipschitz on the orthogonal complement with constant independent of d. The existing proof already derives the perturbation rate O(n^{-1/2} + λ) under these conditions; we will add a short lemma (new Lemma 3.4) that isolates the operator-norm bound on the orthogonal complement and states the explicit constants, making the n ≍ d^{p+δ} claim directly verifiable from the stated hypotheses. revision: yes
Referee: [Main result] Main result (Theorem on AGOP recovery): the argument that the top-r eigenspace of AGOP(ˆf) recovers span(U) while prediction error stays large must explicitly control higher-degree leakage in the KRR fit. If the kernel allows non-negligible mass on degree >p terms, the gradient outer-product perturbation need not vanish at the stated rate, undermining the separation between representation and prediction.

Authors: The analysis already restricts to kernels whose RKHS inner products on degree >p monomials are controlled by the regularization λ (chosen as λ ≍ d^{-p} in the regime of interest). The higher-degree leakage in the AGOP is bounded by O(λ + n^{-1/2}) via the same concentration argument used for the low-degree part; this term is o(1) precisely when n ≍ d^{p+δ}. To make the control fully explicit we will insert a new proposition (Proposition 4.2) that decomposes AGOP(ˆf) = AGOP(low-degree projection) + error term, with the error term bounded in operator norm by the RKHS mass on degree >p times a factor that vanishes at the claimed rate. This decomposition preserves the separation: accurate prediction requires controlling all degrees simultaneously, while subspace recovery only needs the low-degree directions to dominate the AGOP. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained under external assumptions

full rationale

The paper derives that the top-r eigenspace of AGOP(ˆf) from KRR recovers the central subspace of a multi-index model f* = h(Ux) at n ≍ d^{p+δ} (with p < p* the degree carrying the directions), even when prediction error is large. This follows from an analysis of the population AGOP operator applied to the KRR estimator, controlling its deviation from the AGOP of the degree-p projection of f* via Mercer eigenvalue decay and gradient bounds on the orthogonal complement. No step reduces the claim to a fitted parameter renamed as prediction, a self-definitional equivalence, or a load-bearing self-citation whose content is itself unverified; the assumptions on the kernel and multi-index structure are stated as external and sufficient for the separation result. The derivation is therefore independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the target being a multi-index polynomial whose informative part has degree p, plus standard kernel and concentration assumptions that are not enumerated in the abstract.

axioms (2)

domain assumption target function is a multi-index polynomial of known degree structure
stated in abstract as f*(x)=h(Ux)
domain assumption kernel ridge regression is performed with a suitable kernel
algorithm analyzed is KRR

pith-pipeline@v0.9.0 · 5598 in / 1277 out tokens · 47033 ms · 2026-05-15T03:05:53.439476+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the top r-dimensional eigenspace of AGOP provably recovers the central subspace... n ≍ d^{p+δ}
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

M≤p = E[∇f*≤p ∇f*≤p⊤] ... Hermite expansion of h

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

[1]

Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. “Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics”. In:The Thirty Sixth Annual Conference on Learning Theory. PMLR. 2023, pp. 2552–2623

work page 2023
[2]

The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks

Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. “The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks”. In:Conference on Learning Theory. PMLR. 2022, pp. 4782–4887

work page 2022
[3]

Classical orthogonal polynomials

George E Andrews and Richard Askey. “Classical orthogonal polynomials”. In:Polynômes Orthogo- naux et Applications: Proceedings of the Laguerre Symposium held at Bar-le-Duc, October 15–18,

work page
[4]

2006, pp

Springer. 2006, pp. 36–62

work page 2006
[5]

Online learning and information exponents: On the importance of batch size, and time/complexity tradeoffs

Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, and Ludovic Stephan. “Online learning and information exponents: On the importance of batch size, and time/complexity tradeoffs”. In:arXiv preprint arXiv:2406.02157(2024)

work page arXiv 2024
[6]

Repetita iuvant: Data repetition allows sgd to learn high-dimensional multi-index functions

Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, and Ludovic Stephan. “Repetita iuvant: Data repetition allows sgd to learn high-dimensional multi-index functions”. In:arXiv preprint arXiv:2405.15459(2024)

work page arXiv 2024
[7]

Online stochastic gradient descent on non-convex losses from high-dimensional inference

Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. “Online stochastic gradient descent on non-convex losses from high-dimensional inference”. In:Journal of Machine Learning Research 22.106 (2021), pp. 1–51

work page 2021
[8]

High- dimensional asymptotics of feature learning: How one gradient step improves the representation

Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. “High- dimensional asymptotics of feature learning: How one gradient step improves the representation”. In: Advances in Neural Information Processing Systems35 (2022), pp. 37932–37946

work page 2022
[9]

xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

Daniel Beaglehole, David Holzmüller, Adityanarayanan Radhakrishnan, and Mikhail Belkin. “xRFM: Accurate, scalable, and interpretable feature learning models for tabular data”. In:The Fourteenth International Conference on Learning Representations. 2026.url: https://openreview.net/ forum?id=wHuVdpnUFp

work page 2026
[10]

On learning gaussian multi-index models with gradient flow

Alberto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. “On learning gaussian multi-index models with gradient flow”. In:arXiv preprint arXiv:2310.19793(2023)

work page arXiv 2023
[11]

Learning single-index models with shallow neural networks

Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. “Learning single-index models with shallow neural networks”. In:Advances in Neural Information Processing Systems35 (2022), pp. 9768–9783

work page 2022
[12]

Exact matrix completion via convex optimization

Emmanuel Candes and Benjamin Recht. “Exact matrix completion via convex optimization”. In: Communications of the ACM55.6 (2012), pp. 111–119

work page 2012
[13]

Courier Corporation, 2011

Theodore S Chihara.An introduction to orthogonal polynomials. Courier Corporation, 2011

work page 2011
[14]

Fisher lecture: Dimension reduction in regression

R Dennis Cook. “Fisher lecture: Dimension reduction in regression”. In: (2007)

work page 2007
[15]

The Generative Leap: Tight Sample Complexity for Efficiently Learning Gaussian Multi-Index Models

Alex Damian, Jason D Lee, and Joan Bruna. “The Generative Leap: Tight Sample Complexity for Efficiently Learning Gaussian Multi-Index Models”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[16]

Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models

Alex Damian, Eshaan Nichani, Rong Ge, and Jason D. Lee. “Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023.url:https://openreview.net/ forum?id=73XPopmbXH

work page 2023
[17]

Neural networks can learn representations with gradient descent

Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi. “Neural networks can learn representations with gradient descent”. In:Conference on Learning Theory. PMLR. 2022, pp. 5413–5452

work page 2022
[18]

How two-layer neural networks learn, one (giant) step at a time

Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, and Ludovic Stephan. “How two-layer neural networks learn, one (giant) step at a time”. In:Journal of Machine Learning Research25.349 (2024), pp. 1–65. 11

work page 2024
[19]

The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents

Yatin Dandi, Emanuele Troiani, Luca Arnaboldi, Luca Pesce, Lenka Zdeborova, and Florent Krzakala. “The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents”. In:International Conference on Machine Learning. PMLR. 2024, pp. 9991–10016

work page 2024
[20]

The spectrum of kernel random matrices

Noureddine El Karoui. “The spectrum of kernel random matrices”. In:The Annals of Statistics38.1 (2010), pp. 1–50.doi:10.1214/09-AOS715

work page doi:10.1214/09-aos715 2010
[21]

Kernel dimension reduction in regression

Kenji Fukumizu, Francis R Bach, and Michael I Jordan. “Kernel dimension reduction in regression”. In:The Annals of Statistics37.4 (2009), pp. 1871–1905.doi:10.1214/08-AOS637

work page doi:10.1214/08-aos637 2009
[22]

Gradient-based kernel dimension reduction for regression

Kenji Fukumizu and Chenlei Leng. “Gradient-based kernel dimension reduction for regression”. In: Journal of the American Statistical Association109.505 (2014), pp. 359–370

work page 2014
[23]

Linearized two-layers neural networks in high dimension

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “Linearized two-layers neural networks in high dimension”. In:The Annals of Statistics49.2 (2021)

work page 2021
[24]

When do neural networks outperform kernel methods?

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “When do neural networks outperform kernel methods?” In:Advances in Neural Information Processing Systems33 (2020), pp. 14820–14830

work page 2020
[25]

Investigating smooth multiple regression by the method of average derivatives

Wolfgang Härdle and Thomas M Stoker. “Investigating smooth multiple regression by the method of average derivatives”. In:Journal of the American statistical Association84.408 (1989), pp. 986–995

work page 1989
[26]

Structure adaptive approach for dimension reduction

Marian Hristache, Anatoli Juditsky, Jörg Polzehl, and Vladimir Spokoiny. “Structure adaptive approach for dimension reduction”. In:Annals of Statistics(2001), pp. 1537–1566

work page 2001
[27]

Learning Multi-Index Models with Hyper-Kernel Ridge Regression

Shuo Huang, Hippolyte Labarrière, Ernesto De Vito, Tomaso Poggio, and Lorenzo Rosasco. “Learning Multi-Index Models with Hyper-Kernel Ridge Regression”. In:arXiv preprint arXiv:2510.02532 (2025)

work page arXiv 2025
[28]

Semiparametric least squares (SLS) and weighted SLS estimation of single-index models

Hidehiko Ichimura. “Semiparametric least squares (SLS) and weighted SLS estimation of single-index models”. In:Journal of econometrics58.1-2 (1993), pp. 71–120

work page 1993
[29]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clément Hongler. “Neural tangent kernel: Convergence and generalization in neural networks”. In:Advances in neural information processing systems. 2018, pp. 8571–8580

work page 2018
[30]

A general technique for approximating high-dimensional empirical kernel matrices

Chiraag Kaushik, Justin Romberg, and Vidya Muthukumar. “A general technique for approximating high-dimensional empirical kernel matrices”. In:arXiv preprint arXiv:2511.03892(2025)

work page arXiv 2025
[31]

Local EGOP for Continuous Index Learning

Alex Kokot, Anand Hemmady, Vydhourie Thiyageswaran, and Marina Meila. “Local EGOP for Continuous Index Learning”. In:arXiv preprint arXiv:2601.07061(2026)

work page arXiv 2026
[32]

Springer Science & Business Media, 2013

Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013

work page 2013
[33]

Neural network learns low-dimensional polynomials with sgd near the information-theoretic limit

Jason D Lee, Kazusato Oko, Taiji Suzuki, and Denny Wu. “Neural network learns low-dimensional polynomials with sgd near the information-theoretic limit”. In:Advances in Neural Information Processing Systems37 (2024), pp. 58716–58756

work page 2024
[34]

On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma

Ker-Chau Li. “On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma”. In:Journal of the American Statistical Association87.420 (1992), pp. 1025–1039

work page 1992
[35]

Sliced Inverse Regression for Dimension Reduction

Ker-Chau Li. “Sliced Inverse Regression for Dimension Reduction”. In:Journal of the American Statistical Association86.414 (1991), pp. 316–327.issn: 01621459, 1537274X

work page 1991
[36]

ON CONSISTENCY AND SPARSITY FOR SLICED INVERSE REGRESSION IN HIGH DIMENSIONS

Qian Lin, Zhigen Zhao, and Jun S Liu. “ON CONSISTENCY AND SPARSITY FOR SLICED INVERSE REGRESSION IN HIGH DIMENSIONS”. In:The Annals of Statistics46.2 (2018), pp. 580–610

work page 2018
[37]

A review on dimension reduction

Yanyuan Ma and Liping Zhu. “A review on dimension reduction”. In:International Statistical Review 81.1 (2013), pp. 134–150. 12

work page 2013
[38]

Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration”. In:Applied and Computational Harmonic Analysis59 (2022), pp. 3–84

work page 2022
[39]

A theory of non-linear feature learning with one gradient step in two-layer neural networks

Behrad Moniri, Donghwan Lee, Hamed Hassani, and Edgar Dobriban. “A theory of non-linear feature learning with one gradient step in two-layer neural networks”. In:Proceedings of the 41st International Conference on Machine Learning. 2024, pp. 36106–36159

work page 2024
[40]

Phase transitions for feature learning in neural networks

Andrea Montanari and Zihao Wang. “Phase transitions for feature learning in neural networks”. In: arXiv preprint arXiv:2602.01434(2026)

work page arXiv 2026
[41]

Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A Erdogdu. “Neural Networks Efficiently Learn Low-Dimensional Representations with SGD”. In:The Eleventh International Conference on Learning Representations. 2023

work page 2023
[42]

Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A Erdogdu. “Neural Networks Efficiently Learn Low-Dimensional Representations with SGD”. In:The Eleventh International Conference on Learning Representations. 2023.url: https://openreview.net/ forum?id=6taykzqcPD

work page 2023
[43]

Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics

Alireza Mousavi-Hosseini, Denny Wu, and Murat A Erdogdu. “Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics”. In:The Thirteenth International Conference on Learning Representations. 2025.url:https://openreview.net/forum?id=WHhZv8X5zF

work page 2025
[44]

Cambridge University Press, 2014

Ryan O’Donnell.Analysis of boolean functions. Cambridge University Press, 2014

work page 2014
[45]

Universality of kernel random matrices and kernel regression in the quadratic regime

Parthe Pandit, Zhichao Wang, and Yizhe Zhu. “Universality of kernel random matrices and kernel regression in the quadratic regime”. In:Journal of Machine Learning Research26.224 (2025), pp. 1– 73

work page 2025
[46]

Semiparametric estimation of index coefficients

James L Powell, James H Stock, and Thomas M Stoker. “Semiparametric estimation of index coefficients”. In:Econometrica: Journal of the Econometric Society(1989), pp. 1403–1430

work page 1989
[47]

Mechanism for feature learning in neural networks and backpropagation-free machine learning models

Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. “Mechanism for feature learning in neural networks and backpropagation-free machine learning models”. In: Science383.6690 (2024), pp. 1461–1467.doi: 10.1126/science.adi5639. eprint: https://www. science.org/doi/pdf/10.1126/science.adi5639 .url: https://www.science.org/doi...

work page doi:10.1126/science.adi5639 2024
[48]

Linear recursive feature machines provably recover low-rank matrices

Adityanarayanan Radhakrishnan, Mikhail Belkin, and Dmitriy Drusvyatskiy. “Linear recursive feature machines provably recover low-rank matrices”. In:Proceedings of the National Academy of Sciences122.13 (2025), e2411325122

work page 2025
[49]

A Compositional Kernel Model for Feature Learning

Feng Ruan, Keli Liu, and Michael Jordan. “A Compositional Kernel Model for Feature Learning”. In:arXiv preprint arXiv:2509.14158(2025)

work page arXiv 2025
[50]

Learning Gaussian Multi-Index Models with Gra- dient Flow: Time Complexity and Directional Convergence

Berfin Simsek, Amire Bendjeddou, and Daniel Hsu. “Learning Gaussian Multi-Index Models with Gra- dient Flow: Time Complexity and Directional Convergence”. In:The 28th International Conference on Artificial Intelligence and Statistics. 2025.url:https://openreview.net/forum?id=wYfOdkKnsr

work page 2025
[51]

Gabor Szeg.Orthogonal polynomials. Vol. 23. American Mathematical Soc., 1939

work page 1939
[52]

From Informa- tion to Generative Exponent: Learning Rate Induces Phase Transitions in SGD

Konstantinos Christopher Tsiolis, Alireza Mousavi-Hosseini, and Murat A Erdogdu. “From Informa- tion to Generative Exponent: Learning Rate Induces Phase Transitions in SGD”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems(2025)

work page 2025
[53]

A Constructive Approach to the Estimation of Dimension Reduction Directions

Yingcun Xia. “A Constructive Approach to the Estimation of Dimension Reduction Directions”. In: The Annals of Statistics(2007), pp. 2654–2690

work page 2007
[54]

An adaptive estimation of dimension reduction space

Yingcun Xia, Howell Tong, Wai Keung Li, and Li-Xing Zhu. “An adaptive estimation of dimension reduction space”. In:Journal of the Royal Statistical Society Series B: Statistical Methodology64.3 (2002), pp. 363–410. 13

work page 2002
[55]

Efficient estimation of the central mean subspace via smoothed gradient outer products

Gan Yuan, Mingyue Xu, Samory Kpotufe, and Daniel Hsu. “Efficient estimation of the central mean subspace via smoothed gradient outer products”. In:SIAM Journal on Mathematics of Data Science 7.3 (2025), pp. 1241–1264

work page 2025
[56]

Neural Networks Learn Generic Multi- Index Models Near Information-Theoretic Limit

Bohan Zhang, Zihao Wang, Hengyu Fu, and Jason D. Lee. “Neural Networks Learn Generic Multi- Index Models Near Information-Theoretic Limit”. In:The Fourteenth International Conference on Learning Representations. 2026.url:https://openreview.net/forum?id=2Q0U2rV2Jz

work page 2026
[57]

Iteratively reweighted kernel machines efficiently learn sparse functions

Libin Zhu, Damek Davis, Dmitriy Drusvyatskiy, and Maryam Fazel. “Iteratively reweighted kernel machines efficiently learn sparse functions”. In:arXiv preprint arXiv:2505.08277(2025)

work page arXiv 2025
[58]

Spectral norm bound for the product of random Fourier-Walsh matrices

Libin Zhu, Damek Davis, Dmitriy Drusvyatskiy, and Maryam Fazel. “Spectral norm bound for the product of random Fourier-Walsh matrices”. In:arXiv preprint arXiv:2504.03148(2025). 14 Appendix Contents A Additional numerical results 16 B Preliminaries 19 B.1 Preliminaries on orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B....

work page arXiv 2025
[59]

If|S| ≤Cd p+δ0 withδ 0 ∈(−∞, δ), then ∥ΦS ∥op =O d,P(√n)

work page
[60]

The asymptotic behavior ofΦS depends on the relative scale ofnand|S|, namely n |S| ≍d q withq >0|{z} Case I orq <0|{z} Case II

If|S| ≥Cd p+δ0 withδ 0 ∈(δ,∞), then ∥ΦS ∥op =O d,P( p |S|). The asymptotic behavior ofΦS depends on the relative scale ofnand|S|, namely n |S| ≍d q withq >0|{z} Case I orq <0|{z} Case II . The following two theorems show, respectively, that in Case I the empirical covariance matrix1 nΦ⊤ S ΦS is asymptotically equal to the identityI|S|, while in Case II th...

work page
[61]

Choose such a block as the distinguished witness

First, consider assignments for which at least one remaining block touchesS. Choose such a block as the distinguished witness. Letj1, . . . , jt, with t≥ 2, be its fixed labels. For a support coordinate i∈S, this block contributes at most tY a=1 |u[ja] i | ≤q t/2 i ≤q i, where the last inequality usesqi ≤ 1. Summing over the possible support coordinate in...

work page
[62]

Choose such a block as the distinguished witness

Second, consider the remaining assignments for which no remaining block touchesS, but some coordinate in Sc carries remaining multiplicity at least4. Choose such a block as the distinguished witness. If its fixed labels arej1, . . . , jt, witht≥4, then its total absolute contribution is at most dX i=1 tY a=1 |u[ja] i | ≤ dX i=1 qt/2 i ≤ dX i=1 q2 i =ρ≤∆ U

work page
[63]

1 n c⊤Φ⊤Φe)J 2# = (c⊤eJ)2 +O d(n−1)∥c∥ 2 2 ∥eJ ∥2 2 , and E

Third, consider the remaining assignments for which no remaining block touchesS, and no coordinate in Sc carries remaining multiplicity at least4. At this stage, every remaining block is a2-block inSc, and the supports of these2-blocks are distinct. If the assignment is still non-principal, then at least one such2-block is mixed-direction. Choose a mixed ...

work page
[64]

otherwise

Combining (192) with (193), and using (191) when the endpoints differ, we obtain E h αη(Pqη − b1)⊤Mη(Pqη + b2) 2i ≤ sX t=0 Od(d−2δ) pX q=0 β2t q (Pqb1)⊤(Pqb2) 2 +O d n−1∥b1∥2 2∥b2∥2 2 . This is exactly E h αη(Pqη − b1)⊤Mη(Pqη + b2) 2i ≤ sX t=0 Od(d−2δ)B t(b1, b2) +O d n−1∥b1∥2 2∥b2∥2 2 . Since|Ω(T)|=O p,cH ,cG(1), Cauchy–Schwarz in (189) proves (179). K.1...

work page