pith. machine review for the scientific record. sign in

arxiv: 2605.15082 · v1 · submitted 2026-05-14 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Recognition: 2 theorem links

· Lean Theorem

Average Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:05 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH
keywords kernel ridge regressionaverage gradient outer productcentral subspacemulti-index modelssubspace recoverysample complexityrepresentation learning
0
0 comments X

The pith

The average gradient outer product from kernel ridge regression recovers the central subspace of multi-index models in sample regimes too small for accurate prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that fitting kernel ridge regression to samples from a multi-index polynomial and extracting the average gradient outer product of the fitted function produces a matrix whose leading eigenvectors span the unknown low-dimensional subspace on which the target depends. Recovery holds even when the number of samples is far below what is needed for the predictor to achieve small error, because a low-degree component of the target already encodes the relevant directions. This creates an explicit separation between learning a representation and achieving accurate function approximation. The result supplies a theoretical basis for why iterative kernel procedures can extract useful structure efficiently in practice.

Core claim

For a target function of the form f^*(x) = h(Ux) where U is an unknown r by d matrix with r much smaller than d, the top r-dimensional eigenspace of the average gradient outer product computed from a kernel ridge regression fit recovers the column space of U. When a degree-p component of f^* already carries the relevant directions, this recovery occurs with n on the order of d^{p + δ} samples for any δ in (0,1), whereas accurate prediction requires n on the order of d^{p^*} where p^* is the full degree.

What carries the argument

Average Gradient Outer Product (AGOP) of the kernel ridge regression predictor, defined as the average over training points of the outer product of its gradient with itself; its leading r-dimensional eigenspace aligns with the central subspace.

If this is right

  • Subspace identification succeeds with far fewer samples than those required for low prediction error.
  • Iterative procedures that repeatedly apply AGOP can extract features sample-efficiently without first solving the full regression problem.
  • Representation recovery and function approximation are separable tasks under multi-index structure.
  • The same mechanism explains why recursive feature machines remain practical even when the kernel bandwidth has not yet reached the interpolation regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Gradient outer products may serve as an early-training diagnostic for hidden low-dimensional structure in other nonparametric estimators.
  • The separation could extend to neural networks if their early gradient statistics similarly isolate relevant directions before loss decreases substantially.
  • Designing kernels or regularization that amplify low-degree directional signals might further reduce the sample threshold for subspace recovery.

Load-bearing premise

A low-degree component of the target function already contains all the directions needed for the central subspace, together with standard regularity conditions on the kernel and the multi-index structure.

What would settle it

Generate synthetic data from a known multi-index polynomial of degree p^* greater than p, fit kernel ridge regression with n approximately d to the power p plus a small positive delta, compute the AGOP eigenspace, and check whether its distance to the true column space of U falls below a small constant.

Figures

Figures reproduced from arXiv: 2605.15082 by Damek Davis, Dmitriy Drusvyatskiy, Libin Zhu, Maryam Fazel.

Figure 1
Figure 1. Figure 1: For target function (L1), we train RFM with a Gaussian kernel on hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 Iteration… view at source ↗
Figure 2
Figure 2. Figure 2: For target function (L2), we train RFM with a Gaussian kernel on hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 6 Conclusion We studied whether a fitted predictor can reveal low-dimensional structure before it predicts accurately. For polynomial mult… view at source ↗
Figure 3
Figure 3. Figure 3: For target function (L1), we train RFM with a Gaussian kernel on dense Gaussian data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 Iter… view at source ↗
Figure 4
Figure 4. Figure 4: For target function (L2), we train RFM with a Gaussian kernel on dense Gaussian data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: For target function (L1), we train RFM with a Gaussian kernel on sparse hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 It… view at source ↗
Figure 6
Figure 6. Figure 6: For target function (L2), we train RFM with a Gaussian kernel on sparse hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. Laplacian kernel. 0 1 2 3 4 Iteration 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Sample-size exponent Test loss 0 1 2 3 4 Iteratio… view at source ↗
Figure 7
Figure 7. Figure 7: For target function (L1), we train RFM with a Laplacian kernel on dense Gaussian data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: For target function (L2), we train RFM with a Laplacian kernel on dense Gaussian data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0… view at source ↗
Figure 9
Figure 9. Figure 9: For target function (L1), we train RFM with a Laplacian kernel on dense hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 It… view at source ↗
Figure 10
Figure 10. Figure 10: For target function (L2), we train RFM with a Laplacian kernel on dense hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: For target function (L1), we train RFM with a Laplacian kernel on sparse hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. 0 1 2 3 4 Iteration 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 Sample-size exponent Test loss 0 1 2 3 4 Iteration ||sin (U,U)||op 0 1 2 3 4 … view at source ↗
Figure 12
Figure 12. Figure 12: For target function (L2), we train RFM with a Laplacian kernel on sparse hypercube data for five iterations. From left to right, the panels display the test loss, the sine of the largest principal angle, and the largest, second-largest and third-largest eigenvalues of the AGOP. B Preliminaries B.1 Preliminaries on orthogonal polynomials In this section, we record notation and preliminaries on orthogonal p… view at source ↗
read the original abstract

We study a prototypical situation when a learned predictor can discover useful low-dimensional structure in data, while using fewer samples than are needed for accurate prediction. Specifically, we consider the problem of recovering a multi-index polynomial $f^*(x)=h(Ux)$, with $U\in\mathbb{R}^{r\times d}$ and $r\ll d$, from finitely many data/label pairs. Importantly, the target function depends on input $x$ only through the projection onto an unknown $r$-dimensional central subspace. The algorithm we analyze is appealingly simple: fit kernel ridge regression (KRR) to the data and compute the Average Gradient Outer Product (AGOP) from the fitted predictor. Our main results show that under reasonable assumptions the top $r$-dimensional eigenspace of AGOP provably recovers the central subspace, even in regimes when the prediction error remains large. Specifically, if the target function $f^*$ has degree $p^*$, it is known that $n\asymp d^{p^*}$ samples are necessary for KRR to achieve accurate prediction. In contrast, we show that if a low degree $p$ component of $f^*$ already carries all relevant directions for prediction, subspace recovery occurs in the much lower sample regime $n\asymp d^{p+\delta}$ for any $\delta\in(0,1)$. Our results thus demonstrate a separation between prediction and representation, and provide an explanation for why iterative kernel methods such as Recursive Feature Machines (RFM) can be sample-efficient in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper studies recovering the central r-dimensional subspace of multi-index models f^*(x)=h(Ux) (r<<d) by computing the Average Gradient Outer Product (AGOP) of a kernel ridge regression (KRR) estimator. It claims that, under reasonable assumptions on the kernel and multi-index structure, the top-r eigenspace of AGOP(ˆf) recovers span(U) at sample sizes n≍d^{p+δ} (any δ∈(0,1)), where p is the degree of a low-degree component of h that already contains all relevant directions, even in regimes where the KRR prediction error remains large.

Significance. If the central claim holds, the result supplies a concrete separation between subspace recovery and accurate prediction for kernel methods, offering a theoretical account for the practical sample efficiency of iterative procedures such as Recursive Feature Machines. It would also furnish falsifiable predictions about the sample complexity at which representation learning succeeds for polynomial multi-index targets.

major comments (2)
  1. [Abstract] Abstract: the phrase 'reasonable assumptions' is used without an explicit list. The proof that the population AGOP of the finite-sample KRR estimator remains close to the AGOP of the degree-p projection of f* (with perturbation controlled at rate n^{-something}) requires concrete bounds on the Mercer eigenvalues, the RKHS projection of f* onto degree >p polynomials, and the operator norm of the gradient map on the orthogonal complement; without these the claimed eigenspace alignment at n≍d^{p+δ} cannot be verified.
  2. [Main result] Main result (Theorem on AGOP recovery): the argument that the top-r eigenspace of AGOP(ˆf) recovers span(U) while prediction error stays large must explicitly control higher-degree leakage in the KRR fit. If the kernel allows non-negligible mass on degree >p terms, the gradient outer-product perturbation need not vanish at the stated rate, undermining the separation between representation and prediction.
minor comments (1)
  1. [Abstract] Notation: distinguish p* (full degree of f*) from p (degree of the relevant component) more clearly in the abstract and introduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments highlight important points on making assumptions explicit and strengthening the control of higher-degree terms. We will revise the manuscript to address both issues directly while preserving the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the phrase 'reasonable assumptions' is used without an explicit list. The proof that the population AGOP of the finite-sample KRR estimator remains close to the AGOP of the degree-p projection of f* (with perturbation controlled at rate n^{-something}) requires concrete bounds on the Mercer eigenvalues, the RKHS projection of f* onto degree >p polynomials, and the operator norm of the gradient map on the orthogonal complement; without these the claimed eigenspace alignment at n≍d^{p+δ} cannot be verified.

    Authors: We agree the abstract phrasing is imprecise. In the revision we will replace 'reasonable assumptions' with an explicit list of the three conditions used in the proofs: (i) Mercer eigenvalues of the kernel decay at rate at least k^{-α} for α>1, (ii) the RKHS norm of the projection of f* onto degree >p polynomials is bounded by a constant independent of d, and (iii) the gradient operator is Lipschitz on the orthogonal complement with constant independent of d. The existing proof already derives the perturbation rate O(n^{-1/2} + λ) under these conditions; we will add a short lemma (new Lemma 3.4) that isolates the operator-norm bound on the orthogonal complement and states the explicit constants, making the n ≍ d^{p+δ} claim directly verifiable from the stated hypotheses. revision: yes

  2. Referee: [Main result] Main result (Theorem on AGOP recovery): the argument that the top-r eigenspace of AGOP(ˆf) recovers span(U) while prediction error stays large must explicitly control higher-degree leakage in the KRR fit. If the kernel allows non-negligible mass on degree >p terms, the gradient outer-product perturbation need not vanish at the stated rate, undermining the separation between representation and prediction.

    Authors: The analysis already restricts to kernels whose RKHS inner products on degree >p monomials are controlled by the regularization λ (chosen as λ ≍ d^{-p} in the regime of interest). The higher-degree leakage in the AGOP is bounded by O(λ + n^{-1/2}) via the same concentration argument used for the low-degree part; this term is o(1) precisely when n ≍ d^{p+δ}. To make the control fully explicit we will insert a new proposition (Proposition 4.2) that decomposes AGOP(ˆf) = AGOP(low-degree projection) + error term, with the error term bounded in operator norm by the RKHS mass on degree >p times a factor that vanishes at the claimed rate. This decomposition preserves the separation: accurate prediction requires controlling all degrees simultaneously, while subspace recovery only needs the low-degree directions to dominate the AGOP. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained under external assumptions

full rationale

The paper derives that the top-r eigenspace of AGOP(ˆf) from KRR recovers the central subspace of a multi-index model f* = h(Ux) at n ≍ d^{p+δ} (with p < p* the degree carrying the directions), even when prediction error is large. This follows from an analysis of the population AGOP operator applied to the KRR estimator, controlling its deviation from the AGOP of the degree-p projection of f* via Mercer eigenvalue decay and gradient bounds on the orthogonal complement. No step reduces the claim to a fitted parameter renamed as prediction, a self-definitional equivalence, or a load-bearing self-citation whose content is itself unverified; the assumptions on the kernel and multi-index structure are stated as external and sufficient for the separation result. The derivation is therefore independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the target being a multi-index polynomial whose informative part has degree p, plus standard kernel and concentration assumptions that are not enumerated in the abstract.

axioms (2)
  • domain assumption target function is a multi-index polynomial of known degree structure
    stated in abstract as f*(x)=h(Ux)
  • domain assumption kernel ridge regression is performed with a suitable kernel
    algorithm analyzed is KRR

pith-pipeline@v0.9.0 · 5598 in / 1277 out tokens · 47033 ms · 2026-05-15T03:05:53.439476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

  1. [1]

    Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics

    Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. “Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics”. In:The Thirty Sixth Annual Conference on Learning Theory. PMLR. 2023, pp. 2552–2623

  2. [2]

    The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks

    Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. “The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks”. In:Conference on Learning Theory. PMLR. 2022, pp. 4782–4887

  3. [3]

    Classical orthogonal polynomials

    George E Andrews and Richard Askey. “Classical orthogonal polynomials”. In:Polynômes Orthogo- naux et Applications: Proceedings of the Laguerre Symposium held at Bar-le-Duc, October 15–18,

  4. [4]

    2006, pp

    Springer. 2006, pp. 36–62

  5. [5]

    Online learning and information exponents: On the importance of batch size, and time/complexity tradeoffs

    Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, and Ludovic Stephan. “Online learning and information exponents: On the importance of batch size, and time/complexity tradeoffs”. In:arXiv preprint arXiv:2406.02157(2024)

  6. [6]

    Repetita iuvant: Data repetition allows sgd to learn high-dimensional multi-index functions

    Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, and Ludovic Stephan. “Repetita iuvant: Data repetition allows sgd to learn high-dimensional multi-index functions”. In:arXiv preprint arXiv:2405.15459(2024)

  7. [7]

    Online stochastic gradient descent on non-convex losses from high-dimensional inference

    Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. “Online stochastic gradient descent on non-convex losses from high-dimensional inference”. In:Journal of Machine Learning Research 22.106 (2021), pp. 1–51

  8. [8]

    High- dimensional asymptotics of feature learning: How one gradient step improves the representation

    Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. “High- dimensional asymptotics of feature learning: How one gradient step improves the representation”. In: Advances in Neural Information Processing Systems35 (2022), pp. 37932–37946

  9. [9]

    xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

    Daniel Beaglehole, David Holzmüller, Adityanarayanan Radhakrishnan, and Mikhail Belkin. “xRFM: Accurate, scalable, and interpretable feature learning models for tabular data”. In:The Fourteenth International Conference on Learning Representations. 2026.url: https://openreview.net/ forum?id=wHuVdpnUFp

  10. [10]

    On learning gaussian multi-index models with gradient flow

    Alberto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. “On learning gaussian multi-index models with gradient flow”. In:arXiv preprint arXiv:2310.19793(2023)

  11. [11]

    Learning single-index models with shallow neural networks

    Alberto Bietti, Joan Bruna, Clayton Sanford, and Min Jae Song. “Learning single-index models with shallow neural networks”. In:Advances in Neural Information Processing Systems35 (2022), pp. 9768–9783

  12. [12]

    Exact matrix completion via convex optimization

    Emmanuel Candes and Benjamin Recht. “Exact matrix completion via convex optimization”. In: Communications of the ACM55.6 (2012), pp. 111–119

  13. [13]

    Courier Corporation, 2011

    Theodore S Chihara.An introduction to orthogonal polynomials. Courier Corporation, 2011

  14. [14]

    Fisher lecture: Dimension reduction in regression

    R Dennis Cook. “Fisher lecture: Dimension reduction in regression”. In: (2007)

  15. [15]

    The Generative Leap: Tight Sample Complexity for Efficiently Learning Gaussian Multi-Index Models

    Alex Damian, Jason D Lee, and Joan Bruna. “The Generative Leap: Tight Sample Complexity for Efficiently Learning Gaussian Multi-Index Models”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems

  16. [16]

    Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models

    Alex Damian, Eshaan Nichani, Rong Ge, and Jason D. Lee. “Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023.url:https://openreview.net/ forum?id=73XPopmbXH

  17. [17]

    Neural networks can learn representations with gradient descent

    Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi. “Neural networks can learn representations with gradient descent”. In:Conference on Learning Theory. PMLR. 2022, pp. 5413–5452

  18. [18]

    How two-layer neural networks learn, one (giant) step at a time

    Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, and Ludovic Stephan. “How two-layer neural networks learn, one (giant) step at a time”. In:Journal of Machine Learning Research25.349 (2024), pp. 1–65. 11

  19. [19]

    The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents

    Yatin Dandi, Emanuele Troiani, Luca Arnaboldi, Luca Pesce, Lenka Zdeborova, and Florent Krzakala. “The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents”. In:International Conference on Machine Learning. PMLR. 2024, pp. 9991–10016

  20. [20]

    The spectrum of kernel random matrices

    Noureddine El Karoui. “The spectrum of kernel random matrices”. In:The Annals of Statistics38.1 (2010), pp. 1–50.doi:10.1214/09-AOS715

  21. [21]

    Kernel dimension reduction in regression

    Kenji Fukumizu, Francis R Bach, and Michael I Jordan. “Kernel dimension reduction in regression”. In:The Annals of Statistics37.4 (2009), pp. 1871–1905.doi:10.1214/08-AOS637

  22. [22]

    Gradient-based kernel dimension reduction for regression

    Kenji Fukumizu and Chenlei Leng. “Gradient-based kernel dimension reduction for regression”. In: Journal of the American Statistical Association109.505 (2014), pp. 359–370

  23. [23]

    Linearized two-layers neural networks in high dimension

    Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “Linearized two-layers neural networks in high dimension”. In:The Annals of Statistics49.2 (2021)

  24. [24]

    When do neural networks outperform kernel methods?

    Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “When do neural networks outperform kernel methods?” In:Advances in Neural Information Processing Systems33 (2020), pp. 14820–14830

  25. [25]

    Investigating smooth multiple regression by the method of average derivatives

    Wolfgang Härdle and Thomas M Stoker. “Investigating smooth multiple regression by the method of average derivatives”. In:Journal of the American statistical Association84.408 (1989), pp. 986–995

  26. [26]

    Structure adaptive approach for dimension reduction

    Marian Hristache, Anatoli Juditsky, Jörg Polzehl, and Vladimir Spokoiny. “Structure adaptive approach for dimension reduction”. In:Annals of Statistics(2001), pp. 1537–1566

  27. [27]

    Learning Multi-Index Models with Hyper-Kernel Ridge Regression

    Shuo Huang, Hippolyte Labarrière, Ernesto De Vito, Tomaso Poggio, and Lorenzo Rosasco. “Learning Multi-Index Models with Hyper-Kernel Ridge Regression”. In:arXiv preprint arXiv:2510.02532 (2025)

  28. [28]

    Semiparametric least squares (SLS) and weighted SLS estimation of single-index models

    Hidehiko Ichimura. “Semiparametric least squares (SLS) and weighted SLS estimation of single-index models”. In:Journal of econometrics58.1-2 (1993), pp. 71–120

  29. [29]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Clément Hongler. “Neural tangent kernel: Convergence and generalization in neural networks”. In:Advances in neural information processing systems. 2018, pp. 8571–8580

  30. [30]

    A general technique for approximating high-dimensional empirical kernel matrices

    Chiraag Kaushik, Justin Romberg, and Vidya Muthukumar. “A general technique for approximating high-dimensional empirical kernel matrices”. In:arXiv preprint arXiv:2511.03892(2025)

  31. [31]

    Local EGOP for Continuous Index Learning

    Alex Kokot, Anand Hemmady, Vydhourie Thiyageswaran, and Marina Meila. “Local EGOP for Continuous Index Learning”. In:arXiv preprint arXiv:2601.07061(2026)

  32. [32]

    Springer Science & Business Media, 2013

    Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013

  33. [33]

    Neural network learns low-dimensional polynomials with sgd near the information-theoretic limit

    Jason D Lee, Kazusato Oko, Taiji Suzuki, and Denny Wu. “Neural network learns low-dimensional polynomials with sgd near the information-theoretic limit”. In:Advances in Neural Information Processing Systems37 (2024), pp. 58716–58756

  34. [34]

    On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma

    Ker-Chau Li. “On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s lemma”. In:Journal of the American Statistical Association87.420 (1992), pp. 1025–1039

  35. [35]

    Sliced Inverse Regression for Dimension Reduction

    Ker-Chau Li. “Sliced Inverse Regression for Dimension Reduction”. In:Journal of the American Statistical Association86.414 (1991), pp. 316–327.issn: 01621459, 1537274X

  36. [36]

    ON CONSISTENCY AND SPARSITY FOR SLICED INVERSE REGRESSION IN HIGH DIMENSIONS

    Qian Lin, Zhigen Zhao, and Jun S Liu. “ON CONSISTENCY AND SPARSITY FOR SLICED INVERSE REGRESSION IN HIGH DIMENSIONS”. In:The Annals of Statistics46.2 (2018), pp. 580–610

  37. [37]

    A review on dimension reduction

    Yanyuan Ma and Liping Zhu. “A review on dimension reduction”. In:International Statistical Review 81.1 (2013), pp. 134–150. 12

  38. [38]

    Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration

    Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration”. In:Applied and Computational Harmonic Analysis59 (2022), pp. 3–84

  39. [39]

    A theory of non-linear feature learning with one gradient step in two-layer neural networks

    Behrad Moniri, Donghwan Lee, Hamed Hassani, and Edgar Dobriban. “A theory of non-linear feature learning with one gradient step in two-layer neural networks”. In:Proceedings of the 41st International Conference on Machine Learning. 2024, pp. 36106–36159

  40. [40]

    Phase transitions for feature learning in neural networks

    Andrea Montanari and Zihao Wang. “Phase transitions for feature learning in neural networks”. In: arXiv preprint arXiv:2602.01434(2026)

  41. [41]

    Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

    Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A Erdogdu. “Neural Networks Efficiently Learn Low-Dimensional Representations with SGD”. In:The Eleventh International Conference on Learning Representations. 2023

  42. [42]

    Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

    Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A Erdogdu. “Neural Networks Efficiently Learn Low-Dimensional Representations with SGD”. In:The Eleventh International Conference on Learning Representations. 2023.url: https://openreview.net/ forum?id=6taykzqcPD

  43. [43]

    Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics

    Alireza Mousavi-Hosseini, Denny Wu, and Murat A Erdogdu. “Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics”. In:The Thirteenth International Conference on Learning Representations. 2025.url:https://openreview.net/forum?id=WHhZv8X5zF

  44. [44]

    Cambridge University Press, 2014

    Ryan O’Donnell.Analysis of boolean functions. Cambridge University Press, 2014

  45. [45]

    Universality of kernel random matrices and kernel regression in the quadratic regime

    Parthe Pandit, Zhichao Wang, and Yizhe Zhu. “Universality of kernel random matrices and kernel regression in the quadratic regime”. In:Journal of Machine Learning Research26.224 (2025), pp. 1– 73

  46. [46]

    Semiparametric estimation of index coefficients

    James L Powell, James H Stock, and Thomas M Stoker. “Semiparametric estimation of index coefficients”. In:Econometrica: Journal of the Econometric Society(1989), pp. 1403–1430

  47. [47]

    Mechanism for feature learning in neural networks and backpropagation-free machine learning models

    Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. “Mechanism for feature learning in neural networks and backpropagation-free machine learning models”. In: Science383.6690 (2024), pp. 1461–1467.doi: 10.1126/science.adi5639. eprint: https://www. science.org/doi/pdf/10.1126/science.adi5639 .url: https://www.science.org/doi...

  48. [48]

    Linear recursive feature machines provably recover low-rank matrices

    Adityanarayanan Radhakrishnan, Mikhail Belkin, and Dmitriy Drusvyatskiy. “Linear recursive feature machines provably recover low-rank matrices”. In:Proceedings of the National Academy of Sciences122.13 (2025), e2411325122

  49. [49]

    A Compositional Kernel Model for Feature Learning

    Feng Ruan, Keli Liu, and Michael Jordan. “A Compositional Kernel Model for Feature Learning”. In:arXiv preprint arXiv:2509.14158(2025)

  50. [50]

    Learning Gaussian Multi-Index Models with Gra- dient Flow: Time Complexity and Directional Convergence

    Berfin Simsek, Amire Bendjeddou, and Daniel Hsu. “Learning Gaussian Multi-Index Models with Gra- dient Flow: Time Complexity and Directional Convergence”. In:The 28th International Conference on Artificial Intelligence and Statistics. 2025.url:https://openreview.net/forum?id=wYfOdkKnsr

  51. [51]

    Gabor Szeg.Orthogonal polynomials. Vol. 23. American Mathematical Soc., 1939

  52. [52]

    From Informa- tion to Generative Exponent: Learning Rate Induces Phase Transitions in SGD

    Konstantinos Christopher Tsiolis, Alireza Mousavi-Hosseini, and Murat A Erdogdu. “From Informa- tion to Generative Exponent: Learning Rate Induces Phase Transitions in SGD”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems(2025)

  53. [53]

    A Constructive Approach to the Estimation of Dimension Reduction Directions

    Yingcun Xia. “A Constructive Approach to the Estimation of Dimension Reduction Directions”. In: The Annals of Statistics(2007), pp. 2654–2690

  54. [54]

    An adaptive estimation of dimension reduction space

    Yingcun Xia, Howell Tong, Wai Keung Li, and Li-Xing Zhu. “An adaptive estimation of dimension reduction space”. In:Journal of the Royal Statistical Society Series B: Statistical Methodology64.3 (2002), pp. 363–410. 13

  55. [55]

    Efficient estimation of the central mean subspace via smoothed gradient outer products

    Gan Yuan, Mingyue Xu, Samory Kpotufe, and Daniel Hsu. “Efficient estimation of the central mean subspace via smoothed gradient outer products”. In:SIAM Journal on Mathematics of Data Science 7.3 (2025), pp. 1241–1264

  56. [56]

    Neural Networks Learn Generic Multi- Index Models Near Information-Theoretic Limit

    Bohan Zhang, Zihao Wang, Hengyu Fu, and Jason D. Lee. “Neural Networks Learn Generic Multi- Index Models Near Information-Theoretic Limit”. In:The Fourteenth International Conference on Learning Representations. 2026.url:https://openreview.net/forum?id=2Q0U2rV2Jz

  57. [57]

    Iteratively reweighted kernel machines efficiently learn sparse functions

    Libin Zhu, Damek Davis, Dmitriy Drusvyatskiy, and Maryam Fazel. “Iteratively reweighted kernel machines efficiently learn sparse functions”. In:arXiv preprint arXiv:2505.08277(2025)

  58. [58]

    Spectral norm bound for the product of random Fourier-Walsh matrices

    Libin Zhu, Damek Davis, Dmitriy Drusvyatskiy, and Maryam Fazel. “Spectral norm bound for the product of random Fourier-Walsh matrices”. In:arXiv preprint arXiv:2504.03148(2025). 14 Appendix Contents A Additional numerical results 16 B Preliminaries 19 B.1 Preliminaries on orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B....

  59. [59]

    If|S| ≤Cd p+δ0 withδ 0 ∈(−∞, δ), then ∥ΦS ∥op =O d,P(√n)

  60. [60]

    The asymptotic behavior ofΦS depends on the relative scale ofnand|S|, namely n |S| ≍d q withq >0|{z} Case I orq <0|{z} Case II

    If|S| ≥Cd p+δ0 withδ 0 ∈(δ,∞), then ∥ΦS ∥op =O d,P( p |S|). The asymptotic behavior ofΦS depends on the relative scale ofnand|S|, namely n |S| ≍d q withq >0|{z} Case I orq <0|{z} Case II . The following two theorems show, respectively, that in Case I the empirical covariance matrix1 nΦ⊤ S ΦS is asymptotically equal to the identityI|S|, while in Case II th...

  61. [61]

    Choose such a block as the distinguished witness

    First, consider assignments for which at least one remaining block touchesS. Choose such a block as the distinguished witness. Letj1, . . . , jt, with t≥ 2, be its fixed labels. For a support coordinate i∈S, this block contributes at most tY a=1 |u[ja] i | ≤q t/2 i ≤q i, where the last inequality usesqi ≤ 1. Summing over the possible support coordinate in...

  62. [62]

    Choose such a block as the distinguished witness

    Second, consider the remaining assignments for which no remaining block touchesS, but some coordinate in Sc carries remaining multiplicity at least4. Choose such a block as the distinguished witness. If its fixed labels arej1, . . . , jt, witht≥4, then its total absolute contribution is at most dX i=1 tY a=1 |u[ja] i | ≤ dX i=1 qt/2 i ≤ dX i=1 q2 i =ρ≤∆ U

  63. [63]

    1 n c⊤Φ⊤Φe)J 2# = (c⊤eJ)2 +O d(n−1)∥c∥ 2 2 ∥eJ ∥2 2 , and E

    Third, consider the remaining assignments for which no remaining block touchesS, and no coordinate in Sc carries remaining multiplicity at least4. At this stage, every remaining block is a2-block inSc, and the supports of these2-blocks are distinct. If the assignment is still non-principal, then at least one such2-block is mixed-direction. Choose a mixed ...

  64. [64]

    otherwise

    Combining (192) with (193), and using (191) when the endpoints differ, we obtain E h αη(Pqη − b1)⊤Mη(Pqη + b2) 2i ≤ sX t=0 Od(d−2δ) pX q=0 β2t q (Pqb1)⊤(Pqb2) 2 +O d n−1∥b1∥2 2∥b2∥2 2 . This is exactly E h αη(Pqη − b1)⊤Mη(Pqη + b2) 2i ≤ sX t=0 Od(d−2δ)B t(b1, b2) +O d n−1∥b1∥2 2∥b2∥2 2 . Since|Ω(T)|=O p,cH ,cG(1), Cauchy–Schwarz in (189) proves (179). K.1...