pith. machine review for the scientific record. sign in

arxiv: 2605.06314 · v2 · submitted 2026-05-07 · 💻 cs.LG

Recognition: no theorem link

When Does ell₂-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the ell₁ Implicit Bias

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:41 UTC · model grok-4.3

classification 💻 cs.LG
keywords benign overfittingboostinghigh-dimensional riskimplicit biasLasso equivalencegreedy coordinate selectionrisk asymptotics
0
0 comments X

The pith

ℓ₂-boosting fails to benignly overfit at linear rates in high dimensions because greedy selection localizes noise into sparse active sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes the high-dimensional risk of continuous-time ℓ₂-boosting over p features and n samples. It establishes that, under an isotropic pure-noise model, the method does not achieve the fast linear-rate benign overfitting known from ℓ₂ geometries. Instead, the ℓ₁ implicit bias of greedy coordinate selection concentrates noise into sparse subsets, so excess variance decays only logarithmically as Θ(σ²/log(p/n)). The analysis also supplies a tuning-free early stopping rule derived from the boosting flow's subdifferential dynamics; under a bounded ℓ₁-path condition this rule recovers the Lasso basic inequality and attains minimax-optimal empirical prediction rates. A sympathetic reader cares because the result clarifies when an implicitly regularized greedy method remains safe in overparameterized regimes and when explicit tuning is still required.

Core claim

Under an isotropic pure-noise model, continuous-time ℓ₂-Boosting does not exhibit benign overfitting at the linear rate. Greedy selection localizes noise into sparse active sets, causing the excess variance to decay at the logarithmic rate Θ(σ²/log(p/n)). For spiked-isotropic designs whose tail dimension r₂ satisfies r₂ ≫ n, risk still converges to zero but only at the slower logarithmic rate Θ(σ²/log(r₂/n)). The non-smooth subdifferential dynamics of the boosting flow yield a tuning-free early stopping rule that, under a bounded ℓ₁-path condition, recovers the Lasso basic inequality and attains the minimax-optimal empirical prediction rate for ℓ₁-bounded signals.

What carries the argument

The continuous-time ℓ₂-boosting flow whose greedy coordinate selection induces an ℓ₁ implicit bias; this bias is resolved analytically by coupling the Convex Gaussian Minimax Theorem with asymptotic expansions of double-sided truncated Gaussian moments.

If this is right

  • Excess variance decays only logarithmically instead of linearly with the overparameterization ratio.
  • For designs with large tail dimensions r₂ ≫ n the risk converges to zero but at a logarithmic rate slower than in ridge regression.
  • A simple early stopping rule based on the boosting path length recovers the Lasso basic inequality without parameter tuning.
  • The localization of noise into sparse active sets persists as long as the ℓ₁-path remains bounded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same localization mechanism may continue to operate even when weak signals are present, making boosting inherently sparsity-promoting.
  • In practice, monitoring the cumulative ℓ₁ length of the boosting path could serve as an automatic safeguard against slow noise accumulation.
  • The contrast with linear-rate benign overfitting in ℓ₂ methods suggests that greedy coordinate algorithms trade faster noise adaptation for built-in sparsity bias.

Load-bearing premise

The data must be generated from an isotropic pure-noise model with no underlying signal, and the boosting trajectory must obey a bounded ℓ₁-path condition.

What would settle it

Generate high-dimensional pure-noise data with p ≫ n, run continuous-time ℓ₂-boosting, and check whether the excess prediction risk decays proportionally to 1/log(p/n) rather than linearly with n/p.

Figures

Figures reproduced from arXiv: 2605.06314 by Jian Li, Ye Su, Yong Liu.

Figure 1
Figure 1. Figure 1: Empirical verification under isotropic Gaussian design. view at source ↗
Figure 1
Figure 1. Figure 1: Empirical verification under isotropic Gaussian design. [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical evaluation of ℓ1 risk decay across distinct feature distributions (n = 800). (a) The excess variance trajectories overlap closely, showing insensitivity to feature tail properties. (b) OLS regressions against 1/ log(p/n) yield R2 > 0.999 for all ensembles, verifying that the logarithmic scaling law is a universal property of the ℓ1 geometry. 7.2.2 Universality across Feature Distributions view at source ↗
Figure 3
Figure 3. Figure 3: Empirical excess variance under spiked-isotropic designs across multiple system sizes view at source ↗
Figure 4
Figure 4. Figure 4: Dynamic path analysis of the continuous ℓ2-Boosting flow (n = 800, p = 1600). The analytical stopping time t ∗ is triggered when the maximum correlation ρ(t) intercepts the noise threshold 2λn. This tuning-free stopping rule mathematically separates the safe signal-recovery regime from the subsequent noise-interpolating rebound. The analytically stopped estimator achieves the optimum risk of the LassoCV ba… view at source ↗
Figure 5
Figure 5. Figure 5: Adaptive early stopping of the ℓ2-Boosting flow (n = 800, p = 1600). The data-driven threshold 2λˆ n, computed via RCV, aligns with the oracle threshold. The resulting algorithmic halt tˆ∗ intercepts the overfitting rebound. The estimator at tˆ∗ matches the optimum risk of the LassoCV baseline and exhibits predictive performance equivalent to the 5-fold CV stopped estimator without requiring repetitive mod… view at source ↗
Figure 6
Figure 6. Figure 6: Empirical evaluation of XGBoost interpolating pure noise ( view at source ↗
Figure 6
Figure 6. Figure 6: Empirical evaluation of XGBoost interpolating pure noise ( [PITH_FULL_IMAGE:figures/full_fig_p038_6.png] view at source ↗
read the original abstract

Benign overfitting is well-characterized in $\ell_2$ geometries, but its behavior under the $\ell_1$ implicit bias of greedy ensembles remains challenging. The analytical barrier stems from the non-linear coupling of coordinate selection thresholds, which invalidates standard spectral resolvent tools. To isolate this algorithmic bias, we characterize the high-dimensional risk of continuous-time $\ell_2$-Boosting over $p$ features and $n$ samples. By coupling the Convex Gaussian Minimax Theorem with delicate asymptotic expansions of double-sided truncated Gaussian moments, we analytically resolve the non-smooth $\ell_1$ interpolant. Under an isotropic pure-noise model, we prove that benign overfitting fails at the linear rate: greedy selection localizes noise into sparse active sets, and the excess variance decays at a logarithmic rate $\Theta(\sigma^2/\log(p/n))$ for noise variance $\sigma^2$. We remark that while this localization mechanism should persist in the presence of signals, the exact signal-noise decomposition remains an open problem. For spiked-isotropic designs with $k^*$ head eigenvalues and $r_2 = p - k^*$ tail dimensions, the risk converges to zero when $r_{2} \gg n$, but only at a logarithmic rate $\Theta(\sigma^2/\log(r_2/n))$, which is slower than the linear decay observed in $\ell_2$ geometries. To avoid this slow convergence, we analyze the non-smooth subdifferential dynamics of the boosting flow. This yields a tuning-free early stopping rule that, under a bounded $\ell_1$-path condition, recovers the Lasso basic inequality and attains the minimax-optimal empirical prediction rate for $\ell_1$-bounded signals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the high-dimensional risk of continuous-time ℓ₂-Boosting over p features and n samples. It couples the Convex Gaussian Minimax Theorem (CGMT) with asymptotic expansions of double-sided truncated Gaussian moments to resolve the non-smooth ℓ₁ interpolant induced by greedy coordinate selection. Under an isotropic pure-noise model, it claims that benign overfitting fails at the linear rate: noise localizes into sparse active sets and excess variance decays as Θ(σ²/log(p/n)). For spiked-isotropic designs it obtains a slower logarithmic rate Θ(σ²/log(r₂/n)) when r₂ ≫ n. It further derives a tuning-free early-stopping rule that, under a bounded ℓ₁-path condition, recovers the Lasso basic inequality and minimax-optimal empirical prediction rates.

Significance. If the central claims hold, the work clarifies how the ℓ₁ implicit bias of boosting prevents the linear-rate benign overfitting familiar from ℓ₂ geometries, while still permitting slow (logarithmic) convergence to zero risk. The technical device of coupling CGMT to path-dependent truncated-Gaussian expansions is a genuine advance for analyzing non-smooth greedy flows. The proposed early-stopping rule supplies a practical, tuning-free method for attaining optimal rates under ℓ₁-bounded signals. These contributions would be of clear interest to the high-dimensional statistics and optimization communities.

major comments (2)
  1. [CGMT coupling and moment expansions (proof of the logarithmic excess-risk rate)] The claimed excess-variance rate Θ(σ²/log(p/n)) is obtained by integrating the CGMT representation along the continuous-time boosting path whose coordinate-selection thresholds evolve over an interval of width Θ(log(p/n)). The manuscript invokes asymptotic expansions of double-sided truncated Gaussian moments but supplies no explicit uniform remainder bound over this path-dependent range of truncation levels. If the integrated error is not o(1/log(p/n)), it contaminates the leading term at the same order, rendering the precise rate unproven. This is load-bearing for the main pure-noise theorem.
  2. [analysis of the non-smooth subdifferential dynamics and early-stopping rule] The early-stopping rule is asserted to recover the Lasso basic inequality and minimax rates once a bounded ℓ₁-path condition is imposed. The paper does not state or verify the design conditions under which this path bound holds (isotropic, spiked, or otherwise), nor does it quantify how often the condition is satisfied in the regimes of interest. Because the optimality claim rests on this assumption, its scope must be clarified.
minor comments (2)
  1. [spiked-design paragraph] The transition from the pure-noise result to the spiked-isotropic case (r₂ = p - k*) would benefit from an explicit statement of which parts of the CGMT argument carry over unchanged and which require new expansions.
  2. [introduction and notation] Notation for the tail dimension r₂ is introduced only in the abstract and late in the text; an earlier definition would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify key points where the proofs and assumptions require additional rigor, which we will address in the revision.

read point-by-point responses
  1. Referee: [CGMT coupling and moment expansions (proof of the logarithmic excess-risk rate)] The claimed excess-variance rate Θ(σ²/log(p/n)) is obtained by integrating the CGMT representation along the continuous-time boosting path whose coordinate-selection thresholds evolve over an interval of width Θ(log(p/n)). The manuscript invokes asymptotic expansions of double-sided truncated Gaussian moments but supplies no explicit uniform remainder bound over this path-dependent range of truncation levels. If the integrated error is not o(1/log(p/n)), it contaminates the leading term at the same order, rendering the precise rate unproven. This is load-bearing for the main pure-noise theorem.

    Authors: We agree that an explicit uniform remainder bound is required to rigorously justify the rate. In the revised manuscript we will add a new appendix lemma establishing a uniform bound on the approximation error of the double-sided truncated Gaussian moments over the full path-dependent interval of truncation levels of width Θ(log(p/n)). The lemma shows that the integrated remainder is o(1/log(p/n)), so that it does not affect the leading Θ(σ²/log(p/n)) term. This directly strengthens the proof of the pure-noise theorem. revision: yes

  2. Referee: [analysis of the non-smooth subdifferential dynamics and early-stopping rule] The early-stopping rule is asserted to recover the Lasso basic inequality and minimax rates once a bounded ℓ₁-path condition is imposed. The paper does not state or verify the design conditions under which this path bound holds (isotropic, spiked, or otherwise), nor does it quantify how often the condition is satisfied in the regimes of interest. Because the optimality claim rests on this assumption, its scope must be clarified.

    Authors: We will revise the text to state the precise design conditions under which the bounded ℓ₁-path condition holds with high probability. For isotropic designs the condition is satisfied when the signal is ℓ₁-bounded by a dimension-independent constant; for spiked-isotropic designs it holds whenever r₂ ≫ n. We will add a proposition that quantifies the success probability of the path bound, showing that it tends to 1 in the high-dimensional regime under the paper’s assumptions. This clarifies the scope of the early-stopping optimality claim. revision: yes

Circularity Check

0 steps flagged

No circularity: risk asymptotics derived from external CGMT and moment expansions

full rationale

The derivation couples the Convex Gaussian Minimax Theorem (external) to continuous-time boosting flow via asymptotic expansions of double-sided truncated Gaussian moments. These expansions are applied to path-dependent thresholds but are not shown to reduce the excess-risk expression to a fitted parameter or self-referential definition. The early-stopping rule recovers the Lasso basic inequality under an explicit bounded ℓ1-path assumption rather than by construction. No self-citation is load-bearing for the main isotropic pure-noise result, and the logarithmic rate follows from the analysis without renaming or smuggling an ansatz. The paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on the isotropic pure-noise model and bounded ℓ₁-path condition as domain assumptions, with no free parameters fitted or new entities invented; CGMT is treated as a standard external theorem.

axioms (2)
  • domain assumption Isotropic pure-noise model
    Invoked for the main risk characterization and logarithmic decay proof under p features and n samples.
  • domain assumption Bounded ℓ₁-path condition
    Required for the non-smooth subdifferential dynamics to recover the Lasso basic inequality and minimax rates.

pith-pipeline@v0.9.0 · 5630 in / 1406 out tokens · 69880 ms · 2026-05-13T06:41:56.623880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 2 internal anchors

  1. [1]

    Journal of Machine Learning Research , volume=

    Boosting as a regularized path to a maximum margin classifier , author=. Journal of Machine Learning Research , volume=

  2. [2]

    Least angle regression , author=

  3. [3]

    International Conference on Machine Learning , pages=

    Characterizing implicit bias in terms of optimization geometry , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  4. [4]

    1997 , publisher=

    Convex analysis , author=. 1997 , publisher=

  5. [5]

    On Milman's inequality and random subspaces which escape through a mesh in

    Gordon, Yehoram , booktitle=. On Milman's inequality and random subspaces which escape through a mesh in. 2006 , organization=

  6. [6]

    Conference on Learning Theory , pages=

    Regularized linear regression: A precise analysis of the estimation error , author=. Conference on Learning Theory , pages=. 2015 , organization=

  7. [7]

    2018 , publisher=

    High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

  8. [8]

    IEEE transactions on information theory , volume=

    Decoding by linear programming , author=. IEEE transactions on information theory , volume=. 2005 , publisher=

  9. [9]

    Proceedings of the National Academy of Sciences , volume=

    Message-passing algorithms for compressed sensing , author=. Proceedings of the National Academy of Sciences , volume=. 2009 , publisher=

  10. [10]

    Information and Inference: A Journal of the IMA , volume=

    Living on the edge: Phase transitions in convex programs with random data , author=. Information and Inference: A Journal of the IMA , volume=. 2014 , publisher=

  11. [11]

    The Gaussian min--max theorem in the presence of convexity, 2014

    The Gaussian min-max theorem in the presence of convexity , author=. arXiv preprint arXiv:1408.4837 , year=

  12. [12]

    2012 , publisher=

    The theory of max-min and its application to weapons allocation problems , author=. 2012 , publisher=

  13. [13]

    The annals of Statistics , pages=

    Estimation of the mean of a multivariate normal distribution , author=. The annals of Statistics , pages=. 1981 , publisher=

  14. [14]

    2013 , publisher=

    Perturbation analysis of optimization problems , author=. 2013 , publisher=

  15. [15]

    Proceedings of the National Academy of Sciences , volume=

    Benign overfitting in linear regression , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

  16. [16]

    Journal of Machine Learning Research , volume=

    Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , author=. Journal of Machine Learning Research , volume=

  17. [17]

    , author=

    On general minimax theorems. , author=

  18. [18]

    1981 , publisher=

    Asymptotic methods in analysis , author=. 1981 , publisher=

  19. [19]

    1975 , publisher=

    Asymptotic expansions of integrals , author=. 1975 , publisher=

  20. [20]

    Optimization , volume=

    Karush-kuhn-tucker conditions , author=. Optimization , volume=

  21. [21]

    2018 , publisher=

    Foundations of the theory of probability: Second English Edition , author=. 2018 , publisher=

  22. [22]

    The Annals of Statistics , volume=

    Simultaneous analysis of Lasso and Dantzig selector , author=. The Annals of Statistics , volume=. 2009 , publisher=

  23. [23]

    IEEE transactions on information theory , volume=

    Minimax rates of estimation for high-dimensional linear regression over _q -balls , author=. IEEE transactions on information theory , volume=. 2011 , publisher=

  24. [24]

    The Annals of Statistics , volume=

    Boosting with early stopping: Convergence and consistency , author=. The Annals of Statistics , volume=. 2005 , publisher=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Adaboost is consistent , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    2000 , publisher=

    Probability and measure theory , author=. 2000 , publisher=

  27. [27]

    Proceedings of the National Academy of Sciences , volume=

    Reconciling modern machine-learning practice and the classical bias--variance trade-off , author=. Proceedings of the National Academy of Sciences , volume=. 2019 , publisher=

  28. [28]

    Annals of statistics , volume=

    Surprises in high-dimensional ridgeless least squares interpolation , author=. Annals of statistics , volume=

  29. [29]

    Communications on Pure and Applied Mathematics , volume=

    The generalization error of random features regression: Precise asymptotics and the double descent curve , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=

  30. [30]

    Journal of Machine Learning Research , volume=

    The implicit bias of gradient descent on separable data , author=. Journal of Machine Learning Research , volume=

  31. [31]

    Journal of Machine Learning Research , volume=

    Benign overfitting in ridge regression , author=. Journal of Machine Learning Research , volume=

  32. [32]

    IEEE Journal on Selected Areas in Information Theory , volume=

    Harmless interpolation of noisy data in regression , author=. IEEE Journal on Selected Areas in Information Theory , volume=. 2020 , publisher=

  33. [33]

    Neural computation , volume=

    Neural networks and the bias/variance dilemma , author=. Neural computation , volume=. 1992 , publisher=

  34. [34]

    2010 , publisher=

    The elements of statistical learning: data mining, inference, and prediction , author=. 2010 , publisher=

  35. [35]

    Understanding deep learning requires rethinking generalization

    Understanding deep learning requires rethinking generalization , author=. arXiv preprint arXiv:1611.03530 , year=

  36. [36]

    Journal of Machine Learning Research , volume=

    Explaining the success of adaboost and random forests as interpolating classifiers , author=. Journal of Machine Learning Research , volume=

  37. [37]

    Advances in neural information processing systems , volume=

    Why do tree-based models still outperform deep learning on typical tabular data? , author=. Advances in neural information processing systems , volume=

  38. [38]

    Annals of statistics , pages=

    Greedy function approximation: a gradient boosting machine , author=. Annals of statistics , pages=. 2001 , publisher=

  39. [39]

    Information fusion , volume=

    Tabular data: Deep learning is not all you need , author=. Information fusion , volume=. 2022 , publisher=

  40. [40]

    Journal of the American Statistical Association , volume=

    Boosting with the _2 loss: regression and classification , author=. Journal of the American Statistical Association , volume=. 2003 , publisher=

  41. [41]

    Electronic Journal of Statistics , volume=

    Forward stagewise regression and the lasso , author=. Electronic Journal of Statistics , volume=. 2007 , publisher=

  42. [42]

    SIAM review , volume=

    Atomic decomposition by basis pursuit , author=. SIAM review , volume=. 2001 , publisher=

  43. [43]

    The annals of statistics , volume=

    Boosting the margin: A new explanation for the effectiveness of voting methods , author=. The annals of statistics , volume=. 1998 , publisher=

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting , author=. Advances in Neural Information Processing Systems , volume=

  45. [45]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Regression shrinkage and selection via the lasso , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 1996 , publisher=

  46. [46]

    Monographs on statistics and applied probability , volume=

    Statistical learning with sparsity , author=. Monographs on statistics and applied probability , volume=. 2015 , publisher=

  47. [47]

    IEEE Transactions on signal processing , volume=

    Matching pursuits with time-frequency dictionaries , author=. IEEE Transactions on signal processing , volume=. 1993 , publisher=

  48. [48]

    757 arxiv e-prints p , author=

    Just interpolate: Kernel” ridgeless” regression can generalize. 757 arxiv e-prints p , author=. arXiv preprint arXiv:1808.00387 , volume=

  49. [49]

    International Conference on Machine Learning , pages=

    Margins, shrinkage, and boosting , author=. International Conference on Machine Learning , pages=. 2013 , organization=

  50. [50]

    International Conference on Artificial Intelligence and Statistics , pages=

    Tight bounds for minimum _1 -norm interpolation of noisy data , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

  51. [51]

    arXiv e-prints, page , author=

    The distribution of the Lasso: Uniform control over sparse balls and adaptive parameter tuning. arXiv e-prints, page , author=. arXiv preprint arXiv:1811.01212 , year=

  52. [52]

    International Conference on Artificial Intelligence and Statistics , pages=

    Fundamental limits of ridge-regularized empirical risk minimization in high dimensions , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

  53. [53]

    The Annals of Statistics , volume=

    The lasso with general gaussian designs with applications to hypothesis testing , author=. The Annals of Statistics , volume=. 2023 , publisher=

  54. [54]

    Biometrika , volume=

    Scaled sparse linear regression , author=. Biometrika , volume=. 2012 , publisher=

  55. [55]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Variance estimation using refitted cross-validation in ultrahigh dimensional regression , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2012 , publisher=

  56. [56]

    IEEE Transactions on Information theory , volume=

    Universal approximation bounds for superpositions of a sigmoidal function , author=. IEEE Transactions on Information theory , volume=. 2002 , publisher=

  57. [57]

    Advances in neural information processing systems , volume=

    Benign overfitting in two-layer convolutional neural networks , author=. Advances in neural information processing systems , volume=

  58. [58]

    International conference on machine learning , pages=

    Benign overfitting in two-layer relu convolutional neural networks , author=. International conference on machine learning , pages=. 2023 , organization=

  59. [59]

    International Conference on Machine Learning , pages=

    Benign overfitting in deep neural networks under lazy training , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  60. [60]

    arXiv preprint arXiv:2510.15337 , year=

    Transfer Learning for Benign Overfitting in High-Dimensional Linear Regression , author=. arXiv preprint arXiv:2510.15337 , year=

  61. [61]

    arXiv preprint arXiv:2410.01774 , year=

    Trained transformer classifiers generalize and exhibit benign overfitting in-context , author=. arXiv preprint arXiv:2410.01774 , year=

  62. [62]

    Advances in Neural Information Processing Systems , volume=

    Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization , author=. Advances in Neural Information Processing Systems , volume=

  63. [63]

    arXiv preprint arXiv:2410.07746 , year=

    Benign overfitting in single-head attention , author=. arXiv preprint arXiv:2410.07746 , year=

  64. [64]

    arXiv preprint arXiv:2409.17625 , year=

    Benign Overfitting in Token Selection of Attention Mechanism , author=. arXiv preprint arXiv:2409.17625 , year=

  65. [65]

    arXiv preprint arXiv:2410.19139 , year=

    Initialization matters: On the benign overfitting of two-layer relu cnn with fully trainable layers , author=. arXiv preprint arXiv:2410.19139 , year=

  66. [66]

    arXiv preprint arXiv:2502.11893 , year=

    Rethinking benign overfitting in two-layer neural networks , author=. arXiv preprint arXiv:2502.11893 , year=

  67. [67]

    URL https://arxiv

    Benign, tempered, or catastrophic: A taxonomy of overfitting , author=. URL https://arxiv. org/abs/2207 , volume=

  68. [68]

    2019 , publisher=

    High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

  69. [69]

    Electronic Journal of Statistics , volume=

    The lasso problem and uniqueness , author=. Electronic Journal of Statistics , volume=

  70. [70]

    IEEE Transactions on Information Theory , volume=

    Random forests and kernel methods , author=. IEEE Transactions on Information Theory , volume=. 2016 , publisher=

  71. [71]

    Test , volume=

    A random forest guided tour , author=. Test , volume=. 2016 , publisher=

  72. [72]

    International conference on machine learning , pages=

    On the spectral bias of neural networks , author=. International conference on machine learning , pages=. 2019 , organization=

  73. [73]

    The Annals of Statistics , volume=

    High-dimensional asymptotics of prediction: Ridge regression and classification , author=. The Annals of Statistics , volume=. 2018 , publisher=

  74. [74]

    Journal of the american statistical association , volume=

    Adapting to unknown smoothness via wavelet shrinkage , author=. Journal of the american statistical association , volume=. 1995 , publisher=

  75. [75]

    Mills, J. P. , title=. Biometrika , volume=

  76. [76]

    1999 , publisher=

    Theory of multivariate statistics , author=. 1999 , publisher=

  77. [77]

    1968 , publisher=

    An introduction to probability theory and its applications, Volume 1 , author=. 1968 , publisher=

  78. [78]

    1966 , publisher=

    Handbook of mathematical functions, with formulas, graphs, and mathematical tables , author=. 1966 , publisher=

  79. [79]

    Limit of the smallest eigenvalue of a large dimensional sample covariance matrix , author=. Ann. Probab , volume=. 1993 , publisher=

  80. [80]

    2009 , publisher=

    Aspects of multivariate statistical theory , author=. 2009 , publisher=

Showing first 80 references.