arxiv: 2605.10290 · v1 · submitted 2026-05-11 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Recognition: no theorem link

Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation

Adrien Hardy, Alain Durmus, Lucas Morisset

Pith reviewed 2026-05-12 04:48 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH

keywords generalization errorrandom feature regressiondata augmentationproportional regimemean squared errortest errormisspecified featuresasymptotic analysis

0 comments

The pith

The test error of random feature regression with arbitrary data augmentation admits a tight characterization using only the population quantities of the true data and the first and second order statistics of the augmentation scheme.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper derives an exact asymptotic expression for the mean squared test error in random feature regression models that use data augmentation. The expression is valid in the proportional high-dimensional regime and holds even for misspecified feature maps provided only the final readout layer is trained on frozen or randomly initialized features. A sympathetic reader would care because the result isolates the regularization induced by augmentation to low-order statistics, allowing prediction of generalization performance without full simulation or knowledge of the feature map details. The characterization is shown to be tight when the data are Gaussian.

Core claim

In the proportional regime, the test mean squared error for random feature regression with arbitrary data augmentation is given by a closed-form expression depending solely on the population quantities of the true data together with the first and second order statistics of the augmentation scheme. This holds under misspecified feature maps and for any architecture in which only the readout layer is trained while the rest of the network is frozen or randomly initialized. When the data are Gaussian the asymptotic formula is tight.

What carries the argument

The asymptotic formula for the mean squared test error expressed solely in terms of the true data's population quantities and the augmentation scheme's first- and second-order statistics.

If this is right

The benefit of any augmentation scheme can be predicted in advance from its induced moments without retraining the model.
Different augmentation procedures can be ranked or optimized by comparing only their first- and second-order statistics.
Misspecification between the feature map and the true data distribution does not invalidate the error formula.
The regularization effect of augmentation is isolated to these low-order statistics even when the underlying network architecture varies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could engineer augmentation distributions to achieve target regularization levels by solving for desired moment values.
The same moment-based reduction may apply to other convex losses or to settings beyond pure regression if analogous proportional limits are derived.
In overparameterized regimes the result implies that higher-order properties of the augmentation become irrelevant for generalization.

Load-bearing premise

The dimension must grow proportionally with the number of samples and only the final readout layer is trained while preceding features remain frozen or randomly initialized.

What would settle it

Generate large finite samples from a known Gaussian distribution, apply a concrete augmentation with known first and second moments, train the random feature model, and check whether the observed test MSE converges to the predicted formula as dimension and sample size grow proportionally.

Figures

Figures reproduced from arXiv: 2605.10290 by Adrien Hardy, Alain Durmus, Lucas Morisset.

**Figure 1.** Figure 1: Deterministic equivalents accuracy for various λ and α. Comparison between empirical estimates (solid lines) and deterministic equivalents (dashed lines) as the regularization λ varies, for multiple augmentation strengths α (colors). Shaded bands represent ±1 empirical standard deviation over our Monte-Carlo approximations of E [G(α, λ)], E [ξα,λ] and E [χα,λ]. 3. Finally [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 2.** Figure 2: Bias–variance decomposition for various aspect ratios. Empirical (solid) and deterministic-equivalent (dashed) learning curves as the aspect ratio n/p varies (log scale), at a fixed regularization level λ = 0.05. Shaded bands show ±1 empirical standard deviation over our Monte-Carlo approximations of G(α, 0.05), bias2 (α, 0.05) and V(α, 0.05). Our setup also allows us to disentangle how DA affects the bias… view at source ↗

**Figure 3.** Figure 3: Masked MNIST examples used in the inpainting task. Each panel displays original digits, their masked inputs, and the removed center patches. The original tall figure has been split into two halves so that three examples appear in each panel. Although our theoretical developments were stated for scalar responses, the extension to the present vector-valued setting is immediate. Indeed, for any feature map φ … view at source ↗

**Figure 4.** Figure 4: MNIST inpainting with the identity feature map. Empirical estimates and deterministic equivalents for the test error, bias, and variance as functions of the aspect ratio, for the three augmentation schemes considered in the main text and several augmentation strengths α. More precisely, for each output coordinate j ∈ {1, . . . , 25}, we apply the theory of Theorem 1 and Proposition 1 to the each coordinate… view at source ↗

**Figure 5.** Figure 5: MNIST inpainting with feature map. Same comparison as in [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

This paper aims at analyzing the regularization effect that data augmentation induces on supervised regression methods in the proportional regime, where the number of covariates grows proportionally to the number of samples. We provide a tight characterization of the test error, measured in mean squared error, in terms only of the population quantities of the true data, as well as first and second order statistics of the augmentation scheme. Our results are valid under misspecified feature maps, and for any network architecture where only the last readout layer is trained, and the rest of the network is either frozen or randomly initialized. We specify our results in the case of Gaussian data, and show that our asymptotic characterization is tight in this setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a closed-form asymptotic MSE formula for random-feature regression with arbitrary data augmentation, expressed via population covariances and the augmentation's first two moments.

read the letter

The main result here is an asymptotic expression for test error in random feature models that incorporates data augmentation through its mean and covariance. It keeps the same population-level description as earlier work but extends it to general augmentation schemes, misspecified features, and the case where only the readout layer is trained while the rest is frozen or random. For Gaussian data they show the formula is tight in the proportional limit, which is a useful sanity check on the derivation.

Referee Report

1 major / 1 minor

Summary. The manuscript derives an asymptotic characterization of the test MSE for random feature regression with arbitrary data augmentation in the proportional regime. The formula is expressed solely in terms of population covariances of the true data and the first- and second-order moments of the augmentation distribution. The analysis covers misspecified (possibly nonlinear) feature maps with only the readout trained, and the characterization is shown to be tight when the underlying data is Gaussian.

Significance. If correct, the result supplies a practical tool for quantifying the regularization induced by data augmentation via low-order moments alone, extending random-matrix methods to augmented random-feature models while accommodating feature misspecification. The explicit tightness proof under Gaussian data is a concrete strength that allows direct validation of the formulas.

major comments (1)

[Abstract and §3] Abstract and §3 (main theorem): the claim that the test-error characterization depends only on first- and second-order augmentation statistics for arbitrary augmentations and nonlinear feature maps φ is not obviously consistent with the fact that E[φ(aug(x))φ(aug(x))ᵀ] is in general a functional of the full law of aug(x), not merely its mean and covariance. The derivation therefore appears to require either linearity of φ or Gaussianity of the augmented data to close; the paper should state the precise assumptions under which the general (non-Gaussian) formula holds.

minor comments (1)

[§2] The statement of the proportional regime (n,d→∞ with fixed ratio) and the precise definition of the random-feature map (frozen vs. randomly initialized) could be repeated explicitly in the theorem statements for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (main theorem): the claim that the test-error characterization depends only on first- and second-order augmentation statistics for arbitrary augmentations and nonlinear feature maps φ is not obviously consistent with the fact that E[φ(aug(x))φ(aug(x))ᵀ] is in general a functional of the full law of aug(x), not merely its mean and covariance. The derivation therefore appears to require either linearity of φ or Gaussianity of the augmented data to close; the paper should state the precise assumptions under which the general (non-Gaussian) formula holds.

Authors: We thank the referee for this observation. The manuscript already states that the results are specified and shown to be tight in the Gaussian data case (see the abstract and the main theorem in §3). Under the Gaussian assumption, the law of aug(x) is fully determined by its first- and second-order moments, so that for any measurable (possibly nonlinear) feature map φ the expectation E[φ(aug(x))φ(aug(x))ᵀ] depends only on those moments. We agree that the role of the Gaussian assumption should be stated more explicitly to avoid any ambiguity about non-Gaussian settings. We will revise the abstract and §3 accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: characterization expressed via independent population quantities

full rationale

The paper's central result is an asymptotic MSE characterization for random-feature regression under arbitrary data augmentation, expressed directly in terms of population covariances of the data-generating process together with the first- and second-order moments of the augmentation map. These quantities are defined independently of the trained readout weights and of the fitted model itself; the derivation therefore does not reduce any claimed prediction to a tautological re-expression of its own inputs. The analysis is further restricted to the Gaussian-data case where the stated tightness is verified, and the provided abstract and claims contain no load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard high-dimensional asymptotics and random-matrix concentration; no new free parameters or invented entities are introduced. The augmentation scheme is summarized by its first two moments, which are treated as observable inputs.

axioms (2)

domain assumption Proportional regime: number of features p and samples n satisfy p/n -> gamma in (0,infty)
Invoked to obtain the asymptotic characterization of the test error.
domain assumption Only the final readout layer is trained; the feature map is frozen or randomly initialized
Stated in the abstract as the setting under which the results hold.

pith-pipeline@v0.9.0 · 5416 in / 1381 out tokens · 27165 ms · 2026-05-12T04:48:27.192351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 1 internal anchor

[1]

IEEE Transactions on Knowledge and Data Engineering , year=

A comprehensive survey on data augmentation , author=. IEEE Transactions on Knowledge and Data Engineering , year=

work page
[2]

Array , volume=

Data augmentation: A comprehensive survey of modern approaches , author=. Array , volume=. 2022 , publisher=

work page 2022
[3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

Masked Autoencoders Are Scalable Vision Learners , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages =

work page
[4]

Improved Regularization of Convolutional Neural Networks with Cutout

Improved regularization of convolutional neural networks with cutout , author=. arXiv preprint arXiv:1708.04552 , year=

work page internal anchor Pith review arXiv
[5]

CoRR , booktitle =

How does mixup help with robustness and generalization? , author=. arXiv preprint arXiv:2010.04819 , year=

work page arXiv 2010
[6]

, title =

Bishop, Christopher M. , title =. Neural Computation , year =

work page
[7]

and Muthukumar, Vidya , title =

Lin, Chi-Heng and Kaushik, Chiraag and Dyer, Eva L. and Muthukumar, Vidya , title =. Journal of Machine Learning Research , year =

work page
[8]

Proceedings of the 36th International Conference on Machine Learning , editor =

Dao, Tri and Gu, Albert and Ratner, Alexander and Smith, Virginia and De Sa, Chris and Re, Christopher , title =. Proceedings of the 36th International Conference on Machine Learning , editor =. 2019 , month = jun, publisher =

work page 2019
[9]

, title =

Chen, Shuxiao and Dobriban, Edgar and Lee, Jane H. , title =. Journal of Machine Learning Research , year =

work page
[10]

Advances in Neural Information Processing Systems , year =

Van Assel, Hugues and Ibrahim, Mark and Biancalani, Tommaso and Regev, Aviv and Balestriero, Randall , title =. Advances in Neural Information Processing Systems , year =

work page
[11]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Autoaugment: Learning augmentation strategies from data , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[12]

Advances in neural information processing systems , volume=

Fast autoaugment , author=. Advances in neural information processing systems , volume=

work page
[13]

European conference on computer vision , pages=

Faster autoaugment: Learning augmentation strategies using backpropagation , author=. European conference on computer vision , pages=. 2020 , organization=

work page 2020
[14]

2016 IEEE international conference on image processing (ICIP) , pages=

Adaptive data augmentation for image classification , author=. 2016 IEEE international conference on image processing (ICIP) , pages=. 2016 , organization=

work page 2016
[15]

, title =

Shorten, Connor and Khoshgoftaar, Taghi M. , title =. Journal of Big Data , year =

work page
[16]

IEEE Access , year=

Image data augmentation approaches: A comprehensive survey and future directions , author=. IEEE Access , year=

work page
[17]

arXiv preprint arXiv:2002.12478 , year=

Time series data augmentation for deep learning: A survey , author=. arXiv preprint arXiv:2002.12478 , year=

work page arXiv 2002
[18]

Neural Computing and Applications , volume=

Data augmentation techniques in time series domain: a survey and taxonomy , author=. Neural Computing and Applications , volume=. 2023 , publisher=

work page 2023
[19]

Plos one , volume=

An empirical survey of data augmentation for time series classification with neural networks , author=. Plos one , volume=. 2021 , publisher=

work page 2021
[20]

ACM Computing Surveys , volume=

A survey on data augmentation for text classification , author=. ACM Computing Surveys , volume=. 2022 , publisher=

work page 2022
[21]

arXiv preprint arXiv:2105.03075 , year=

A survey of data augmentation approaches for NLP , author=. arXiv preprint arXiv:2105.03075 , year=

work page arXiv
[22]

Journal of big Data , volume=

Text data augmentation for deep learning , author=. Journal of big Data , volume=. 2021 , publisher=

work page 2021
[23]

Proceedings of the National Academy of Sciences of the United States of America , year =

Belkin, Mikhail and Hsu, Daniel and Ma, Siyuan and Mandal, Soumik , title =. Proceedings of the National Academy of Sciences of the United States of America , year =

work page
[24]

and Long, Philip M

Bartlett, Peter L. and Long, Philip M. and Lugosi, G. Benign overfitting in linear regression , journal =. 2020 , volume =

work page 2020
[25]

2024 , eprint =

Asymptotics of Learning with Deep Structured (Random) Features , author =. 2024 , eprint =. doi:10.48550/arXiv.2402.13999 , note =

work page doi:10.48550/arxiv.2402.13999 2024
[26]

Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference , pages =

The Gaussian equivalence of generative models for learning with shallow neural networks , author =. Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference , pages =. 2022 , editor =

work page 2022
[27]

Physical Review X , year =

Goldt, Sebastian and M\'ezard, Marc and Krzakala, Florent and Zdeborov\'a, Lenka , title =. Physical Review X , year =

work page
[28]

Journal of Statistical Mechanics: Theory and Experiment , year =

Gerace, Federica and Loureiro, Bruno and Krzakala, Florent and M\'ezard, Marc and Zdeborov\'a, Lenka , title =. Journal of Statistical Mechanics: Theory and Experiment , year =

work page
[29]

, title =

Hu, Hong and Lu, Yue M. , title =. IEEE Transactions on Information Theory , year =

work page
[30]

, title =

Dhifallah, Oussama and Lu, Yue M. , title =. 2020 , journal =. 2008.11904 , archivePrefix=

work page arXiv 2020
[31]

Communications on Pure and Applied Mathematics , year =

Mei, Song and Montanari, Andrea , title =. Communications on Pure and Applied Mathematics , year =

work page
[32]

Proceedings of Thirty Sixth Conference on Learning Theory , editor =

Precise Asymptotic Analysis of Deep Random Feature Models , author =. Proceedings of Thirty Sixth Conference on Learning Theory , editor =. 2023 , month =

work page 2023
[33]

Advances in Neural Information Processing Systems , editor =

Zavatone-Veth, Jacob and Pehlevan, Cengiz , title =. Advances in Neural Information Processing Systems , editor =. 2023 , publisher =

work page 2023
[34]

A Rainbow in Deep Network Black Boxes , journal =

Guth, Florentin and M. A Rainbow in Deep Network Black Boxes , journal =. 2023 , eprint =

work page 2023
[35]

Biometrika , year =

Wishart, John , title =. Biometrika , year =

work page
[36]

Electronic Communications in Probability , year =

Adamczak, Radoslaw , title =. Electronic Communications in Probability , year =. doi:10.1214/ECP.v20-3781 , publisher =

work page doi:10.1214/ecp.v20-3781
[37]

2018 , publisher =

High-Dimensional Probability: An Introduction with Applications in Data Science , author =. 2018 , publisher =. doi:10.1017/9781108231596 , isbn =

work page doi:10.1017/9781108231596 2018
[38]

arXiv preprint arXiv:2211.13044 , year=

Quantitative deterministic equivalent of sample covariance matrices with a general dependence structure , author=. arXiv preprint arXiv:2211.13044 , year=. 2211.13044 , archivePrefix=

work page arXiv
[39]

The Annals of Mathe- matical Statistics22(1), 79–86 (1951) https://doi.org/10.1214/aoms/1177729694

Sherman, Jack and Morrison, Winifred J. , title =. Annals of Mathematical Statistics , year =. doi:10.1214/aoms/1177729893 , mrnumber=

work page doi:10.1214/aoms/1177729893
[40]

Henderson, H. V. and Searle, S. R. , title =. SIAM Review , year =

work page
[41]

2018 , eprint =

Louart, Cosme and Couillet, Romain , title =. 2018 , eprint =

work page 2018
[42]

2020 , eprint =

Louart, Cosme and Couillet, Romain , title =. 2020 , eprint =

work page 2020
[43]

2021 , eprint =

Louart, Cosme and Couillet, Romain , title =. 2021 , eprint =

work page 2021
[44]

Eigenvectors of some large sample covariance matrix ensembles , journal =

Ledoit, Olivier and P. Eigenvectors of some large sample covariance matrix ensembles , journal =. 2011 , volume =

work page 2011
[45]

Annales Henri Poincar\'e , volume =

Averaging fluctuations in resolvents of random band matrices , author =. Annales Henri Poincar\'e , volume =. 2013 , doi =. 1205.5664 , archivePrefix =

work page arXiv 2013
[46]

Proceedings of the 40th International Conference on Machine Learning , pages =

Deterministic equivalent and error universality of deep random features learning , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[47]

Nature , volume =

AI models collapse when trained on recursively generated data , author =. Nature , volume =

work page
[48]

The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , year =

Model Collapse Demystified: The Case of Regression , author =. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024) , year =

work page 2024
[49]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Non-Asymptotic Analysis of Data Augmentation for Precision Matrix Estimation , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2510.02119 , archivePrefix=

work page arXiv