pith. sign in

arxiv: 2605.16913 · v1 · pith:DMMM25JRnew · submitted 2026-05-16 · 📊 stat.ML · cond-mat.dis-nn· cond-mat.stat-mech· cs.LG· math.PR

A Fourier perspective on the learning dynamics of neural networks: from sample complexities to mechanistic insights

Pith reviewed 2026-05-19 19:36 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncond-mat.stat-mechcs.LGmath.PR
keywords neural network trainingFourier analysissimplicity biasamplitude and phaseSGD sample complexitypower-law spectratranslation invarianceimage classification
0
0 comments X p. Extension
pith:DMMM25JR Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{DMMM25JR}

Prints a linked pith:DMMM25JR badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Online SGD cannot learn phase-only classification on isotropic high-dimensional inputs before order N cubed steps, but power-law spectra accelerate it substantially.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the simplicity bias of neural networks through a Fourier decomposition that separates amplitude information, tied to pixel correlations, from phase information, which encodes edges and higher-order structure. Experiments on image tasks show networks exploit amplitude before phase. A new synthetic model of translation-invariant data with controllable amplitudes and phases is introduced to make the dynamics tractable. Rigorous analysis proves that phase-based classification is hard for online SGD under isotropic inputs, requiring far more steps than amplitude-based tasks, while power-law spectra speed phase learning even when they add no classification benefit. Simulations with shallow and deep networks on textures, CIFAR100, and ImageNet confirm the same amplitude-to-phase progression.

Core claim

For isotropic and high-dimensional inputs, classification based on phase information alone is a genuinely hard task: online SGD cannot distinguish the structured inputs from noise within n much less than N cubed steps, but needs at least n much greater than N cubed log squared N steps. Power-law spectra can dramatically accelerate the speed of learning phase information, even if the spectra do not help with classification itself.

What carries the argument

A synthetic data model for translation-invariant inputs that separates control of amplitudes and phases while preserving tractability for SGD analysis.

If this is right

  • Networks trained on images first rely on amplitude information before exploiting phase information.
  • Power-law spectra accelerate phase learning even without improving final accuracy.
  • The same amplitude-before-phase progression appears in deep convolutional networks on CIFAR100 and ImageNet.
  • This amplitude-phase interaction explains how networks learn natural image distributions efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hardness result may extend to other high-dimensional data with flat spectra.
  • Power-law acceleration could be tested on regression tasks or different architectures.
  • The model offers a way to study how translation invariance interacts with spectral properties during training.

Load-bearing premise

The synthetic data model for translation-invariant inputs captures the real interaction between amplitudes, phases, and SGD dynamics without artifacts that would change the hardness or acceleration results.

What would settle it

An experiment showing that online SGD succeeds at phase-only classification on high-dimensional isotropic inputs in substantially fewer than N cubed steps would disprove the hardness claim.

Figures

Figures reproduced from arXiv: 2605.16913 by Claudia Merger, Fabiola Ricci, Sebastian Goldt.

Figure 1
Figure 1. Figure 1: Learning phase vs amplitude information. a-b) Pictures of a bird and a snake from ImageNet. c-d) Fourier image reconstruction with phases φkk′ from the “bird” and amplitudes ρkk′ from the “snake” and vice versa. e) Phases of images from the “cotton” class of the ALOT texture dataset for patches of size 16 × 16 along the first Fourier mode in the x-direction. f) Uniform phase distribution. g) Performance of… view at source ↗
Figure 2
Figure 2. Figure 2: Performance of SGD in classifying isotropic inputs on the Fourier data model. (Left) We run online SGD applied to the correlation loss (1) with isotropic inputs drawn from the Fourier data model (3). SGD does not weakly recover the signal at linear ( ) or quadratic ( ) sample complexity, whereas it converges to the subspace spanned by the DFT phase vectors in the cubic ( ) regime. On the y-axes, we see the… view at source ↗
Figure 3
Figure 3. Figure 3: Shared principal subspace speeds up learning. a) Average squared Fourier amplitudes of image patches of “cotton” class from ALOT texture dataset, averaged over wave vectors of equal length |k| = q k 2 x + k 2 y , for patches of increasing size. b-d) Test losses of classifiers trained on distinguishing “cotton” vs. “lace” on original data ( ), data where all Fourier amplitudes of both classes have been set … view at source ↗
Figure 4
Figure 4. Figure 4: Performance of SGD in classifying non-isotropic inputs on the Fourier data model. (Left) Cartoon of the principal subspace of power-law-decaying inputs sampled from the Fourier data model (3), spanned by the DFT phase vectors (u, v) and a finite number of other principal components (u m, vm). (Middle) At first, online SGD quickly recovers the whole principal subspace, including the DFT phase vectors. Then,… view at source ↗
read the original abstract

Neural networks trained with gradient-based methods exhibit a strong simplicity bias: they learn simpler statistical features of their data before moving to more complex features. Previous analyses of this phenomenon have largely focused on settings with (quasi-)isotropic inputs. In this work, we study the simplicity bias from a Fourier perspective, which allows us to include two key features of natural images in the analysis: approximate translation-invariance and power-law spectra. We first show experimentally that simple neural networks trained on image classification tasks first rely on amplitude information -- related to pair-wise correlations between pixels -- before exploiting phase information, which encodes edges and higher-order correlations. In view of this, we introduce a synthetic data model for translation-invariant inputs that allows precise control over amplitudes and phases while remaining tractable. We rigorously establish that for isotropic and high-dimensional inputs, classification based on phase information alone is a genuinely hard task: online stochastic gradient descent (SGD) cannot distinguish the structured inputs from noise within $n \ll N^3$ steps, but needs at least $n \gg N^3 \log^2{N}$ steps. In contrast, we show both experimentally and theoretically that power-law spectra can dramatically accelerate the speed of learning phase information, even if the spectra do not help with classification. Simulations with two-layer networks trained on textures and with deep convolutional networks on ImageNet and CIFAR100 confirm this non-trivial interaction between amplitudes and phases, providing mechanistic insights into how deep neural networks can learn natural image distributions efficiently.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that neural networks exhibit a simplicity bias by learning amplitude information (pairwise pixel correlations) before phase information (edges and higher-order correlations) when trained on image classification. From a Fourier perspective incorporating translation invariance and power-law spectra, the authors introduce a synthetic data model for translation-invariant inputs. They rigorously prove that for isotropic high-dimensional inputs, online SGD cannot learn phase-only classification within n ≪ N³ steps and requires at least n ≫ N³ log²N steps. They further show both theoretically and experimentally that power-law spectra accelerate phase learning even when spectra do not aid classification directly. Experiments with two-layer networks on textures and deep CNNs on ImageNet/CIFAR100 support the amplitude-to-phase transition and the non-trivial interaction.

Significance. If the results hold, the work provides mechanistic insights into efficient learning of natural image distributions by deep networks, extending simplicity bias analyses beyond quasi-isotropic inputs. The combination of rigorous sample-complexity bounds for the synthetic model, power-law acceleration derivations, and empirical validation on real datasets strengthens the Fourier-based explanation of learning dynamics. The parameter-free nature of the hardness lower bound and the reproducible experimental setup on standard benchmarks are notable strengths.

major comments (2)
  1. [§3.2] §3.2 (Synthetic data model definition): The central hardness claim that phase-only classification is information-theoretically and algorithmically hard for online SGD (requiring n ≫ N³ log²N) depends on the model introducing no unintended label-correlated phase alignments or higher-order dependencies under the translation-invariance constraint. The phase sampling procedure could embed weak correlations that invalidate the lower bound as a general statement about isotropic inputs; an explicit proof or numerical verification that labels remain independent of phases in the Fourier domain is needed to confirm the result is not model-specific.
  2. [Theorem 4.1] Theorem 4.1 (Hardness lower bound for online SGD): The derivation assumes the synthetic model faithfully captures the interaction between amplitudes, phases, and dynamics without artifacts. If the translation-invariance enforcement introduces even mild phase-label dependencies, the claimed separation from noise (n ≪ N³ vs n ≫ N³ log²N) may not hold in the intended regime; a direct comparison to a fully random-phase baseline would clarify whether the bound is tight.
minor comments (2)
  1. [Figure 3] Figure 3 and associated text: Error bars or multiple random seeds are not reported for the ImageNet/CIFAR100 runs, making it harder to assess the statistical significance of the observed amplitude-to-phase transition.
  2. [§2] Notation: The definition of the Fourier transform and the precise normalization used for amplitudes/phases should be stated explicitly in §2 to avoid ambiguity when comparing to standard image processing conventions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for the constructive comments on the synthetic data model and hardness results. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Synthetic data model definition): The central hardness claim that phase-only classification is information-theoretically and algorithmically hard for online SGD (requiring n ≫ N³ log²N) depends on the model introducing no unintended label-correlated phase alignments or higher-order dependencies under the translation-invariance constraint. The phase sampling procedure could embed weak correlations that invalidate the lower bound as a general statement about isotropic inputs; an explicit proof or numerical verification that labels remain independent of phases in the Fourier domain is needed to confirm the result is not model-specific.

    Authors: We agree that an explicit check for label-phase independence is important to ensure the hardness result is not an artifact of the model construction. In Section 3.2, phases are drawn independently and uniformly, and the label is generated from a translation-invariant function of the full phase vector (specifically, a thresholded sum over selected frequency interactions). This construction is designed to make the label uncorrelated with any fixed subset of phases. To confirm, we have added numerical verification in the revision: the empirical correlation between the label and each individual phase coefficient is statistically indistinguishable from zero across multiple random seeds, and mutual information estimates are at the level of sampling noise. We will include this as a new panel in Figure 3 (or an appendix) to substantiate that no unintended dependencies are present. revision: yes

  2. Referee: [Theorem 4.1] Theorem 4.1 (Hardness lower bound for online SGD): The derivation assumes the synthetic model faithfully captures the interaction between amplitudes, phases, and dynamics without artifacts. If the translation-invariance enforcement introduces even mild phase-label dependencies, the claimed separation from noise (n ≪ N³ vs n ≫ N³ log²N) may not hold in the intended regime; a direct comparison to a fully random-phase baseline would clarify whether the bound is tight.

    Authors: We appreciate the suggestion for a random-phase baseline comparison. The proof of Theorem 4.1 shows that the expected gradient contribution from the phase variables vanishes under isotropy, with the N³ scaling arising from the variance of the stochastic updates. To verify that translation invariance does not introduce spurious dependencies that would invalidate the separation, we will add experiments in the revised version comparing our structured-phase model against a fully random-phase control (where labels are assigned independently of the input). The random-phase case learns at the rate expected for pure noise, while the structured case exhibits the predicted delay, confirming that the bound reflects the intended phase-learning difficulty rather than model artifacts. revision: yes

Circularity Check

0 steps flagged

Standard high-dimensional SGD analysis supports hardness result without reduction to fitted inputs or self-citations

full rationale

The paper introduces a synthetic translation-invariant data model to control amplitudes and phases, then rigorously derives the online SGD hardness bound (n ≪ N³ vs n ≫ N³ log²N) for phase-only classification using standard high-dimensional analysis techniques. This does not reduce by construction to quantities fitted from the target result, nor does it rely on load-bearing self-citations or ansatzes smuggled from prior work. Power-law acceleration is shown both theoretically and via independent experiments on textures/ImageNet. The derivation chain is self-contained and externally falsifiable via the stated assumptions on isotropic inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the fidelity of the synthetic data model and standard high-dimensional learning assumptions; no free parameters are fitted to produce the hardness bound, and no new entities are postulated.

axioms (1)
  • domain assumption Inputs are high-dimensional, isotropic, and translation-invariant for the hardness result on phase learning.
    Explicitly stated as the regime in which the N³ scaling is proven.

pith-pipeline@v0.9.0 · 5822 in / 1242 out tokens · 35355 ms · 2026-05-19T19:36:43.567585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 1 internal anchor

  1. [1]

    et al.SGD on Neural Networks Learns Functions of Increasing ComplexityinAdvances in Neural Information Processing Systems32(2019), 3491–3501

    Kalimeris, D. et al.SGD on Neural Networks Learns Functions of Increasing ComplexityinAdvances in Neural Information Processing Systems32(2019), 3491–3501

  2. [2]

    & Goldt, S

    Ingrosso, A. & Goldt, S. Data-driven emergence of convolutional structure in neural networks. Proceedings of the National Academy of Sciences119(2022)

  3. [3]

    & Goldt, S.Neural networks trained with SGD learn distributions of increasing complexityinInternational Conference on Machine Learning(2023), 28843–28863

    Refinetti, M., Ingrosso, A. & Goldt, S.Neural networks trained with SGD learn distributions of increasing complexityinInternational Conference on Machine Learning(2023), 28843–28863

  4. [4]

    & Goldt, S.A distributional simplicity bias in the learning dynamics of transformersinAdvances in Neural Information Processing Systems37(2024), 96207–96228

    Rende, R., Gerace, F., Laio, A. & Goldt, S.A distributional simplicity bias in the learning dynamics of transformersinAdvances in Neural Information Processing Systems37(2024), 96207–96228

  5. [5]

    & Fern, X.Neural Networks Learn Statistics of Increasing Complexityin (arXiv, 2024)

    Belrose, N., Pope, Q., Quirke, L., Mallen, A. & Fern, X.Neural Networks Learn Statistics of Increasing Complexityin (arXiv, 2024)

  6. [6]

    & Wyart, M.How compositional generalization and creativity improve as diffusion models are trainedin (arXiv, 2025)

    Favero, A., Sclocchi, A., Cagnetta, F., Frossard, P. & Wyart, M.How compositional generalization and creativity improve as diffusion models are trainedin (arXiv, 2025)

  7. [7]

    & Saglietti, L.How Transformers Learn Structured Data: Insights From Hierarchical FilteringinInternational Conference on Machine Learning(2025)

    Garnier-Brun, J., Mézard, M., Moscato, E. & Saglietti, L.How Transformers Learn Structured Data: Insights From Hierarchical FilteringinInternational Conference on Machine Learning(2025)

  8. [8]

    & Solla, S

    Saad, D. & Solla, S. Exact Solution for On-Line Learning in Multilayer Neural Networks.Phys. Rev. Lett.74,4337–4340 (1995)

  9. [9]

    M., McClelland, J

    Saxe, A. M., McClelland, J. L. & Ganguli, S.Exact solutions to the nonlinear dynamics of learning in deep linear neural networksinICLR(2014)

  10. [10]

    M., McClelland, J

    Saxe, A. M., McClelland, J. L. & Ganguli, S. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences116,11537–11546 (2019)

  11. [11]

    S., Bresler, G

    Abbe, E., Boix-Adsera, E., Brennan, M. S., Bresler, G. & Nagaraj, D. The staircase property: How hierarchical structure can guide deep learning.Advances in Neural Information Processing Systems 34,26989–27002 (2021)

  12. [12]

    Abbe, E., Adsera, E. B. & Misiakiewicz, T.SGD learning on neural networks: leap complexity and saddle-to-saddle dynamicsinThe Thirty Sixth Annual Conference on Learning Theory(2023), 2552– 2623

  13. [13]

    & Stephan, L

    Dandi, Y., Krzakala, F., Loureiro, B., Pesce, L. & Stephan, L. How Two-Layer Neural Networks Learn, One (Giant) Step at a Time.Journal of Machine Learning Research25,1–65 (2024)

  14. [14]

    & Zhou, K

    Berthier, R., Montanari, A. & Zhou, K. Learning time-scales in two-layers neural networks.Founda- tions of Computational Mathematics25,1627–1710 (2025)

  15. [15]

    & Mondelli, M.Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and DepthinInternational Conference on Machine Learning(2024)

    Kögler, K., Shevchenko, A., Hassani, H. & Mondelli, M.Compression of Structured Data with Autoencoders: Provable Benefit of Nonlinearities and DepthinInternational Conference on Machine Learning(2024). 14

  16. [16]

    & Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018)

    Farnia, F., Zhang, J. & Tse, D.A Spectral Approach to Generalization and Optimization in Neural NetworksinICLR(2018)

  17. [17]

    et al.On the Spectral Bias of Neural NetworksinInternational Conference of Machine Learning97(2019), 5301–5310

    Rahaman, N. et al.On the Spectral Bias of Neural NetworksinInternational Conference of Machine Learning97(2019), 5301–5310

  18. [18]

    Merger, C. et al. Learning Interacting Theories from Data.Physical Review X13.Publisher: American Physical Society, 041033 (Nov. 2023)

  19. [19]

    & Goldt, S.Sliding Down the Stairs: How Correlated Latent Variables Accelerate Learning with Neural NetworksinInternational Conference on Machine Learning235(PMLR, 2024), 3024–3045

    Bardone, L. & Goldt, S.Sliding Down the Stairs: How Correlated Latent Variables Accelerate Learning with Neural NetworksinInternational Conference on Machine Learning235(PMLR, 2024), 3024–3045

  20. [20]

    & Goldt, S.Reduce and Conquer: Independent Component Analysis at linear sample complexityinHigh-dimensional Learning Dynamics(2025)

    Ricci, F., Bardone, L. & Goldt, S.Reduce and Conquer: Independent Component Analysis at linear sample complexityinHigh-dimensional Learning Dynamics(2025)

  21. [21]

    & van Hateren, J

    van der Schaaf, A. & van Hateren, J. Modelling the Power Spectra of Natural Images: Statistics and Information.Vision Research36,2759–2770 (1996)

  22. [22]

    & Hoyer, P

    Hyvärinen, A., Hurri, J. & Hoyer, P. O.Natural image statistics: A probabilistic approach to early computational vision.(Springer Science & Business Media, 2009)

  23. [23]

    & Lim, J

    Oppenheim, A. & Lim, J. The importance of phase in signals.Proceedings of the IEEE69(1981)

  24. [24]

    & Campbell, C

    Piotrowski, L. & Campbell, C. A demonstration of the visual importance and flexibility of spatial- frequency amplitude and phase.Journal of Physics A: Mathematical and Theoretical53,174003 (1982)

  25. [25]

    Burghouts, G. J. & Geusebroek, J.-M. Material-specific adaptation of color invariant features. en. Pattern Recognition Letters30,306–313 (Feb. 2009)

  26. [26]

    & Jagannath, A

    Ben Arous, G., Gheissari, R. & Jagannath, A. Online Stochastic Gradient Descent on Non-Convex Losses from High-Dimensional Inference.J. Mach. Learn. Res.22(2021)

  27. [27]

    Ben Arous, G., Gheissari, R. & Jagannath, A.High-dimensional limit theorems for SGD: Effective dynamics and critical scalinginAdvances in Neural Information Processing Systems35(Curran Associates, Inc., 2022), 25349–25362

  28. [28]

    & Ginis, V.Linear CNNs discover the statistical structure of the dataset using only the most dominant frequenciesinInternational Conference on Machine Learning(2023), 27876–27906

    Pinson, H., Lenaerts, J. & Ginis, V.Linear CNNs discover the statistical structure of the dataset using only the most dominant frequenciesinInternational Conference on Machine Learning(2023), 27876–27906

  29. [29]

    D., Soudry, D

    Gunasekar, S., Lee, J. D., Soudry, D. & Srebro, N. Implicit bias of gradient descent on linear convolu- tional networks.Advances in neural information processing systems31(2018)

  30. [30]

    Visual Pattern Discrimination.IRE Transactions on Information Theory8,84–92 (1962)

    Julesz, B. Visual Pattern Discrimination.IRE Transactions on Information Theory8,84–92 (1962)

  31. [31]

    S., Victor, J

    Tkačik, G., Prentice, J. S., Victor, J. D. & Balasubramanian, V. Local statistics in natural scenes predict the saliency of synthetic textures.Proceedings of the National Academy of Sciences107, 18149–18154 (2010)

  32. [32]

    Caramellino, R. et al. Rat sensitivity to multipoint statistics is predicted by efficient coding of natural scenes.Elife10,e72081 (2021)

  33. [33]

    et al.ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustnessinInternational conference on learning representations(2018)

    Geirhos, R. et al.ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustnessinInternational conference on learning representations(2018)

  34. [34]

    & Pennington, J

    Paquette, E., Paquette, C., Xiao, L. & Pennington, J. 4+ 3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems37,16459–16537 (2024)

  35. [35]

    Braun, G., Loureiro, B., Minh, H. Q. & Imaizumi, M.Fast Escape, Slow Convergence: Learning Dynamics of Phase Retrieval under Power-Law Datain (arXiv, 2025)

  36. [36]

    A., Vural, N

    Ben Arous, G., Erdogdu, M. A., Vural, N. M. & Wu, D.Learning quadratic neural networks in high dimensions: SGD dynamics and scaling lawsinThe Thirty-ninth Annual Conference on Neural Information Processing Systems(2025). 15

  37. [37]

    & Lee, J

    Ren, Y., Nichani, E., Wu, D. & Lee, J. Emergence and scaling laws in sgd learning of shallow neural networks.Advances in Neural Information Processing Systems38,38227–38309 (2026)

  38. [38]

    Defilippis, L. et al. Scaling laws and spectra of shallow neural networks in the feature learning regime.arXiv preprint arXiv:2509.24882(2025)

  39. [39]

    & Lee, J

    Damian, A., Nichani, E., Ge, R. & Lee, J. D.Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index ModelsinConference on Neural Information Processing Systems(2023)

  40. [40]

    Dandi, Y. et al. The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents.arXiv(2024)

  41. [41]

    Gutmann, M. & Hyvärinen, A.Noise-contrastive estimation: A new estimation principle for unnormal- ized statistical modelsinProceedings of the thirteenth international conference on artificial intelligence and statistics(2010), 297–304

  42. [42]

    Damian, A., Pillaud-Vivien, L., Lee, J. D. & Bruna, J. The Computational Complexity of Learning Gaussian Single-Index Models.arXiv:2403.05529(2024)

  43. [43]

    & Montanari, A

    Richard, E. & Montanari, A. A statistical model for tensor PCA.Advances in neural information processing systems27(2014)

  44. [44]

    Ricci, F., Bardone, L. & Goldt, S.Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensionsinInternational Conference of Machine Learning267(2025), 51614–51639

  45. [45]

    & Erdogdu, M

    Mousavi-Hosseini, A., Wu, D., Suzuki, T. & Erdogdu, M. A. Gradient-based feature learning under structured data.Advances in Neural Information Processing Systems36,71449–71485 (2023)

  46. [46]

    & Loureiro, B

    Wortsman, A. & Loureiro, B. Kernel ridge regression under power-law data: spectrum and general- ization.arXiv:2510.04780(2025)

  47. [47]

    L., Long, P

    Bartlett, P. L., Long, P. M., Lugosi, G. & Tsigler, A. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences117,30063–30070 (2020)

  48. [48]

    & Montanari, A

    Cheng, C. & Montanari, A. Dimension free ridge regression.The Annals of Statistics52,2879–2912 (2024)

  49. [49]

    Field, D. J. Relations between the statistics of natural images and the response properties of cortical cells.J. Opt. Soc. Am. A4,2379–2394 (Dec. 1987)

  50. [50]

    & Jagannath, A

    Ben Arous, G., Gheissari, R., Huang, J. & Jagannath, A. Spectral alignment of stochastic gradient descent for high-dimensional classification tasks.The Annals of Applied Probability35,2767–2822 (2025)

  51. [51]

    & Piccolo, V

    Ben Arous, G., Gerbelot, C. & Piccolo, V. Stochastic gradient descent in high dimensions for multi-spiked tensor PCA.arXiv preprint arXiv:2410.18162(2024)

  52. [52]

    & Jagannath, A

    Ben Arous, G., Gheissari, R. & Jagannath, A. Algorithmic thresholds for tensor PCA.The Annals of Probability(2018)

  53. [53]

    Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature381,607–609 (1996)

  54. [54]

    Mendes, V. C. et al. A solvable high-dimensional model where nonlinear autoencoders learn structure invisible to PCA while test loss misaligns with generalization.arXiv:2602.10680(2026)

  55. [55]

    Hopkins, S.Statistical inference and the sum of squares methodPhD thesis (Cornell University, 2018)

  56. [56]

    (Academic Press, San Diego, 1999)

    Mallat, S.A Wavelet Tour of Signal Processing2nd ed. (Academic Press, San Diego, 1999)

  57. [57]

    Victor JD, C. M. Local image statistics: maximum-entropy constructions and perceptual salience. Journal of the Optical Society of America A29,1313–1345 (2012). 16

  58. [58]

    & Simoncelli, E

    Portilla, J. & Simoncelli, E. P. A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients.International Journal of Computer Vision40,49–70 (2000)

  59. [59]

    & Piasini, E

    De Paolis, L., Anselmi, F., Ansuini, A. & Piasini, E. Perceptual misalignment of texture representa- tions in convolutional neural networks.arXiv preprint arXiv:2604.01341(2026)

  60. [60]

    & Tsipras, D.Robustness (Python Library)2019

    Engstrom, L., Ilyas, A., Santurkar, S. & Tsipras, D.Robustness (Python Library)2019

  61. [61]

    & Olshausen, B

    Simoncelli, E. & Olshausen, B. Natural Image Statistics and Neural Representation.Annual review of neuroscience24(2001)

  62. [62]

    & Wakin, M

    Zhu, Z. & Wakin, M. On the Asymptotic Equivalence of Circulant and Toeplitz Matrices.IEEE Transactions on Information Theory63(2016)

  63. [63]

    On certain Hermitian forms associated with the Fourier series of a positive function

    Szegö, G. On certain Hermitian forms associated with the Fourier series of a positive function. Communications in Seminars of Mathematics, University of Lund(1952)

  64. [64]

    & Silbermann, B.Analysis of Toeplitz operators(Springer-Verlag, Berlin, 1990)

    Böttcher, A. & Silbermann, B.Analysis of Toeplitz operators(Springer-Verlag, Berlin, 1990)

  65. [65]

    J.Circulant Matrices(Chelsea, 1994)

    Davis, P. J.Circulant Matrices(Chelsea, 1994)

  66. [66]

    Szegö, G.Orthogonal Polynomials(American Mathematical Society, 1975)

  67. [67]

    Kunisky, D., Wein, A. S. & Bandeira, A. S.Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratioinInternational Congress of ISAAC (International Society for Analysis, its Applications and Computation)(2019), 1–50

  68. [68]

    cotton” (label= 1) from textures of type “lace

    Isserlis, L. On a Formula for the Product-Moment Coefficient of Any Order of a Normal Frequency Distribution in Any Number of Variables.Biometrika12(1918). 17 A Experimental details In this appendix, we collect detailed information on how we ran the experiments of this paper. A.1 Figure 1 We use greyscale images from the “ALOT” dataset [25], which we down...

  69. [69]

    δN -localizable

    Similarly, cℓ 22 =E h2 v·x σC h2 u·x σB = 1 λ2 k0 E[(v·x) 2(u·x) 2]− 1 λk0 h E[(v·x) 2] +E[(u·x) 2] i + 1 = 1 λ2 k0 E[(v·x) 2(u·x) 2]−1. By exploiting the orthonormality ofuandvand Lemma C.7, we have E[(v·x) 2(u·x) 2] = N−1X k,l,m,n=0 ukulvmvnE[xkxlxmxn] =λ 2 k0 +T 4, where T4 = 2 N 4 J4(4ε)E[ρ4 k0] N−1X k,l,m,n=0 ukulvmvn cos 2πk0 N (k+l+n+m). Define now...