pith. machine review for the scientific record. sign in

arxiv: 2605.05436 · v1 · submitted 2026-05-06 · 📊 stat.ML · cs.LG

Recognition: unknown

Estimating Implicit Regularization in Deep Learning

Giles Hooker, Joseph H. Rudoler, Kevin Tan, Konrad P. Kording

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:51 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords implicit regularizationdeep learninggradient matchingdropoutearly stoppingL2 penaltyneural network training
0
0 comments X

The pith

Gradient matching methods can empirically recover the implicit regularization effects induced by complex training procedures in deep networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops gradient matching techniques that compare observed weight updates against pure loss gradients to identify equivalent penalty terms. This approach recovers explicit penalties such as L1 and L2 when they are known, and it reproduces the quadratic weight penalty known to arise from early stopping in gradient descent. Applied to dropout, the method reveals an implicit L2-like effect in deep networks. Because the technique is empirical rather than analytical, it applies to arbitrary architectures and training modifications where closed-form derivations are unavailable. A sympathetic reader would value this for turning opaque training choices into measurable regularization strengths.

Core claim

By solving for the penalty term whose gradient best explains the difference between actual parameter updates and the loss gradient alone, one obtains an estimate of the implicit regularization at work during training; this recovers known explicit and implicit penalties and characterizes dropout as inducing an L2 effect in deep networks.

What carries the argument

Gradient matching, which finds a penalty function whose gradient aligns observed weight updates with the sum of the loss gradient plus that penalty gradient.

If this is right

  • The method verifies analytical predictions of implicit bias such as the quadratic penalty from early stopping.
  • Dropout in deep networks produces an implicit L2 regularization effect that the matching procedure can quantify.
  • Practitioners can apply the technique to interpret the net regularization strength of their chosen training modifications.
  • The empirical nature of the approach allows characterization of implicit biases in networks too complex for analytic derivation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same matching idea could be used to compare regularization strength across different optimizers or architectures in controlled experiments.
  • Quantified implicit penalties might guide automated hyperparameter search by treating regularization strength as an observable target.
  • Extending the method to measure how regularization evolves over training epochs could reveal time-varying bias effects not captured by static penalties.

Load-bearing premise

The deviation between actual weight updates and pure loss gradients can be matched to the gradient of some explicit penalty term.

What would settle it

Running the method on early-stopped gradient descent and failing to recover a quadratic weight penalty would falsify the claim that the matching procedure identifies the implicit regularization.

Figures

Figures reproduced from arXiv: 2605.05436 by Giles Hooker, Joseph H. Rudoler, Kevin Tan, Konrad P. Kording.

Figure 1
Figure 1. Figure 1: Deviations from loss minima contain information about regularization. Regulariza￾tion modifies the objective function, leading to different optima than strictly minimizing empir￾ical loss. This gap must be explained by a regu￾larizer (∇R) counteracting any nonzero loss gra￾dient (−∇L) – this is the key logic of the paper. A hypothetical optimum. The standard em￾pirical risk minimization (ERM) framework con… view at source ↗
Figure 2
Figure 2. Figure 2: Gradient deviations along the trajectory. In a 2-parameter toy model, SGD mini-batching causes weight updates along the trajectory to deviate from the full-batch gradient. We can model the implicit regularization underlying these deviations. Example: weight penalty in linear models. The sim￾plest case we can consider is that of recovering a weight norm penalty in linear models, i.e. ℓ1 (lasso) or ℓ2 (ridge… view at source ↗
Figure 3
Figure 3. Figure 3: Recovering explicit regularization. Elastic-net recovery at β = 10−3 on a 6 × 6 grid of true (λ1, λ2) with 10 dataset resamples per cell. Each panel plots the estimated λˆ i against the true λi on log–log axes; lines show the mean across seeds and error bars give ±1 standard error. Left: λˆ 1 vs. true λ1, one line per value of true λ2. Right: λˆ 2 vs. true λ2, one line per value of true λ1. The dashed diag… view at source ↗
Figure 4
Figure 4. Figure 4: Implicit regularization due to early stopping. (A) The theoretical regularization matrix Λ (t) predicted by Ali et al. at the early-stopped iterate t = 500. (B) The full symmetric estimator Λˆ (t) m using m = 10 endpoints, each trained to the same fixed iterate t = 500. (C) The under￾parametrized diagonal estimator diag(Λˆ(t) ) fit from a single endpoint. The regularized solution ˆθ Λ under diag(Λˆ(t) ) ma… view at source ↗
Figure 5
Figure 5. Figure 5: Implicit regularization of dropout. Estimated ℓ2 regularization strength (λˆ) as a function of dropout rate for MNIST classifiers. Each point is one seed; columns vary width, rows vary depth. Color indicates gradient-matching loss (darker = better fit). The monotonic increase with dropout rate is consistent with the theoretical prediction that dropout acts as adaptive weight decay. of its myriad effects. T… view at source ↗
Figure 6
Figure 6. Figure 6: Recovering implicit gradient regularization from discrete gradient steps. Barrett and Dherin [5] derived λ = ηp/4 for the gradient-penalty regularizer induced by discrete GD. Left: Value of RIG, the averaged squared gradient, against the estimated scaling λˆ. Middle: Test accuracy against λˆ. Right: Direct recovery check comparing λˆ to the theoretical value ηp/4. This application shows that our method can… view at source ↗
Figure 7
Figure 7. Figure 7: Bootstrap recovery of the early-stopped GD regularizer. Parallel to view at source ↗
read the original abstract

Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization -- connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like $\ell_1$ and $\ell_2$. It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit $\ell_2$ effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes gradient matching methods to empirically estimate implicit regularization (implicit bias) in deep networks with complex training modifications. It validates the approach by recovering known explicit penalties (ℓ1, ℓ2) and known implicit effects (quadratic penalty from early stopping in GD), then applies it to characterize dropout as inducing implicit ℓ2 regularization, claiming the empirical method works for arbitrary networks where analytic derivation is intractable.

Significance. If the method can reliably isolate implicit bias from stochastic gradient noise, it would offer a practical, falsifiable tool for quantifying regularization effects in real training pipelines, aiding both theoretical understanding and hyperparameter interpretation. The grounding in recovery of independently known penalties is a strength, as is the focus on replicable effects like early stopping.

major comments (2)
  1. [§5 (dropout experiments)] §5 (dropout experiments): the central claim that the method characterizes implicit regularization in arbitrary modified procedures (including dropout) is load-bearing on the assertion that gradient deviations can be matched to an equivalent penalty even when minibatch and dropout noise are present. The validation recovers penalties only in low-noise/deterministic cases (explicit ℓ1/ℓ2, early stopping); no ablation, noise-only baseline, or variance decomposition is reported to show the fitted ℓ2 term reflects bias rather than absorbing irreducible stochastic variance whose statistics depend on batch size, dropout rate, and weights. This directly affects whether the dropout result supports the claim for complex networks.
  2. [Method section (gradient matching procedure)] Method section (gradient matching procedure): the fitting of an equivalent penalty term to observed update deviations assumes the deviation is a clean signal of implicit bias. No analysis is given of how the matching accuracy or recovered parameters degrade as a function of batch size or dropout rate, nor is there a quantitative metric (e.g., residual error after fitting, cross-validation against held-out updates) that would confirm the procedure separates bias from noise when both are simultaneously present.
minor comments (3)
  1. [Abstract] The abstract states the method 'replicates known implicit effects' but does not specify the quantitative criterion used to declare successful replication (e.g., parameter recovery error, R² of the fit).
  2. [Method] Notation for the estimated penalty (e.g., how the equivalent regularization term is parameterized and optimized) should be introduced earlier and used consistently when reporting recovered coefficients for ℓ1/ℓ2 and dropout cases.
  3. [Figures (dropout results)] Figure captions for the dropout results should include error bars or multiple random seeds to indicate variability in the recovered penalty strength.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of validating our gradient matching approach under stochastic conditions. We agree that additional evidence is needed to confirm the method isolates implicit bias from noise in the dropout setting, and we will revise the manuscript accordingly to strengthen these claims.

read point-by-point responses
  1. Referee: [§5 (dropout experiments)] the central claim that the method characterizes implicit regularization in arbitrary modified procedures (including dropout) is load-bearing on the assertion that gradient deviations can be matched to an equivalent penalty even when minibatch and dropout noise are present. The validation recovers penalties only in low-noise/deterministic cases (explicit ℓ1/ℓ2, early stopping); no ablation, noise-only baseline, or variance decomposition is reported to show the fitted ℓ2 term reflects bias rather than absorbing irreducible stochastic variance whose statistics depend on batch size, dropout rate, and weights. This directly affects whether the dropout result supports the claim for complex networks.

    Authors: We agree that the current validation leaves open the question of whether the fitted ℓ2 term in the dropout experiments primarily captures implicit bias or absorbs stochastic variance. While the observed scaling of the recovered regularization strength with dropout rate is consistent with known theoretical predictions for dropout, we acknowledge the absence of targeted ablations. In the revision we will add a noise-only baseline (fitting to stochastic gradients from a network without dropout), ablations over batch size and dropout rate that track the stability of the recovered parameter, and reporting of residual fitting error to quantify the portion of variance explained by the bias term versus irreducible noise. These additions will directly test whether the procedure separates the two effects. revision: yes

  2. Referee: [Method section (gradient matching procedure)] the fitting of an equivalent penalty term to observed update deviations assumes the deviation is a clean signal of implicit bias. No analysis is given of how the matching accuracy or recovered parameters degrade as a function of batch size or dropout rate, nor is there a quantitative metric (e.g., residual error after fitting, cross-validation against held-out updates) that would confirm the procedure separates bias from noise when both are simultaneously present.

    Authors: The referee is correct that the manuscript does not yet provide a quantitative characterization of robustness to noise. We will expand the method section to include: (i) plots of recovered regularization strength and mean-squared residual error as functions of batch size and dropout probability; (ii) a cross-validation procedure that fits the penalty on one set of observed updates and evaluates predictive accuracy on held-out gradient deviations; and (iii) explicit discussion of the conditions (e.g., sufficient averaging over multiple runs) under which the matching procedure is expected to isolate bias. These changes will give readers clear guidance on the method's operating regime. revision: yes

Circularity Check

0 steps flagged

Empirical gradient-matching method is self-contained with no circular derivation steps

full rationale

The paper introduces an empirical procedure for estimating implicit regularization via matching observed deviations between weight updates and loss gradients to equivalent penalty terms. Validation proceeds by recovering externally known explicit penalties (ℓ1, ℓ2) and independently established implicit effects (quadratic penalty from early stopping), none of which are defined or derived from the method's own equations. Application to dropout is presented as a characterization exercise rather than a closed prediction. No load-bearing step reduces by construction to a fitted parameter, self-citation, or ansatz imported from the authors' prior work; the central claim remains falsifiable against external benchmarks and does not loop back to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that implicit regularization manifests as a deviation from loss gradients that can be matched to an equivalent explicit penalty; no free parameters, invented entities, or additional axioms are specified in the abstract.

axioms (1)
  • domain assumption Implicit regularization can be represented as an equivalent penalty that augments the learning objective and causes observable deviations in weight updates.
    Invoked in the abstract when stating that regularization makes weight updates deviate from loss gradients.

pith-pipeline@v0.9.0 · 5560 in / 1202 out tokens · 35935 ms · 2026-05-08T15:51:14.265097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study, June 2025

    Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, and Nadav Cohen. Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study, June 2025. URLhttp://arxiv. org/abs/2506.03931. arXiv:2506.03931 [cs]

  2. [2]

    A continuous-time view of early stopping for least squares regression

    Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuous-time view of early stopping for least squares regression. InThe 22nd international conference on artificial intelligence and statistics, pages 1370–1378. PMLR, 2019

  3. [3]

    On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018

    Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018. URLhttp://arxiv.org/abs/ 1802.06509. arXiv:1802.06509 [cs]

  4. [4]

    Implicit Regularization in Deep Matrix Factorization, October 2019

    Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit Regularization in Deep Matrix Factorization, October 2019. URLhttp://arxiv.org/abs/1905.13655. arXiv:1905.13655 [cs]

  5. [5]

    David G. T. Barrett and Benoit Dherin. Implicit Gradient Regularization, July 2022. URL http://arxiv.org/abs/2009.11162. arXiv:2009.11162 [cs]

  6. [6]

    Bartlett, Andrea Montanari, and Alexander Rakhlin

    Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statis- tical viewpoint.Acta Numerica, 30:87–201, May 2021. ISSN 0962-4929, 1474-0508. doi: 10.1017/S0962492921000027. URLhttps://www.cambridge.org/core/product/ identifier/S0962492921000027/type/journal_article

  7. [7]

    Towards Exact Computation of Inductive Bias, June 2024

    Akhilan Boopathy, William Yue, Jaedong Hwang, Abhiram Iyer, and Ila Fiete. Towards Exact Computation of Inductive Bias, June 2024. URLhttp://arxiv.org/abs/2406.15941. arXiv:2406.15941 [cs]

  8. [8]

    Spectral bias and task- model alignment explain generalization in kernel regression and infinitely wide neu- ral networks.Nature Communications, 12(1):2914, May 2021

    Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task- model alignment explain generalization in kernel regression and infinitely wide neu- ral networks.Nature Communications, 12(1):2914, May 2021. ISSN 2041-

  9. [9]

    Nature Communications , author =

    doi: 10.1038/s41467-021-23103-1. URLhttps://www.nature.com/articles/ s41467-021-23103-1

  10. [10]

    Dropout as a Low-Rank Regularizer for Matrix Factorization

    Jacopo Cavazza, Pietro Morerio, Benjamin Haeffele, Connor Lane, Vittorio Murino, and Rene Vidal. Dropout as a Low-Rank Regularizer for Matrix Factorization. InProceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pages 435–444. PMLR, March 2018. URLhttps://proceedings.mlr.press/v84/cavazza18a.html

  11. [11]

    Entropy-sgd: Biasing gradient descent into wide valleys

    Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing Gra- dient Descent Into Wide Valleys, April 2017. URLhttp://arxiv.org/abs/1611.01838. arXiv:1611.01838 [cs]

  12. [12]

    Implicit Bias of Gradient Descent for Wide Two-layer Neu- ral Networks Trained with the Logistic Loss

    Lénaïc Chizat and Francis Bach. Implicit Bias of Gradient Descent for Wide Two-layer Neu- ral Networks Trained with the Logistic Loss. InProceedings of Thirty Third Conference on Learning Theory, pages 1305–1338. PMLR, July 2020. URLhttps://proceedings.mlr. press/v125/chizat20a.html

  13. [13]

    Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-Aware Min- imization for Efficiently Improving Generalization, April 2021. URLhttp://arxiv.org/ abs/2010.01412. arXiv:2010.01412 [cs]

  14. [14]

    Implicit Regularization in Matrix Factorization, May 2017

    Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit Regularization in Matrix Factorization, May 2017. URLhttp://arxiv. org/abs/1705.09280. arXiv:1705.09280 [stat]

  15. [15]

    Comparing Biases for Minimal Network Construction with Back-Propagation

    Stephen Hanson and Lorien Pratt. Comparing Biases for Minimal Network Construction with Back-Propagation. InAdvances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988. URLhttps://proceedings.neurips.cc/paper/1988/hash/ 1c9ac0159c94d8d0cbedc973445af2da-Abstract.html. 10

  16. [16]

    Helmbold and Philip M

    David P. Helmbold and Philip M. Long. On the inductive bias of dropout.J. Mach. Learn. Res., 16(1):3403–3454, January 2015. ISSN 1532-4435

  17. [17]

    Flat Minima , journal =

    Sepp Hochreiter and Jürgen Schmidhuber. Flat Minima.Neural Computation, 9(1):1–42, January 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URLhttps://doi.org/ 10.1162/neco.1997.9.1.1

  18. [18]

    Probing as Quantifying Inductive Bias

    Alexander Immer, Lucas Torroba Hennigen, Vincent Fortuin, and Ryan Cotterell. Probing as Quantifying Inductive Bias. In Smaranda Muresan, Preslav Nakov, and Aline Villavicen- cio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1839–1851, Dublin, Ireland, May 2022. Asso- ciatio...

  19. [19]

    Simran Kaur, Jeremy Cohen, and Zachary C. Lipton. On the Maximum Hessian Eigenvalue and Generalization, June 2022. URLhttps://arxiv.org/abs/2206.10654v3

  20. [20]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, February 2017. URLhttp://arxiv.org/abs/1609.04836. arXiv:1609.04836 [cs]

  21. [21]

    Heung-Chang Lee and Jeonggeun Song

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, November 1998. ISSN 1558-2256. doi: 10.1109/5.726791. URLhttps://ieeexplore.ieee.org/document/726791

  22. [22]

    MNIST handwritten digit database.ATT Labs [Online]

    Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database.ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010

  23. [23]

    On the Principles of Parsimony and Self- Consistency for the Emergence of Intelligence, July 2022

    Yi Ma, Doris Tsao, and Heung-Yeung Shum. On the Principles of Parsimony and Self- Consistency for the Emergence of Intelligence, July 2022. URLhttp://arxiv.org/abs/ 2207.04630. arXiv:2207.04630 [cs, math]

  24. [24]

    D. J.C. MacKay. Probable networks and plausible predictions-a review of practical Bayesian methods for supervised neural networks.Network: Computation in Neural Systems, 6(3):469, August 1995. ISSN 0954-898X. doi: 10.1088/0954-898X/6/3/011. URLhttps://doi.org/ 10.1088/0954-898X/6/3/011

  25. [25]

    On Dropout and Nuclear Norm Regularization

    Poorya Mianjy and Raman Arora. On Dropout and Nuclear Norm Regularization. InProceed- ings of the 36th International Conference on Machine Learning, pages 4575–4584. PMLR, May 2019. URLhttps://proceedings.mlr.press/v97/mianjy19a.html

  26. [26]

    On the Implicit Bias of Dropout

    Poorya Mianjy, Raman Arora, and Rene Vidal. On the Implicit Bias of Dropout. InProceed- ings of the 35th International Conference on Machine Learning, pages 3540–3548. PMLR, July 2018. URLhttps://proceedings.mlr.press/v80/mianjy18b.html

  27. [27]

    Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks, May 2022

    Hancheng Min, Salma Tarmoun, Rene Vidal, and Enrique Mallada. Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks, May 2022. URLhttp:// arxiv.org/abs/2105.06351. arXiv:2105.06351 [cs]

  28. [28]

    Implicit Bias of the Step Size in Linear Diagonal Neural Networks

    Mor Shpigel Nacson, Kavya Ravichandran, Nathan Srebro, and Daniel Soudry. Implicit Bias of the Step Size in Linear Diagonal Neural Networks. InProceedings of the 39th International Conference on Machine Learning, pages 16270–16295. PMLR, June 2022. URLhttps: //proceedings.mlr.press/v162/nacson22a.html

  29. [29]

    Edelman, Fred Zhang, and Boaz Barak

    Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L. Edelman, Fred Zhang, and Boaz Barak. SGD on Neural Networks Learns Functions of Increasing Complexity, May 2019. URLhttp://arxiv.org/abs/1905.11604. arXiv:1905.11604 [cs]

  30. [30]

    2012.Bayesian learning for neural networks

    Radford M. Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics. Springer, New York, NY , 1996. ISBN 978-0-387-94724-2 978-1-4612- 0745-0. doi: 10.1007/978-1-4612-0745-0. URLhttp://link.springer.com/10.1007/ 978-1-4612-0745-0. 11

  31. [31]

    In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

    Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning, April 2015. URLhttp: //arxiv.org/abs/1412.6614. arXiv:1412.6614 [cs]

  32. [32]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, January 2022. URLhttp: //arxiv.org/abs/2201.02177. arXiv:2201.02177 [cs]

  33. [33]

    On the Spectral Bias of Neural Networks

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the Spectral Bias of Neural Networks. InProceed- ings of the 36th International Conference on Machine Learning, pages 5301–5310. PMLR, May 2019. URLhttps://proceedings.mlr.press/v97/rahaman19a.html

  34. [34]

    Reginaldo J. Santos. Equivalence of regularization and truncated iteration for general ill-posed problems.Linear Algebra and its Applications, 236:25–33, March 1996. ISSN 0024-3795. doi: 10.1016/0024-3795(94)00114-6. URLhttps://www.sciencedirect.com/science/ article/pii/0024379594001146

  35. [35]

    Smith, Benoit Dherin, David G

    Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the Origin of Implicit Regularization in Stochastic Gradient Descent, January 2021. URLhttp://arxiv.org/ abs/2101.12176. arXiv:2101.12176 [cs]

  36. [36]

    The Implicit Bias of Gradient Descent on Separable Data, November 2018

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data, November 2018. URLhttp://arxiv. org/abs/1710.10345. arXiv:1710.10345 [stat]

  37. [37]

    Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, 15(56):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, 15(56):1929–1958, 2014. ISSN 1533-7928. URLhttp: //jmlr.org/papers/v15/srivastava14a.html

  38. [38]

    and Louis, Ard A

    Guillermo Valle-Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions, April 2019. URL http://arxiv.org/abs/1805.08522. arXiv:1805.08522 [stat]

  39. [39]

    On Margin Maximization in Linear and ReLU Networks, October 2022

    Gal Vardi, Ohad Shamir, and Nathan Srebro. On Margin Maximization in Linear and ReLU Networks, October 2022. URLhttp://arxiv.org/abs/2110.02732. arXiv:2110.02732 [cs]

  40. [40]

    Dropout Training as Adaptive Regularization, November 2013

    Stefan Wager, Sida Wang, and Percy Liang. Dropout Training as Adaptive Regularization, November 2013. URLhttp://arxiv.org/abs/1307.1493. arXiv:1307.1493 [stat]

  41. [41]

    The Implicit and Explicit Regularization Effects of Dropout

    Colin Wei, Sham Kakade, and Tengyu Ma. The Implicit and Explicit Regularization Effects of Dropout. InProceedings of the 37th International Conference on Machine Learning, pages 10181–10192. PMLR, November 2020. URLhttps://proceedings.mlr.press/v119/ wei20d.html

  42. [42]

    Deep Learning is Not So Mysterious or Different, March 2025

    Andrew Gordon Wilson. Deep Learning is Not So Mysterious or Different, March 2025. URL http://arxiv.org/abs/2503.02113. arXiv:2503.02113 [cs]

  43. [43]

    Zhenqin Wu, Bharath Ramsundar, Evan N

    Andrew Gordon Wilson and Pavel Izmailov. Bayesian Deep Learning and a Probabilistic Perspective of Generalization, March 2022. URLhttp://arxiv.org/abs/2002.08791. arXiv:2002.08791 [cs]

  44. [44]

    Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro

    Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and Rich Regimes in Overparametrized Models. InProceedings of Thirty Third Conference on Learning Theory, pages 3635–3673. PMLR, July 2020. URLhttps://proceedings.mlr.press/v125/woodworth20a.html

  45. [45]

    On Early Stopping in Gradient Descent Learning.Constructive Approximation, 26(2):289–315, August 2007

    Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On Early Stopping in Gradient Descent Learning.Constructive Approximation, 26(2):289–315, August 2007. ISSN 1432-0940. doi: 10.1007/s00365-006-0663-2. URLhttps://doi.org/10.1007/s00365-006-0663-2. 12

  46. [46]

    The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks, June 2023

    Can Yaras, Peng Wang, Wei Hu, Zhihui Zhu, Laura Balzano, and Qing Qu. The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks, June 2023. URLhttp: //arxiv.org/abs/2306.01154. arXiv:2306.01154 [cs]

  47. [47]

    Understanding Deep Learning (Still) Requires Rethinking Generalization , volume =

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Un- derstanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, February 2021. ISSN 0001-0782. doi: 10.1145/3446776. URL https://doi.org/10.1145/3446776

  48. [48]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xingxuan Zhang, Renzhe Xu, Han Yu, Hao Zou, and Peng Cui. Gradient Norm Aware Mini- mization Seeks First-Order Flatness and Improves Generalization. In2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 20247–20257, Vancouver, BC, Canada, June 2023. IEEE. ISBN 979-8-3503-0129-8. doi: 10.1109/CVPR52729.2023. 01939. URLhttps:...

  49. [49]

    Additionally, •(Sub-Gaussian correlated errors)E[ξ|Φ] = 0, and there existsσ≥0so that E[exp(tv⊤ξ)|Φ]≤exp(σ 2t2v⊤Σv/2)for allv∈R p

    To obtain a bound on the error, we make the following assumptions: Assumption 2(Nonparametric endpoint well-specification).b=b ⋆ +ξ, whereb ⋆ is the predictable endpoint bias andξis mean-zero correlated error. Additionally, •(Sub-Gaussian correlated errors)E[ξ|Φ] = 0, and there existsσ≥0so that E[exp(tv⊤ξ)|Φ]≤exp(σ 2t2v⊤Σv/2)for allv∈R p. •(Bounded log-co...

  50. [50]

    propose a Bayesian framework for quantifying theamountof inductive bias needed to achieve generalization on some prediction task. They define inductive bias of a task as the negative log probability that a hypothesishachieves some test error rateε– intuitively, if a randomly sampled hypothesish∼p h is unlikely (small probability) then a large inductive bi...

  51. [51]

    propose a method of quantifying inductive bias based on probing intermediate representations. They consider Bayesian evidence as a proxy for inductive bias, formalizing it as the maximum evidence (how likely it is that a particular dataset could have been generated by a given model) for some intermediate representation, over all possible probes in a funct...

  52. [52]

    We repeat each configuration across10random seeds

    by gradient matching, fitting the single parameter with Adam at learning rate0.05for up to2000epochs and bias-fit patience100. We repeat each configuration across10random seeds. H.5 Recovering Barrett implicit gradient regularization This appendix details the known-coefficient recovery experiment for the implicit gradient regularizer derived by Barrett an...