arxiv: 2605.05436 · v1 · submitted 2026-05-06 · 📊 stat.ML · cs.LG

Recognition: unknown

Estimating Implicit Regularization in Deep Learning

Giles Hooker, Joseph H. Rudoler, Kevin Tan, Konrad P. Kording

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:51 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords implicit regularizationdeep learninggradient matchingdropoutearly stoppingL2 penaltyneural network training

0 comments

The pith

Gradient matching methods can empirically recover the implicit regularization effects induced by complex training procedures in deep networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops gradient matching techniques that compare observed weight updates against pure loss gradients to identify equivalent penalty terms. This approach recovers explicit penalties such as L1 and L2 when they are known, and it reproduces the quadratic weight penalty known to arise from early stopping in gradient descent. Applied to dropout, the method reveals an implicit L2-like effect in deep networks. Because the technique is empirical rather than analytical, it applies to arbitrary architectures and training modifications where closed-form derivations are unavailable. A sympathetic reader would value this for turning opaque training choices into measurable regularization strengths.

Core claim

By solving for the penalty term whose gradient best explains the difference between actual parameter updates and the loss gradient alone, one obtains an estimate of the implicit regularization at work during training; this recovers known explicit and implicit penalties and characterizes dropout as inducing an L2 effect in deep networks.

What carries the argument

Gradient matching, which finds a penalty function whose gradient aligns observed weight updates with the sum of the loss gradient plus that penalty gradient.

If this is right

The method verifies analytical predictions of implicit bias such as the quadratic penalty from early stopping.
Dropout in deep networks produces an implicit L2 regularization effect that the matching procedure can quantify.
Practitioners can apply the technique to interpret the net regularization strength of their chosen training modifications.
The empirical nature of the approach allows characterization of implicit biases in networks too complex for analytic derivation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same matching idea could be used to compare regularization strength across different optimizers or architectures in controlled experiments.
Quantified implicit penalties might guide automated hyperparameter search by treating regularization strength as an observable target.
Extending the method to measure how regularization evolves over training epochs could reveal time-varying bias effects not captured by static penalties.

Load-bearing premise

The deviation between actual weight updates and pure loss gradients can be matched to the gradient of some explicit penalty term.

What would settle it

Running the method on early-stopped gradient descent and failing to recover a quadratic weight penalty would falsify the claim that the matching procedure identifies the implicit regularization.

Figures

Figures reproduced from arXiv: 2605.05436 by Giles Hooker, Joseph H. Rudoler, Kevin Tan, Konrad P. Kording.

**Figure 1.** Figure 1: Deviations from loss minima contain information about regularization. Regularization modifies the objective function, leading to different optima than strictly minimizing empirical loss. This gap must be explained by a regularizer (∇R) counteracting any nonzero loss gradient (−∇L) – this is the key logic of the paper. A hypothetical optimum. The standard empirical risk minimization (ERM) framework con… view at source ↗

**Figure 2.** Figure 2: Gradient deviations along the trajectory. In a 2-parameter toy model, SGD mini-batching causes weight updates along the trajectory to deviate from the full-batch gradient. We can model the implicit regularization underlying these deviations. Example: weight penalty in linear models. The simplest case we can consider is that of recovering a weight norm penalty in linear models, i.e. ℓ1 (lasso) or ℓ2 (ridge… view at source ↗

**Figure 3.** Figure 3: Recovering explicit regularization. Elastic-net recovery at β = 10−3 on a 6 × 6 grid of true (λ1, λ2) with 10 dataset resamples per cell. Each panel plots the estimated λˆ i against the true λi on log–log axes; lines show the mean across seeds and error bars give ±1 standard error. Left: λˆ 1 vs. true λ1, one line per value of true λ2. Right: λˆ 2 vs. true λ2, one line per value of true λ1. The dashed diag… view at source ↗

**Figure 4.** Figure 4: Implicit regularization due to early stopping. (A) The theoretical regularization matrix Λ (t) predicted by Ali et al. at the early-stopped iterate t = 500. (B) The full symmetric estimator Λˆ (t) m using m = 10 endpoints, each trained to the same fixed iterate t = 500. (C) The underparametrized diagonal estimator diag(Λˆ(t) ) fit from a single endpoint. The regularized solution ˆθ Λ under diag(Λˆ(t) ) ma… view at source ↗

**Figure 5.** Figure 5: Implicit regularization of dropout. Estimated ℓ2 regularization strength (λˆ) as a function of dropout rate for MNIST classifiers. Each point is one seed; columns vary width, rows vary depth. Color indicates gradient-matching loss (darker = better fit). The monotonic increase with dropout rate is consistent with the theoretical prediction that dropout acts as adaptive weight decay. of its myriad effects. T… view at source ↗

**Figure 6.** Figure 6: Recovering implicit gradient regularization from discrete gradient steps. Barrett and Dherin [5] derived λ = ηp/4 for the gradient-penalty regularizer induced by discrete GD. Left: Value of RIG, the averaged squared gradient, against the estimated scaling λˆ. Middle: Test accuracy against λˆ. Right: Direct recovery check comparing λˆ to the theoretical value ηp/4. This application shows that our method can… view at source ↗

**Figure 7.** Figure 7: Bootstrap recovery of the early-stopped GD regularizer. Parallel to view at source ↗

read the original abstract

Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization -- connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like $\ell_1$ and $\ell_2$. It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit $\ell_2$ effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gradient matching recovers known regularization effects cleanly but needs more work to separate bias from noise in stochastic training.

read the letter

The paper's main contribution is a gradient-matching procedure that estimates the implicit regularization present in a given training run. It works by treating the difference between the actual weight update and the loss gradient as the effect of some penalty, then fitting that penalty. What the paper does well is recover the expected penalties in controlled settings. When they add explicit L1 or L2 regularization, the method finds it. When they use early stopping, it finds the quadratic penalty that theory predicts. Those results are straightforward and match what we already know, which is good evidence that the fitting step is behaving reasonably. The new application is to dropout. The authors report that dropout acts like an L2 penalty on the weights. If this holds up, it gives practitioners a way to quantify the regularization coming from a common trick without needing a full analytical derivation. The soft spot is the handling of stochastic gradients. Minibatch training and dropout both add noise to the updates. The deviation signal therefore contains both the implicit bias and irreducible variance. The fitting procedure might capture some of that variance as part of the penalty. The paper validates the method on deterministic or low-noise cases, but the dropout experiments are precisely the setting where noise is high. No additional checks are described that would show the fitted term is isolating the bias rather than the noise. This work is for people who train deep networks and want a practical way to inspect what kind of regularization their setup is applying. It is also relevant to theorists who need an empirical test for their implicit bias predictions on real architectures. The core idea is simple enough and the validation on known cases is solid enough that it deserves a serious referee. The stochastic robustness question is the main thing that needs attention in review. I would bring it to a reading group to discuss the method and the dropout result. I would not cite it in my own work unless I started using the gradient matching tool myself.

Referee Report

2 major / 3 minor

Summary. The paper proposes gradient matching methods to empirically estimate implicit regularization (implicit bias) in deep networks with complex training modifications. It validates the approach by recovering known explicit penalties (ℓ1, ℓ2) and known implicit effects (quadratic penalty from early stopping in GD), then applies it to characterize dropout as inducing implicit ℓ2 regularization, claiming the empirical method works for arbitrary networks where analytic derivation is intractable.

Significance. If the method can reliably isolate implicit bias from stochastic gradient noise, it would offer a practical, falsifiable tool for quantifying regularization effects in real training pipelines, aiding both theoretical understanding and hyperparameter interpretation. The grounding in recovery of independently known penalties is a strength, as is the focus on replicable effects like early stopping.

major comments (2)

[§5 (dropout experiments)] §5 (dropout experiments): the central claim that the method characterizes implicit regularization in arbitrary modified procedures (including dropout) is load-bearing on the assertion that gradient deviations can be matched to an equivalent penalty even when minibatch and dropout noise are present. The validation recovers penalties only in low-noise/deterministic cases (explicit ℓ1/ℓ2, early stopping); no ablation, noise-only baseline, or variance decomposition is reported to show the fitted ℓ2 term reflects bias rather than absorbing irreducible stochastic variance whose statistics depend on batch size, dropout rate, and weights. This directly affects whether the dropout result supports the claim for complex networks.
[Method section (gradient matching procedure)] Method section (gradient matching procedure): the fitting of an equivalent penalty term to observed update deviations assumes the deviation is a clean signal of implicit bias. No analysis is given of how the matching accuracy or recovered parameters degrade as a function of batch size or dropout rate, nor is there a quantitative metric (e.g., residual error after fitting, cross-validation against held-out updates) that would confirm the procedure separates bias from noise when both are simultaneously present.

minor comments (3)

[Abstract] The abstract states the method 'replicates known implicit effects' but does not specify the quantitative criterion used to declare successful replication (e.g., parameter recovery error, R² of the fit).
[Method] Notation for the estimated penalty (e.g., how the equivalent regularization term is parameterized and optimized) should be introduced earlier and used consistently when reporting recovered coefficients for ℓ1/ℓ2 and dropout cases.
[Figures (dropout results)] Figure captions for the dropout results should include error bars or multiple random seeds to indicate variability in the recovered penalty strength.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of validating our gradient matching approach under stochastic conditions. We agree that additional evidence is needed to confirm the method isolates implicit bias from noise in the dropout setting, and we will revise the manuscript accordingly to strengthen these claims.

read point-by-point responses

Referee: [§5 (dropout experiments)] the central claim that the method characterizes implicit regularization in arbitrary modified procedures (including dropout) is load-bearing on the assertion that gradient deviations can be matched to an equivalent penalty even when minibatch and dropout noise are present. The validation recovers penalties only in low-noise/deterministic cases (explicit ℓ1/ℓ2, early stopping); no ablation, noise-only baseline, or variance decomposition is reported to show the fitted ℓ2 term reflects bias rather than absorbing irreducible stochastic variance whose statistics depend on batch size, dropout rate, and weights. This directly affects whether the dropout result supports the claim for complex networks.

Authors: We agree that the current validation leaves open the question of whether the fitted ℓ2 term in the dropout experiments primarily captures implicit bias or absorbs stochastic variance. While the observed scaling of the recovered regularization strength with dropout rate is consistent with known theoretical predictions for dropout, we acknowledge the absence of targeted ablations. In the revision we will add a noise-only baseline (fitting to stochastic gradients from a network without dropout), ablations over batch size and dropout rate that track the stability of the recovered parameter, and reporting of residual fitting error to quantify the portion of variance explained by the bias term versus irreducible noise. These additions will directly test whether the procedure separates the two effects. revision: yes
Referee: [Method section (gradient matching procedure)] the fitting of an equivalent penalty term to observed update deviations assumes the deviation is a clean signal of implicit bias. No analysis is given of how the matching accuracy or recovered parameters degrade as a function of batch size or dropout rate, nor is there a quantitative metric (e.g., residual error after fitting, cross-validation against held-out updates) that would confirm the procedure separates bias from noise when both are simultaneously present.

Authors: The referee is correct that the manuscript does not yet provide a quantitative characterization of robustness to noise. We will expand the method section to include: (i) plots of recovered regularization strength and mean-squared residual error as functions of batch size and dropout probability; (ii) a cross-validation procedure that fits the penalty on one set of observed updates and evaluates predictive accuracy on held-out gradient deviations; and (iii) explicit discussion of the conditions (e.g., sufficient averaging over multiple runs) under which the matching procedure is expected to isolate bias. These changes will give readers clear guidance on the method's operating regime. revision: yes

Circularity Check

0 steps flagged

Empirical gradient-matching method is self-contained with no circular derivation steps

full rationale

The paper introduces an empirical procedure for estimating implicit regularization via matching observed deviations between weight updates and loss gradients to equivalent penalty terms. Validation proceeds by recovering externally known explicit penalties (ℓ1, ℓ2) and independently established implicit effects (quadratic penalty from early stopping), none of which are defined or derived from the method's own equations. Application to dropout is presented as a characterization exercise rather than a closed prediction. No load-bearing step reduces by construction to a fitted parameter, self-citation, or ansatz imported from the authors' prior work; the central claim remains falsifiable against external benchmarks and does not loop back to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that implicit regularization manifests as a deviation from loss gradients that can be matched to an equivalent explicit penalty; no free parameters, invented entities, or additional axioms are specified in the abstract.

axioms (1)

domain assumption Implicit regularization can be represented as an equivalent penalty that augments the learning objective and causes observable deviations in weight updates.
Invoked in the abstract when stating that regularization makes weight updates deviate from loss gradients.

pith-pipeline@v0.9.0 · 5560 in / 1202 out tokens · 35935 ms · 2026-05-08T15:51:14.265097+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 34 canonical work pages · 2 internal anchors

[1]

Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study, June 2025

Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, and Nadav Cohen. Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study, June 2025. URLhttp://arxiv. org/abs/2506.03931. arXiv:2506.03931 [cs]

work page arXiv 2025
[2]

A continuous-time view of early stopping for least squares regression

Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuous-time view of early stopping for least squares regression. InThe 22nd international conference on artificial intelligence and statistics, pages 1370–1378. PMLR, 2019

2019
[3]

On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018. URLhttp://arxiv.org/abs/ 1802.06509. arXiv:1802.06509 [cs]

work page arXiv 2018
[4]

Implicit Regularization in Deep Matrix Factorization, October 2019

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit Regularization in Deep Matrix Factorization, October 2019. URLhttp://arxiv.org/abs/1905.13655. arXiv:1905.13655 [cs]

work page arXiv 2019
[5]

David G. T. Barrett and Benoit Dherin. Implicit Gradient Regularization, July 2022. URL http://arxiv.org/abs/2009.11162. arXiv:2009.11162 [cs]

work page arXiv 2022
[6]

Bartlett, Andrea Montanari, and Alexander Rakhlin

Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statis- tical viewpoint.Acta Numerica, 30:87–201, May 2021. ISSN 0962-4929, 1474-0508. doi: 10.1017/S0962492921000027. URLhttps://www.cambridge.org/core/product/ identifier/S0962492921000027/type/journal_article

work page doi:10.1017/s0962492921000027 2021
[7]

Towards Exact Computation of Inductive Bias, June 2024

Akhilan Boopathy, William Yue, Jaedong Hwang, Abhiram Iyer, and Ila Fiete. Towards Exact Computation of Inductive Bias, June 2024. URLhttp://arxiv.org/abs/2406.15941. arXiv:2406.15941 [cs]

work page arXiv 2024
[8]

Spectral bias and task- model alignment explain generalization in kernel regression and infinitely wide neu- ral networks.Nature Communications, 12(1):2914, May 2021

Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task- model alignment explain generalization in kernel regression and infinitely wide neu- ral networks.Nature Communications, 12(1):2914, May 2021. ISSN 2041-

2021
[9]

Nature Communications , author =

doi: 10.1038/s41467-021-23103-1. URLhttps://www.nature.com/articles/ s41467-021-23103-1

work page doi:10.1038/s41467-021-23103-1
[10]

Dropout as a Low-Rank Regularizer for Matrix Factorization

Jacopo Cavazza, Pietro Morerio, Benjamin Haeffele, Connor Lane, Vittorio Murino, and Rene Vidal. Dropout as a Low-Rank Regularizer for Matrix Factorization. InProceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pages 435–444. PMLR, March 2018. URLhttps://proceedings.mlr.press/v84/cavazza18a.html

2018
[11]

Entropy-sgd: Biasing gradient descent into wide valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing Gra- dient Descent Into Wide Valleys, April 2017. URLhttp://arxiv.org/abs/1611.01838. arXiv:1611.01838 [cs]

work page arXiv 2017
[12]

Implicit Bias of Gradient Descent for Wide Two-layer Neu- ral Networks Trained with the Logistic Loss

Lénaïc Chizat and Francis Bach. Implicit Bias of Gradient Descent for Wide Two-layer Neu- ral Networks Trained with the Logistic Loss. InProceedings of Thirty Third Conference on Learning Theory, pages 1305–1338. PMLR, July 2020. URLhttps://proceedings.mlr. press/v125/chizat20a.html

2020
[13]

Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-Aware Min- imization for Efficiently Improving Generalization, April 2021. URLhttp://arxiv.org/ abs/2010.01412. arXiv:2010.01412 [cs]

work page arXiv 2021
[14]

Implicit Regularization in Matrix Factorization, May 2017

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit Regularization in Matrix Factorization, May 2017. URLhttp://arxiv. org/abs/1705.09280. arXiv:1705.09280 [stat]

work page arXiv 2017
[15]

Comparing Biases for Minimal Network Construction with Back-Propagation

Stephen Hanson and Lorien Pratt. Comparing Biases for Minimal Network Construction with Back-Propagation. InAdvances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988. URLhttps://proceedings.neurips.cc/paper/1988/hash/ 1c9ac0159c94d8d0cbedc973445af2da-Abstract.html. 10

1988
[16]

Helmbold and Philip M

David P. Helmbold and Philip M. Long. On the inductive bias of dropout.J. Mach. Learn. Res., 16(1):3403–3454, January 2015. ISSN 1532-4435

2015
[17]

Flat Minima , journal =

Sepp Hochreiter and Jürgen Schmidhuber. Flat Minima.Neural Computation, 9(1):1–42, January 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URLhttps://doi.org/ 10.1162/neco.1997.9.1.1

work page doi:10.1162/neco.1997.9.1.1 1997
[18]

Probing as Quantifying Inductive Bias

Alexander Immer, Lucas Torroba Hennigen, Vincent Fortuin, and Ryan Cotterell. Probing as Quantifying Inductive Bias. In Smaranda Muresan, Preslav Nakov, and Aline Villavicen- cio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1839–1851, Dublin, Ireland, May 2022. Asso- ciatio...

work page doi:10.18653/v1/2022.acl-long.129 2022
[19]

Simran Kaur, Jeremy Cohen, and Zachary C. Lipton. On the Maximum Hessian Eigenvalue and Generalization, June 2022. URLhttps://arxiv.org/abs/2206.10654v3

work page arXiv 2022
[20]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, February 2017. URLhttp://arxiv.org/abs/1609.04836. arXiv:1609.04836 [cs]

work page internal anchor Pith review arXiv 2017
[21]

Heung-Chang Lee and Jeonggeun Song

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, November 1998. ISSN 1558-2256. doi: 10.1109/5.726791. URLhttps://ieeexplore.ieee.org/document/726791

work page doi:10.1109/5.726791 1998
[22]

MNIST handwritten digit database.ATT Labs [Online]

Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database.ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010

2010
[23]

On the Principles of Parsimony and Self- Consistency for the Emergence of Intelligence, July 2022

Yi Ma, Doris Tsao, and Heung-Yeung Shum. On the Principles of Parsimony and Self- Consistency for the Emergence of Intelligence, July 2022. URLhttp://arxiv.org/abs/ 2207.04630. arXiv:2207.04630 [cs, math]

work page arXiv 2022
[24]

D. J.C. MacKay. Probable networks and plausible predictions-a review of practical Bayesian methods for supervised neural networks.Network: Computation in Neural Systems, 6(3):469, August 1995. ISSN 0954-898X. doi: 10.1088/0954-898X/6/3/011. URLhttps://doi.org/ 10.1088/0954-898X/6/3/011

work page doi:10.1088/0954-898x/6/3/011 1995
[25]

On Dropout and Nuclear Norm Regularization

Poorya Mianjy and Raman Arora. On Dropout and Nuclear Norm Regularization. InProceed- ings of the 36th International Conference on Machine Learning, pages 4575–4584. PMLR, May 2019. URLhttps://proceedings.mlr.press/v97/mianjy19a.html

2019
[26]

On the Implicit Bias of Dropout

Poorya Mianjy, Raman Arora, and Rene Vidal. On the Implicit Bias of Dropout. InProceed- ings of the 35th International Conference on Machine Learning, pages 3540–3548. PMLR, July 2018. URLhttps://proceedings.mlr.press/v80/mianjy18b.html

2018
[27]

Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks, May 2022

Hancheng Min, Salma Tarmoun, Rene Vidal, and Enrique Mallada. Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks, May 2022. URLhttp:// arxiv.org/abs/2105.06351. arXiv:2105.06351 [cs]

work page arXiv 2022
[28]

Implicit Bias of the Step Size in Linear Diagonal Neural Networks

Mor Shpigel Nacson, Kavya Ravichandran, Nathan Srebro, and Daniel Soudry. Implicit Bias of the Step Size in Linear Diagonal Neural Networks. InProceedings of the 39th International Conference on Machine Learning, pages 16270–16295. PMLR, June 2022. URLhttps: //proceedings.mlr.press/v162/nacson22a.html

2022
[29]

Edelman, Fred Zhang, and Boaz Barak

Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L. Edelman, Fred Zhang, and Boaz Barak. SGD on Neural Networks Learns Functions of Increasing Complexity, May 2019. URLhttp://arxiv.org/abs/1905.11604. arXiv:1905.11604 [cs]

work page arXiv 2019
[30]

2012.Bayesian learning for neural networks

Radford M. Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics. Springer, New York, NY , 1996. ISBN 978-0-387-94724-2 978-1-4612- 0745-0. doi: 10.1007/978-1-4612-0745-0. URLhttp://link.springer.com/10.1007/ 978-1-4612-0745-0. 11

work page doi:10.1007/978-1-4612-0745-0 1996
[31]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning, April 2015. URLhttp: //arxiv.org/abs/1412.6614. arXiv:1412.6614 [cs]

work page Pith review arXiv 2015
[32]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, January 2022. URLhttp: //arxiv.org/abs/2201.02177. arXiv:2201.02177 [cs]

work page internal anchor Pith review arXiv 2022
[33]

On the Spectral Bias of Neural Networks

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the Spectral Bias of Neural Networks. InProceed- ings of the 36th International Conference on Machine Learning, pages 5301–5310. PMLR, May 2019. URLhttps://proceedings.mlr.press/v97/rahaman19a.html

2019
[34]

Reginaldo J. Santos. Equivalence of regularization and truncated iteration for general ill-posed problems.Linear Algebra and its Applications, 236:25–33, March 1996. ISSN 0024-3795. doi: 10.1016/0024-3795(94)00114-6. URLhttps://www.sciencedirect.com/science/ article/pii/0024379594001146

work page doi:10.1016/0024-3795(94)00114-6 1996
[35]

Smith, Benoit Dherin, David G

Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the Origin of Implicit Regularization in Stochastic Gradient Descent, January 2021. URLhttp://arxiv.org/ abs/2101.12176. arXiv:2101.12176 [cs]

work page arXiv 2021
[36]

The Implicit Bias of Gradient Descent on Separable Data, November 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data, November 2018. URLhttp://arxiv. org/abs/1710.10345. arXiv:1710.10345 [stat]

work page arXiv 2018
[37]

Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, 15(56):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, 15(56):1929–1958, 2014. ISSN 1533-7928. URLhttp: //jmlr.org/papers/v15/srivastava14a.html

1929
[38]

and Louis, Ard A

Guillermo Valle-Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions, April 2019. URL http://arxiv.org/abs/1805.08522. arXiv:1805.08522 [stat]

work page arXiv 2019
[39]

On Margin Maximization in Linear and ReLU Networks, October 2022

Gal Vardi, Ohad Shamir, and Nathan Srebro. On Margin Maximization in Linear and ReLU Networks, October 2022. URLhttp://arxiv.org/abs/2110.02732. arXiv:2110.02732 [cs]

work page arXiv 2022
[40]

Dropout Training as Adaptive Regularization, November 2013

Stefan Wager, Sida Wang, and Percy Liang. Dropout Training as Adaptive Regularization, November 2013. URLhttp://arxiv.org/abs/1307.1493. arXiv:1307.1493 [stat]

work page arXiv 2013
[41]

The Implicit and Explicit Regularization Effects of Dropout

Colin Wei, Sham Kakade, and Tengyu Ma. The Implicit and Explicit Regularization Effects of Dropout. InProceedings of the 37th International Conference on Machine Learning, pages 10181–10192. PMLR, November 2020. URLhttps://proceedings.mlr.press/v119/ wei20d.html

2020
[42]

Deep Learning is Not So Mysterious or Different, March 2025

Andrew Gordon Wilson. Deep Learning is Not So Mysterious or Different, March 2025. URL http://arxiv.org/abs/2503.02113. arXiv:2503.02113 [cs]

work page arXiv 2025
[43]

Zhenqin Wu, Bharath Ramsundar, Evan N

Andrew Gordon Wilson and Pavel Izmailov. Bayesian Deep Learning and a Probabilistic Perspective of Generalization, March 2022. URLhttp://arxiv.org/abs/2002.08791. arXiv:2002.08791 [cs]

work page arXiv 2022
[44]

Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro

Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and Rich Regimes in Overparametrized Models. InProceedings of Thirty Third Conference on Learning Theory, pages 3635–3673. PMLR, July 2020. URLhttps://proceedings.mlr.press/v125/woodworth20a.html

2020
[45]

On Early Stopping in Gradient Descent Learning.Constructive Approximation, 26(2):289–315, August 2007

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On Early Stopping in Gradient Descent Learning.Constructive Approximation, 26(2):289–315, August 2007. ISSN 1432-0940. doi: 10.1007/s00365-006-0663-2. URLhttps://doi.org/10.1007/s00365-006-0663-2. 12

work page doi:10.1007/s00365-006-0663-2 2007
[46]

The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks, June 2023

Can Yaras, Peng Wang, Wei Hu, Zhihui Zhu, Laura Balzano, and Qing Qu. The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks, June 2023. URLhttp: //arxiv.org/abs/2306.01154. arXiv:2306.01154 [cs]

work page arXiv 2023
[47]

Understanding Deep Learning (Still) Requires Rethinking Generalization , volume =

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Un- derstanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, February 2021. ISSN 0001-0782. doi: 10.1145/3446776. URL https://doi.org/10.1145/3446776

work page doi:10.1145/3446776 2021
[48]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xingxuan Zhang, Renzhe Xu, Han Yu, Hao Zou, and Peng Cui. Gradient Norm Aware Mini- mization Seeks First-Order Flatness and Improves Generalization. In2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 20247–20257, Vancouver, BC, Canada, June 2023. IEEE. ISBN 979-8-3503-0129-8. doi: 10.1109/CVPR52729.2023. 01939. URLhttps:...

work page doi:10.1109/cvpr52729.2023 2023
[49]

Additionally, •(Sub-Gaussian correlated errors)E[ξ|Φ] = 0, and there existsσ≥0so that E[exp(tv⊤ξ)|Φ]≤exp(σ 2t2v⊤Σv/2)for allv∈R p

To obtain a bound on the error, we make the following assumptions: Assumption 2(Nonparametric endpoint well-specification).b=b ⋆ +ξ, whereb ⋆ is the predictable endpoint bias andξis mean-zero correlated error. Additionally, •(Sub-Gaussian correlated errors)E[ξ|Φ] = 0, and there existsσ≥0so that E[exp(tv⊤ξ)|Φ]≤exp(σ 2t2v⊤Σv/2)for allv∈R p. •(Bounded log-co...
[50]

propose a Bayesian framework for quantifying theamountof inductive bias needed to achieve generalization on some prediction task. They define inductive bias of a task as the negative log probability that a hypothesishachieves some test error rateε– intuitively, if a randomly sampled hypothesish∼p h is unlikely (small probability) then a large inductive bi...
[51]

propose a method of quantifying inductive bias based on probing intermediate representations. They consider Bayesian evidence as a proxy for inductive bias, formalizing it as the maximum evidence (how likely it is that a particular dataset could have been generated by a given model) for some intermediate representation, over all possible probes in a funct...
[52]

We repeat each configuration across10random seeds

by gradient matching, fitting the single parameter with Adam at learning rate0.05for up to2000epochs and bias-fit patience100. We repeat each configuration across10random seeds. H.5 Recovering Barrett implicit gradient regularization This appendix details the known-coefficient recovery experiment for the implicit gradient regularizer derived by Barrett an...