arxiv: 2602.01642 · v2 · submitted 2026-02-02 · 💻 cs.LG · cs.AI· math.OC· stat.CO· stat.ML

Recognition: no theorem link

The Effect of Mini-Batch Noise on the Implicit Bias of Adam

Matias D. Cattaneo , Boris Shigida

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.COstat.ML

keywords Adam optimizermini-batch noiseimplicit biasgeneralizationmomentum parametersbatch sizeloss landscapemulti-epoch training

0 comments

The pith

Mini-batch noise reverses whether Adam's higher β2 pushes toward sharper or flatter minima.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that mini-batch noise interacts with Adam's momentum memory to control an implicit bias toward sharp or flat regions of the loss surface. In large-batch regimes, raising β2 strengthens anti-regularization and tends to hurt generalization. When batches shrink, the dependence flips so that higher β2 instead favors flatter minima that support better generalization. A parallel but opposite reversal occurs for β1. These patterns matter for multi-epoch training on limited data, where the choice of β1 and β2 can shift validation accuracy without any change in explicit regularization.

Core claim

In the case of large batch sizes, higher β2 increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regularization on β2 is reversed. A similar monotonicity shift (in the opposite direction) happens in β1. The commonly default pair (β1, β2) = (0.9, 0.999) is a good choice if batches are small; for larger batches, moving β1 closer to β2 is much better in terms of validation accuracy in multi-epoch training. The scale of the batch size at which the shift happens connects to the scale of the critical batch size.

What carries the argument

The interaction between mini-batch noise and Adam's momentum memory parameters β1 and β2, which sets the direction and strength of implicit bias toward sharper or flatter loss minima.

If this is right

Large-batch training benefits from setting β1 closer to β2 to reduce anti-regularization.
Small-batch training performs well with the standard default values β1=0.9 and β2=0.999.
The batch-size scale at which the monotonicity reversal occurs tracks the critical batch size.
The effect is observable in the about-to-overfit multi-epoch regime on small-scale data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Batch size should be treated as a first-class hyperparameter when choosing momentum values for Adam.
The reversal implies that noise can counteract memory-driven anti-regularization in ways standard bias analyses miss.
Similar batch-size-dependent reversals may appear in other momentum-based adaptive methods.
Dynamic schedules that adjust β1 or β2 as effective batch size changes could improve generalization.

Load-bearing premise

The interaction between mini-batch noise and momentum memory can be isolated from other optimization dynamics, and that flatness of reached minima correlates with generalization in the multi-epoch regime.

What would settle it

An experiment that trains Adam with fixed β2 across a range of batch sizes, measures the curvature of the reached minima, and finds either no reversal in the β2-flatness dependence or no connection between that curvature and validation accuracy.

Figures

Figures reproduced from arXiv: 2602.01642 by Boris Shigida, Matias D. Cattaneo.

**Figure 2.** Figure 2: Minimal validation perplexity (before overfitting) of Transformer-XL trained with Adam on [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Minimal validation perplexity (before overfitting) of Transformer-XL trained with Adam on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The estimated simple noise scale Bbsimple for different training runs of Transformer-XL on WikiText-2. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

read the original abstract

With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(\beta_1, \beta_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $\beta_1$, $\beta_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $\beta_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $\beta_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $\beta_1$. In particular, the commonly "default" pair $(\beta_1, \beta_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $\beta_1$ closer to $\beta_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims mini-batch noise reverses how Adam's β2 (and oppositely β1) affects implicit bias toward sharp or flat minima, with the flip tied to critical batch size, but the abstract gives no derivations or bounds so the claim stays provisional.

read the letter

The main thing to know is that this work links mini-batch noise to a sign flip in the implicit bias from Adam's momentum memory: higher β2 pushes toward sharper regions (worse generalization) at large batch sizes but the dependence reverses at small sizes, with a similar but opposite shift for β1. They suggest the default (0.9, 0.999) suits small batches while larger ones may benefit from β1 closer to β2, and they connect the transition to critical batch size in multi-epoch training on limited data.

Referee Report

3 major / 2 minor

Summary. The paper introduces a theoretical framework to analyze how mini-batch noise influences the implicit bias induced by Adam's momentum parameters (β1, β2) toward sharper or flatter regions of the loss landscape in multi-epoch training. It claims that the dependence of this (anti-)regularization effect on β2 reverses with batch size: higher β2 increases anti-regularization for large batches but the dependence reverses for smaller batches, with an opposite monotonicity shift for β1. The framework connects the reversal scale to the critical batch size, and the practical implication is that the default (0.9, 0.999) pair is suitable for small batches while moving β1 closer to β2 is preferable for larger batches. These predictions are illustrated via experiments on small-scale data in the about-to-overfit regime.

Significance. If the isolation of noise-memory coupling holds with the required domination bounds, the result supplies a principled, batch-size-dependent rule for tuning Adam hyperparameters to improve generalization in multi-epoch regimes that are regaining importance under data constraints. The explicit linkage of the reversal point to critical batch size constitutes a falsifiable prediction and a strength of the work. The significance remains conditional on verifying that the modeling assumptions do not introduce artifacts near the critical scale.

major comments (3)

[Theoretical Framework] Theoretical Framework: the monotonicity reversal for β2 (and opposite shift for β1) is stated without derivation details, error bounds, or explicit assumptions on the loss landscape; it is therefore impossible to verify whether the predicted reversal is independent of the data used to illustrate it or reduces to a quantity fitted from the same observations.
[Theoretical Framework] Theoretical Framework: no domination bounds or regime conditions are supplied to guarantee that the mini-batch noise–momentum memory interaction dominates curvature evolution, gradient alignment changes, and multi-epoch landscape drift; without these, the reversal could be an artifact of the isolation choice rather than a robust prediction.
[Experiments] Experiments section: the experiments are characterized only as “small-scale” and “about-to-overfit” with no reported controls for post-hoc hyperparameter choices or checks that the observed correlation between flatness and generalization persists outside this narrow regime, weakening support for the practical recommendation on default β values.

minor comments (2)

[Abstract] Abstract: the phrase “anti-regularization by memory” is used without a concise definition or pointer to the relevant equation, which may hinder readers outside the implicit-bias literature.
[Notation] Notation: introduce the precise definition of the critical batch size and its relation to the reversal threshold in a dedicated paragraph or equation early in the theoretical section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, committing to revisions that add derivation details, regime bounds, and experimental clarifications while preserving the core contributions.

read point-by-point responses

Referee: [Theoretical Framework] Theoretical Framework: the monotonicity reversal for β2 (and opposite shift for β1) is stated without derivation details, error bounds, or explicit assumptions on the loss landscape; it is therefore impossible to verify whether the predicted reversal is independent of the data used to illustrate it or reduces to a quantity fitted from the same observations.

Authors: We will include the complete derivation in the appendix, starting from the Adam second-moment update with additive mini-batch noise. Assumptions are local quadratic loss approximation and bounded noise variance. The reversal follows analytically from the closed-form bias term coupling noise scale to β2, with the transition point tied directly to critical batch size; no data fitting is involved. Error bounds on the approximation will be stated explicitly. revision: yes
Referee: [Theoretical Framework] Theoretical Framework: no domination bounds or regime conditions are supplied to guarantee that the mini-batch noise–momentum memory interaction dominates curvature evolution, gradient alignment changes, and multi-epoch landscape drift; without these, the reversal could be an artifact of the isolation choice rather than a robust prediction.

Authors: We agree and will add a dedicated subsection deriving explicit regime conditions. These include noise variance dominating curvature evolution rate (by factor Ω(1/√B)) and memory decay outpacing alignment drift over epochs. The bounds confirm the noise-memory term governs the reversal within the stated regime, ruling out isolation artifacts. revision: yes
Referee: [Experiments] Experiments section: the experiments are characterized only as “small-scale” and “about-to-overfit” with no reported controls for post-hoc hyperparameter choices or checks that the observed correlation between flatness and generalization persists outside this narrow regime, weakening support for the practical recommendation on default β values.

Authors: Experiments target the about-to-overfit regime where implicit bias is most visible in multi-epoch settings. We will expand the section with full hyperparameter grid results to eliminate post-hoc concerns and add explicit discussion of regime limitations. The practical β recommendations rest primarily on the theory; experiments remain illustrative, and broader validation is noted as future work. revision: partial

Circularity Check

0 steps flagged

Derivation chain self-contained without reduction to inputs

full rationale

The paper introduces a theoretical framework analyzing mini-batch noise effects on Adam momentum memory (β1, β2) and its implicit bias toward flat/sharp regions, deriving monotonicity reversals in β dependence as batch size varies and linking the transition scale to critical batch size. No equations, self-citations, fitted parameters presented as predictions, or imported uniqueness theorems appear in the provided text to reduce the central claims to inputs by construction. The isolation of noise-memory coupling is stated as an assumption within the framework rather than a self-referential fit, leaving the derivations independent and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents enumeration of specific free parameters or axioms. The framework implicitly assumes a separation between noise-induced bias and other Adam dynamics, plus a flatness-generalization correlation that is treated as given.

pith-pipeline@v0.9.0 · 5616 in / 1208 out tokens · 30637 ms · 2026-05-16T08:00:22.069745+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 9 internal anchors

[1]

A Modern Look at the Relationship between Sharpness and Generalization

Maksym Andriushchenko et al. “A Modern Look at the Relationship between Sharpness and Generalization”.Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023 (cit. on p. 3)

work page 2023
[2]

Maksym Andriushchenko et al.Why Do We Need Weight Decay in Modern Deep Learning?2024 (cit. on p. 3). 10

work page 2024
[3]

PaLM 2 Technical Report

Rohan Anil et al. “Palm 2 technical report”.arXiv preprint arXiv:2305.10403(2023) (cit. on p. 1)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Understanding Gradient Descent on the Edge of Stability in Deep Learning

Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. “Understanding Gradient Descent on the Edge of Stability in Deep Learning”.Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 3)

work page 2022
[5]

Implicit Gradient Regularization

David Barrett and Benoit Dherin. “Implicit Gradient Regularization”.International Conference on Learning Representations. 2021 (cit. on p. 3)

work page 2021
[6]

On the Trajectories of SGD Without Replacement

Pierfrancesco Beneventano. “On the Trajectories of SGD Without Replacement”.arXiv preprint arXiv:2312.16143(2023) (cit. on p. 3)

work page arXiv 2023
[7]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Stella Biderman et al. “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling”.Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023 (cit. on p. 1)

work page 2023
[9]

Language Models are Few-Shot Learners

Tom Brown et al. “Language Models are Few-Shot Learners”.Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020 (cit. on p. 1)

work page 2020
[10]

Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis

Matias D Cattaneo and Boris Shigida. “Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis”.arXiv preprint arXiv:2509.08483(2025) (cit. on p. 4)

work page arXiv 2025
[11]

On the Implicit Bias of Adam

Matias D. Cattaneo, Jason Matthew Klusowski, and Boris Shigida. “On the Implicit Bias of Adam”. Proceedings of the 41st International Conference on Machine Learning. Vol. 235. Proceedings of Machine Learning Research. PMLR, 2024 (cit. on pp. 2, 3, 6, 8)

work page 2024
[12]

How Memory in Optimization Algorithms Implicitly Modifies the Loss

Matias D. Cattaneo and Boris Shigida. “How Memory in Optimization Algorithms Implicitly Modifies the Loss”.The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025 (cit. on pp. 2–4)

work page 2025
[13]

Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise

Enea Monzio Compagnoni et al. “Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise”.The Thirteenth International Conference on Learning Representations. 2025 (cit. on p. 2)

work page 2025
[14]

Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

Zihang Dai et al. “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context”. Annual Meeting of the Association for Computational Linguistics. 2019 (cit. on p. 10)

work page 2019
[15]

Label Noise SGD Provably Prefers Flat Global Minimizers

Alex Damian, Tengyu Ma, and Jason D Lee. “Label Noise SGD Provably Prefers Flat Global Minimizers”.Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., 2021 (cit. on p. 3)

work page 2021
[16]

Sharp Minima Can Generalize For Deep Nets

Laurent Dinh et al. “Sharp Minima Can Generalize For Deep Nets”.Proceedings of the 34th International Conference on Machine Learning. Vol. 70. Proceedings of Machine Learning Research. PMLR, 2017 (cit. on p. 3)

work page 2017
[17]

Sharpness-Aware Training for Free

Jiawei Du et al. “Sharpness-Aware Training for Free”.Advances in Neural Information Processing Systems. Vol. 35. Curran Associates, Inc., 2022 (cit. on p. 3)

work page 2022
[18]

The Llama 3 Herd of Models

Abhimanyu Dubey et al. “The llama 3 herd of models”.arXiv preprint arXiv:2407.21783(2024) (cit. on p. 1)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Multiscale analysis of accelerated gradient methods

Mohammad Farazmand. “Multiscale analysis of accelerated gradient methods”.SIAM Journal on Optimization30.3 (2020) (cit. on p. 3)

work page 2020
[20]

Sharpness-aware Minimization for Efficiently Improving Generalization

Pierre Foret et al. “Sharpness-aware Minimization for Efficiently Improving Generalization”.Inter- national Conference on Learning Representations. 2021 (cit. on pp. 2, 3, 5)

work page 2021
[21]

Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent

Avrajit Ghosh et al. “Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent”.The Eleventh International Conference on Learning Representations. 2023 (cit. on pp. 3, 6)

work page 2023
[22]

Characterizing implicit bias in terms of optimization geometry

Suriya Gunasekar et al. “Characterizing implicit bias in terms of optimization geometry”.Interna- tional Conference on Machine Learning. PMLR. 2018 (cit. on p. 3)

work page 2018
[23]

Implicit bias of gradient descent on linear convolutional networks

Suriya Gunasekar et al. “Implicit bias of gradient descent on linear convolutional networks”.Advances in neural information processing systems31 (2018) (cit. on p. 3). 11

work page 2018
[24]

SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA

Sepp Hochreiter and J¨ urgen Schmidhuber. “SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA”.Advances in Neural Information Processing Systems. Vol. 7. MIT Press, 1994 (cit. on p. 2)

work page 1994
[25]

Directional convergence and alignment in deep learning

Ziwei Ji and Matus Telgarsky. “Directional convergence and alignment in deep learning”.Advances in Neural Information Processing Systems33 (2020) (cit. on p. 3)

work page 2020
[26]

Gradient descent aligns the layers of deep linear networks

Ziwei Ji and Matus Telgarsky. “Gradient descent aligns the layers of deep linear networks”.arXiv preprint arXiv:1810.02032(2018) (cit. on p. 3)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Risk and parameter convergence of logistic regression

Ziwei Ji and Matus Telgarsky. “Risk and parameter convergence of logistic regression”.arXiv preprint arXiv:1803.07300(2018) (cit. on p. 3)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

The implicit bias of gradient descent on nonseparable data

Ziwei Ji and Matus Telgarsky. “The implicit bias of gradient descent on nonseparable data”. Conference on Learning Theory. PMLR. 2019 (cit. on p. 3)

work page 2019
[29]

Fantastic Generalization Measures and Where to Find Them

Yiding Jiang et al. “Fantastic Generalization Measures and Where to Find Them”.International Conference on Learning Representations. 2020 (cit. on pp. 2, 5)

work page 2020
[30]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”.International Conference on Learning Representations. 2017 (cit. on p. 2)

work page 2017
[31]

2025 (cit

Konwoo Kim et al.Pre-training under infinite compute. 2025 (cit. on p. 1)

work page 2025
[32]

Fisher SAM: Information Geometry and Sharpness Aware Minimisation

Minyoung Kim et al. “Fisher SAM: Information Geometry and Sharpness Aware Minimisation”. Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 3)

work page 2022
[33]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”.arXiv preprint arXiv:1412.6980(2014) (cit. on pp. 1, 5)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[34]

Weight decay induces low-rank attention layers

Seijin Kobayashi, Yassir Akram, and Johannes von Oswald. “Weight decay induces low-rank attention layers”.The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024 (cit. on p. 3)

work page 2024
[35]

Continuous time analysis of momentum methods

Nikola B Kovachki and Andrew M Stuart. “Continuous time analysis of momentum methods”. Journal of Machine Learning Research22.17 (2021) (cit. on pp. 3, 6)

work page 2021
[36]

Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be

Frederik Kunstner et al. “Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be”.The Eleventh International Conference on Learning Representations. 2023 (cit. on p. 10)

work page 2023
[37]

ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks

Jungmin Kwon et al. “ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks”.Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021 (cit. on p. 3)

work page 2021
[38]

Enhancing Sharpness-Aware Optimization Through Variance Suppression

Bingcong Li and Georgios B. Giannakis. “Enhancing Sharpness-Aware Optimization Through Variance Suppression”.Thirty-seventh Conference on Neural Information Processing Systems. 2023 (cit. on p. 3)

work page 2023
[39]

Friendly Sharpness-Aware Minimization

Tao Li et al. “Friendly Sharpness-Aware Minimization”.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024 (cit. on p. 3)

work page 2024
[40]

Towards Efficient and Scalable Sharpness-Aware Minimization

Yong Liu et al. “Towards Efficient and Scalable Sharpness-Aware Minimization”.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022 (cit. on p. 3)

work page 2022
[41]

2019 (cit

Ilya Loshchilov and Frank Hutter.Decoupled Weight Decay Regularization. 2019 (cit. on p. 1)

work page 2019
[42]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. “Gradient descent maximizes the margin of homogeneous neural networks”. arXiv preprint arXiv:1906.05890(2019) (cit. on p. 3)

work page arXiv 1906
[43]

A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms

Chao Ma, Lei Wu, and Weinan E. “A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms”.Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference. Vol. 145. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 2)

work page 2022
[44]

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Sadhika Malladi et al. “On the SDEs and Scaling Rules for Adaptive Gradient Algorithms”.Advances in Neural Information Processing Systems. Vol. 35. Curran Associates, Inc., 2022 (cit. on p. 2)

work page 2022
[45]

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation is Wasteful

Martin Marek et al. “Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation is Wasteful”.The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025 (cit. on p. 2). 12

work page 2025
[46]

2018 (cit

Sam McCandlish et al.An Empirical Model of Large-Batch Training. 2018 (cit. on pp. 8, 10)

work page 2018
[47]

Pointer Sentinel Mixture Models

Stephen Merity et al. “Pointer Sentinel Mixture Models”.International Conference on Learning Representations. 2017 (cit. on p. 10)

work page 2017
[48]

Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis

Taiki Miyagawa. “Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis”.Advances in Neural Information Processing Systems. 2022 (cit. on p. 3)

work page 2022
[49]

Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate

Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. “Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate”.The 22nd International Conference on Artificial Intelligence and Statistics. PMLR. 2019 (cit. on p. 3)

work page 2019
[50]

Convergence of gradient descent on separable data

Mor Shpigel Nacson et al. “Convergence of gradient descent on separable data”.The 22nd Interna- tional Conference on Artificial Intelligence and Statistics. PMLR. 2019 (cit. on p. 3)

work page 2019
[51]

Lexicographic and depth-sensitive margins in homogeneous and non- homogeneous deep models

Mor Shpigel Nacson et al. “Lexicographic and depth-sensitive margins in homogeneous and non- homogeneous deep models”.International Conference on Machine Learning. PMLR. 2019 (cit. on p. 3)

work page 2019
[52]

In Search of Adam’s Secret Sauce

Antonio Orvieto and Robert M. Gower. “In Search of Adam’s Secret Sauce”.The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025 (cit. on p. 2)

work page 2025
[53]

The AdEMAMix Optimizer: Better, Faster, Older

Matteo Pagliardini, Pierre Ablin, and David Grangier. “The AdEMAMix Optimizer: Better, Faster, Older”.The Thirteenth International Conference on Learning Representations. 2025 (cit. on p. 2)

work page 2025
[54]

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Tomer Porian et al. “Resolving Discrepancies in Compute-Optimal Scaling of Language Models”. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024 (cit. on p. 2)

work page 2024
[55]

The implicit bias of adagrad on separable data

Qian Qian and Xiaoyuan Qian. “The implicit bias of adagrad on separable data”.Advances in Neural Information Processing Systems32 (2019) (cit. on p. 3)

work page 2019
[56]

A Scale Invariant Measure of Flatness for Deep Network Minima

Akshay Rangamani et al. “A Scale Invariant Measure of Flatness for Deep Network Minima”. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021 (cit. on p. 3)

work page 2021
[57]

On a continuous time model of gradient descent dynamics and instability in deep learning

Mihaela Rosca et al. “On a continuous time model of gradient descent dynamics and instability in deep learning”.Transactions on Machine Learning Research(2023).issn: 2835-8856 (cit. on p. 3)

work page 2023
[58]

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

Robin M Schmidt, Frank Schneider, and Philipp Hennig. “Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers”.Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021 (cit. on p. 2)

work page 2021
[59]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. “Adafactor: Adaptive learning rates with sublinear memory cost”. International Conference on Machine Learning. PMLR. 2018 (cit. on p. 1)

work page 2018
[60]

Optimizer benchmarking needs to account for hyperparameter tuning

Prabhu Teja Sivaprasad et al. “Optimizer benchmarking needs to account for hyperparameter tuning”.International conference on machine learning. PMLR. 2020 (cit. on p. 1)

work page 2020
[61]

On the Origin of Implicit Regularization in Stochastic Gradient Descent

Samuel L Smith et al. “On the Origin of Implicit Regularization in Stochastic Gradient Descent”. International Conference on Learning Representations. 2021 (cit. on pp. 3, 4)

work page 2021
[62]

The implicit bias of gradient descent on separable data

Daniel Soudry et al. “The implicit bias of gradient descent on separable data”.The Journal of Machine Learning Research19.1 (2018) (cit. on p. 3)

work page 2018
[63]

A Universal Class of Sharpness-Aware Minimization Algorithms

Behrooz Tahmasebi et al. “A Universal Class of Sharpness-Aware Minimization Algorithms”.Forty- first International Conference on Machine Learning. 2024 (cit. on p. 3)

work page 2024
[64]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron et al. “Llama 2: Open foundation and fine-tuned chat models”.arXiv preprint arXiv:2307.09288(2023) (cit. on p. 1)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC-Bayesian Analysis

Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. “Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC-Bayesian Analysis”.Proceedings of the 37th International Conference on Machine Learning. Vol. 119. Proceedings of Machine Learning Research. PMLR, 2020 (cit. on p. 3)

work page 2020
[66]

Position: Will we run out of data? Limits of LLM scaling based on human- generated data

Pablo Villalobos et al. “Position: Will we run out of data? Limits of LLM scaling based on human- generated data”.Forty-first International Conference on Machine Learning. 2024 (cit. on p. 1)

work page 2024
[67]

Does Momentum Change the Implicit Regularization on Separable Data?

Bohan Wang et al. “Does Momentum Change the Implicit Regularization on Separable Data?” Advances in Neural Information Processing Systems35 (2022) (cit. on p. 3). 13

work page 2022
[68]

The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Bohan Wang et al. “The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks”.Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021 (cit. on p. 3)

work page 2021
[69]

2025 (cit

Kaiyue Wen et al.Fantastic Pretraining Optimizers and Where to Find Them. 2025 (cit. on p. 2)

work page 2025
[70]

2025 (cit

Lechao Xiao.Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling. 2025 (cit. on p. 1)

work page 2025
[71]

Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization

Shuo Xie and Zhiyuan Li. “Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization”.Proceed- ings of the 41st International Conference on Machine Learning. Vol. 235. Proceedings of Machine Learning Research. PMLR, 2024 (cit. on p. 3)

work page 2024
[72]

SAMPa: Sharpness-aware Minimization Par- allelized

Wanyun Xie, Thomas Pethick, and Volkan Cevher. “SAMPa: Sharpness-aware Minimization Par- allelized”.The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024 (cit. on p. 3)

work page 2024
[73]

Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momen- tum

Zeke Xie et al. “Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momen- tum”.Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 2)

work page 2022
[74]

Positively Scale-Invariant Flatness of ReLU Neural Networks

Mingyang Yi et al. “Positively Scale-Invariant Flatness of ReLU Neural Networks”.arXiv preprint arXiv:1903.02237(2019) (cit. on p. 3)

work page internal anchor Pith review Pith/arXiv arXiv 1903
[75]

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng et al. “Glm-130b: An open bilingual pre-trained model”.arXiv preprint arXiv:2210.02414 (2022) (cit. on p. 1)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Three Mechanisms of Weight Decay Regularization

Guodong Zhang et al. “Three Mechanisms of Weight Decay Regularization”.International Conference on Learning Representations. 2019 (cit. on p. 3)

work page 2019
[77]

How Does Critical Batch Size Scale in Pre-training?

Hanlin Zhang et al. “How Does Critical Batch Size Scale in Pre-training?”The Thirteenth Interna- tional Conference on Learning Representations. 2025 (cit. on p. 2)

work page 2025
[78]

Why are adaptive methods good for attention models?

Jingzhao Zhang et al. “Why are adaptive methods good for attention models?”Advances in Neural Information Processing Systems33 (2020) (cit. on p. 10)

work page 2020
[79]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang et al. “Opt: Open pre-trained transformer language models”.arXiv preprint arX- iv:2205.01068(2022) (cit. on p. 1)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[80]

Deconstructing What Makes a Good Optimizer for Autoregressive Language Models

Rosie Zhao et al. “Deconstructing What Makes a Good Optimizer for Autoregressive Language Models”.The Thirteenth International Conference on Learning Representations. 2025 (cit. on pp. 2, 26)

work page 2025
[81]

Regularizing Neural Networks via Adversarial Model Perturbation

Yaowei Zheng, Richong Zhang, and Yongyi Mao. “Regularizing Neural Networks via Adversarial Model Perturbation”.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021 (cit. on p. 3)

work page 2021

Showing first 80 references.