pith. machine review for the scientific record. sign in

arxiv: 2602.01642 · v2 · submitted 2026-02-02 · 💻 cs.LG · cs.AI· math.OC· stat.CO· stat.ML

Recognition: no theorem link

The Effect of Mini-Batch Noise on the Implicit Bias of Adam

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.COstat.ML
keywords Adam optimizermini-batch noiseimplicit biasgeneralizationmomentum parametersbatch sizeloss landscapemulti-epoch training
0
0 comments X

The pith

Mini-batch noise reverses whether Adam's higher β2 pushes toward sharper or flatter minima.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that mini-batch noise interacts with Adam's momentum memory to control an implicit bias toward sharp or flat regions of the loss surface. In large-batch regimes, raising β2 strengthens anti-regularization and tends to hurt generalization. When batches shrink, the dependence flips so that higher β2 instead favors flatter minima that support better generalization. A parallel but opposite reversal occurs for β1. These patterns matter for multi-epoch training on limited data, where the choice of β1 and β2 can shift validation accuracy without any change in explicit regularization.

Core claim

In the case of large batch sizes, higher β2 increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regularization on β2 is reversed. A similar monotonicity shift (in the opposite direction) happens in β1. The commonly default pair (β1, β2) = (0.9, 0.999) is a good choice if batches are small; for larger batches, moving β1 closer to β2 is much better in terms of validation accuracy in multi-epoch training. The scale of the batch size at which the shift happens connects to the scale of the critical batch size.

What carries the argument

The interaction between mini-batch noise and Adam's momentum memory parameters β1 and β2, which sets the direction and strength of implicit bias toward sharper or flatter loss minima.

If this is right

  • Large-batch training benefits from setting β1 closer to β2 to reduce anti-regularization.
  • Small-batch training performs well with the standard default values β1=0.9 and β2=0.999.
  • The batch-size scale at which the monotonicity reversal occurs tracks the critical batch size.
  • The effect is observable in the about-to-overfit multi-epoch regime on small-scale data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Batch size should be treated as a first-class hyperparameter when choosing momentum values for Adam.
  • The reversal implies that noise can counteract memory-driven anti-regularization in ways standard bias analyses miss.
  • Similar batch-size-dependent reversals may appear in other momentum-based adaptive methods.
  • Dynamic schedules that adjust β1 or β2 as effective batch size changes could improve generalization.

Load-bearing premise

The interaction between mini-batch noise and momentum memory can be isolated from other optimization dynamics, and that flatness of reached minima correlates with generalization in the multi-epoch regime.

What would settle it

An experiment that trains Adam with fixed β2 across a range of batch sizes, measures the curvature of the reached minima, and finds either no reversal in the β2-flatness dependence or no connection between that curvature and validation accuracy.

Figures

Figures reproduced from arXiv: 2602.01642 by Boris Shigida, Matias D. Cattaneo.

Figure 1
Figure 1. Figure 1: Schematic illustration: validation accuracy vs. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Minimal validation perplexity (before overfitting) of Transformer-XL trained with Adam on [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Minimal validation perplexity (before overfitting) of Transformer-XL trained with Adam on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The estimated simple noise scale Bbsimple for different training runs of Transformer-XL on WikiText-2. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
read the original abstract

With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(\beta_1, \beta_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $\beta_1$, $\beta_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $\beta_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $\beta_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $\beta_1$. In particular, the commonly "default" pair $(\beta_1, \beta_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $\beta_1$ closer to $\beta_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a theoretical framework to analyze how mini-batch noise influences the implicit bias induced by Adam's momentum parameters (β1, β2) toward sharper or flatter regions of the loss landscape in multi-epoch training. It claims that the dependence of this (anti-)regularization effect on β2 reverses with batch size: higher β2 increases anti-regularization for large batches but the dependence reverses for smaller batches, with an opposite monotonicity shift for β1. The framework connects the reversal scale to the critical batch size, and the practical implication is that the default (0.9, 0.999) pair is suitable for small batches while moving β1 closer to β2 is preferable for larger batches. These predictions are illustrated via experiments on small-scale data in the about-to-overfit regime.

Significance. If the isolation of noise-memory coupling holds with the required domination bounds, the result supplies a principled, batch-size-dependent rule for tuning Adam hyperparameters to improve generalization in multi-epoch regimes that are regaining importance under data constraints. The explicit linkage of the reversal point to critical batch size constitutes a falsifiable prediction and a strength of the work. The significance remains conditional on verifying that the modeling assumptions do not introduce artifacts near the critical scale.

major comments (3)
  1. [Theoretical Framework] Theoretical Framework: the monotonicity reversal for β2 (and opposite shift for β1) is stated without derivation details, error bounds, or explicit assumptions on the loss landscape; it is therefore impossible to verify whether the predicted reversal is independent of the data used to illustrate it or reduces to a quantity fitted from the same observations.
  2. [Theoretical Framework] Theoretical Framework: no domination bounds or regime conditions are supplied to guarantee that the mini-batch noise–momentum memory interaction dominates curvature evolution, gradient alignment changes, and multi-epoch landscape drift; without these, the reversal could be an artifact of the isolation choice rather than a robust prediction.
  3. [Experiments] Experiments section: the experiments are characterized only as “small-scale” and “about-to-overfit” with no reported controls for post-hoc hyperparameter choices or checks that the observed correlation between flatness and generalization persists outside this narrow regime, weakening support for the practical recommendation on default β values.
minor comments (2)
  1. [Abstract] Abstract: the phrase “anti-regularization by memory” is used without a concise definition or pointer to the relevant equation, which may hinder readers outside the implicit-bias literature.
  2. [Notation] Notation: introduce the precise definition of the critical batch size and its relation to the reversal threshold in a dedicated paragraph or equation early in the theoretical section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, committing to revisions that add derivation details, regime bounds, and experimental clarifications while preserving the core contributions.

read point-by-point responses
  1. Referee: [Theoretical Framework] Theoretical Framework: the monotonicity reversal for β2 (and opposite shift for β1) is stated without derivation details, error bounds, or explicit assumptions on the loss landscape; it is therefore impossible to verify whether the predicted reversal is independent of the data used to illustrate it or reduces to a quantity fitted from the same observations.

    Authors: We will include the complete derivation in the appendix, starting from the Adam second-moment update with additive mini-batch noise. Assumptions are local quadratic loss approximation and bounded noise variance. The reversal follows analytically from the closed-form bias term coupling noise scale to β2, with the transition point tied directly to critical batch size; no data fitting is involved. Error bounds on the approximation will be stated explicitly. revision: yes

  2. Referee: [Theoretical Framework] Theoretical Framework: no domination bounds or regime conditions are supplied to guarantee that the mini-batch noise–momentum memory interaction dominates curvature evolution, gradient alignment changes, and multi-epoch landscape drift; without these, the reversal could be an artifact of the isolation choice rather than a robust prediction.

    Authors: We agree and will add a dedicated subsection deriving explicit regime conditions. These include noise variance dominating curvature evolution rate (by factor Ω(1/√B)) and memory decay outpacing alignment drift over epochs. The bounds confirm the noise-memory term governs the reversal within the stated regime, ruling out isolation artifacts. revision: yes

  3. Referee: [Experiments] Experiments section: the experiments are characterized only as “small-scale” and “about-to-overfit” with no reported controls for post-hoc hyperparameter choices or checks that the observed correlation between flatness and generalization persists outside this narrow regime, weakening support for the practical recommendation on default β values.

    Authors: Experiments target the about-to-overfit regime where implicit bias is most visible in multi-epoch settings. We will expand the section with full hyperparameter grid results to eliminate post-hoc concerns and add explicit discussion of regime limitations. The practical β recommendations rest primarily on the theory; experiments remain illustrative, and broader validation is noted as future work. revision: partial

Circularity Check

0 steps flagged

Derivation chain self-contained without reduction to inputs

full rationale

The paper introduces a theoretical framework analyzing mini-batch noise effects on Adam momentum memory (β1, β2) and its implicit bias toward flat/sharp regions, deriving monotonicity reversals in β dependence as batch size varies and linking the transition scale to critical batch size. No equations, self-citations, fitted parameters presented as predictions, or imported uniqueness theorems appear in the provided text to reduce the central claims to inputs by construction. The isolation of noise-memory coupling is stated as an assumption within the framework rather than a self-referential fit, leaving the derivations independent and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents enumeration of specific free parameters or axioms. The framework implicitly assumes a separation between noise-induced bias and other Adam dynamics, plus a flatness-generalization correlation that is treated as given.

pith-pipeline@v0.9.0 · 5616 in / 1208 out tokens · 30637 ms · 2026-05-16T08:00:22.069745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 9 internal anchors

  1. [1]

    A Modern Look at the Relationship between Sharpness and Generalization

    Maksym Andriushchenko et al. “A Modern Look at the Relationship between Sharpness and Generalization”.Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023 (cit. on p. 3)

  2. [2]

    Maksym Andriushchenko et al.Why Do We Need Weight Decay in Modern Deep Learning?2024 (cit. on p. 3). 10

  3. [3]

    PaLM 2 Technical Report

    Rohan Anil et al. “Palm 2 technical report”.arXiv preprint arXiv:2305.10403(2023) (cit. on p. 1)

  4. [4]

    Understanding Gradient Descent on the Edge of Stability in Deep Learning

    Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. “Understanding Gradient Descent on the Edge of Stability in Deep Learning”.Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 3)

  5. [5]

    Implicit Gradient Regularization

    David Barrett and Benoit Dherin. “Implicit Gradient Regularization”.International Conference on Learning Representations. 2021 (cit. on p. 3)

  6. [6]

    On the Trajectories of SGD Without Replacement

    Pierfrancesco Beneventano. “On the Trajectories of SGD Without Replacement”.arXiv preprint arXiv:2312.16143(2023) (cit. on p. 3)

  7. [7]

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Stella Biderman et al. “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling”.Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023 (cit. on p. 1)

  8. [9]

    Language Models are Few-Shot Learners

    Tom Brown et al. “Language Models are Few-Shot Learners”.Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020 (cit. on p. 1)

  9. [10]

    Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis

    Matias D Cattaneo and Boris Shigida. “Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis”.arXiv preprint arXiv:2509.08483(2025) (cit. on p. 4)

  10. [11]

    On the Implicit Bias of Adam

    Matias D. Cattaneo, Jason Matthew Klusowski, and Boris Shigida. “On the Implicit Bias of Adam”. Proceedings of the 41st International Conference on Machine Learning. Vol. 235. Proceedings of Machine Learning Research. PMLR, 2024 (cit. on pp. 2, 3, 6, 8)

  11. [12]

    How Memory in Optimization Algorithms Implicitly Modifies the Loss

    Matias D. Cattaneo and Boris Shigida. “How Memory in Optimization Algorithms Implicitly Modifies the Loss”.The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025 (cit. on pp. 2–4)

  12. [13]

    Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise

    Enea Monzio Compagnoni et al. “Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise”.The Thirteenth International Conference on Learning Representations. 2025 (cit. on p. 2)

  13. [14]

    Transformer-XL: Attentive Language Models beyond a Fixed-Length Context

    Zihang Dai et al. “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context”. Annual Meeting of the Association for Computational Linguistics. 2019 (cit. on p. 10)

  14. [15]

    Label Noise SGD Provably Prefers Flat Global Minimizers

    Alex Damian, Tengyu Ma, and Jason D Lee. “Label Noise SGD Provably Prefers Flat Global Minimizers”.Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., 2021 (cit. on p. 3)

  15. [16]

    Sharp Minima Can Generalize For Deep Nets

    Laurent Dinh et al. “Sharp Minima Can Generalize For Deep Nets”.Proceedings of the 34th International Conference on Machine Learning. Vol. 70. Proceedings of Machine Learning Research. PMLR, 2017 (cit. on p. 3)

  16. [17]

    Sharpness-Aware Training for Free

    Jiawei Du et al. “Sharpness-Aware Training for Free”.Advances in Neural Information Processing Systems. Vol. 35. Curran Associates, Inc., 2022 (cit. on p. 3)

  17. [18]

    The Llama 3 Herd of Models

    Abhimanyu Dubey et al. “The llama 3 herd of models”.arXiv preprint arXiv:2407.21783(2024) (cit. on p. 1)

  18. [19]

    Multiscale analysis of accelerated gradient methods

    Mohammad Farazmand. “Multiscale analysis of accelerated gradient methods”.SIAM Journal on Optimization30.3 (2020) (cit. on p. 3)

  19. [20]

    Sharpness-aware Minimization for Efficiently Improving Generalization

    Pierre Foret et al. “Sharpness-aware Minimization for Efficiently Improving Generalization”.Inter- national Conference on Learning Representations. 2021 (cit. on pp. 2, 3, 5)

  20. [21]

    Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent

    Avrajit Ghosh et al. “Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent”.The Eleventh International Conference on Learning Representations. 2023 (cit. on pp. 3, 6)

  21. [22]

    Characterizing implicit bias in terms of optimization geometry

    Suriya Gunasekar et al. “Characterizing implicit bias in terms of optimization geometry”.Interna- tional Conference on Machine Learning. PMLR. 2018 (cit. on p. 3)

  22. [23]

    Implicit bias of gradient descent on linear convolutional networks

    Suriya Gunasekar et al. “Implicit bias of gradient descent on linear convolutional networks”.Advances in neural information processing systems31 (2018) (cit. on p. 3). 11

  23. [24]

    SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA

    Sepp Hochreiter and J¨ urgen Schmidhuber. “SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA”.Advances in Neural Information Processing Systems. Vol. 7. MIT Press, 1994 (cit. on p. 2)

  24. [25]

    Directional convergence and alignment in deep learning

    Ziwei Ji and Matus Telgarsky. “Directional convergence and alignment in deep learning”.Advances in Neural Information Processing Systems33 (2020) (cit. on p. 3)

  25. [26]

    Gradient descent aligns the layers of deep linear networks

    Ziwei Ji and Matus Telgarsky. “Gradient descent aligns the layers of deep linear networks”.arXiv preprint arXiv:1810.02032(2018) (cit. on p. 3)

  26. [27]

    Risk and parameter convergence of logistic regression

    Ziwei Ji and Matus Telgarsky. “Risk and parameter convergence of logistic regression”.arXiv preprint arXiv:1803.07300(2018) (cit. on p. 3)

  27. [28]

    The implicit bias of gradient descent on nonseparable data

    Ziwei Ji and Matus Telgarsky. “The implicit bias of gradient descent on nonseparable data”. Conference on Learning Theory. PMLR. 2019 (cit. on p. 3)

  28. [29]

    Fantastic Generalization Measures and Where to Find Them

    Yiding Jiang et al. “Fantastic Generalization Measures and Where to Find Them”.International Conference on Learning Representations. 2020 (cit. on pp. 2, 5)

  29. [30]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”.International Conference on Learning Representations. 2017 (cit. on p. 2)

  30. [31]

    2025 (cit

    Konwoo Kim et al.Pre-training under infinite compute. 2025 (cit. on p. 1)

  31. [32]

    Fisher SAM: Information Geometry and Sharpness Aware Minimisation

    Minyoung Kim et al. “Fisher SAM: Information Geometry and Sharpness Aware Minimisation”. Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 3)

  32. [33]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”.arXiv preprint arXiv:1412.6980(2014) (cit. on pp. 1, 5)

  33. [34]

    Weight decay induces low-rank attention layers

    Seijin Kobayashi, Yassir Akram, and Johannes von Oswald. “Weight decay induces low-rank attention layers”.The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024 (cit. on p. 3)

  34. [35]

    Continuous time analysis of momentum methods

    Nikola B Kovachki and Andrew M Stuart. “Continuous time analysis of momentum methods”. Journal of Machine Learning Research22.17 (2021) (cit. on pp. 3, 6)

  35. [36]

    Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be

    Frederik Kunstner et al. “Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be”.The Eleventh International Conference on Learning Representations. 2023 (cit. on p. 10)

  36. [37]

    ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks

    Jungmin Kwon et al. “ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks”.Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021 (cit. on p. 3)

  37. [38]

    Enhancing Sharpness-Aware Optimization Through Variance Suppression

    Bingcong Li and Georgios B. Giannakis. “Enhancing Sharpness-Aware Optimization Through Variance Suppression”.Thirty-seventh Conference on Neural Information Processing Systems. 2023 (cit. on p. 3)

  38. [39]

    Friendly Sharpness-Aware Minimization

    Tao Li et al. “Friendly Sharpness-Aware Minimization”.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024 (cit. on p. 3)

  39. [40]

    Towards Efficient and Scalable Sharpness-Aware Minimization

    Yong Liu et al. “Towards Efficient and Scalable Sharpness-Aware Minimization”.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022 (cit. on p. 3)

  40. [41]

    2019 (cit

    Ilya Loshchilov and Frank Hutter.Decoupled Weight Decay Regularization. 2019 (cit. on p. 1)

  41. [42]

    Gradient descent maximizes the margin of homogeneous neural networks

    Kaifeng Lyu and Jian Li. “Gradient descent maximizes the margin of homogeneous neural networks”. arXiv preprint arXiv:1906.05890(2019) (cit. on p. 3)

  42. [43]

    A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms

    Chao Ma, Lei Wu, and Weinan E. “A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms”.Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference. Vol. 145. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 2)

  43. [44]

    On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

    Sadhika Malladi et al. “On the SDEs and Scaling Rules for Adaptive Gradient Algorithms”.Advances in Neural Information Processing Systems. Vol. 35. Curran Associates, Inc., 2022 (cit. on p. 2)

  44. [45]

    Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation is Wasteful

    Martin Marek et al. “Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation is Wasteful”.The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025 (cit. on p. 2). 12

  45. [46]

    2018 (cit

    Sam McCandlish et al.An Empirical Model of Large-Batch Training. 2018 (cit. on pp. 8, 10)

  46. [47]

    Pointer Sentinel Mixture Models

    Stephen Merity et al. “Pointer Sentinel Mixture Models”.International Conference on Learning Representations. 2017 (cit. on p. 10)

  47. [48]

    Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis

    Taiki Miyagawa. “Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis”.Advances in Neural Information Processing Systems. 2022 (cit. on p. 3)

  48. [49]

    Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate

    Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. “Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate”.The 22nd International Conference on Artificial Intelligence and Statistics. PMLR. 2019 (cit. on p. 3)

  49. [50]

    Convergence of gradient descent on separable data

    Mor Shpigel Nacson et al. “Convergence of gradient descent on separable data”.The 22nd Interna- tional Conference on Artificial Intelligence and Statistics. PMLR. 2019 (cit. on p. 3)

  50. [51]

    Lexicographic and depth-sensitive margins in homogeneous and non- homogeneous deep models

    Mor Shpigel Nacson et al. “Lexicographic and depth-sensitive margins in homogeneous and non- homogeneous deep models”.International Conference on Machine Learning. PMLR. 2019 (cit. on p. 3)

  51. [52]

    In Search of Adam’s Secret Sauce

    Antonio Orvieto and Robert M. Gower. “In Search of Adam’s Secret Sauce”.The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025 (cit. on p. 2)

  52. [53]

    The AdEMAMix Optimizer: Better, Faster, Older

    Matteo Pagliardini, Pierre Ablin, and David Grangier. “The AdEMAMix Optimizer: Better, Faster, Older”.The Thirteenth International Conference on Learning Representations. 2025 (cit. on p. 2)

  53. [54]

    Resolving Discrepancies in Compute-Optimal Scaling of Language Models

    Tomer Porian et al. “Resolving Discrepancies in Compute-Optimal Scaling of Language Models”. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024 (cit. on p. 2)

  54. [55]

    The implicit bias of adagrad on separable data

    Qian Qian and Xiaoyuan Qian. “The implicit bias of adagrad on separable data”.Advances in Neural Information Processing Systems32 (2019) (cit. on p. 3)

  55. [56]

    A Scale Invariant Measure of Flatness for Deep Network Minima

    Akshay Rangamani et al. “A Scale Invariant Measure of Flatness for Deep Network Minima”. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021 (cit. on p. 3)

  56. [57]

    On a continuous time model of gradient descent dynamics and instability in deep learning

    Mihaela Rosca et al. “On a continuous time model of gradient descent dynamics and instability in deep learning”.Transactions on Machine Learning Research(2023).issn: 2835-8856 (cit. on p. 3)

  57. [58]

    Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers

    Robin M Schmidt, Frank Schneider, and Philipp Hennig. “Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers”.Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021 (cit. on p. 2)

  58. [59]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. “Adafactor: Adaptive learning rates with sublinear memory cost”. International Conference on Machine Learning. PMLR. 2018 (cit. on p. 1)

  59. [60]

    Optimizer benchmarking needs to account for hyperparameter tuning

    Prabhu Teja Sivaprasad et al. “Optimizer benchmarking needs to account for hyperparameter tuning”.International conference on machine learning. PMLR. 2020 (cit. on p. 1)

  60. [61]

    On the Origin of Implicit Regularization in Stochastic Gradient Descent

    Samuel L Smith et al. “On the Origin of Implicit Regularization in Stochastic Gradient Descent”. International Conference on Learning Representations. 2021 (cit. on pp. 3, 4)

  61. [62]

    The implicit bias of gradient descent on separable data

    Daniel Soudry et al. “The implicit bias of gradient descent on separable data”.The Journal of Machine Learning Research19.1 (2018) (cit. on p. 3)

  62. [63]

    A Universal Class of Sharpness-Aware Minimization Algorithms

    Behrooz Tahmasebi et al. “A Universal Class of Sharpness-Aware Minimization Algorithms”.Forty- first International Conference on Machine Learning. 2024 (cit. on p. 3)

  63. [64]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron et al. “Llama 2: Open foundation and fine-tuned chat models”.arXiv preprint arXiv:2307.09288(2023) (cit. on p. 1)

  64. [65]

    Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC-Bayesian Analysis

    Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. “Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC-Bayesian Analysis”.Proceedings of the 37th International Conference on Machine Learning. Vol. 119. Proceedings of Machine Learning Research. PMLR, 2020 (cit. on p. 3)

  65. [66]

    Position: Will we run out of data? Limits of LLM scaling based on human- generated data

    Pablo Villalobos et al. “Position: Will we run out of data? Limits of LLM scaling based on human- generated data”.Forty-first International Conference on Machine Learning. 2024 (cit. on p. 1)

  66. [67]

    Does Momentum Change the Implicit Regularization on Separable Data?

    Bohan Wang et al. “Does Momentum Change the Implicit Regularization on Separable Data?” Advances in Neural Information Processing Systems35 (2022) (cit. on p. 3). 13

  67. [68]

    The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks

    Bohan Wang et al. “The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks”.Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021 (cit. on p. 3)

  68. [69]

    2025 (cit

    Kaiyue Wen et al.Fantastic Pretraining Optimizers and Where to Find Them. 2025 (cit. on p. 2)

  69. [70]

    2025 (cit

    Lechao Xiao.Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling. 2025 (cit. on p. 1)

  70. [71]

    Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization

    Shuo Xie and Zhiyuan Li. “Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization”.Proceed- ings of the 41st International Conference on Machine Learning. Vol. 235. Proceedings of Machine Learning Research. PMLR, 2024 (cit. on p. 3)

  71. [72]

    SAMPa: Sharpness-aware Minimization Par- allelized

    Wanyun Xie, Thomas Pethick, and Volkan Cevher. “SAMPa: Sharpness-aware Minimization Par- allelized”.The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024 (cit. on p. 3)

  72. [73]

    Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momen- tum

    Zeke Xie et al. “Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momen- tum”.Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 2)

  73. [74]

    Positively Scale-Invariant Flatness of ReLU Neural Networks

    Mingyang Yi et al. “Positively Scale-Invariant Flatness of ReLU Neural Networks”.arXiv preprint arXiv:1903.02237(2019) (cit. on p. 3)

  74. [75]

    GLM-130B: An Open Bilingual Pre-trained Model

    Aohan Zeng et al. “Glm-130b: An open bilingual pre-trained model”.arXiv preprint arXiv:2210.02414 (2022) (cit. on p. 1)

  75. [76]

    Three Mechanisms of Weight Decay Regularization

    Guodong Zhang et al. “Three Mechanisms of Weight Decay Regularization”.International Conference on Learning Representations. 2019 (cit. on p. 3)

  76. [77]

    How Does Critical Batch Size Scale in Pre-training?

    Hanlin Zhang et al. “How Does Critical Batch Size Scale in Pre-training?”The Thirteenth Interna- tional Conference on Learning Representations. 2025 (cit. on p. 2)

  77. [78]

    Why are adaptive methods good for attention models?

    Jingzhao Zhang et al. “Why are adaptive methods good for attention models?”Advances in Neural Information Processing Systems33 (2020) (cit. on p. 10)

  78. [79]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang et al. “Opt: Open pre-trained transformer language models”.arXiv preprint arX- iv:2205.01068(2022) (cit. on p. 1)

  79. [80]

    Deconstructing What Makes a Good Optimizer for Autoregressive Language Models

    Rosie Zhao et al. “Deconstructing What Makes a Good Optimizer for Autoregressive Language Models”.The Thirteenth International Conference on Learning Representations. 2025 (cit. on pp. 2, 26)

  80. [81]

    Regularizing Neural Networks via Adversarial Model Perturbation

    Yaowei Zheng, Richong Zhang, and Yongyi Mao. “Regularizing Neural Networks via Adversarial Model Perturbation”.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021 (cit. on p. 3)

Showing first 80 references.