Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods

Alexandru Meterez; Depen Morwani; Pranav Nair; Sham Kakade

arxiv: 2606.19179 · v1 · pith:DVGSE5RPnew · submitted 2026-06-17 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods

Depen Morwani , Alexandru Meterez , Pranav Nair , Sham Kakade This is my paper

Pith reviewed 2026-06-26 21:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords stochastic momentumheavy ballcompute efficiencyserial runtimebatch sizeaccelerated SGDlinear regressionlower bounds

0 comments

The pith

Heavy ball preserves SGD-level compute efficiency over a batch-size window up to sqrt(kappa) larger than SGD's critical size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves finite-dimensional lower bounds for stochastic heavy-ball and accelerated SGD in consistent linear regression with Gaussian covariates. It shows that heavy ball does not raise the compute-efficiency frontier above SGD for arbitrary spectra but keeps the same efficiency level across a wider batch-size range. This wider window, larger by a factor of sqrt(kappa), lets practitioners use bigger batches to shorten serial runtime without increasing total gradient cost until the method reaches its deterministic accelerated regime. For accelerated SGD the picture depends on the eigenvalue spectrum: rapid power-law decay yields better small-batch efficiency that is later traded for runtime gains as batches grow. Synthetic experiments confirm the predicted qualitative regimes including spectrum-dependent overlap or tradeoff behavior.

Core claim

For consistent linear regression with Gaussian covariates, stochastic heavy ball does not improve the CE frontier over SGD for arbitrary spectra; rather, it preserves SGD-level CE over a larger batch-size window allowing larger batches to reduce serial runtime until HB reaches its deterministic accelerated scale. This window can be a factor sqrt(kappa) larger than the SGD critical batch size. For ASGD the picture is more spectrum-dependent: for rapidly decaying power-law spectra ASGD improves small-batch CE over HB/SGD but trades this CE advantage for improved serial runtime as batch size grows.

What carries the argument

Finite-dimensional discrete-time lower bounds on batch-size tradeoffs between compute efficiency and serial runtime for stochastic HB and ASGD.

If this is right

HB allows larger batches to cut serial runtime while holding total compute cost at the SGD level.
The critical batch size at which CE begins to degrade is up to sqrt(kappa) times larger for HB than for SGD.
ASGD gains a CE advantage over HB and SGD at small batches when the spectrum follows a rapidly decaying power law.
As batch size increases for rapidly decaying spectra, ASGD converts its CE advantage into further serial-runtime reduction.
For slowly decaying spectra the CE-serial runtime curves of ASGD and HB nearly overlap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Knowing the eigenvalue decay rate of a problem could inform whether to start with ASGD at small batches or switch to HB for larger ones.
The Gaussian linear-regression setting provides a clean test bed; checking whether the same batch-size windows appear in non-linear models would test robustness.
If the contraction gap scales linearly with batch size in other stochastic settings, similar CE preservation windows may exist for higher-order momentum variants.

Load-bearing premise

The analysis assumes consistent linear regression with Gaussian covariates in a finite-dimensional discrete-time setting.

What would settle it

An experiment on the same linear regression model that finds heavy ball achieving strictly higher compute efficiency than SGD at some batch size beyond the SGD critical size would falsify the no-improvement claim.

Figures

Figures reproduced from arXiv: 2606.19179 by Alexandru Meterez, Depen Morwani, Pranav Nair, Sham Kakade.

read the original abstract

Stochastic momentum methods such as heavy ball (HB), Nesterov momentum, and variants of Accelerated SGD (ASGD) [Kidambi et al., 2018] are widely used in modern training, but their stochastic benefits depend on two distinct quantities: serial runtime, the number of iterations needed to reach a target accuracy, and compute efficiency (CE), the inverse total gradient-query or FLOP cost. Larger batches reduce serial runtime without hurting CE only when the contraction gap grows linearly with batch size. We study stochastic HB and ASGD for consistent linear regression with Gaussian covariates and prove finite-dimensional, discrete-time lower bounds on their batch-size tradeoffs. Our first result shows that HB does not improve the CE frontier over SGD for arbitrary spectra; rather, it preserves SGD-level CE over a larger batch-size window, allowing larger batches to reduce serial runtime until HB reaches its deterministic accelerated scale. This window can be a factor $\sqrt{\kappa}$ larger than the SGD critical batch size. For ASGD, the picture is more spectrum-dependent: for rapidly decaying power-law spectra, ASGD improves small-batch CE over HB/SGD, but as batch size grows it trades this CE advantage for improved serial runtime. Synthetic linear-regression experiments verify these qualitative regimes, including near-overlap of ASGD and HB for slowly decaying spectra and the predicted CE--serial tradeoff for rapidly decaying spectra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HB keeps the same CE frontier as SGD but stretches the usable batch-size window by √κ in this linear model; ASGD trades small-batch CE gains for runtime on fast spectra.

read the letter

The core result is that heavy-ball momentum does not beat SGD on the compute-efficiency frontier for arbitrary spectra in consistent linear regression with Gaussian covariates. Instead it maintains SGD-level CE out to a batch-size window roughly √κ larger than the SGD critical size, which lets you cut serial runtime further before hitting the deterministic regime. ASGD improves CE at small batches when the spectrum decays rapidly but loses that edge as batch size grows, trading it for better iteration count. Both sets of lower bounds are stated in finite-dimensional discrete time.

The paper earns credit for making these tradeoffs explicit with lower bounds rather than just upper bounds or heuristics, and the synthetic experiments line up with the predicted regimes for slow versus fast spectra. That gives a concrete picture of when momentum buys you extra batch scaling without extra total compute.

The obvious limitation is the narrow setting: everything is scoped to consistent linear regression with Gaussian covariates. If the data are inconsistent or the covariates are non-Gaussian, the stated windows need not apply, and the paper does not claim otherwise. Without the full derivations it is hard to judge how tight the constants are or whether any spectrum assumptions are hidden, but the stress-test note confirms the strongest claims stay inside the stated model.

This is useful reading for anyone who cares about batch-size scaling rules for momentum in large-model training. A reader who already works on stochastic optimization lower bounds or practical large-batch training will get the most out of the explicit windows and the spectrum dependence. It is worth sending to peer review; the claims are scoped and the experiments provide a basic check, so referees can focus on whether the proofs hold.

Referee Report

2 major / 2 minor

Summary. The paper analyzes stochastic momentum methods (heavy ball HB, Nesterov, and ASGD) in the setting of consistent linear regression with Gaussian covariates in finite dimensions and discrete time. It proves lower bounds showing that HB does not improve the compute-efficiency (CE) frontier over SGD for arbitrary spectra but preserves SGD-level CE over a batch-size window that is a factor √κ larger than the SGD critical batch size, thereby permitting larger batches to reduce serial runtime until the deterministic accelerated regime. For ASGD the behavior is spectrum-dependent: it improves small-batch CE for rapidly decaying power-law spectra but trades CE for serial-runtime gains as batch size grows. Synthetic linear-regression experiments are used to verify the predicted qualitative regimes.

Significance. If the lower bounds hold, the work supplies a precise, model-specific characterization of the batch-size regimes in which momentum methods improve or trade off compute efficiency versus serial runtime. The restriction to consistent linear regression with Gaussian covariates permits exact finite-dimensional analysis, and the synthetic experiments that reproduce the predicted spectrum-dependent regimes add concrete support. The results clarify when larger batches can be used without CE loss and therefore have direct implications for large-batch training practice.

major comments (2)

[Abstract] Abstract: the central quantitative claim that the HB batch-size window is a factor √κ larger than the SGD critical batch size is load-bearing for the first result; the manuscript must exhibit the explicit derivation (or the equation that produces this scaling) for arbitrary spectra so that the lower-bound argument can be verified.
The proofs are stated to be finite-dimensional and discrete-time; if the derivation relies on post-hoc spectrum assumptions that are not stated in the model definition, the generality of the lower bounds for arbitrary spectra is at risk and must be clarified with a concrete counter-example or tightness argument.

minor comments (2)

[Abstract] The abstract refers to 'near-overlap of ASGD and HB for slowly decaying spectra'; the main text should include a quantitative metric (e.g., measured CE ratio or batch-size window overlap) rather than a qualitative description.
Synthetic experiments are said to 'verify these qualitative regimes'; the manuscript should report the exact spectra used, the measured critical batch sizes, and the observed √κ scaling factor so that the verification can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for minor revision. The comments highlight opportunities to strengthen the clarity of the central claims. We address each point below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central quantitative claim that the HB batch-size window is a factor √κ larger than the SGD critical batch size is load-bearing for the first result; the manuscript must exhibit the explicit derivation (or the equation that produces this scaling) for arbitrary spectra so that the lower-bound argument can be verified.

Authors: We agree that the √κ scaling is central and should be traceable. The scaling arises from comparing the critical batch sizes at which the contraction gap ceases to grow linearly with batch size for SGD versus HB; specifically, the HB critical batch size scales as √κ times the SGD value because the momentum-augmented variance term in the finite-dimensional recurrence permits a larger batch before the deterministic regime dominates. In the revision we will insert a parenthetical reference in the abstract (or a short footnote) to the precise equation in Section 3 that produces this factor for arbitrary spectra, allowing direct verification of the lower-bound argument. revision: yes
Referee: [—] The proofs are stated to be finite-dimensional and discrete-time; if the derivation relies on post-hoc spectrum assumptions that are not stated in the model definition, the generality of the lower bounds for arbitrary spectra is at risk and must be clarified with a concrete counter-example or tightness argument.

Authors: The lower bounds are derived directly from the finite-dimensional discrete-time recurrence under the stated model (consistent linear regression with Gaussian covariates) and hold for arbitrary spectra without additional post-hoc assumptions. The analysis tracks the evolution of the second-moment matrix for any positive-definite covariance; the worst-case behavior over spectra is obtained by optimizing over eigenvalue distributions within that model. In the revision we will add a clarifying remark after the model definition and a short tightness paragraph showing that the √κ factor is achieved for both flat and power-law spectra, confirming generality. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives its lower bounds and CE/serial-runtime tradeoffs via explicit finite-dimensional discrete-time proofs inside the consistent linear-regression model with Gaussian covariates. These derivations start from the model assumptions and produce the stated results (HB preserves SGD-level CE over a √κ-larger window; spectrum-dependent ASGD behavior) without reducing to fitted parameters, self-definitions, or load-bearing self-citations. The single external citation (Kidambi et al. 2018) merely names ASGD variants and carries no weight for the proved claims. The analysis is therefore self-contained against the scoped setting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claims rest on the modeling choice of consistent linear regression with Gaussian covariates and finite-dimensional discrete-time dynamics; no free parameters, invented entities, or additional axioms are identifiable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5794 in / 1166 out tokens · 22565 ms · 2026-06-26T21:29:50.144578+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 11 internal anchors

[1]

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, and Pierfrancesco Ben- eventano. Momentum further constrains sharpness at the edge of stochastic stability.arXiv preprint arXiv:2604.14108,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Blake Bordelon and Francesco Mori. Theory of optimal learning rate schedules and scaling laws for a random feature model.arXiv preprint arXiv:2602.04774,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065,

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065,

work page arXiv
[4]

Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

work page arXiv
[5]

Dimension- adapted momentum outscales sgd.arXiv preprint arXiv:2505.16098,

Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, and Courtney Paquette. Dimension- adapted momentum outscales sgd.arXiv preprint arXiv:2505.16098,

work page arXiv
[6]

When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000,

Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, and Nanning Zheng. When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000,

work page arXiv
[7]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be

Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, and Mark Schmidt. Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be. arXiv preprint arXiv:2304.13960,

work page arXiv
[11]

Accelerating sgd with momentum for over-parameterized learning

Chaoyue Liu and Mikhail Belkin. Accelerating sgd with momentum for over-parameterized learning. arXiv preprint arXiv:1810.13395,

work page arXiv
[12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Aggregated Momentum: Stability Through Passive Damping

James Lucas, Shengyang Sun, Richard Zemel, and Roger Grosse. Aggregated momentum: Stability through passive damping.arXiv preprint arXiv:1804.00325,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Quasi-hyperbolic momentum and Adam for deep learning

Jerry Ma and Denis Yarats. Quasi-hyperbolic momentum and adam for deep learning.arXiv preprint arXiv:1810.06801,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful.arXiv preprint arXiv:2507.07101,

Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, and Micah Goldblum. Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful.arXiv preprint arXiv:2507.07101,

work page arXiv
[16]

Critical batch size revisited: A simple empirical approach to large-batch language model training.arXiv preprint arXiv:2505.23971,

William Merrill, Shane Arora, Dirk Groeneveld, and Hannaneh Hajishirzi. Critical batch size revisited: A simple empirical approach to large-batch language model training.arXiv preprint arXiv:2505.23971,

work page arXiv
[17]

A simplified analysis of sgd for linear regression with weight averaging.arXiv preprint arXiv:2506.15535, 2025a

Alexandru Meterez, Depen Morwani, Costin-Andrei Oncescu, Jingfeng Wu, Cengiz Pehlevan, and Sham Kakade. A simplified analysis of sgd for linear regression with weight averaging.arXiv preprint arXiv:2506.15535, 2025a. Alexandru Meterez, Depen Morwani, Jingfeng Wu, Costin-Andrei Oncescu, Cengiz Pehlevan, and Sham Kakade. Seesaw: Accelerating training by bal...

work page arXiv
[18]

The ademamix optimizer: Better, faster, older

Matteo Pagliardini, Pierre Ablin, and David Grangier. The ademamix optimizer: Better, faster, older. arXiv preprint arXiv:2409.03137,

work page arXiv
[19]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196,

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, and Zhiyuan Li. The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196,

work page arXiv
[22]

Qwen3 Technical Report

Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. InInternational conference on machine learning, pages 24280–24314. PMLR, 2022a. Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham Kakade. The power and limitation of pretr...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

How does critical batch size scale in pre-training?arXiv preprint arXiv:2410.21676,

Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, and Sham Kakade. How does critical batch size scale in pre-training?arXiv preprint arXiv:2410.21676,

work page arXiv

[1] [1]

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, and Pierfrancesco Ben- eventano. Momentum further constrains sharpness at the edge of stochastic stability.arXiv preprint arXiv:2604.14108,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Blake Bordelon and Francesco Mori. Theory of optimal learning rate schedules and scaling laws for a random feature model.arXiv preprint arXiv:2602.04774,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065,

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065,

work page arXiv

[4] [4]

Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

work page arXiv

[5] [5]

Dimension- adapted momentum outscales sgd.arXiv preprint arXiv:2505.16098,

Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, and Courtney Paquette. Dimension- adapted momentum outscales sgd.arXiv preprint arXiv:2505.16098,

work page arXiv

[6] [6]

When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000,

Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, and Nanning Zheng. When and why momentum accelerates sgd: An empirical study.arXiv preprint arXiv:2306.09000,

work page arXiv

[7] [7]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be

Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, and Mark Schmidt. Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be. arXiv preprint arXiv:2304.13960,

work page arXiv

[11] [11]

Accelerating sgd with momentum for over-parameterized learning

Chaoyue Liu and Mikhail Belkin. Accelerating sgd with momentum for over-parameterized learning. arXiv preprint arXiv:1810.13395,

work page arXiv

[12] [12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Aggregated Momentum: Stability Through Passive Damping

James Lucas, Shengyang Sun, Richard Zemel, and Roger Grosse. Aggregated momentum: Stability through passive damping.arXiv preprint arXiv:1804.00325,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Quasi-hyperbolic momentum and Adam for deep learning

Jerry Ma and Denis Yarats. Quasi-hyperbolic momentum and adam for deep learning.arXiv preprint arXiv:1810.06801,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful.arXiv preprint arXiv:2507.07101,

Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, and Micah Goldblum. Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful.arXiv preprint arXiv:2507.07101,

work page arXiv

[16] [16]

Critical batch size revisited: A simple empirical approach to large-batch language model training.arXiv preprint arXiv:2505.23971,

William Merrill, Shane Arora, Dirk Groeneveld, and Hannaneh Hajishirzi. Critical batch size revisited: A simple empirical approach to large-batch language model training.arXiv preprint arXiv:2505.23971,

work page arXiv

[17] [17]

A simplified analysis of sgd for linear regression with weight averaging.arXiv preprint arXiv:2506.15535, 2025a

Alexandru Meterez, Depen Morwani, Costin-Andrei Oncescu, Jingfeng Wu, Cengiz Pehlevan, and Sham Kakade. A simplified analysis of sgd for linear regression with weight averaging.arXiv preprint arXiv:2506.15535, 2025a. Alexandru Meterez, Depen Morwani, Jingfeng Wu, Costin-Andrei Oncescu, Cengiz Pehlevan, and Sham Kakade. Seesaw: Accelerating training by bal...

work page arXiv

[18] [18]

The ademamix optimizer: Better, faster, older

Matteo Pagliardini, Pierre Ablin, and David Grangier. The ademamix optimizer: Better, faster, older. arXiv preprint arXiv:2409.03137,

work page arXiv

[19] [19]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

SOAP: Improving and Stabilizing Shampoo using Adam

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196,

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, and Zhiyuan Li. The marginal value of momentum for small learning rate sgd.arXiv preprint arXiv:2307.15196,

work page arXiv

[22] [22]

Qwen3 Technical Report

Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. InInternational conference on machine learning, pages 24280–24314. PMLR, 2022a. Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, and Sham Kakade. The power and limitation of pretr...

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

How does critical batch size scale in pre-training?arXiv preprint arXiv:2410.21676,

Hanlin Zhang, Depen Morwani, Nikhil Vyas, Jingfeng Wu, Difan Zou, Udaya Ghai, Dean Foster, and Sham Kakade. How does critical batch size scale in pre-training?arXiv preprint arXiv:2410.21676,

work page arXiv