arxiv: 2409.20325 · v2 · submitted 2024-09-30 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein , Laker Newhouse

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:23 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords optimizerssteepest descentoperator normsneural network trainingAdamShampoofirst-order methodsrole-specific norms

0 comments

The pith

Deep learning optimizers like Adam are equivalent to steepest descent under particular norms once exponential moving averages are disabled.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reinterprets Adam, Shampoo, and Prodigy as first-order methods by showing each reduces to steepest descent in a specific norm after the exponential moving averages are turned off. This reframing avoids convexity assumptions and instead treats the choice of norm as the key design choice. The authors generalize the observation to propose that tensors playing different roles in a network, such as linear layers versus embeddings, should receive different operator norms even when they occupy the same weight space. The hope is that deliberate metrization of the architecture will produce more stable and faster training.

Core claim

After switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, different operator norms should be assigned to different tensors based on the role that the tensor plays within the network.

What carries the argument

Steepest descent under a role-specific operator norm, where the norm is chosen according to whether a tensor implements a linear layer, embedding, or other network component.

If this is right

Linear and embedding layers, despite sharing the same weight space R^{m x n}, must be assigned different norms because they perform different functions.
Training algorithms can be constructed by selecting an appropriate norm for each tensor type rather than relying on a single global rule.
The resulting first-order methods remain valid without convexity assumptions on the objective.
Careful choice of per-tensor norms may improve stability, scalability, and speed over current uniform-norm optimizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Role-specific norms could be tested on modern transformer blocks to check whether attention and feed-forward tensors benefit from distinct choices.
The same norm-assignment principle might extend to convolutional or recurrent layers once their functional roles are defined.
If norm choice proves decisive, existing adaptive methods could be re-derived by first selecting the right norm and then adding minimal smoothing.

Load-bearing premise

Disabling exponential moving averages preserves the essential behavior of the methods, and assigning different norms to tensors according to their roles will improve training without further assumptions on the loss surface.

What would settle it

A controlled experiment in which linear and embedding layers receive identical norms yet training stability or speed remains unchanged would undermine the claim that role-specific norms matter.

read the original abstract

Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of $\mathbb{R}^{m\times n}$, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes Adam, Shampoo, and Prodigy as steepest descent under specific norms once EMAs are removed, then uses that to propose assigning different norms to tensors by their network role.

read the letter

The core move is to drop the exponential moving averages from Adam, Shampoo, and Prodigy and show each one becomes steepest descent under its own norm. That reduction is the new piece; it is not how these methods are usually derived or explained. From there the authors open a design space in which you choose the norm according to the tensor's job inside the network rather than just its shape. Linear layers and embeddings can have the same dimensions but get different metrics because they play different roles. That is a direct, first-order way to think about optimizer construction without convexity assumptions, and it is worth having on the table for people who tune large models.

Referee Report

2 major / 0 minor

Summary. The paper claims that after disabling exponential moving averages, the updates of Adam, Shampoo, and Prodigy each reduce exactly to steepest descent under a particular (instantaneous) norm; it then generalizes this observation into a design space in which distinct operator norms are assigned to tensors according to their architectural role (e.g., linear versus embedding layers) in order to improve stability and speed.

Significance. If the claimed exact equivalence can be established and the role-specific norm assignment yields measurable gains, the work would supply a clean first-order reinterpretation of several widely used optimizers and open a systematic route to architecture-aware metric design that does not rely on convexity or second-order approximations.

major comments (2)

[Abstract] Abstract: the central equivalence claim (that disabling EMAs renders each method identical to steepest descent under a fixed norm) is stated without any derivation, explicit norm definition, or verification that the resulting single-step operator coincides with projection onto the unit ball of that norm for arbitrary loss surfaces.
The subsequent design-space generalization (role-specific norms per tensor) rests on the equivalence being exact rather than heuristic; without an explicit construction of the norm from the per-step statistics once EMA is removed, it is unclear whether the modified update remains a true steepest-descent step or merely an approximation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and precise comments. The concerns about the abstract's brevity and the need for explicit construction of the norms are well-taken. We address each point below and will revise the manuscript accordingly to make the derivations and constructions fully explicit.

read point-by-point responses

Referee: [Abstract] Abstract: the central equivalence claim (that disabling EMAs renders each method identical to steepest descent under a fixed norm) is stated without any derivation, explicit norm definition, or verification that the resulting single-step operator coincides with projection onto the unit ball of that norm for arbitrary loss surfaces.

Authors: We agree that the abstract, constrained by length, omits the derivation and explicit norm definitions. The body of the manuscript (Sections 3.1–3.3) derives the equivalence for each optimizer by removing the EMA terms and showing that the resulting update is exactly the steepest-descent step: the direction that minimizes the directional derivative subject to the unit ball of the induced norm. For arbitrary differentiable losses this holds by definition of steepest descent in a norm (no convexity required). We will expand the abstract to include the three explicit norm definitions and a one-sentence statement that the single-step operator matches the projection onto the unit ball of the dual norm. revision: yes
Referee: [—] The subsequent design-space generalization (role-specific norms per tensor) rests on the equivalence being exact rather than heuristic; without an explicit construction of the norm from the per-step statistics once EMA is removed, it is unclear whether the modified update remains a true steepest-descent step or merely an approximation.

Authors: The design space is constructed directly from the exact equivalences derived earlier. After EMA removal, the per-step statistics (e.g., the element-wise squared gradients for Adam, the Kronecker factors for Shampoo) define the norm at each step; the update is then precisely the steepest-descent direction in that instantaneous norm. Because the construction uses only the first-order gradient and the chosen norm, the equivalence is exact for any differentiable loss surface. We will add a new subsection (3.4) that explicitly maps each optimizer’s statistics to its norm and verifies that the resulting operator is the true steepest-descent step rather than an approximation. revision: yes

Circularity Check

0 steps flagged

Algebraic equivalence after EMA removal follows from direct manipulation without self-referential reduction

full rationale

The central claim equates modified (EMA-disabled) Adam/Shampoo/Prodigy to steepest descent under role-specific norms via algebraic rewriting of the update rules. This manipulation starts from the standard optimizer equations and substitutes zero decay, yielding an instantaneous-norm step; the resulting identity is a direct consequence of the definitions rather than a fitted parameter or imported uniqueness theorem. No load-bearing self-citation chain or ansatz smuggling is required for the equivalence itself. The subsequent design-space proposal (assigning different norms to different tensor roles) is a generalization that rests on the algebraic observation but does not collapse back into it. Minor self-citation may exist for background but is not used to justify the core equivalence. Hence only a low score is warranted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The argument rests on the standard definition of steepest descent with respect to an operator norm; no free parameters, ad-hoc axioms, or new entities are introduced in the abstract.

axioms (1)

standard math Steepest descent is the direction of maximal decrease measured in a chosen norm on the parameter space.
Invoked to equate the optimizers to first-order methods.

pith-pipeline@v0.9.0 · 5437 in / 1098 out tokens · 28197 ms · 2026-05-16T07:23:30.832679+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation (Jcost uniqueness), LedgerCanonicality (J-symmetry to double-entry), HierarchyEmergence (uniform scaling from zero-parameter composition) bilinear_family_forced (RCL from d'Alembert), Jcost_symm (reciprocity), locality_forces_additive_composition (Fibonacci from minimal recurrence) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm... Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network
Cost.JcostCore (J as recognition cost), DiscretenessForcing (J-minima force discrete stabilization) Jcost_nonneg (defect ≥0), existence_economically_inevitable (unique minimizer at unity) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Proposition 1 (Steepest descent)... arg min Δw [gJΔw + λ/2 ||Δw||²] = -||g||*/λ · arg max ||t||=1 gJt

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Phases of Muon: When Muon Eclipses SignSGD
math.OC 2026-05 unverdicted novelty 7.0

On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning
cs.LG 2026-05 unverdicted novelty 7.0

Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new...
Layerwise LQR for Geometry-Aware Optimization of Deep Networks
cs.LG 2026-05 unverdicted novelty 7.0

Steepest descent under divergence-induced quadratic models equals an LQR problem, enabling learning of diagonal or Kronecker-factored inverse preconditioners via a global layerwise objective for scalable geometry-awar...
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo
cs.LG 2026-04 unverdicted novelty 7.0

A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
cs.LG 2026-03 unverdicted novelty 7.0

Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing f...
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
cs.LG 2026-05 unverdicted novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence
cs.LG 2026-05 unverdicted novelty 6.0

Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
Optimistic Dual Averaging Unifies Modern Optimizers
cs.LG 2026-05 unverdicted novelty 6.0

SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.
PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Demystifying Manifold Constraints in LLM Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
Optimal Projection-Free Adaptive SGD for Matrix Optimization
math.OC 2026-04 unverdicted novelty 6.0

Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
cs.LG 2026-03 unverdicted novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
cs.LG 2026-05 unverdicted novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization
cs.LG 2026-05 unverdicted novelty 5.0

MuonQ achieves stable 4-bit quantization of Muon optimizer states via pre-quantization normalization, singular component decomposition with power iteration, and μ-law companding, matching full-precision loss and accur...
Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models
cs.LG 2026-04 unverdicted novelty 5.0

Fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 and other clinical outcome predictions, while certain temporal encodings like event order match or exceed time tokens with shorter sequences.
Can Muon Fine-tune Adam-Pretrained Models?
cs.LG 2026-05 unverdicted novelty 4.0

Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 18 Pith papers · 1 internal anchor

[1]

Scalable second order optimization for deep learning

Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning. arXiv:2002.09018, 2020

work page arXiv 2002
[2]

Minimization of functions having Lipschitz continuous first partial derivatives

Larry Armijo. Minimization of functions having Lipschitz continuous first partial derivatives . Pacific Journal of Mathematics, 1966

work page 1966
[3]

Dissecting A dam: The sign, magnitude and variance of stochastic gradients

Lukas Balles and Philipp Hennig. Dissecting A dam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, 2018

work page 2018
[4]

sign SGD : Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. sign SGD : Compressed optimisation for non-convex problems. In International Conference on Machine Learning, 2018

work page 2018
[5]

A utomatic G radient D escent: D eep L earning without H yperparameters

Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan, and Yisong Yue. A utomatic G radient D escent: D eep L earning without H yperparameters. arXiv:2304.05187, 2023

work page arXiv 2023
[6]

A ke Bj\" o rck and C. Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix. SIAM Journal on Numerical Analysis, 1971

work page 1971
[7]

Stochastic spectral descent for R estricted B oltzmann M achines

David Carlson, Volkan Cevher, and Lawrence Carin. Stochastic spectral descent for R estricted B oltzmann M achines. In International Conference on Artificial Intelligence and Statistics, 2015 a

work page 2015
[8]

Preconditioned spectral descent for deep learning

David Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, and Volkan Cevher. Preconditioned spectral descent for deep learning. In Neural Information Processing Systems, 2015 b

work page 2015
[9]

Stochastic spectral descent for discrete graphical models

David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and Volkan Cevher. Stochastic spectral descent for discrete graphical models. Selected Topics in Signal Processing, 2016

work page 2016
[10]

M \'e thode g \'e n \'e rale pour la r \'e solution des syst \`e mes d' \'e quations simultan \'e es

Augustin-Louis Cauchy. M \'e thode g \'e n \'e rale pour la r \'e solution des syst \`e mes d' \'e quations simultan \'e es. Comptes Rendus Hebdomadaires des S \'e ances de l'Acad \'e mie des Sciences , 1847

work page
[11]

Symbolic discovery of optimization algorithms

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. In Neural Information Processing Systems, 2023

work page 2023
[12]

George E. Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel L. Peirson, Bilal Khan, Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, Kongtao Chen, Chris J. Maddison, Rakshith Vasudev, Michal Badur...

work page arXiv 2023
[13]

Learning-rate-free learning by D -adaptation

Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by D -adaptation. In International Conference on Machine Learning, 2023

work page 2023
[14]

Duchi, Elad Hazan, and Yoram Singer

John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal Machine Learning Research, 2011

work page 2011
[15]

Unifying the Stochastic Spectral Descent for Restricted Boltzmann Machines with Bernoulli or Gaussian Inputs

Kai Fan. Unifying the stochastic spectral descent for R estricted B oltzmann M achines with B ernoulli or G aussian inputs. arXiv:1703.09766, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Jennifer Sun, Rohan Anil, and Elad Hazan

Vladimir Feinberg, Xinyi Chen, Y. Jennifer Sun, Rohan Anil, and Elad Hazan. Sketchy: Memory-efficient adaptive regularization with frequent directions. In Neural Information Processing Systems, 2023

work page 2023
[17]

A unified approach to adaptive regularization in online and stochastic optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. A unified approach to adaptive regularization in online and stochastic optimization. Technical report, Google Brain, 2017

work page 2017
[18]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 2018

work page 2018
[19]

Nicholas J. Higham. Functions of Matrices. Society for Industrial and Applied Mathematics, 2008

work page 2008
[20]

DoG is SGD 's best friend: A parameter-free dynamic step size schedule

Maor Ivgi, Oliver Hinder, and Yair Carmon. DoG is SGD 's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023

work page 2023
[21]

Improving line search methods for large scale neural network training

Philip Kenneweg, Tristan Kenneweg, and Barbara Hammer. Improving line search methods for large scale neural network training. In International Conference on Artificial Intelligence, Computer, Data Sciences and Applications, 2024

work page 2024
[22]

Do WG unleashed: An efficient universal parameter-free gradient descent method

Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. Do WG unleashed: An efficient universal parameter-free gradient descent method. In Neural Information Processing Systems, 2023

work page 2023
[23]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015

work page 2015
[24]

Some iterative methods for improving orthonormality

Zdislav Kovarik. Some iterative methods for improving orthonormality. SIAM Journal on Numerical Analysis, 1970

work page 1970
[25]

On the computation of the matrix k-th root

Slobodan Lakić. On the computation of the matrix k-th root. Journal of Applied Mathematics and Mechanics, 1998

work page 1998
[26]

MM Optimization Algorithms

Kenneth Lange. MM Optimization Algorithms . Society for Industrial and Applied Mathematics, 2016

work page 2016
[27]

Scalable optimization in the modular norm

Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, and Jeremy Bernstein. Scalable optimization in the modular norm. arXiv:2405.14813, 2024

work page arXiv 2024
[28]

Per-Gunnar Martinsson and Joel A. Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 2020

work page 2020
[29]

Prodigy: An expeditiously adaptive parameter-free learner

Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. arXiv:2306.06101, 2023

work page arXiv 2023
[30]

A new perspective on S hampoo's preconditioner

Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, and Lucas Janson. A new perspective on S hampoo's preconditioner. arXiv:2406.17748, 2024

work page arXiv 2024
[31]

A direct adaptive method for faster backpropagation learning: The RPROP algorithm

Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In International Conference on Neural Networks, 1993

work page 1993
[32]

A distributed data-parallel PyTorch implementation of the distributed S hampoo optimizer for training neural networks at-scale

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed S hampoo optimizer for training neural networks at-scale. arXiv:2309.06497, 2023

work page arXiv 2023
[33]

Universal majorization-minimization algorithms

Matthew Streeter. Universal majorization-minimization algorithms. arXiv:2308.00190, 2023

work page arXiv 2023
[34]

Shiqing Sun and James C. Spall. Connection of diagonal H essian estimates to natural gradients in stochastic optimization. In Information Sciences and Systems, 2021

work page 2021
[35]

Tijmen Tieleman and Geoffrey Hinton. RMSprop . Coursera: Neural Networks for Machine Learning, Lecture 6.5, 2012

work page 2012
[36]

Implicit bias of AdamW : _ -norm constrained optimization

Shuo Xie and Zhiyuan Li. Implicit bias of AdamW : _ -norm constrained optimization. In International Conference on Machine Learning, 2024

work page 2024
[37]

Simon, and Jeremy Bernstein

Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv:2310.17813, 2023

work page arXiv 2023
[38]

Deconstructing what makes a good optimizer for language models

Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham Kakade. Deconstructing what makes a good optimizer for language models. arXiv:2407.07972, 2024

work page arXiv 2024