pith. machine review for the scientific record. sign in

arxiv: 2409.20325 · v2 · submitted 2024-09-30 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

Old Optimizer, New Norm: An Anthology

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:23 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords optimizerssteepest descentoperator normsneural network trainingAdamShampoofirst-order methodsrole-specific norms
0
0 comments X

The pith

Deep learning optimizers like Adam are equivalent to steepest descent under particular norms once exponential moving averages are disabled.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reinterprets Adam, Shampoo, and Prodigy as first-order methods by showing each reduces to steepest descent in a specific norm after the exponential moving averages are turned off. This reframing avoids convexity assumptions and instead treats the choice of norm as the key design choice. The authors generalize the observation to propose that tensors playing different roles in a network, such as linear layers versus embeddings, should receive different operator norms even when they occupy the same weight space. The hope is that deliberate metrization of the architecture will produce more stable and faster training.

Core claim

After switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, different operator norms should be assigned to different tensors based on the role that the tensor plays within the network.

What carries the argument

Steepest descent under a role-specific operator norm, where the norm is chosen according to whether a tensor implements a linear layer, embedding, or other network component.

If this is right

  • Linear and embedding layers, despite sharing the same weight space R^{m x n}, must be assigned different norms because they perform different functions.
  • Training algorithms can be constructed by selecting an appropriate norm for each tensor type rather than relying on a single global rule.
  • The resulting first-order methods remain valid without convexity assumptions on the objective.
  • Careful choice of per-tensor norms may improve stability, scalability, and speed over current uniform-norm optimizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Role-specific norms could be tested on modern transformer blocks to check whether attention and feed-forward tensors benefit from distinct choices.
  • The same norm-assignment principle might extend to convolutional or recurrent layers once their functional roles are defined.
  • If norm choice proves decisive, existing adaptive methods could be re-derived by first selecting the right norm and then adding minimal smoothing.

Load-bearing premise

Disabling exponential moving averages preserves the essential behavior of the methods, and assigning different norms to tensors according to their roles will improve training without further assumptions on the loss surface.

What would settle it

A controlled experiment in which linear and embedding layers receive identical norms yet training stability or speed remains unchanged would undermine the claim that role-specific norms matter.

read the original abstract

Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of $\mathbb{R}^{m\times n}$, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that after disabling exponential moving averages, the updates of Adam, Shampoo, and Prodigy each reduce exactly to steepest descent under a particular (instantaneous) norm; it then generalizes this observation into a design space in which distinct operator norms are assigned to tensors according to their architectural role (e.g., linear versus embedding layers) in order to improve stability and speed.

Significance. If the claimed exact equivalence can be established and the role-specific norm assignment yields measurable gains, the work would supply a clean first-order reinterpretation of several widely used optimizers and open a systematic route to architecture-aware metric design that does not rely on convexity or second-order approximations.

major comments (2)
  1. [Abstract] Abstract: the central equivalence claim (that disabling EMAs renders each method identical to steepest descent under a fixed norm) is stated without any derivation, explicit norm definition, or verification that the resulting single-step operator coincides with projection onto the unit ball of that norm for arbitrary loss surfaces.
  2. The subsequent design-space generalization (role-specific norms per tensor) rests on the equivalence being exact rather than heuristic; without an explicit construction of the norm from the per-step statistics once EMA is removed, it is unclear whether the modified update remains a true steepest-descent step or merely an approximation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and precise comments. The concerns about the abstract's brevity and the need for explicit construction of the norms are well-taken. We address each point below and will revise the manuscript accordingly to make the derivations and constructions fully explicit.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central equivalence claim (that disabling EMAs renders each method identical to steepest descent under a fixed norm) is stated without any derivation, explicit norm definition, or verification that the resulting single-step operator coincides with projection onto the unit ball of that norm for arbitrary loss surfaces.

    Authors: We agree that the abstract, constrained by length, omits the derivation and explicit norm definitions. The body of the manuscript (Sections 3.1–3.3) derives the equivalence for each optimizer by removing the EMA terms and showing that the resulting update is exactly the steepest-descent step: the direction that minimizes the directional derivative subject to the unit ball of the induced norm. For arbitrary differentiable losses this holds by definition of steepest descent in a norm (no convexity required). We will expand the abstract to include the three explicit norm definitions and a one-sentence statement that the single-step operator matches the projection onto the unit ball of the dual norm. revision: yes

  2. Referee: [—] The subsequent design-space generalization (role-specific norms per tensor) rests on the equivalence being exact rather than heuristic; without an explicit construction of the norm from the per-step statistics once EMA is removed, it is unclear whether the modified update remains a true steepest-descent step or merely an approximation.

    Authors: The design space is constructed directly from the exact equivalences derived earlier. After EMA removal, the per-step statistics (e.g., the element-wise squared gradients for Adam, the Kronecker factors for Shampoo) define the norm at each step; the update is then precisely the steepest-descent direction in that instantaneous norm. Because the construction uses only the first-order gradient and the chosen norm, the equivalence is exact for any differentiable loss surface. We will add a new subsection (3.4) that explicitly maps each optimizer’s statistics to its norm and verifies that the resulting operator is the true steepest-descent step rather than an approximation. revision: yes

Circularity Check

0 steps flagged

Algebraic equivalence after EMA removal follows from direct manipulation without self-referential reduction

full rationale

The central claim equates modified (EMA-disabled) Adam/Shampoo/Prodigy to steepest descent under role-specific norms via algebraic rewriting of the update rules. This manipulation starts from the standard optimizer equations and substitutes zero decay, yielding an instantaneous-norm step; the resulting identity is a direct consequence of the definitions rather than a fitted parameter or imported uniqueness theorem. No load-bearing self-citation chain or ansatz smuggling is required for the equivalence itself. The subsequent design-space proposal (assigning different norms to different tensor roles) is a generalization that rests on the algebraic observation but does not collapse back into it. Minor self-citation may exist for background but is not used to justify the core equivalence. Hence only a low score is warranted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The argument rests on the standard definition of steepest descent with respect to an operator norm; no free parameters, ad-hoc axioms, or new entities are introduced in the abstract.

axioms (1)
  • standard math Steepest descent is the direction of maximal decrease measured in a chosen norm on the parameter space.
    Invoked to equate the optimizers to first-order methods.

pith-pipeline@v0.9.0 · 5437 in / 1098 out tokens · 28197 ms · 2026-05-16T07:23:30.832679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Phases of Muon: When Muon Eclipses SignSGD

    math.OC 2026-05 unverdicted novelty 7.0

    On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.

  2. Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

    cs.LG 2026-05 unverdicted novelty 7.0

    Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

  3. Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new...

  4. Layerwise LQR for Geometry-Aware Optimization of Deep Networks

    cs.LG 2026-05 unverdicted novelty 7.0

    Steepest descent under divergence-induced quadratic models equals an LQR problem, enabling learning of diagonal or Kronecker-factored inverse preconditioners via a global layerwise objective for scalable geometry-awar...

  5. A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

    cs.LG 2026-04 unverdicted novelty 7.0

    A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.

  6. Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

    cs.LG 2026-03 unverdicted novelty 7.0

    Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing f...

  7. Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

    cs.LG 2026-05 unverdicted novelty 6.0

    Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...

  8. Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

    cs.LG 2026-05 unverdicted novelty 6.0

    Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.

  9. Optimistic Dual Averaging Unifies Modern Optimizers

    cs.LG 2026-05 unverdicted novelty 6.0

    SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.

  10. PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation

    cs.LG 2026-05 unverdicted novelty 6.0

    PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.

  11. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

    cs.LG 2026-05 unverdicted novelty 6.0

    Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

  12. Demystifying Manifold Constraints in LLM Pre-training

    cs.LG 2026-05 unverdicted novelty 6.0

    Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...

  13. Optimal Projection-Free Adaptive SGD for Matrix Optimization

    math.OC 2026-04 unverdicted novelty 6.0

    Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.

  14. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    cs.LG 2026-03 unverdicted novelty 6.0

    MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

  15. Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

    cs.LG 2026-05 unverdicted novelty 5.0

    Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

  16. MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization

    cs.LG 2026-05 unverdicted novelty 5.0

    MuonQ achieves stable 4-bit quantization of Muon optimizer states via pre-quantization normalization, singular component decomposition with power iteration, and μ-law companding, matching full-precision loss and accur...

  17. Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models

    cs.LG 2026-04 unverdicted novelty 5.0

    Fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 and other clinical outcome predictions, while certain temporal encodings like event order match or exceed time tokens with shorter sequences.

  18. Can Muon Fine-tune Adam-Pretrained Models?

    cs.LG 2026-05 unverdicted novelty 4.0

    Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 18 Pith papers · 1 internal anchor

  1. [1]

    Scalable second order optimization for deep learning

    Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning. arXiv:2002.09018, 2020

  2. [2]

    Minimization of functions having Lipschitz continuous first partial derivatives

    Larry Armijo. Minimization of functions having Lipschitz continuous first partial derivatives . Pacific Journal of Mathematics, 1966

  3. [3]

    Dissecting A dam: The sign, magnitude and variance of stochastic gradients

    Lukas Balles and Philipp Hennig. Dissecting A dam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, 2018

  4. [4]

    sign SGD : Compressed optimisation for non-convex problems

    Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. sign SGD : Compressed optimisation for non-convex problems. In International Conference on Machine Learning, 2018

  5. [5]

    A utomatic G radient D escent: D eep L earning without H yperparameters

    Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan, and Yisong Yue. A utomatic G radient D escent: D eep L earning without H yperparameters. arXiv:2304.05187, 2023

  6. [6]

    A ke Bj\" o rck and C. Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix. SIAM Journal on Numerical Analysis, 1971

  7. [7]

    Stochastic spectral descent for R estricted B oltzmann M achines

    David Carlson, Volkan Cevher, and Lawrence Carin. Stochastic spectral descent for R estricted B oltzmann M achines. In International Conference on Artificial Intelligence and Statistics, 2015 a

  8. [8]

    Preconditioned spectral descent for deep learning

    David Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, and Volkan Cevher. Preconditioned spectral descent for deep learning. In Neural Information Processing Systems, 2015 b

  9. [9]

    Stochastic spectral descent for discrete graphical models

    David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and Volkan Cevher. Stochastic spectral descent for discrete graphical models. Selected Topics in Signal Processing, 2016

  10. [10]

    M \'e thode g \'e n \'e rale pour la r \'e solution des syst \`e mes d' \'e quations simultan \'e es

    Augustin-Louis Cauchy. M \'e thode g \'e n \'e rale pour la r \'e solution des syst \`e mes d' \'e quations simultan \'e es. Comptes Rendus Hebdomadaires des S \'e ances de l'Acad \'e mie des Sciences , 1847

  11. [11]

    Symbolic discovery of optimization algorithms

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. In Neural Information Processing Systems, 2023

  12. [12]

    George E. Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel L. Peirson, Bilal Khan, Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, Kongtao Chen, Chris J. Maddison, Rakshith Vasudev, Michal Badur...

  13. [13]

    Learning-rate-free learning by D -adaptation

    Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by D -adaptation. In International Conference on Machine Learning, 2023

  14. [14]

    Duchi, Elad Hazan, and Yoram Singer

    John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal Machine Learning Research, 2011

  15. [15]

    Unifying the Stochastic Spectral Descent for Restricted Boltzmann Machines with Bernoulli or Gaussian Inputs

    Kai Fan. Unifying the stochastic spectral descent for R estricted B oltzmann M achines with B ernoulli or G aussian inputs. arXiv:1703.09766, 2017

  16. [16]

    Jennifer Sun, Rohan Anil, and Elad Hazan

    Vladimir Feinberg, Xinyi Chen, Y. Jennifer Sun, Rohan Anil, and Elad Hazan. Sketchy: Memory-efficient adaptive regularization with frequent directions. In Neural Information Processing Systems, 2023

  17. [17]

    A unified approach to adaptive regularization in online and stochastic optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. A unified approach to adaptive regularization in online and stochastic optimization. Technical report, Google Brain, 2017

  18. [18]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 2018

  19. [19]

    Nicholas J. Higham. Functions of Matrices. Society for Industrial and Applied Mathematics, 2008

  20. [20]

    DoG is SGD 's best friend: A parameter-free dynamic step size schedule

    Maor Ivgi, Oliver Hinder, and Yair Carmon. DoG is SGD 's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023

  21. [21]

    Improving line search methods for large scale neural network training

    Philip Kenneweg, Tristan Kenneweg, and Barbara Hammer. Improving line search methods for large scale neural network training. In International Conference on Artificial Intelligence, Computer, Data Sciences and Applications, 2024

  22. [22]

    Do WG unleashed: An efficient universal parameter-free gradient descent method

    Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. Do WG unleashed: An efficient universal parameter-free gradient descent method. In Neural Information Processing Systems, 2023

  23. [23]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015

  24. [24]

    Some iterative methods for improving orthonormality

    Zdislav Kovarik. Some iterative methods for improving orthonormality. SIAM Journal on Numerical Analysis, 1970

  25. [25]

    On the computation of the matrix k-th root

    Slobodan Lakić. On the computation of the matrix k-th root. Journal of Applied Mathematics and Mechanics, 1998

  26. [26]

    MM Optimization Algorithms

    Kenneth Lange. MM Optimization Algorithms . Society for Industrial and Applied Mathematics, 2016

  27. [27]

    Scalable optimization in the modular norm

    Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, and Jeremy Bernstein. Scalable optimization in the modular norm. arXiv:2405.14813, 2024

  28. [28]

    Per-Gunnar Martinsson and Joel A. Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 2020

  29. [29]

    Prodigy: An expeditiously adaptive parameter-free learner

    Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. arXiv:2306.06101, 2023

  30. [30]

    A new perspective on S hampoo's preconditioner

    Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, and Lucas Janson. A new perspective on S hampoo's preconditioner. arXiv:2406.17748, 2024

  31. [31]

    A direct adaptive method for faster backpropagation learning: The RPROP algorithm

    Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In International Conference on Neural Networks, 1993

  32. [32]

    A distributed data-parallel PyTorch implementation of the distributed S hampoo optimizer for training neural networks at-scale

    Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed S hampoo optimizer for training neural networks at-scale. arXiv:2309.06497, 2023

  33. [33]

    Universal majorization-minimization algorithms

    Matthew Streeter. Universal majorization-minimization algorithms. arXiv:2308.00190, 2023

  34. [34]

    Shiqing Sun and James C. Spall. Connection of diagonal H essian estimates to natural gradients in stochastic optimization. In Information Sciences and Systems, 2021

  35. [35]

    Tijmen Tieleman and Geoffrey Hinton. RMSprop . Coursera: Neural Networks for Machine Learning, Lecture 6.5, 2012

  36. [36]

    Implicit bias of AdamW : _ -norm constrained optimization

    Shuo Xie and Zhiyuan Li. Implicit bias of AdamW : _ -norm constrained optimization. In International Conference on Machine Learning, 2024

  37. [37]

    Simon, and Jeremy Bernstein

    Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv:2310.17813, 2023

  38. [38]

    Deconstructing what makes a good optimizer for language models

    Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham Kakade. Deconstructing what makes a good optimizer for language models. arXiv:2407.07972, 2024