Recognition: 2 theorem links
· Lean TheoremOld Optimizer, New Norm: An Anthology
Pith reviewed 2026-05-16 07:23 UTC · model grok-4.3
The pith
Deep learning optimizers like Adam are equivalent to steepest descent under particular norms once exponential moving averages are disabled.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, different operator norms should be assigned to different tensors based on the role that the tensor plays within the network.
What carries the argument
Steepest descent under a role-specific operator norm, where the norm is chosen according to whether a tensor implements a linear layer, embedding, or other network component.
If this is right
- Linear and embedding layers, despite sharing the same weight space R^{m x n}, must be assigned different norms because they perform different functions.
- Training algorithms can be constructed by selecting an appropriate norm for each tensor type rather than relying on a single global rule.
- The resulting first-order methods remain valid without convexity assumptions on the objective.
- Careful choice of per-tensor norms may improve stability, scalability, and speed over current uniform-norm optimizers.
Where Pith is reading between the lines
- Role-specific norms could be tested on modern transformer blocks to check whether attention and feed-forward tensors benefit from distinct choices.
- The same norm-assignment principle might extend to convolutional or recurrent layers once their functional roles are defined.
- If norm choice proves decisive, existing adaptive methods could be re-derived by first selecting the right norm and then adding minimal smoothing.
Load-bearing premise
Disabling exponential moving averages preserves the essential behavior of the methods, and assigning different norms to tensors according to their roles will improve training without further assumptions on the loss surface.
What would settle it
A controlled experiment in which linear and embedding layers receive identical norms yet training stability or speed remains unchanged would undermine the claim that role-specific norms matter.
read the original abstract
Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of $\mathbb{R}^{m\times n}$, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that after disabling exponential moving averages, the updates of Adam, Shampoo, and Prodigy each reduce exactly to steepest descent under a particular (instantaneous) norm; it then generalizes this observation into a design space in which distinct operator norms are assigned to tensors according to their architectural role (e.g., linear versus embedding layers) in order to improve stability and speed.
Significance. If the claimed exact equivalence can be established and the role-specific norm assignment yields measurable gains, the work would supply a clean first-order reinterpretation of several widely used optimizers and open a systematic route to architecture-aware metric design that does not rely on convexity or second-order approximations.
major comments (2)
- [Abstract] Abstract: the central equivalence claim (that disabling EMAs renders each method identical to steepest descent under a fixed norm) is stated without any derivation, explicit norm definition, or verification that the resulting single-step operator coincides with projection onto the unit ball of that norm for arbitrary loss surfaces.
- The subsequent design-space generalization (role-specific norms per tensor) rests on the equivalence being exact rather than heuristic; without an explicit construction of the norm from the per-step statistics once EMA is removed, it is unclear whether the modified update remains a true steepest-descent step or merely an approximation.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and precise comments. The concerns about the abstract's brevity and the need for explicit construction of the norms are well-taken. We address each point below and will revise the manuscript accordingly to make the derivations and constructions fully explicit.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central equivalence claim (that disabling EMAs renders each method identical to steepest descent under a fixed norm) is stated without any derivation, explicit norm definition, or verification that the resulting single-step operator coincides with projection onto the unit ball of that norm for arbitrary loss surfaces.
Authors: We agree that the abstract, constrained by length, omits the derivation and explicit norm definitions. The body of the manuscript (Sections 3.1–3.3) derives the equivalence for each optimizer by removing the EMA terms and showing that the resulting update is exactly the steepest-descent step: the direction that minimizes the directional derivative subject to the unit ball of the induced norm. For arbitrary differentiable losses this holds by definition of steepest descent in a norm (no convexity required). We will expand the abstract to include the three explicit norm definitions and a one-sentence statement that the single-step operator matches the projection onto the unit ball of the dual norm. revision: yes
-
Referee: [—] The subsequent design-space generalization (role-specific norms per tensor) rests on the equivalence being exact rather than heuristic; without an explicit construction of the norm from the per-step statistics once EMA is removed, it is unclear whether the modified update remains a true steepest-descent step or merely an approximation.
Authors: The design space is constructed directly from the exact equivalences derived earlier. After EMA removal, the per-step statistics (e.g., the element-wise squared gradients for Adam, the Kronecker factors for Shampoo) define the norm at each step; the update is then precisely the steepest-descent direction in that instantaneous norm. Because the construction uses only the first-order gradient and the chosen norm, the equivalence is exact for any differentiable loss surface. We will add a new subsection (3.4) that explicitly maps each optimizer’s statistics to its norm and verifies that the resulting operator is the true steepest-descent step rather than an approximation. revision: yes
Circularity Check
Algebraic equivalence after EMA removal follows from direct manipulation without self-referential reduction
full rationale
The central claim equates modified (EMA-disabled) Adam/Shampoo/Prodigy to steepest descent under role-specific norms via algebraic rewriting of the update rules. This manipulation starts from the standard optimizer equations and substitutes zero decay, yielding an instantaneous-norm step; the resulting identity is a direct consequence of the definitions rather than a fitted parameter or imported uniqueness theorem. No load-bearing self-citation chain or ansatz smuggling is required for the equivalence itself. The subsequent design-space proposal (assigning different norms to different tensor roles) is a generalization that rests on the algebraic observation but does not collapse back into it. Minor self-citation may exist for background but is not used to justify the core equivalence. Hence only a low score is warranted.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Steepest descent is the direction of maximal decrease measured in a chosen norm on the parameter space.
Lean theorems connected to this paper
-
Cost.FunctionalEquation (Jcost uniqueness), LedgerCanonicality (J-symmetry to double-entry), HierarchyEmergence (uniform scaling from zero-parameter composition)bilinear_family_forced (RCL from d'Alembert), Jcost_symm (reciprocity), locality_forces_additive_composition (Fibonacci from minimal recurrence) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm... Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network
-
Cost.JcostCore (J as recognition cost), DiscretenessForcing (J-minima force discrete stabilization)Jcost_nonneg (defect ≥0), existence_economically_inevitable (unique minimizer at unity) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Proposition 1 (Steepest descent)... arg min Δw [gJΔw + λ/2 ||Δw||²] = -||g||*/λ · arg max ||t||=1 gJt
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
Phases of Muon: When Muon Eclipses SignSGD
On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
-
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
-
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning
Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new...
-
Layerwise LQR for Geometry-Aware Optimization of Deep Networks
Steepest descent under divergence-induced quadratic models equals an LQR problem, enabling learning of diagonal or Kronecker-factored inverse preconditioners via a global layerwise objective for scalable geometry-awar...
-
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
-
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing f...
-
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
-
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence
Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
-
Optimistic Dual Averaging Unifies Modern Optimizers
SODA unifies several modern optimizers under optimistic dual averaging and supplies a 1/k decay wrapper that improves performance without weight decay tuning.
-
PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation
PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Demystifying Manifold Constraints in LLM Pre-training
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
-
Optimal Projection-Free Adaptive SGD for Matrix Optimization
Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
-
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
-
MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization
MuonQ achieves stable 4-bit quantization of Muon optimizer states via pre-quantization normalization, singular component decomposition with power iteration, and μ-law companding, matching full-precision loss and accur...
-
Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models
Fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 and other clinical outcome predictions, while certain temporal encodings like event order match or exceed time tokens with shorter sequences.
-
Can Muon Fine-tune Adam-Pretrained Models?
Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
Reference graph
Works this paper leans on
-
[1]
Scalable second order optimization for deep learning
Rohan Anil, Vineet Gupta, Tomer Koren, Kevin Regan, and Yoram Singer. Scalable second order optimization for deep learning. arXiv:2002.09018, 2020
-
[2]
Minimization of functions having Lipschitz continuous first partial derivatives
Larry Armijo. Minimization of functions having Lipschitz continuous first partial derivatives . Pacific Journal of Mathematics, 1966
work page 1966
-
[3]
Dissecting A dam: The sign, magnitude and variance of stochastic gradients
Lukas Balles and Philipp Hennig. Dissecting A dam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, 2018
work page 2018
-
[4]
sign SGD : Compressed optimisation for non-convex problems
Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. sign SGD : Compressed optimisation for non-convex problems. In International Conference on Machine Learning, 2018
work page 2018
-
[5]
A utomatic G radient D escent: D eep L earning without H yperparameters
Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan, and Yisong Yue. A utomatic G radient D escent: D eep L earning without H yperparameters. arXiv:2304.05187, 2023
-
[6]
A ke Bj\" o rck and C. Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix. SIAM Journal on Numerical Analysis, 1971
work page 1971
-
[7]
Stochastic spectral descent for R estricted B oltzmann M achines
David Carlson, Volkan Cevher, and Lawrence Carin. Stochastic spectral descent for R estricted B oltzmann M achines. In International Conference on Artificial Intelligence and Statistics, 2015 a
work page 2015
-
[8]
Preconditioned spectral descent for deep learning
David Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, and Volkan Cevher. Preconditioned spectral descent for deep learning. In Neural Information Processing Systems, 2015 b
work page 2015
-
[9]
Stochastic spectral descent for discrete graphical models
David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and Volkan Cevher. Stochastic spectral descent for discrete graphical models. Selected Topics in Signal Processing, 2016
work page 2016
-
[10]
M \'e thode g \'e n \'e rale pour la r \'e solution des syst \`e mes d' \'e quations simultan \'e es
Augustin-Louis Cauchy. M \'e thode g \'e n \'e rale pour la r \'e solution des syst \`e mes d' \'e quations simultan \'e es. Comptes Rendus Hebdomadaires des S \'e ances de l'Acad \'e mie des Sciences , 1847
-
[11]
Symbolic discovery of optimization algorithms
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. In Neural Information Processing Systems, 2023
work page 2023
-
[12]
George E. Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel L. Peirson, Bilal Khan, Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, Kongtao Chen, Chris J. Maddison, Rakshith Vasudev, Michal Badur...
-
[13]
Learning-rate-free learning by D -adaptation
Aaron Defazio and Konstantin Mishchenko. Learning-rate-free learning by D -adaptation. In International Conference on Machine Learning, 2023
work page 2023
-
[14]
Duchi, Elad Hazan, and Yoram Singer
John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal Machine Learning Research, 2011
work page 2011
-
[15]
Kai Fan. Unifying the stochastic spectral descent for R estricted B oltzmann M achines with B ernoulli or G aussian inputs. arXiv:1703.09766, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Jennifer Sun, Rohan Anil, and Elad Hazan
Vladimir Feinberg, Xinyi Chen, Y. Jennifer Sun, Rohan Anil, and Elad Hazan. Sketchy: Memory-efficient adaptive regularization with frequent directions. In Neural Information Processing Systems, 2023
work page 2023
-
[17]
A unified approach to adaptive regularization in online and stochastic optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. A unified approach to adaptive regularization in online and stochastic optimization. Technical report, Google Brain, 2017
work page 2017
-
[18]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 2018
work page 2018
-
[19]
Nicholas J. Higham. Functions of Matrices. Society for Industrial and Applied Mathematics, 2008
work page 2008
-
[20]
DoG is SGD 's best friend: A parameter-free dynamic step size schedule
Maor Ivgi, Oliver Hinder, and Yair Carmon. DoG is SGD 's best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning, 2023
work page 2023
-
[21]
Improving line search methods for large scale neural network training
Philip Kenneweg, Tristan Kenneweg, and Barbara Hammer. Improving line search methods for large scale neural network training. In International Conference on Artificial Intelligence, Computer, Data Sciences and Applications, 2024
work page 2024
-
[22]
Do WG unleashed: An efficient universal parameter-free gradient descent method
Ahmed Khaled, Konstantin Mishchenko, and Chi Jin. Do WG unleashed: An efficient universal parameter-free gradient descent method. In Neural Information Processing Systems, 2023
work page 2023
-
[23]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015
work page 2015
-
[24]
Some iterative methods for improving orthonormality
Zdislav Kovarik. Some iterative methods for improving orthonormality. SIAM Journal on Numerical Analysis, 1970
work page 1970
-
[25]
On the computation of the matrix k-th root
Slobodan Lakić. On the computation of the matrix k-th root. Journal of Applied Mathematics and Mechanics, 1998
work page 1998
-
[26]
Kenneth Lange. MM Optimization Algorithms . Society for Industrial and Applied Mathematics, 2016
work page 2016
-
[27]
Scalable optimization in the modular norm
Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, and Jeremy Bernstein. Scalable optimization in the modular norm. arXiv:2405.14813, 2024
-
[28]
Per-Gunnar Martinsson and Joel A. Tropp. Randomized numerical linear algebra: Foundations and algorithms. Acta Numerica, 2020
work page 2020
-
[29]
Prodigy: An expeditiously adaptive parameter-free learner
Konstantin Mishchenko and Aaron Defazio. Prodigy: An expeditiously adaptive parameter-free learner. arXiv:2306.06101, 2023
-
[30]
A new perspective on S hampoo's preconditioner
Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, and Lucas Janson. A new perspective on S hampoo's preconditioner. arXiv:2406.17748, 2024
-
[31]
A direct adaptive method for faster backpropagation learning: The RPROP algorithm
Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In International Conference on Neural Networks, 1993
work page 1993
-
[32]
Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed S hampoo optimizer for training neural networks at-scale. arXiv:2309.06497, 2023
-
[33]
Universal majorization-minimization algorithms
Matthew Streeter. Universal majorization-minimization algorithms. arXiv:2308.00190, 2023
-
[34]
Shiqing Sun and James C. Spall. Connection of diagonal H essian estimates to natural gradients in stochastic optimization. In Information Sciences and Systems, 2021
work page 2021
-
[35]
Tijmen Tieleman and Geoffrey Hinton. RMSprop . Coursera: Neural Networks for Machine Learning, Lecture 6.5, 2012
work page 2012
-
[36]
Implicit bias of AdamW : _ -norm constrained optimization
Shuo Xie and Zhiyuan Li. Implicit bias of AdamW : _ -norm constrained optimization. In International Conference on Machine Learning, 2024
work page 2024
-
[37]
Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning. arXiv:2310.17813, 2023
-
[38]
Deconstructing what makes a good optimizer for language models
Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham Kakade. Deconstructing what makes a good optimizer for language models. arXiv:2407.07972, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.