Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods
Pith reviewed 2026-05-21 21:01 UTC · model grok-4.3
The pith
Preconditioned matrix norms provide a single framework in which steepest descent, quasi-Newton, and adaptive optimizers all appear as special cases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Preconditioned matrix norms generalize steepest descent by allowing arbitrary norm choices that adapt to different geometries, extend quasi-Newton and adaptive methods beyond the Frobenius inner product, and establish that SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus emerge directly as special cases of the same construction. Necessary and sufficient conditions for affine and scale invariance are derived under these generalized norms.
What carries the argument
Preconditioned matrix norms, which augment standard matrix norms with a preconditioning operator to encode both geometric adaptation and curvature utilization in a single object.
If this is right
- Existing optimizers can be re-derived and compared inside one formalism instead of being developed in isolation.
- Hybrid methods such as MuAdam arise systematically by selecting different combinations of norm and preconditioner.
- Invariance properties for matrix-valued parameters can be checked or enforced by verifying the stated necessary and sufficient conditions.
- New optimizers can be constructed by exploring other preconditioned norms that have not yet been instantiated.
Where Pith is reading between the lines
- The unification may allow automatic selection or interpolation between norms based on observed curvature or architecture type.
- Similar preconditioned-norm constructions could be carried over to Riemannian or manifold-constrained optimization settings.
- Convergence rates for the new MuAdam variants could be derived by specializing existing analyses of steepest descent under matrix norms.
Load-bearing premise
The chosen abstraction of preconditioned matrix norms is assumed to capture the essential geometry and curvature of the listed optimizers without the unification holding only by how the norms are defined.
What would settle it
An explicit derivation showing that Adam or Muon cannot be recovered from any choice of preconditioned matrix norm without extra structure that lies outside the framework.
Figures
read the original abstract
Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a unified optimization framework based on preconditioned matrix norms that generalizes steepest descent (via norm choice), quasi-Newton methods, and adaptive methods (via curvature incorporation). It claims that SGD (P = I), Adam (second-moment diagonal preconditioning), Muon (spectral norm), KL-Shampoo, SOAP, and SPlus all arise as special cases. The work derives necessary and sufficient conditions for affine and scale invariance under these generalized norms and proposes two new hybrids, MuAdam and MuAdam-SANIA, which combine Muon's spectral geometry with Adam-style preconditioning; experiments indicate these are competitive with or superior to existing methods on standard benchmarks. Code is provided for reproducibility.
Significance. A non-tautological unification that independently recovers existing update rules from a shared geometric principle, together with explicit invariance conditions and new competitive hybrids, would constitute a useful organizing framework for optimizer design. The systematic invariance analysis and experimental validation of the proposed MuAdam variants are potentially valuable contributions if the derivations hold without post-hoc embedding of each method's preconditioner.
major comments (2)
- [§3] §3 (Definition of preconditioned matrix norms and recovery of existing methods): The central unification claim requires that minimizing ||ΔW||_P independently yields the exact update rules of SGD, Adam, Muon, etc. The manuscript should explicitly demonstrate that the choice of P for each optimizer is derived from geometric or curvature considerations rather than reverse-engineered to match the known update; otherwise the equivalence risks being definitional. A concrete example showing the norm minimization step for at least Adam and Muon, with the resulting closed-form update, would clarify this.
- [§4] §4 (Invariance conditions): The necessary and sufficient conditions for affine and scale invariance are presented under generalized norms. It is unclear whether these conditions are satisfied by the specific P choices that recover the listed optimizers (e.g., Adam's second-moment estimate or Muon's spectral projection), or whether additional restrictions are imposed that limit the framework's applicability. A table or proposition verifying invariance for each recovered method would strengthen the claim.
minor comments (2)
- [Abstract / §1] The abstract and introduction should more clearly distinguish the novel contribution (preconditioned norms as a unifying principle) from the known fact that many optimizers can be viewed as preconditioned gradient steps.
- [Experiments] Experimental section: baseline comparisons should include recent hybrids such as SOAP and SPlus with identical hyperparameter tuning protocols to ensure the competitiveness claim for MuAdam variants is not due to tuning differences.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments on our manuscript. These have helped us clarify the presentation of the preconditioned norms framework and its connections to existing methods. We address each major comment point by point below, with revisions made to strengthen the derivations and invariance analysis.
read point-by-point responses
-
Referee: [§3] §3 (Definition of preconditioned matrix norms and recovery of existing methods): The central unification claim requires that minimizing ||ΔW||_P independently yields the exact update rules of SGD, Adam, Muon, etc. The manuscript should explicitly demonstrate that the choice of P for each optimizer is derived from geometric or curvature considerations rather than reverse-engineered to match the known update; otherwise the equivalence risks being definitional. A concrete example showing the norm minimization step for at least Adam and Muon, with the resulting closed-form update, would clarify this.
Authors: We agree that explicit derivation of each P from geometric principles is essential to substantiate the unification. In the revised manuscript, we have added a dedicated subsection (3.3) that derives the preconditioner choices from first principles. For Muon, the spectral norm is obtained by taking the operator norm induced by the Euclidean vector norm on the matrix space, so that minimizing ||ΔW||_P yields the update aligned with the dominant singular vector (scaled by the step size), recovering the exact Muon rule without post-hoc fitting. For Adam, the diagonal preconditioner P is motivated as a curvature approximation via the second-moment estimate of the gradient, which corresponds to a diagonal Hessian approximation; the closed-form minimizer of ||ΔW||_P is then the element-wise scaled update, matching Adam exactly. These derivations are presented with the full minimization steps and resulting closed forms for both methods, showing that the P selections follow directly from the desired geometry or curvature model rather than being reverse-engineered. revision: yes
-
Referee: [§4] §4 (Invariance conditions): The necessary and sufficient conditions for affine and scale invariance are presented under generalized norms. It is unclear whether these conditions are satisfied by the specific P choices that recover the listed optimizers (e.g., Adam's second-moment estimate or Muon's spectral projection), or whether additional restrictions are imposed that limit the framework's applicability. A table or proposition verifying invariance for each recovered method would strengthen the claim.
Authors: We thank the referee for highlighting this verification gap. In the revision, we have inserted a new table (Table 1) in §4 that enumerates each recovered optimizer (SGD, Adam, Muon, KL-Shampoo, SOAP, SPlus, and the proposed MuAdam variants), specifies the corresponding P, and indicates satisfaction of the affine and scale invariance conditions from Propositions 4.1 and 4.2. We also add a short corollary proving that the listed P choices satisfy the necessary and sufficient conditions under the problem assumptions already stated in the paper (e.g., Adam satisfies scale invariance but not full affine invariance, while Muon's spectral norm satisfies both when the matrix dimensions permit). No additional restrictions beyond those in the original framework are required, confirming broad applicability. revision: yes
Circularity Check
Unification holds by embedding existing preconditioners into the norm definition rather than deriving them independently
specific steps
-
self definitional
[Abstract]
"we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle."
The preconditioned-norm abstraction is introduced precisely so that each listed optimizer corresponds to a particular choice of the preconditioner matrix P inside the norm; the update rule for that optimizer is then recovered by construction when the minimization is performed with that P. Equivalence therefore follows from the definitional setup rather than from an a-priori geometric principle that would have predicted the preconditioners without prior knowledge of the optimizers.
full rationale
The paper defines a general steepest-descent update via minimization of a preconditioned matrix norm ||ΔW||_P and then shows that SGD, Adam, Muon, etc. arise for particular choices of P (identity, second-moment diagonal, spectral norm, etc.). Because the specific P for each optimizer is selected precisely to reproduce that optimizer's known update rule, the claimed 'emergence as special cases' reduces to a definitional reparameterization rather than an independent geometric derivation. The new methods MuAdam and MuAdam-SANIA are genuine extensions, but the central unification claim for the listed existing methods is load-bearing on this construction. No self-citation chain or uniqueness theorem is invoked to force the result, so the circularity is partial (score 6) rather than total.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Preconditioned matrix norms form a valid and sufficiently general abstraction for the listed optimization families
invented entities (1)
-
preconditioned matrix norms
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms... SGD and Adam... Muon and KL-Shampoo... SOAP and SPlus... all emerge as special cases
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1: lmoL,R,∥·∥(G) = L⁻¹ lmo∥·∥(L^{-T} G R^{-T}) R^{-1}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
-
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...
Reference graph
Works this paper leans on
-
[1]
Farshed Abdukhakimov, Chulu Xiang, Dmitry Kamzolov, Robert Gower, and Martin Tak´ aˇ c. Sania: Polyak-type optimization framework leads to scale invariant stochastic algorithms.arXiv preprint arXiv:2312.17369,
-
[2]
The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm
Noah Amsel, David Persson, Christopher Musco, and Robert Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.ArXiv, abs/2505.16932,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024a. Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024b. Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed optimisation for non-convex ...
-
[4]
Large-scale machine learning with stochastic gradient descent
L´ eon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, pages 177–186. Springer,
work page 2010
-
[5]
12 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,
work page 1901
-
[6]
Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, and Jun Zhu. Understanding pre-training and fine-tuning from loss landscape perspectives.arXiv preprint arXiv:2505.17646,
-
[7]
Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827,
-
[8]
A stable whitening optimizer for efficient neural network training
Kevin Frans, Sergey Levine, and Pieter Abbeel. A stable whitening optimizer for efficient neural network training. arXiv preprint arXiv:2506.07254,
-
[9]
Gradient methods with online scaling.arXiv preprint arXiv:2411.01803,
Wenzhi Gao, Ya-Chi Chu, Yinyu Ye, and Madeleine Udell. Gradient methods with online scaling.arXiv preprint arXiv:2411.01803,
-
[10]
Apertus: Democratizing open and compliant llms for global language environments
Alejandro Hern´ andez-Cano, Alexander H¨ agele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard FrankˇDurech, Ido Hakimi, et al. Apertus: Democratizing open and compliant llms for global language environments.arXiv preprint arXiv:2509.14233,
-
[11]
Adam: A Method for Stochastic Optimization
URLhttps://kellerjordan.github.io/posts/muon/. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Jiongcheng Li. Quasi-newton method of optimization is proved to be a steepest descent method under the ellipsoid norm.arXiv preprint arXiv:2411.11286,
-
[13]
Wu Lin, Scott Lowe, Felix Dangel, Runa Eschenhagen, Zikun Xu, and Roger Grosse. Understanding and improving the shampoo optimizer via kullback-leibler minimization.arXiv preprint arXiv:2509.03378,
-
[14]
Training Deep Learning Models with Norm-Constrained LMOs
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529,
work page internal anchor Pith review arXiv
-
[15]
Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt´ arik. Gluon: Making muon & scion great again! (bridging theory and practice of lmo-based optimizers for llms).arXiv preprint arXiv:2505.13416,
-
[16]
arXiv preprint arXiv:2509.01440 , year=
Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440,
-
[17]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, and Dongsheng Li. A survey on memory-efficient large-scale model training in ai for science.arXiv preprint arXiv:2501.11847,
-
[19]
SOAP: Improving and Stabilizing Shampoo using Adam
Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
GLUE: A multi-task benchmark and analysis platform for natural language understanding
15 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupa la, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, ...
work page 2018
-
[21]
Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. Rachel Ward. Stochastic gradient descent: where optimization meets machine learning. InProc. Int. Cong. Math, volume 7, pages 5140–5153,
-
[22]
arXiv preprint arXiv:2509.02046 , year=
Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046,
-
[23]
Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured preconditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537,
-
[24]
Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit Dhillon, Cho-Jui Hsieh, and Sanjiv Kumar. LoRA done RITE: Robust invariant transformation equilibration for LoRA optimization.arXiv preprint arXiv:2410.20625,
-
[25]
Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. Thomas Zhang, Behrad Moniri, Ansh Nagwekar, Faraz Rahman, Anton Xue, Hamed Hassani, and Nikolai Matni. On the concurrence of layer-wise preconditioning methods and provable feature learning.arXiv preprint arXiv:2502.01763,
-
[26]
If an optimizer produces iterateswt for L, we denote bywA t the iterates it produces onΦ
Throughout we fix an invertible matrixA∈R d×d and consider the re–parameterized loss for vectorized parameters Φ(wA) :=L A wA , w A :=A −1w, here we usedΦinstead of Lnew as in Section 3 in terms of convenience, we will do the similar change of notation in the Appendix B. If an optimizer produces iterateswt for L, we denote bywA t the iterates it produces ...
work page 2018
-
[27]
20 C Scale Invariance Setup and Hyperparameters To ensure a fair comparison across optimizers and input scalings, we perform hyperparameter tuning separately for each method and for both the original and scaled tasks using Optuna on a held-out validation split (see Section 4.1). Tuned values are selected to maximize validation accuracy, and final results ...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.