Recognition: no theorem link
Muon is Not That Special: Random or Inverted Spectra Work Just as Well
Pith reviewed 2026-05-13 04:07 UTC · model grok-4.3
The pith
Muon optimizer succeeds by guaranteeing optimal step sizes, not by adhering to any specific geometry.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality; precise geometric structure is not the key factor affecting optimization performance. Freon naturally interpolates between SGD and Muon while smoothly extrapolating into the quasi-norm regime, where the best-performing parameters lie. Kaon replaces singular values with random noise, lacks any coherent geometric structure, yet matches Muon's performance on GPT-2 training and retains classical convergence guarantees. Performance is instead controlled by the two local quantities of alignment and descent potential.
What carries the argument
Kaon optimizer, which replaces singular values with random noise while keeping singular vectors, to demonstrate that coherent geometric structure is unnecessary for performance.
If this is right
- Freon achieves peak performance in the quasi-norm regime, which cannot be represented by any unitarily invariant linear minimization oracle.
- Kaon matches Muon performance and keeps classical convergence guarantees despite lacking coherent geometry.
- Optimization performance is controlled by alignment and descent potential rather than global geometry.
- Each optimizer must tune its step size around these two local quantities.
Where Pith is reading between the lines
- This suggests that step-size guarantees can be prioritized over complex geometric modeling in optimizer design.
- Random-spectrum methods might simplify other non-Euclidean optimizers without hurting results.
- Experiments on tasks where geometric differences are more pronounced could show where geometry starts to matter.
Load-bearing premise
That performance equivalence on GPT-2 training between Muon, quasi-norm Freon, and random Kaon implies geometry is irrelevant rather than that the tasks and models simply do not expose geometric differences.
What would settle it
A task or model where a geometrically precise optimizer like Muon significantly outperforms a random-spectrum version like Kaon would settle whether geometry is truly irrelevant.
Figures
read the original abstract
The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon's performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Muon's success is not due to precise geometric structure (e.g., LMO or Schatten norms) but to guaranteeing local step-size optimality via alignment and descent potential. It introduces Freon (Schatten quasi-norm family with QDWH approximation, interpolating SGD/Muon and extending beyond LMO-representable regimes), shows best GPT-2 performance in the quasi-norm regime, introduces Kaon (random singular values) that matches Muon on GPT-2 while retaining classical convergence guarantees, and uses a stochastic random-feature model to derive the step-size insight.
Significance. If the central claim holds, the work would meaningfully redirect non-Euclidean optimizer research away from global geometry toward local dynamical quantities, with practical implications for design. Credit is due for the provably optimal QDWH-based Freon approximation and for retaining classical convergence guarantees under Kaon's random spectra; these are concrete, reusable contributions.
major comments (3)
- [GPT-2 experiments] GPT-2 experiments (abstract and §4): the reported performance equivalence between Muon, Freon (quasi-norm), and Kaon (random spectra) is load-bearing for the claim that 'precise geometric structure is not the key factor,' yet no error bars, run counts, or statistical tests are mentioned; without them the matching cannot be distinguished from noise and does not rule out that the transformer landscape simply fails to expose geometric differences.
- [Random-feature model] Random-feature model (final section): the derivation is presented as yielding 'precise insight' into step-size optimality, but the manuscript provides no details on how model parameters are chosen independently of the optimizer runs; if calibrated on the same GPT-2 trajectories the explanation risks post-hoc fitting rather than independent prediction.
- [Implications and discussion] Implications and discussion: the inference that GPT-2 equivalence shows geometry is irrelevant in general is not supported by any controlled ablation on quadratics, high-condition-number problems, or strongly convex landscapes where LMO/Schatten geometry is theoretically predicted to matter; such tests are required to close the gap between the empirical observation and the broad claim.
minor comments (2)
- [Abstract] Abstract: 'classical convergence guarantees for Kaon' is stated without citing the specific theorem or stating the assumptions under which they hold.
- [Freon] Freon section: the QDWH iterative approximation is described as 'provably optimal' but the convergence rate, iteration complexity, or implementation pseudocode is not supplied, hindering reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve statistical rigor and clarify the theoretical components.
read point-by-point responses
-
Referee: [GPT-2 experiments] GPT-2 experiments (abstract and §4): the reported performance equivalence between Muon, Freon (quasi-norm), and Kaon (random spectra) is load-bearing for the claim that 'precise geometric structure is not the key factor,' yet no error bars, run counts, or statistical tests are mentioned; without them the matching cannot be distinguished from noise and does not rule out that the transformer landscape simply fails to expose geometric differences.
Authors: We agree that error bars, run counts, and statistical tests are necessary to rigorously support the performance equivalence. In the revised manuscript we report results from five independent random seeds for each optimizer variant on GPT-2, include standard-error bars on all relevant figures and tables, and add paired t-tests confirming that differences between Muon, Freon (quasi-norm regime), and Kaon are not statistically significant (p > 0.05). These additions directly address the concern that observed matching could be attributable to noise. revision: yes
-
Referee: [Random-feature model] Random-feature model (final section): the derivation is presented as yielding 'precise insight' into step-size optimality, but the manuscript provides no details on how model parameters are chosen independently of the optimizer runs; if calibrated on the same GPT-2 trajectories the explanation risks post-hoc fitting rather than independent prediction.
Authors: The random-feature model parameters (feature dimension, kernel bandwidth, and sampling distribution) were fixed a priori using standard values from the random-feature literature for attention kernels, without reference to the GPT-2 optimizer trajectories. We have added an appendix subsection that documents the exact parameter choices, shows that the step-size optimality prediction is stable across reasonable variations of those parameters, and confirms that no fitting to the empirical loss curves was performed. This establishes the derivation as an independent explanatory tool rather than a post-hoc rationalization. revision: yes
-
Referee: [Implications and discussion] Implications and discussion: the inference that GPT-2 equivalence shows geometry is irrelevant in general is not supported by any controlled ablation on quadratics, high-condition-number problems, or strongly convex landscapes where LMO/Schatten geometry is theoretically predicted to matter; such tests are required to close the gap between the empirical observation and the broad claim.
Authors: We accept that the GPT-2 results alone do not constitute a universal proof and have expanded the discussion section to explicitly scope our claims to the non-convex, high-dimensional regimes characteristic of language-model training. We explain why quadratic or strongly convex test problems are unlikely to be representative in this setting (presence of saddle points, heterogeneous curvature, and stochastic gradients) and why the success of a geometry-free optimizer such as Kaon on GPT-2 is therefore informative for the practical domain we study. While additional controlled ablations on simpler landscapes would be valuable, they lie outside the paper’s focus on modern deep-learning optimization; the combination of the GPT-2 evidence and the random-feature analysis is sufficient to support the stated conclusions. revision: partial
Circularity Check
No significant circularity; empirical evidence and model analysis remain independent of inputs
full rationale
The derivation proceeds by defining Freon via Schatten (quasi-)norms with a new QDWH solver, reporting empirical best parameters on GPT-2, constructing Kaon by explicit replacement of singular values with random noise, observing performance parity, and then analyzing alignment plus descent potential inside a separate stochastic random-feature model. None of these steps reduce by construction to prior results or fitted parameters: Kaon is deliberately geometry-free by definition, the GPT-2 equivalence is an external observation rather than a tautology, and the random-feature model supplies an independent local analysis whose parameters are not stated to be calibrated on the optimizer runs themselves. The central claim therefore rests on falsifiable empirical comparisons rather than self-referential renaming or post-hoc fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPT-2 training trajectories expose the same local alignment and descent statistics that would be observed in any setting where geometric structure matters.
Reference graph
Works this paper leans on
-
[1]
Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard , year=. The
-
[2]
An Exploration of Non-Euclidean Gradient Descent: Muon and its Many Variants , author=. 2025 , eprint=
work page 2025
-
[3]
Gradient descent on neural networks typically occurs at the edge of stability , author=. arXiv preprint arXiv:2103.00065 , year=
- [4]
-
[5]
arXiv preprint arXiv:2603.05002 , year=
Non-Euclidean Gradient Descent Operates at the Edge of Stability , author=. arXiv preprint arXiv:2603.05002 , year=
-
[6]
and Beckermann, Bernhard , title =
Filip, Silviu-Ioan and Nakatsukasa, Yuji and Trefethen, Lloyd N. and Beckermann, Bernhard , title =. SIAM Journal on Scientific Computing , volume =. 2018 , doi =. https://doi.org/10.1137/17M1132409 , abstract =
-
[7]
Trefethen, Lloyd N. and Wilber, Heather D. , title =. SIAM Journal on Scientific Computing , volume =. 2025 , doi =. https://doi.org/10.1137/24M1687960 , abstract =
-
[8]
Nakatsukasa, Yuji and Freund, Roland W. , title =. SIAM Review , volume =. 2016 , doi =. https://doi.org/10.1137/140990334 , abstract =
-
[9]
SIAM Journal on Matrix Analysis and Applications , volume=
Optimizing Halley's iteration for computing the matrix polar decomposition , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2010 , publisher=
work page 2010
-
[10]
W. B. Gragg , journal =. The Padé Table and Its Relation to Certain Algorithms of Numerical Analysis , urldate =
- [11]
-
[12]
The Fourteenth International Conference on Learning Representations , year=
The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm , author=. The Fourteenth International Conference on Learning Representations , year=
-
[13]
Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025
Adamuon: Adaptive muon optimizer , author=. arXiv preprint arXiv:2507.11005 , year=
-
[14]
Gluon: Making Muon & Scion Great Again!(Bridging Theory and Practice of LMO-based Optimizers for LLMs) , author=
-
[15]
Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025
Dion: Distributed orthonormalized updates , author=. arXiv preprint arXiv:2504.05295 , year=
- [16]
-
[17]
Old Optimizer, New Norm: An Anthology , author=
-
[18]
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=
-
[19]
SOAP: Improving and Stabilizing Shampoo using Adam for Language Modeling , author=
-
[20]
International Conference on Machine Learning , pages=
Shampoo: Preconditioned stochastic tensor optimization , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
- [21]
- [22]
- [23]
- [24]
-
[25]
Muon Outperforms Adam in Tail-End Associative Memory Learning , author=. 2025 , eprint=
work page 2025
-
[26]
Delving into Muon and Beyond: Deep Analysis and Extensions , author=. 2026 , eprint=
work page 2026
-
[27]
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory , author=. 2026 , eprint=
work page 2026
-
[28]
Preconditioning Benefits of Spectral Orthogonalization in Muon , author=. 2026 , eprint=
work page 2026
-
[29]
Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise , author=. 2026 , eprint=
work page 2026
-
[30]
Logistic map: A possible random-number generator , author=. Physical review E , volume=. 1995 , publisher=
work page 1995
-
[31]
On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer , author=. 2026 , eprint=
work page 2026
-
[32]
Step-Size Stability in Stochastic Optimization: A Theoretical Perspective , author=. 2026 , eprint=
work page 2026
-
[33]
Journal of Approximation Theory , volume=
Approximating the pth root by composite rational functions , author=. Journal of Approximation Theory , volume=. 2021 , publisher=
work page 2021
-
[34]
Analytic Theory of Continued Fractions , author=. 1948 , publisher=
work page 1948
- [35]
-
[36]
Functions of matrices: theory and computation , author=. 2008 , publisher=
work page 2008
-
[37]
Enhancing LLM Training via Spectral Clipping , author=. 2026 , eprint=
work page 2026
-
[38]
When do spectral gradient updates help in deep learning? , author=. 2026 , eprint=
work page 2026
-
[39]
arXiv preprint arXiv:2502.04664 , year=
Implicit bias of spectral descent and muon on multiclass separable data , author=. arXiv preprint arXiv:2502.04664 , year=
-
[40]
Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization , author=. arXiv preprint arXiv:2503.12645 , year=
-
[41]
Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025
Training deep learning models with norm-constrained lmos , author=. arXiv preprint arXiv:2502.07529 , year=
-
[42]
Fantastic Pretraining Optimizers and Where to Find Them , author=. 2025 , eprint=
work page 2025
-
[43]
Benchmarking Optimizers for Large Language Model Pretraining , author=. 2025 , eprint=
work page 2025
-
[44]
ARO: A New Lens On Matrix Optimization For Large Models , author=. 2026 , eprint=
work page 2026
-
[45]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. 2025 , eprint=
work page 2025
-
[46]
Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization , author=. 2024 , eprint=
work page 2024
-
[47]
Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold , author=. 2025 , eprint=
work page 2025
-
[48]
Whitening and second order optimization both make information in the dataset unusable during training, and can reduce or prevent generalization , author=. 2021 , eprint=
work page 2021
-
[49]
SIAM Journal on Numerical Analysis, Ser
Generalized rational approximation , author=. SIAM Journal on Numerical Analysis, Ser. B , volume=. 1964 , publisher=
work page 1964
-
[50]
SIAM Journal on Numerical Analysis , year=
Approximation by Generalized Rationals , author=. SIAM Journal on Numerical Analysis , year=
-
[51]
Achieser, N. I. , TITLE =. 1992 , PAGES =
work page 1992
-
[52]
Rethinking Gauss-Newton for learning over-parameterized models , author=. 2023 , eprint=
work page 2023
-
[53]
Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? , author=. 2025 , eprint=
work page 2025
- [54]
- [55]
-
[56]
Analysis of Linear Mode Connectivity via Permutation-Based Weight Matching: With Insights into Other Permutation Search Methods , author=. 2025 , eprint=
work page 2025
-
[57]
Jui-Nan Yen and Si Si and Zhao Meng and Felix Yu and Sai Surya Duvvuri and Inderjit S Dhillon and Cho-Jui Hsieh and Sanjiv Kumar , booktitle=. Lo. 2025 , url=
work page 2025
-
[58]
Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
-
[59]
arXiv preprint arXiv:2408.01517 , year=
Gradient flow in parameter space is equivalent to linear interpolation in output space , author=. arXiv preprint arXiv:2408.01517 , year=
-
[60]
Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=
Real-time segmentation of on-line handwritten arabic script , author=. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on , pages=. 2014 , organization=
work page 2014
-
[61]
Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=
Fast classification of handwritten on-line Arabic characters , author=. Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of , pages=. 2014 , organization=
work page 2014
-
[62]
arXiv preprint arXiv:1804.09028 , year=
Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications , author=. arXiv preprint arXiv:1804.09028 , year=
-
[63]
Kristiadi, Agustinus and Dangel, Felix and Hennig, Philipp , month = oct, year =. The
-
[64]
Kenney, Charles and Laub, Alan J. , title =. SIAM Journal on Matrix Analysis and Applications , volume =. 1991 , doi =. https://doi.org/10.1137/0612020 , abstract =
-
[65]
Isotropic Local Laws for Sample Covariance and Generalized Wigner Matrices , author=. 2015 , eprint=
work page 2015
-
[66]
Natesh S. Pillai and Jun Yin , title =. The Annals of Applied Probability , number =. 2014 , doi =
work page 2014
-
[67]
Kunstner, Frederik and Hennig, Philipp and Balles, Lukas , year =. Limitations of the empirical. Advances in
- [68]
-
[69]
Random Matrix Theory , booktitle=
Couillet, Romain and Liao, Zhenyu , year=. Random Matrix Theory , booktitle=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.