pith. machine review for the scientific record. sign in

arxiv: 2603.26554 · v2 · submitted 2026-03-27 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:34 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords Muon optimizerspectral preconditioningassociative memorystorage capacitypower law distributionlogistic regressionSGDNewton method
0
0 comments X

The pith

Muon matches Newton's method storage capacity while using only first-order information

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines spectral optimizers such as Muon through the linear associative memory problem, which models factual recall in transformers with Gaussian inputs and outputs. It sharply characterizes one-step recovery rates of Muon, SGD, and Newton's method applied to logistic regression loss under power-law frequency distributions. The central result establishes that Muon exceeds SGD in storage capacity and matches Newton's method despite using only first-order updates. Muon also sustains performance to larger critical batch sizes. Multi-step analysis under a thresholded gradient approximation shows faster initial recovery for Muon, with both methods reaching the information-theoretic limit at comparable later rates.

Core claim

Our main result sharply characterizes the recovery rates of one step of Muon, SGD, and Newton's method on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and even matches Newton's method while only using first-order information. Moreover, Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds.

What carries the argument

The spectral preconditioner in Muon, which applies a transformation based on the gradient covariance to amplify signals from low-frequency associations in the power-law setting.

If this is right

  • Muon achieves substantially faster initial recovery than SGD in multi-step dynamics.
  • Muon and SGD both converge to the same information-theoretic limit after the initial phase.
  • The predicted scaling laws hold in synthetic task experiments.
  • Muon supports larger critical batch sizes before performance saturates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Spectral preconditioning may explain empirical gains of Muon in full-scale language model training.
  • The analysis framework could be extended to study non-Gaussian embeddings or deeper transformer layers.
  • Similar first-order spectral methods might be designed for other associative or retrieval tasks.

Load-bearing premise

The linear associative memory problem with Gaussian inputs and outputs accurately models factual recall in transformer-based models.

What would settle it

An experiment measuring one-step recovery rates on logistic regression with Gaussian embeddings and power-law frequencies where Muon fails to exceed SGD capacity or match Newton's method would falsify the main scaling claim.

Figures

Figures reproduced from arXiv: 2603.26554 by Alberto Bietti, Denny Wu, Eshaan Nichani, Jason D. Lee, Juno Kim.

Figure 1
Figure 1. Figure 1: (a) Capacity achieved by one Muon and GD step on the population objective; Muon improves the storage capacity when frequency is power-law distributed with exponent α > 1. (b) Critical batch size for the first Muon and SGD step (α = 1.5); the Muon capacity saturates at a much larger batch size than SGD. Theorem 1.2 (Informal version of Theorems 5.4, 5.5). Under the thresholded update, t steps of Muon recove… view at source ↗
Figure 2
Figure 2. Figure 2: Capacity scaling after one population Muon and GD step. We set N = 100, 000 and vary d, α. Each experiment is repeated 16 times. For each α, we fit the dimension exponents of the mean capacity d Cα (dashed lines), and then find the best fit of exponents Cα in the form of Cα = c1 + c2 α (solid lines). Observe that Muon achieves much higher storage than GD, and the exponents are consistent with Theorems 4.1,… view at source ↗
Figure 3
Figure 3. Figure 3: Capacity scaling after one Muon and SGD step on empirical loss. We set N = 100,000, α = 1.5, and vary the minibatch size B. Each experiment is repeated 16 times. The dashed red line indicates the information-theoretic rate, and the horizontal dashed lines in Figure 3b correspond to the d 1+ 1 2α ceiling; the predicted critical batch sizes are given by their intersections. Observe that Muon offers capacity … view at source ↗
Figure 4
Figure 4. Figure 4: Capacity after T Muon steps on the population cross-entropy loss. We set N = 250,000, η = 2√ d. Figures 4a, 4b, 4c report the capacity at T = 2, 3, 4, respectively (see Figure 2b for T = 1); Figure 4d presents the capacity at large T: we run Muon for up to 500 steps and early stop when the capacity improvement over 10 steps drops below 0.5%. Figure 4e compares the fitted dimension exponents against predict… view at source ↗
Figure 5
Figure 5. Figure 5: Capacity scaling of multi-step Muon and GD. We set N = 100, 000, α = 1.5. (a) Population update: for GD we implement an increasing learning rate schedule (see Theorem 5.5) with η1 = 0.01√ d; for Muon we use a fixed step size η = √ d. Observe that the benefit of Muon is most visible in the “early phase” of training (the initial plateau of GD in the first 3 steps is due to small η1 chosen for numerical stabi… view at source ↗
Figure 6
Figure 6. Figure 6: Capacity scaling after one (population) Muon and Newton step in the anisotropic setting: we choose ui ∼ N (0, 1 d Id), vi ∼ N (0, Ξv), where Ξv is a trace-normalized diagonal matrix with λi(Ξv) ≍ i −κ , κ ≥ 0. We set N = 100, 000 and vary d, α. For Newton’s method we add a ridge regularization λ = 10−8 for numerical stability when the preconditioner is rank-deficient. Observe that when κ = 0 (isotropic, Fi… view at source ↗
Figure 7
Figure 7. Figure 7: ID (left two) and OOD (right two) accuracy on the in-context recall task as a function of model dimension, for Muon, AdamW, and SGD, with batch size 256 at iterations 128 and 1024. For each (dim, optimizer) pair, the learning rate and batch size are chosen to maximize accuracy. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: OOD and memory recall accuracy as a function of batch size B, for Muon, AdamW, and SGD (columns left to right), with different curves per model dimension, at iteration 1024. For Figure 8b, we use a two-layer transformer with no feed-forward layers to avoid redundancies between the value matrix and the subsequent MLP layer. For each (B, dim) pair, the learning rate is chosen to maximize accuracy. 18 [PITH_… view at source ↗
Figure 9
Figure 9. Figure 9: OOD accuracy as a function of model dimension for Muon, AdamW, and SGD (columns left to right), with batch size 256 at iteration 512. Each curve corresponds to a different power-law exponent for (a) the output distribution α0; (b) the trigger distribution αt , with α = 0 being the uniform distribution and larger α concentrating probability mass on fewer tokens (where we expect adaptive optimizers to be ben… view at source ↗
read the original abstract

Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon, SGD, and Newton's method on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and even matches Newton's method while only using first-order information. Moreover, Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of spectral preconditioners and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies spectral optimizers such as Muon through the linear associative memory problem with Gaussian inputs and outputs under power-law frequency distributions. It derives sharp one-step recovery rates for Muon, SGD, and Newton's method applied to logistic regression loss, showing that Muon achieves significantly higher storage capacity than SGD while matching Newton's method using only first-order information. The work further analyzes multi-step dynamics via a thresholded gradient approximation, claiming faster initial recovery for Muon, and validates the scaling laws on synthetic tasks.

Significance. If the one-step derivations hold, the results provide a quantitative explanation for the empirical advantages of spectral preconditioners in capacity scaling for associative memory tasks, which model factual recall in transformers. This could inform optimizer design and scaling laws beyond the current setting, particularly by highlighting how first-order spectral methods can approach second-order performance.

major comments (2)
  1. [§4] §4 (Multi-step dynamics): The thresholded-gradient approximation is invoked to argue faster initial Muon recovery and comparable asymptotic convergence, but no quantitative bound is provided on the accumulated approximation error over iterations, especially in the regime where the number of associations exceeds the embedding dimension and interacts with the power-law tail. This approximation error could systematically affect the claimed comparative advantage over SGD at the stated precision.
  2. [§3.1] §3.1, main recovery-rate theorems: The sharp characterizations for one-step Muon/SGD/Newton rely on the Gaussian association model; while the one-step formulas appear derived independently, the manuscript does not explicitly state the error-control assumptions needed to extend the capacity scaling claims beyond the orthogonal-embedding case to the non-orthogonal Gaussian setting.
minor comments (2)
  1. [§2] Notation for the power-law exponent and frequency distribution should be introduced with a dedicated definition early in §2 to avoid ambiguity when comparing recovery rates across methods.
  2. [Figure 3] Figure 3 (synthetic validation plots): axis labels and legend entries are too small for readability; consider increasing font size and adding error bars from multiple random seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address the two major comments point by point below and will revise the manuscript to improve clarity on assumptions and to discuss the limitations of the multi-step approximation.

read point-by-point responses
  1. Referee: [§4] §4 (Multi-step dynamics): The thresholded-gradient approximation is invoked to argue faster initial Muon recovery and comparable asymptotic convergence, but no quantitative bound is provided on the accumulated approximation error over iterations, especially in the regime where the number of associations exceeds the embedding dimension and interacts with the power-law tail. This approximation error could systematically affect the claimed comparative advantage over SGD at the stated precision.

    Authors: We agree that a rigorous quantitative bound on the accumulated error would be desirable. Deriving such a bound is technically involved because of the dependence structure induced by the power-law frequencies and the iterative updates. In the revision we will add a paragraph in §4 that explicitly discusses this limitation, provides a heuristic error analysis based on the one-step concentration, and reports additional synthetic experiments confirming that the qualitative advantage of Muon over SGD persists even when moderate approximation errors are present. The primary capacity claims will continue to rest on the exact one-step results. revision: partial

  2. Referee: [§3.1] §3.1, main recovery-rate theorems: The sharp characterizations for one-step Muon/SGD/Newton rely on the Gaussian association model; while the one-step formulas appear derived independently, the manuscript does not explicitly state the error-control assumptions needed to extend the capacity scaling claims beyond the orthogonal-embedding case to the non-orthogonal Gaussian setting.

    Authors: The one-step theorems are derived directly for the non-orthogonal Gaussian model (as stated in the problem formulation and abstract). The proofs control deviations via standard sub-Gaussian concentration and random-matrix bounds that hold without orthogonality. We will revise §3.1 to state these error-control assumptions explicitly in the theorem statements and proof sketches, making the passage from the orthogonal case to the general Gaussian case transparent. revision: yes

Circularity Check

0 steps flagged

Derivations derive directly from model equations without reduction to inputs by construction

full rationale

The paper's core results characterize one-step recovery rates for Muon, SGD, and Newton's method on logistic loss under Gaussian inputs and power-law frequencies by direct analysis of the linear associative memory model equations. The multi-step section invokes a thresholded-gradient approximation to compare initial rates, but presents this explicitly as an approximation rather than an exact claim that collapses to fitted parameters or self-referential definitions. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; the storage capacity comparisons follow from the stated assumptions independently of the target scaling laws.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on modeling assumptions about Gaussian embeddings and power-law frequencies as inputs to the recovery rate derivations, plus the thresholded gradient approximation for multi-step analysis.

free parameters (1)
  • power law exponent
    The frequency distribution follows a power law whose specific exponent is an input to the scaling derivations.
axioms (2)
  • domain assumption Linear associative memory with Gaussian inputs/outputs models factual recall in transformers
    Invoked to justify the tractable setup that allows more associations than embedding dimension.
  • domain assumption Thresholded gradient approximation captures multi-step optimizer dynamics
    Used to analyze convergence rates beyond the single-step case.

pith-pipeline@v0.9.0 · 5523 in / 1426 out tokens · 49086 ms · 2026-05-14T23:34:30.192673+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

    stat.ML 2026-05 unverdicted novelty 7.0

    Linear associative memories store up to p_c log p_c / d^2 = 1/2 associations, with optimal weights pushing correct scores just above the extreme value of competing outputs.

  2. Phases of Muon: When Muon Eclipses SignSGD

    math.OC 2026-05 unverdicted novelty 7.0

    On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.

  3. Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

    stat.ML 2026-05 unverdicted novelty 7.0

    Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 3 Pith papers · 7 internal anchors

  1. [1]

    Physics of language models: Part 3.3, knowledge capacity scaling laws.arXiv preprint arXiv:2404.05405, 2024

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.3, knowledge capacity scaling laws.arXiv preprint arXiv:2404.05405, 2024

  2. [2]

    Newton-Schulz, 2024

    Jeremy Bernstein. Newton-Schulz, 2024. URL https://docs.modula.systems/algorithms/ newton-schulz/. Modula documentation

  3. [3]

    Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

  4. [4]

    Birth of a transformer: A memory viewpoint.Advances in Neural Information Processing Systems, 2023

    Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint.Advances in Neural Information Processing Systems, 2023

  5. [5]

    A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092, 2024

    Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092, 2024

  6. [6]

    Scaling laws for associative memories

    Vivien Cabannes, Elvis Dohmatob, and Alberto Bietti. Scaling laws for associative memories. InThe Twelfth International Conference on Learning Representations, 2024

  7. [7]

    Optimal rates for the regularized least-squares algorithm.Foundations of Computational mathematics, 7(3):331–368, 2007

    Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational mathematics, 7(3):331–368, 2007

  8. [8]

    Muon optimizes under spectral norm constraints

    Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025

  9. [9]

    When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

    Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

  10. [10]

    Adaptive subgradient methods for online learn- ing and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learn- ing and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011. 20

  11. [11]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. https://transformer- circuits.pub/2022/t...

  12. [12]

    Implicit bias of spectral descent and Muon on multiclass separable data.arXiv preprint arXiv:2502.04664, 2025

    Chen Fan, Mark Schmidt, and Christos Thrampoulidis. Implicit bias of spectral descent and Muon on multiclass separable data.arXiv preprint arXiv:2502.04664, 2025

  13. [13]

    Dimension-adapted momentum outscales SGD.arXiv preprint arXiv:2505.16098, 2025

    Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, and Courtney Paquette. Dimension-adapted momentum outscales SGD.arXiv preprint arXiv:2505.16098, 2025

  14. [14]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

  15. [15]

    Dissecting recall of fac- tual associations in auto-regressive language models

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of fac- tual associations in auto-regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

  16. [16]

    Insights on Muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

    Antoine Gonon, Andreea-Alexandra Mus ¸at, and Nicolas Boumal. Insights on Muon from simple quadratics.arXiv preprint arXiv:2602.11948, 2026

  17. [17]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018

  18. [18]

    SIAM, 2008

    Nicholas J Higham.Functions of matrices: Theory and computation. SIAM, 2008

  19. [19]

    Neural networks and physical systems with emergent collective computa- tional abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

    John J Hopfield. Neural networks and physical systems with emergent collective computa- tional abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

  20. [20]

    Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization

    Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, and Aryan Mokhtari. Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization. arXiv preprint arXiv:2602.08232, 2026

  21. [21]

    Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam. Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in trans- formers.Advances in Neural Information Processing Systems, 37:67712–67757, 2024

  22. [22]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

  23. [23]

    Convergence of muon with newton-schulz, 2026

    Gyu Yeol Kim and Min-hwan Oh. Convergence of Muon with Newton-Schulz.arXiv preprint arXiv:2601.19156, 2026

  24. [24]

    Scaling laws of SignSGD in linear regression: When does it outperform SGD? InThe Fourteenth International Conference on Learning Representations, 2026

    Jihwan Kim, Dogyoon Song, and Chulhee Yun. Scaling laws of SignSGD in linear regression: When does it outperform SGD? InThe Fourteenth International Conference on Learning Representations, 2026. 21

  25. [25]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  26. [26]

    On Lipschitz functions of normal operators.Proceedings of the American Mathematical Society, 94(3):416–418, 1985

    Fuad Kittaneh. On Lipschitz functions of normal operators.Proceedings of the American Mathematical Society, 94(3):416–418, 1985. ISSN 00029939, 10886826

  27. [27]

    Correlation matrix memories.IEEE Transactions on Computers, C-21: 353–359, 1972

    Teuvo Kohonen. Correlation matrix memories.IEEE Transactions on Computers, C-21: 353–359, 1972. URL https://api.semanticscholar.org/CorpusID:21483100

  28. [28]

    Scaling laws for gradient descent and sign descent for linear bigram models under Zipf’s law.arXiv preprint arXiv:2505.19227, 2025

    Frederik Kunstner and Francis Bach. Scaling laws for gradient descent and sign descent for linear bigram models under Zipf’s law.arXiv preprint arXiv:2505.19227, 2025

  29. [29]

    Heavy- tailed class imbalance and why adam outperforms gradient descent on language models.Ad- vances in Neural Information Processing Systems, 37:30106–30148, 2024

    Frederik Kunstner, Alan Milligan, Robin Yadav, Mark Schmidt, and Alberto Bietti. Heavy- tailed class imbalance and why adam outperforms gradient descent on language models.Ad- vances in Neural Information Processing Systems, 37:30106–30148, 2024

  30. [30]

    Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective.arXiv preprint arXiv:2505.21799, 2025

    Tim Tsz-Kit Lau, Qi Long, and Weijie Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective.arXiv preprint arXiv:2505.21799, 2025

  31. [31]

    Muon in associative memory learning: Training dynamics and scaling laws.arXiv preprint arXiv:2602.05725, 2026

    Binghui Li, Kaifei Wang, Han Zhong, Pinyan Lu, and Liwei Wang. Muon in associative memory learning: Training dynamics and scaling laws.arXiv preprint arXiv:2602.05725, 2026

  32. [32]

    Scaling laws in linear regression: Compute, parameters, and data.arXiv preprint arXiv:2406.08466, 2024

    Licong Lin, Jingfeng Wu, Sham M Kakade, Peter L Bartlett, and Jason D Lee. Scaling laws in linear regression: Compute, parameters, and data.arXiv preprint arXiv:2406.08466, 2024

  33. [33]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025

  34. [34]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101

  35. [35]

    Preconditioning benefits of spectral orthogonalization in Muon.arXiv preprint arXiv:2601.13474, 2026

    Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in Muon.arXiv preprint arXiv:2601.13474, 2026

  36. [36]

    Optimizing neural networks with kronecker-factored ap- proximate curvature

    James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored ap- proximate curvature. InInternational conference on machine learning, pages 2408–2417. PMLR, 2015

  37. [37]

    Locating and editing factual associations in GPT.Advances in neural information processing systems, 35:17359–17372, 2022

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.Advances in neural information processing systems, 35:17359–17372, 2022

  38. [38]

    The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36, 2023

    Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36, 2023

  39. [39]

    Understanding factual recall in transformers via associative memories.arXiv preprint arXiv:2412.06538, 2024

    Eshaan Nichani, Jason D Lee, and Alberto Bietti. Understanding factual recall in transformers via associative memories.arXiv preprint arXiv:2412.06538, 2024. 22

  40. [40]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Ka- plan, Sam McCandlish,...

  41. [41]

    4+3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems, 2024

    Elliot Paquette, Courtney Paquette, Lechao Xiao, and Jeffrey Pennington. 4+3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems, 2024

  42. [42]

    Piantadosi

    Steven T. Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions.Psychonomic Bulletin & Review, 21:1112–1130, 2014

  43. [43]

    arXiv:2504.19983 , year=

    Yunwei Ren, Eshaan Nichani, Denny Wu, and Jason D Lee. Emergence and scaling laws in SGD learning of shallow neural networks.arXiv preprint arXiv:2504.19983, 2025

  44. [44]

    Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 5418–5426, 2020

  45. [45]

    The smallest singular value of a random rectangular matrix

    Mark Rudelson and Roman Vershynin. The smallest singular value of a random rectangular matrix.arXiv preprint arXiv:0802.3956, 2009

  46. [46]

    Non-asymptotic theory of random matrices: extreme singular values

    Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. InProceedings of the International Congress of Mathematicians 2010 (ICM 2010), pages 1576–1602, 2010

  47. [47]

    Transformers, parallel computation, and logarithmic depth

    Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Transformers, parallel computation, and logarithmic depth.arXiv preprint arXiv:2402.09268, 2024

  48. [48]

    Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025

    Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025

  49. [49]

    On the Convergence Analysis of Muon

    Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the conver- gence analysis of Muon.arXiv preprint arXiv:2505.23737, 2025

  50. [50]

    Isotropic curvature model for understanding deep learning optimization: Is gradi- ent orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

    Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradi- ent orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

  51. [51]

    How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980, 2025

    Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, and Christos Thrampoulidis. How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980, 2025

  52. [52]

    Vershynin.High-dimensional probability: An introduction with applications in data sci- ence

    R. Vershynin.High-dimensional probability: An introduction with applications in data sci- ence. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2nd edition, 2018. 23

  53. [53]

    Learning to recall with transformers beyond orthogonal embeddings

    Nuri Mert Vural, Alberto Bietti, Mahdi Soltanolkotabi, and Denny Wu. Learning to recall with transformers beyond orthogonal embeddings. InInternational Conference on Learning Representations, 2026

  54. [54]

    SOAP: Improving and Stabilizing Shampoo using Adam

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. SOAP: Improving and stabilizing Shampoo using Adam. arXiv preprint arXiv:2409.11321, 2024

  55. [55]

    Wainwright.High-dimensional statistics: A non-asymptotic viewpoint

    Martin J. Wainwright.High-dimensional statistics: A non-asymptotic viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019

  56. [56]

    High-dimensional isotropic scaling dynamics of Muon and SGD

    Guangyuan Wang, Elliot Paquette, and Atish Agarwala. High-dimensional isotropic scaling dynamics of Muon and SGD. InOPT 2025: Optimization for Machine Learning, 2025

  57. [57]

    Muon outperforms Adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

    Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms Adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

  58. [58]

    Learning compositional functions with transformers from easy-to-hard data.arXiv preprint arXiv:2505.23683, 2025

    Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Daniel Hsu, Jason D Lee, and Denny Wu. Learning compositional functions with transformers from easy-to-hard data. arXiv preprint arXiv:2505.23683, 2025

  59. [59]

    Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

  60. [60]

    Non- holographic associative memory.Nature, 222(5197):960–962, 1969

    David J Willshaw, O Peter Buneman, and Hugh Christopher Longuet-Higgins. Non- holographic associative memory.Nature, 222(5197):960–962, 1969

  61. [61]

    Structured pre- conditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537, 2025

    Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured pre- conditioners in adaptive optimization: A unified analysis.arXiv preprint arXiv:2503.10537, 2025

  62. [62]

    Provable benefit of sign descent: A minimal model under heavy-tailed class imbalance.arXiv preprint arXiv:2512.00763, 2025

    Robin Yadav, Shuo Xie, Tianhao Wang, and Zhiyuan Li. Provable benefit of sign descent: A minimal model under heavy-tailed class imbalance.arXiv preprint arXiv:2512.00763, 2025. 24 Contents 1 Introduction 1 2 Related Work 4 3 Setting: Associative Memory 5 4 One Step of Muon 6 4.1 One-step recovery of Muon . . . . . . . . . . . . . . . . . . . . . . . . . ....

  63. [63]

    Now supposeB≳d α

    It follows thatrank(M)< d 2 and soλ d/2(M) = 0. Now supposeB≳d α. Choose a positive integerK≍ 1 d B1/α and define the setsI k :={(k−1)d+ d 2 ,· · ·, kd+ d 2 −1}fork≥1. Consider the decomposition M= d/2−1X i=1 q2 i uiu⊤ i | {z } =:M0 + X k∈[K] X i∈Ik q2 i uiu⊤ i | {z } =:Mk + NX i=(K+1/2)d q2 i uiu⊤ i | {z } =:Mtail . Sincerank(M 0)< d 2, we haveλ d/2(M0) ...

  64. [64]

    sup (u,v)∈T Xu,v # ≤E

    Observe that eachK ij:n is a multilinear polynomial of degree at most2nin the entriesu kℓ, vkℓ, thus by Gaussian hypercontractivity, E KL ij:n 1/L ≤(L−1) n E K2 ij:n 1/2 ≲ √ dL r (CρL) n−1/2 =: t√ L . By Markov’s inequality, Pr(|Kij:n|> t)≤t −L E KL ij:n ≲L −L/2 =d −ω(1). Therefore, union bounding over all1≤i, j≤dwithi̸=jandn≲(logd) 2, we conclude: | ˜Kij...