pith. sign in

arxiv: 2606.27715 · v1 · pith:4KGQ6XSWnew · submitted 2026-06-26 · 💻 cs.LG

Aurora: A Leverage-Aware Spectral Optimizer

Pith reviewed 2026-06-29 05:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords Aurora optimizerMuon optimizerspectral optimizationrow normalizationMLP layerspre-trainingoptimizer geometryleverage-aware
0
0 comments X

The pith

Aurora enforces row-uniformity of matrix parameter updates while respecting Muon's polar factor geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

For tall matrix parameters such as projection matrices in MLP layers, the Muon update can produce arbitrarily non-uniform row norms. This creates a self-reinforcing loop in which some neurons receive persistently small updates and stop contributing to network outputs. Aurora applies an additional row normalization step that preserves the polar factor of the momentum matrix, unlike prior methods that shift away from this geometry. The approach outperforms Muon in pre-training runs, and the size of the gains increases with the MLP expansion factor.

Core claim

Aurora is a spectral optimizer that enforces row-uniformity of matrix parameter updates while respecting Muon's polar factor geometry. It addresses non-uniform row norms in tall matrices without deviating from the polar factor of the momentum matrix, which prior row-normalization techniques do. In pre-training experiments Aurora outperforms Muon, reaches state-of-the-art results among spectral optimizers when combined with existing methods, and shows gains that scale with MLP expansion factor.

What carries the argument

Row normalization step that preserves the polar factor of the momentum matrix while enforcing uniform row norms.

If this is right

  • Aurora outperforms Muon on pre-training tasks.
  • Combined with existing methods it reaches state-of-the-art results among spectral optimizers.
  • Performance gains over Muon increase with MLP expansion factor.
  • The method supports effective training of wider MLP layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same row-uniformity correction could be tested on other momentum-based matrix factorizations beyond Muon.
  • If the observed scaling continues, wider hidden dimensions may become trainable without extra regularization techniques.
  • The identified feedback loop may appear in any optimizer that applies matrix polar factors to tall parameter blocks.

Load-bearing premise

That moving the update away from the polar factor of the momentum matrix is undesirable and that preserving this geometry while enforcing row uniformity is both feasible and beneficial.

What would settle it

A controlled pre-training run on a model with large MLP expansion factor in which Aurora fails to outperform Muon or in which a version that drops polar-factor preservation performs better.

Figures

Figures reproduced from arXiv: 2606.27715 by Alec Dewulf, Ashley Zhang, Ben Keigwin, Dhruv Pai, Li Yang.

Figure 1
Figure 1. Figure 1: Plotting CANS-12, quintic-5 and PE-8 for singular values between zero and one. Quintic-5 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Loss curves and polar approximation error graphs for different NS iterations in our 340M [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of unit row normalization on orthogonality for random [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: U-NorMuon achieves better downstream loss than both our Muon and NorMuon baselines [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Each square on the grid corresponds to a row in the parameter matrix; darker neurons have [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We plot the correlation between the leverage scores and row norms of the momentum buffer [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Coefficient of variation (CV) of momentum row norms (left) and row leverage scores (right) [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: We find a large percentage of dead neurons in Rnj-1 on FineWeb-Edu in early layers. MLPs [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of the alignment and row-norm coefficient-of-variation trade-off for Aurora, [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: modded-nanoGPT speedrun convergence curves. Aurora and Contra-Muon reaches the 3.28 validation loss target at step 3175, setting a new state-of-the-art on the optimization track. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Aurora’s advantage over Muon scales with MLP expansion factor. We train a model with [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Leverage scores for all 5632 neurons (up projection, layer 12) in a ReLU [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
read the original abstract

We show that for tall matrix parameters, like projection matrices in the MLP layers, the Muon update can have row norms that are arbitrarily non-uniform. This can lead to a self-reinforcing feedback loop whereby neurons receive persistently small updates and eventually do not contribute meaningfully to network outputs. This problem is effectively mitigated by an additional row normalization step, but current methods do this in a way that moves the Muon update geometry away from the polar factor of the momentum matrix, which we find is undesirable. We propose Aurora, an optimizer that enforces row-uniformity of matrix parameter updates while respecting Muon's polar factor geometry. Aurora outperforms Muon in our pre-training experiments and, when combined with existing methods, achieves state-of-the-art performance among spectral optimizers on the optimizer track of the modded-nanoGPT speedrun. Additionally, we find that Aurora's empirical gains over Muon scale with the MLP expansion factor, suggesting that Aurora may allow for effective training of very wide MLP layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript identifies non-uniform row norms in Muon updates for tall matrices (e.g., MLP projections) that can create a self-reinforcing loop of persistently small neuron updates. It proposes Aurora, which computes the polar factor of the momentum matrix and then applies per-row scaling to enforce uniformity while preserving the polar geometry; this is realized via an explicit two-step procedure. Experiments show Aurora outperforming Muon, with gains increasing with MLP expansion factor, and state-of-the-art results among spectral optimizers on the modded-nanoGPT speedrun when combined with existing methods.

Significance. If the central claims hold, the result is significant for spectral optimizers in large-scale training, especially wide MLPs. The paper explicitly credits the two-step construction that isolates geometry preservation, direct ablations separating the row-normalization effect from polar-factor changes, and the falsifiable scaling prediction with expansion factor. These elements strengthen the contribution beyond empirical reporting.

minor comments (3)
  1. §3.2, Algorithm 1: the two-step procedure is clearly stated, but the pseudocode omits the handling of zero-norm rows after polar factorization; a one-sentence clarification would prevent ambiguity in implementation.
  2. Table 3: the caption does not state the number of independent runs or whether error bars reflect standard deviation across seeds; this is needed to assess the reported gains over Muon.
  3. §5.3: the claim that gains 'scale with the MLP expansion factor' is supported by the plotted trend, but the text does not discuss whether the trend continues beyond the tested range or saturates.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and for recommending minor revision. No specific major comments were provided in the report, so we have no point-by-point responses to address. We are happy to incorporate any minor changes the editor deems necessary based on the overall assessment.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained construction

full rationale

The abstract presents Aurora as an explicit two-step construction (extract polar factor of momentum matrix, then apply per-row scaling) to enforce row uniformity while preserving Muon geometry. No equations, fitted parameters, or self-citations are shown as load-bearing for the central claim. The skeptic note confirms the manuscript supplies an explicit feasible procedure and ablations without reduction to inputs by definition. No self-definitional, fitted-prediction, or self-citation patterns are present in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or can be extracted.

pith-pipeline@v0.9.1-grok · 5704 in / 1018 out tokens · 17307 ms · 2026-06-29T05:22:30.451472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Laguna M.1/XS.2 Technical Report

    Julien Abadji et al. Laguna m.1/xs.2 technical report.arXiv preprint arXiv:2605.27605, 2026

  2. [2]

    Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm, 2025

  3. [3]

    PIQA: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239

  4. [4]

    MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, and Ganzhao Yuan. Muoneq: Balancing before orthogonalization with lightweight equilibration. arXiv preprint arXiv:2603.28254, 2026

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  6. [6]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  7. [7]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  8. [8]

    RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization, 2026

    Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, and Yaoqing Yang. RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization, 2026

  9. [9]

    Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A

    Shibhansh Dohare, J. Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A. Rupam Mahmood, and Richard S. Sutton. Loss of plasticity in deep continual learning.Nature, 632: 768–774, 2024. doi: 10.1038/s41586-024-07711-7

  10. [10]

    A minimalist optimizer design for LLM pretraining, 2025

    Athanasios Glentis, Jiaxiang Li, Andi Han, and Mingyi Hong. A minimalist optimizer design for LLM pretraining, 2025

  11. [11]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv. org/abs/2602.15763

  12. [12]

    Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2025

    Ekaterina Grishina, Matvey Smirnov, and Maxim Rakhuba. Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2025

  13. [13]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

  14. [14]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan. Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps: //kellerjordan.github.io/posts/muon/

  15. [15]

    modded-nanogpt: Speedrunning the nanogpt baseline, 2024

    Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URLhttps://github.com/KellerJordan/modded-nanogpt

  16. [16]

    Convergence of muon with newton-schulz, 2026

    Gyu Yeol Kim and Min hwan Oh. Convergence of muon with newton-schulz, 2026. URL https://arxiv.org/abs/2601.19156. 20

  17. [17]

    Kimi k2.5: Visual agentic intelligence, 2026

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, et al. Kimi k2.5: Visual agentic intelligence, 2026

  18. [18]

    Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. SNIP: Single-shot network pruning based on connection sensitivity. InInternational Conference on Learning Representations,

  19. [19]

    URLhttps://openreview.net/forum?id=B1VZqjAcYX

  20. [20]

    Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

    Yibang Li, Bihari Lal Pandey, Ravi Sah, Andi Han, Cyrus Mostajeran, Pratik Jawanpuria, and Bamdev Mishra. Intrinsic muon: Spectral optimization on riemannian matrix manifolds.arXiv preprint arXiv:2605.09238, 2026

  21. [21]

    Normuon: Making muon more efficient and scalable, 2025

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025. URLhttps://arxiv.org/abs/2510.05491

  22. [22]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  23. [23]

    Dying ReLU and initialization: Theory and numerical examples.Communications in Computational Physics, 28(5):1671–1706,

    Lu Lu, Yeonjong Shin, Yanhui Su, and George Em Karniadakis. Dying ReLU and initialization: Theory and numerical examples.Communications in Computational Physics, 28(5):1671–1706,

  24. [24]

    doi: 10.4208/cicp.OA-2020-0165

  25. [25]

    SWAN: SGD with normalization and whitening enables stateless LLM training, 2024

    Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. SWAN: SGD with normalization and whitening enables stateless LLM training, 2024

  26. [26]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of EMNLP, 2018

  27. [27]

    Importanceestimation for neural network pruning

    PavloMolchanov, ArunMallya, StephenTyree, IuriFrosio, andJanKautz. Importanceestimation for neural network pruning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019. doi: 10.1109/CVPR.2019.01152

  28. [28]

    Kimi k2.6, 2026

    Moonshot AI. Kimi k2.6, 2026. Model card

  29. [29]

    Contra-muon and soft-muon, May 2026

    Nilin. Contra-muon and soft-muon, May 2026. URL https://nilin.github.io/ contra-muon-and-soft-muon/. First version: 2026-05-04; edited: 2026-05-14

  30. [30]

    Nemotron-CC-v2, 2025

    NVIDIA. Nemotron-CC-v2, 2025. URL https://huggingface.co/datasets/nvidia/ Nemotron-CC-v2. Dataset, version 1.0, released 2025-08-18

  31. [31]

    Nemotron-cc-code-v1

    NVIDIA Corporation. Nemotron-cc-code-v1. https://huggingface.co/datasets/nvidia/ Nemotron-CC-Code-v1, December 2025. Hugging Face dataset

  32. [32]

    The lambada dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of ACL, 2016

  33. [33]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557

  34. [34]

    Winogrande: An adversarial winograd schema challenge at scale.Proceedings of the AAAI Conference on Artificial Intelligence, 2020

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 21

  35. [35]

    Gradient multi-normalization for stateless and scalable LLM training, 2025

    Meyer Scetbon, Chao Ma, Wenbo Gong, and Edward Meeds. Gradient multi-normalization for stateless and scalable LLM training, 2025

  36. [36]

    GLU variants improve transformer, 2020

    Noam Shazeer. GLU variants improve transformer, 2020

  37. [37]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations, 2017

  38. [38]

    Adamuon: Adaptive muon optimizer, 2025

    Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer, 2025. URL https://arxiv.org/abs/2507.11005

  39. [39]

    Arcee trinity large technical report.arXiv preprint arXiv:2602.17004, 2026

    Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard, Fares Obied, Jack Min Ong, Jannik Straube, Aria Harley, Conner Stewart, et al. Arcee trinity large technical report.arXiv preprint arXiv:2602.17004, 2026

  40. [40]

    The dormant neuron phenomenon in deep reinforcement learning

    Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32145– 32168. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/sokar23a.html

  41. [41]

    Rnj-1, 2025

    Ashish Vaswani, Mike Callahan, Adarsh Chaluvaraju, Aleksa Gordić, Devaansh Gupta, Yash Jain, Divya Mansingka, Philip Monk, Khoi Nguyen, Mohit Parmar, Michael Pust, Tim Romanski, Peter Rushton, Ali Shehper, Divya Shivaprasad, Somanshu Singla, Kurt Smith, Saurabh Srivastava, Anil Thomas, Alok Tripathy, Yash Vanjani, Ameya Velingker, and Essential AI. Rnj-1,...

  42. [42]

    Picking winning tickets before training by preserving gradient flow

    Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkgsACVKPH

  43. [43]

    SRON: State-free LLM training via row-wise gradient normalization, 2025

    Zhenrui Wen, Yilei Shi, Jipeng Wang, Ping Luo, Liang Qiao, Dongsheng Li, and Tianxiang Sun. SRON: State-free LLM training via row-wise gradient normalization, 2025. URLhttps: //openreview.net/forum?id=BtQLBWr6zI

  44. [44]

    On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer, 2026

    Ruihan Xu, Jiajin Li, and Yiping Lu. On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer, 2026

  45. [45]

    HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019. doi: 10.18653/v1/P19-1472. URLhttps://aclanthology.org/P19-1472

  46. [46]

    UltraData-Math, 2026

    Chuyue Zhou, Hongya Lyu, Xinle Lin, Hengyu Zhao, Junshao Guo, Xueren Zhang, Shuaikang Xue, Qiang Ma, Jie Zhou, Yudong Wang, and Zhiyuan Liu. UltraData-Math, 2026. URL https://huggingface.co/datasets/openbmb/UltraData-Math. A U-NorMuon Ablation Study We ablate all the modifications U-NorMuon applies onto NorMuon individually. We use the term statefulto ref...