Aurora: A Leverage-Aware Spectral Optimizer

Alec Dewulf; Ashley Zhang; Ben Keigwin; Dhruv Pai; Li Yang

arxiv: 2606.27715 · v1 · pith:4KGQ6XSWnew · submitted 2026-06-26 · 💻 cs.LG

Aurora: A Leverage-Aware Spectral Optimizer

Alec Dewulf , Dhruv Pai , Li Yang , Ashley Zhang , Ben Keigwin This is my paper

Pith reviewed 2026-06-29 05:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords Aurora optimizerMuon optimizerspectral optimizationrow normalizationMLP layerspre-trainingoptimizer geometryleverage-aware

0 comments

The pith

Aurora enforces row-uniformity of matrix parameter updates while respecting Muon's polar factor geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

For tall matrix parameters such as projection matrices in MLP layers, the Muon update can produce arbitrarily non-uniform row norms. This creates a self-reinforcing loop in which some neurons receive persistently small updates and stop contributing to network outputs. Aurora applies an additional row normalization step that preserves the polar factor of the momentum matrix, unlike prior methods that shift away from this geometry. The approach outperforms Muon in pre-training runs, and the size of the gains increases with the MLP expansion factor.

Core claim

Aurora is a spectral optimizer that enforces row-uniformity of matrix parameter updates while respecting Muon's polar factor geometry. It addresses non-uniform row norms in tall matrices without deviating from the polar factor of the momentum matrix, which prior row-normalization techniques do. In pre-training experiments Aurora outperforms Muon, reaches state-of-the-art results among spectral optimizers when combined with existing methods, and shows gains that scale with MLP expansion factor.

What carries the argument

Row normalization step that preserves the polar factor of the momentum matrix while enforcing uniform row norms.

If this is right

Aurora outperforms Muon on pre-training tasks.
Combined with existing methods it reaches state-of-the-art results among spectral optimizers.
Performance gains over Muon increase with MLP expansion factor.
The method supports effective training of wider MLP layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same row-uniformity correction could be tested on other momentum-based matrix factorizations beyond Muon.
If the observed scaling continues, wider hidden dimensions may become trainable without extra regularization techniques.
The identified feedback loop may appear in any optimizer that applies matrix polar factors to tall parameter blocks.

Load-bearing premise

That moving the update away from the polar factor of the momentum matrix is undesirable and that preserving this geometry while enforcing row uniformity is both feasible and beneficial.

What would settle it

A controlled pre-training run on a model with large MLP expansion factor in which Aurora fails to outperform Muon or in which a version that drops polar-factor preservation performs better.

Figures

Figures reproduced from arXiv: 2606.27715 by Alec Dewulf, Ashley Zhang, Ben Keigwin, Dhruv Pai, Li Yang.

**Figure 2.** Figure 2: Loss curves and polar approximation error graphs for different NS iterations in our 340M [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of unit row normalization on orthogonality for random [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: U-NorMuon achieves better downstream loss than both our Muon and NorMuon baselines [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Each square on the grid corresponds to a row in the parameter matrix; darker neurons have [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: We plot the correlation between the leverage scores and row norms of the momentum buffer [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Coefficient of variation (CV) of momentum row norms (left) and row leverage scores (right) [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: We find a large percentage of dead neurons in Rnj-1 on FineWeb-Edu in early layers. MLPs [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of the alignment and row-norm coefficient-of-variation trade-off for Aurora, [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: modded-nanoGPT speedrun convergence curves. Aurora and Contra-Muon reaches the 3.28 validation loss target at step 3175, setting a new state-of-the-art on the optimization track. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Aurora’s advantage over Muon scales with MLP expansion factor. We train a model with [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Leverage scores for all 5632 neurons (up projection, layer 12) in a ReLU [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

read the original abstract

We show that for tall matrix parameters, like projection matrices in the MLP layers, the Muon update can have row norms that are arbitrarily non-uniform. This can lead to a self-reinforcing feedback loop whereby neurons receive persistently small updates and eventually do not contribute meaningfully to network outputs. This problem is effectively mitigated by an additional row normalization step, but current methods do this in a way that moves the Muon update geometry away from the polar factor of the momentum matrix, which we find is undesirable. We propose Aurora, an optimizer that enforces row-uniformity of matrix parameter updates while respecting Muon's polar factor geometry. Aurora outperforms Muon in our pre-training experiments and, when combined with existing methods, achieves state-of-the-art performance among spectral optimizers on the optimizer track of the modded-nanoGPT speedrun. Additionally, we find that Aurora's empirical gains over Muon scale with the MLP expansion factor, suggesting that Aurora may allow for effective training of very wide MLP layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Aurora gives Muon a clean row-normalization fix that keeps the polar geometry and shows width-dependent gains in pre-training.

read the letter

The core contribution is a two-step procedure that pulls the polar factor from the momentum matrix first, then applies per-row scaling to enforce uniform norms without shifting the update direction. That combination is not in the Muon papers they cite, and the paper supplies the explicit algorithm plus ablations that isolate the geometry preservation step.

The experiments look straightforward: direct comparisons on pre-training runs, scaling curves with MLP expansion factor, and a combined result that hits the top of the modded-nanoGPT optimizer track. The gains growing with width is the most useful empirical signal; it suggests the fix matters more as layers get wider, which matches the stated motivation.

The main soft spot is that the baselines and statistical details are still narrow—mostly Muon variants and a couple of other spectral methods on one benchmark family. No obvious contradictions in the math or fitting artifacts, but the claims would be stronger with more diverse tasks or larger-scale runs.

This is useful reading for anyone tuning spectral optimizers on wide MLPs. It is not a broad framework shift, but the construction is concrete and the evidence is reproducible enough to warrant referee time. I would send it out for review.

Referee Report

0 major / 3 minor

Summary. The manuscript identifies non-uniform row norms in Muon updates for tall matrices (e.g., MLP projections) that can create a self-reinforcing loop of persistently small neuron updates. It proposes Aurora, which computes the polar factor of the momentum matrix and then applies per-row scaling to enforce uniformity while preserving the polar geometry; this is realized via an explicit two-step procedure. Experiments show Aurora outperforming Muon, with gains increasing with MLP expansion factor, and state-of-the-art results among spectral optimizers on the modded-nanoGPT speedrun when combined with existing methods.

Significance. If the central claims hold, the result is significant for spectral optimizers in large-scale training, especially wide MLPs. The paper explicitly credits the two-step construction that isolates geometry preservation, direct ablations separating the row-normalization effect from polar-factor changes, and the falsifiable scaling prediction with expansion factor. These elements strengthen the contribution beyond empirical reporting.

minor comments (3)

§3.2, Algorithm 1: the two-step procedure is clearly stated, but the pseudocode omits the handling of zero-norm rows after polar factorization; a one-sentence clarification would prevent ambiguity in implementation.
Table 3: the caption does not state the number of independent runs or whether error bars reflect standard deviation across seeds; this is needed to assess the reported gains over Muon.
§5.3: the claim that gains 'scale with the MLP expansion factor' is supported by the plotted trend, but the text does not discuss whether the trend continues beyond the tested range or saturates.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work and for recommending minor revision. No specific major comments were provided in the report, so we have no point-by-point responses to address. We are happy to incorporate any minor changes the editor deems necessary based on the overall assessment.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained construction

full rationale

The abstract presents Aurora as an explicit two-step construction (extract polar factor of momentum matrix, then apply per-row scaling) to enforce row uniformity while preserving Muon geometry. No equations, fitted parameters, or self-citations are shown as load-bearing for the central claim. The skeptic note confirms the manuscript supplies an explicit feasible procedure and ablations without reduction to inputs by definition. No self-definitional, fitted-prediction, or self-citation patterns are present in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or can be extracted.

pith-pipeline@v0.9.1-grok · 5704 in / 1018 out tokens · 17307 ms · 2026-06-29T05:22:30.451472+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 17 canonical work pages · 8 internal anchors

[1]

Laguna M.1/XS.2 Technical Report

Julien Abadji et al. Laguna m.1/xs.2 technical report.arXiv preprint arXiv:2605.27605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm, 2025

2025
[3]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239

work page doi:10.1609/aaai.v34i05.6239 2020
[4]

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, and Ganzhao Yuan. Muoneq: Balancing before orthogonalization with lightweight equilibration. arXiv preprint arXiv:2603.28254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026
[8]

RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization, 2026

Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, and Yaoqing Yang. RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization, 2026

2026
[9]

Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A

Shibhansh Dohare, J. Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A. Rupam Mahmood, and Richard S. Sutton. Loss of plasticity in deep continual learning.Nature, 632: 768–774, 2024. doi: 10.1038/s41586-024-07711-7

work page doi:10.1038/s41586-024-07711-7 2024
[10]

A minimalist optimizer design for LLM pretraining, 2025

Athanasios Glentis, Jiaxiang Li, Andi Han, and Mingyi Hong. A minimalist optimizer design for LLM pretraining, 2025

2025
[11]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv. org/abs/2602.15763

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2025

Ekaterina Grishina, Matvey Smirnov, and Maxim Rakhuba. Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2025

2025
[13]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

2021
[14]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan. Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps: //kellerjordan.github.io/posts/muon/

2024
[15]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URLhttps://github.com/KellerJordan/modded-nanogpt

2024
[16]

Convergence of muon with newton-schulz, 2026

Gyu Yeol Kim and Min hwan Oh. Convergence of muon with newton-schulz, 2026. URL https://arxiv.org/abs/2601.19156. 20

work page arXiv 2026
[17]

Kimi k2.5: Visual agentic intelligence, 2026

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, et al. Kimi k2.5: Visual agentic intelligence, 2026

2026
[18]

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. SNIP: Single-shot network pruning based on connection sensitivity. InInternational Conference on Learning Representations,
[19]

URLhttps://openreview.net/forum?id=B1VZqjAcYX
[20]

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

Yibang Li, Bihari Lal Pandey, Ravi Sah, Andi Han, Cyrus Mostajeran, Pratik Jawanpuria, and Bamdev Mishra. Intrinsic muon: Spectral optimization on riemannian matrix manifolds.arXiv preprint arXiv:2605.09238, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Normuon: Making muon more efficient and scalable, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025. URLhttps://arxiv.org/abs/2510.05491

work page arXiv 2025
[22]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Dying ReLU and initialization: Theory and numerical examples.Communications in Computational Physics, 28(5):1671–1706,

Lu Lu, Yeonjong Shin, Yanhui Su, and George Em Karniadakis. Dying ReLU and initialization: Theory and numerical examples.Communications in Computational Physics, 28(5):1671–1706,
[24]

doi: 10.4208/cicp.OA-2020-0165

work page doi:10.4208/cicp.oa-2020-0165 2020
[25]

SWAN: SGD with normalization and whitening enables stateless LLM training, 2024

Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. SWAN: SGD with normalization and whitening enables stateless LLM training, 2024

2024
[26]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of EMNLP, 2018

2018
[27]

Importanceestimation for neural network pruning

PavloMolchanov, ArunMallya, StephenTyree, IuriFrosio, andJanKautz. Importanceestimation for neural network pruning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019. doi: 10.1109/CVPR.2019.01152

work page doi:10.1109/cvpr.2019.01152 2019
[28]

Kimi k2.6, 2026

Moonshot AI. Kimi k2.6, 2026. Model card

2026
[29]

Contra-muon and soft-muon, May 2026

Nilin. Contra-muon and soft-muon, May 2026. URL https://nilin.github.io/ contra-muon-and-soft-muon/. First version: 2026-05-04; edited: 2026-05-14

2026
[30]

Nemotron-CC-v2, 2025

NVIDIA. Nemotron-CC-v2, 2025. URL https://huggingface.co/datasets/nvidia/ Nemotron-CC-v2. Dataset, version 1.0, released 2025-08-18

2025
[31]

Nemotron-cc-code-v1

NVIDIA Corporation. Nemotron-cc-code-v1. https://huggingface.co/datasets/nvidia/ Nemotron-CC-Code-v1, December 2025. Hugging Face dataset

2025
[32]

The lambada dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of ACL, 2016

2016
[33]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Winogrande: An adversarial winograd schema challenge at scale.Proceedings of the AAAI Conference on Artificial Intelligence, 2020

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 21

2020
[35]

Gradient multi-normalization for stateless and scalable LLM training, 2025

Meyer Scetbon, Chao Ma, Wenbo Gong, and Edward Meeds. Gradient multi-normalization for stateless and scalable LLM training, 2025

2025
[36]

GLU variants improve transformer, 2020

Noam Shazeer. GLU variants improve transformer, 2020

2020
[37]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations, 2017

2017
[38]

Adamuon: Adaptive muon optimizer, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer, 2025. URL https://arxiv.org/abs/2507.11005

work page arXiv 2025
[39]

Arcee trinity large technical report.arXiv preprint arXiv:2602.17004, 2026

Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard, Fares Obied, Jack Min Ong, Jannik Straube, Aria Harley, Conner Stewart, et al. Arcee trinity large technical report.arXiv preprint arXiv:2602.17004, 2026

work page arXiv 2026
[40]

The dormant neuron phenomenon in deep reinforcement learning

Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32145– 32168. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/sokar23a.html

2023
[41]

Rnj-1, 2025

Ashish Vaswani, Mike Callahan, Adarsh Chaluvaraju, Aleksa Gordić, Devaansh Gupta, Yash Jain, Divya Mansingka, Philip Monk, Khoi Nguyen, Mohit Parmar, Michael Pust, Tim Romanski, Peter Rushton, Ali Shehper, Divya Shivaprasad, Somanshu Singla, Kurt Smith, Saurabh Srivastava, Anil Thomas, Alok Tripathy, Yash Vanjani, Ameya Velingker, and Essential AI. Rnj-1,...

2025
[42]

Picking winning tickets before training by preserving gradient flow

Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkgsACVKPH

2020
[43]

SRON: State-free LLM training via row-wise gradient normalization, 2025

Zhenrui Wen, Yilei Shi, Jipeng Wang, Ping Luo, Liang Qiao, Dongsheng Li, and Tianxiang Sun. SRON: State-free LLM training via row-wise gradient normalization, 2025. URLhttps: //openreview.net/forum?id=BtQLBWr6zI

2025
[44]

On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer, 2026

Ruihan Xu, Jiajin Li, and Yiping Lu. On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer, 2026

2026
[45]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019. doi: 10.18653/v1/P19-1472. URLhttps://aclanthology.org/P19-1472

work page doi:10.18653/v1/p19-1472 2019
[46]

UltraData-Math, 2026

Chuyue Zhou, Hongya Lyu, Xinle Lin, Hengyu Zhao, Junshao Guo, Xueren Zhang, Shuaikang Xue, Qiang Ma, Jie Zhou, Yudong Wang, and Zhiyuan Liu. UltraData-Math, 2026. URL https://huggingface.co/datasets/openbmb/UltraData-Math. A U-NorMuon Ablation Study We ablate all the modifications U-NorMuon applies onto NorMuon individually. We use the term statefulto ref...

2026

[1] [1]

Laguna M.1/XS.2 Technical Report

Julien Abadji et al. Laguna m.1/xs.2 technical report.arXiv preprint arXiv:2605.27605, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm, 2025

2025

[3] [3]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239. URL https://ojs.aaai.org/index.php/AAAI/article/view/6239

work page doi:10.1609/aaai.v34i05.6239 2020

[4] [4]

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, and Ganzhao Yuan. Muoneq: Balancing before orthogonalization with lightweight equilibration. arXiv preprint arXiv:2603.28254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026

[8] [8]

RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization, 2026

Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, and Yaoqing Yang. RMNP: Row-momentum normalized preconditioning for scalable matrix-based optimization, 2026

2026

[9] [9]

Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A

Shibhansh Dohare, J. Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A. Rupam Mahmood, and Richard S. Sutton. Loss of plasticity in deep continual learning.Nature, 632: 768–774, 2024. doi: 10.1038/s41586-024-07711-7

work page doi:10.1038/s41586-024-07711-7 2024

[10] [10]

A minimalist optimizer design for LLM pretraining, 2025

Athanasios Glentis, Jiaxiang Li, Andi Han, and Mingyi Hong. A minimalist optimizer design for LLM pretraining, 2025

2025

[11] [11]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5-Team. Glm-5: from vibe coding to agentic engineering, 2026. URLhttps://arxiv. org/abs/2602.15763

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2025

Ekaterina Grishina, Matvey Smirnov, and Maxim Rakhuba. Accelerating newton-schulz iteration for orthogonalization via chebyshev-type polynomials, 2025

2025

[13] [13]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

2021

[14] [14]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan. Muon: An optimizer for hidden layers in neural networks, 2024. URLhttps: //kellerjordan.github.io/posts/muon/

2024

[15] [15]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URLhttps://github.com/KellerJordan/modded-nanogpt

2024

[16] [16]

Convergence of muon with newton-schulz, 2026

Gyu Yeol Kim and Min hwan Oh. Convergence of muon with newton-schulz, 2026. URL https://arxiv.org/abs/2601.19156. 20

work page arXiv 2026

[17] [17]

Kimi k2.5: Visual agentic intelligence, 2026

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, et al. Kimi k2.5: Visual agentic intelligence, 2026

2026

[18] [18]

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. SNIP: Single-shot network pruning based on connection sensitivity. InInternational Conference on Learning Representations,

[19] [19]

URLhttps://openreview.net/forum?id=B1VZqjAcYX

[20] [20]

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

Yibang Li, Bihari Lal Pandey, Ravi Sah, Andi Han, Cyrus Mostajeran, Pratik Jawanpuria, and Bamdev Mishra. Intrinsic muon: Spectral optimization on riemannian matrix manifolds.arXiv preprint arXiv:2605.09238, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Normuon: Making muon more efficient and scalable, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable, 2025. URLhttps://arxiv.org/abs/2510.05491

work page arXiv 2025

[22] [22]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Dying ReLU and initialization: Theory and numerical examples.Communications in Computational Physics, 28(5):1671–1706,

Lu Lu, Yeonjong Shin, Yanhui Su, and George Em Karniadakis. Dying ReLU and initialization: Theory and numerical examples.Communications in Computational Physics, 28(5):1671–1706,

[24] [24]

doi: 10.4208/cicp.OA-2020-0165

work page doi:10.4208/cicp.oa-2020-0165 2020

[25] [25]

SWAN: SGD with normalization and whitening enables stateless LLM training, 2024

Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. SWAN: SGD with normalization and whitening enables stateless LLM training, 2024

2024

[26] [26]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of EMNLP, 2018

2018

[27] [27]

Importanceestimation for neural network pruning

PavloMolchanov, ArunMallya, StephenTyree, IuriFrosio, andJanKautz. Importanceestimation for neural network pruning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11264–11272, 2019. doi: 10.1109/CVPR.2019.01152

work page doi:10.1109/cvpr.2019.01152 2019

[28] [28]

Kimi k2.6, 2026

Moonshot AI. Kimi k2.6, 2026. Model card

2026

[29] [29]

Contra-muon and soft-muon, May 2026

Nilin. Contra-muon and soft-muon, May 2026. URL https://nilin.github.io/ contra-muon-and-soft-muon/. First version: 2026-05-04; edited: 2026-05-14

2026

[30] [30]

Nemotron-CC-v2, 2025

NVIDIA. Nemotron-CC-v2, 2025. URL https://huggingface.co/datasets/nvidia/ Nemotron-CC-v2. Dataset, version 1.0, released 2025-08-18

2025

[31] [31]

Nemotron-cc-code-v1

NVIDIA Corporation. Nemotron-cc-code-v1. https://huggingface.co/datasets/nvidia/ Nemotron-CC-Code-v1, December 2025. Hugging Face dataset

2025

[32] [32]

The lambada dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of ACL, 2016

2016

[33] [33]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlíček, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Winogrande: An adversarial winograd schema challenge at scale.Proceedings of the AAAI Conference on Artificial Intelligence, 2020

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 21

2020

[35] [35]

Gradient multi-normalization for stateless and scalable LLM training, 2025

Meyer Scetbon, Chao Ma, Wenbo Gong, and Edward Meeds. Gradient multi-normalization for stateless and scalable LLM training, 2025

2025

[36] [36]

GLU variants improve transformer, 2020

Noam Shazeer. GLU variants improve transformer, 2020

2020

[37] [37]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations, 2017

2017

[38] [38]

Adamuon: Adaptive muon optimizer, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer, 2025. URL https://arxiv.org/abs/2507.11005

work page arXiv 2025

[39] [39]

Arcee trinity large technical report.arXiv preprint arXiv:2602.17004, 2026

Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard, Fares Obied, Jack Min Ong, Jannik Straube, Aria Harley, Conner Stewart, et al. Arcee trinity large technical report.arXiv preprint arXiv:2602.17004, 2026

work page arXiv 2026

[40] [40]

The dormant neuron phenomenon in deep reinforcement learning

Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phenomenon in deep reinforcement learning. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32145– 32168. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/sokar23a.html

2023

[41] [41]

Rnj-1, 2025

Ashish Vaswani, Mike Callahan, Adarsh Chaluvaraju, Aleksa Gordić, Devaansh Gupta, Yash Jain, Divya Mansingka, Philip Monk, Khoi Nguyen, Mohit Parmar, Michael Pust, Tim Romanski, Peter Rushton, Ali Shehper, Divya Shivaprasad, Somanshu Singla, Kurt Smith, Saurabh Srivastava, Anil Thomas, Alok Tripathy, Yash Vanjani, Ameya Velingker, and Essential AI. Rnj-1,...

2025

[42] [42]

Picking winning tickets before training by preserving gradient flow

Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkgsACVKPH

2020

[43] [43]

SRON: State-free LLM training via row-wise gradient normalization, 2025

Zhenrui Wen, Yilei Shi, Jipeng Wang, Ping Luo, Liang Qiao, Dongsheng Li, and Tianxiang Sun. SRON: State-free LLM training via row-wise gradient normalization, 2025. URLhttps: //openreview.net/forum?id=BtQLBWr6zI

2025

[44] [44]

On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer, 2026

Ruihan Xu, Jiajin Li, and Yiping Lu. On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer, 2026

2026

[45] [45]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019. doi: 10.18653/v1/P19-1472. URLhttps://aclanthology.org/P19-1472

work page doi:10.18653/v1/p19-1472 2019

[46] [46]

UltraData-Math, 2026

Chuyue Zhou, Hongya Lyu, Xinle Lin, Hengyu Zhao, Junshao Guo, Xueren Zhang, Shuaikang Xue, Qiang Ma, Jie Zhou, Yudong Wang, and Zhiyuan Liu. UltraData-Math, 2026. URL https://huggingface.co/datasets/openbmb/UltraData-Math. A U-NorMuon Ablation Study We ablate all the modifications U-NorMuon applies onto NorMuon individually. We use the term statefulto ref...

2026