arxiv: 2603.20527 · v3 · submitted 2026-03-20 · 💻 cs.LG

Recognition: no theorem link

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Shenyang Deng , Zhuoli Ouyang , Tianyu Pang , Zihang Liu , Ruochen Jin , Shuhua Yu , Yaoqing Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords RMNPpreconditioningMuon optimizerrow-wise normalizationtransformer trainingnon-convex convergenceLLM pretrainingadaptive optimization

0 comments

The pith

RMNP replaces Newton-Schulz orthogonalization with row-wise L2 normalization to match Muon performance at linear cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Preconditioned methods such as Muon capture curvature for faster deep network training but rely on iterative orthogonalization that scales quadratically with matrix dimensions. RMNP substitutes this step with a direct row-wise L2 normalization applied to momentum-adjusted gradients along the input dimension. The substitution rests on the empirical block-diagonal structure of transformer Hessians, under which the two operations become asymptotically equivalent. This change lowers per-step complexity to O(mn) while the paper proves matching non-convex convergence rates that attain minimax optimality. Experiments on large language model pretraining confirm that training curves remain competitive with Muon yet preconditioning wall-clock time drops substantially.

Core claim

RMNP shows that row-momentum normalized preconditioning via simple row-wise ℓ2 normalization on the input dimension delivers competitive optimization performance to Muon while reducing preconditioning complexity from O(mn min(m,n)) to O(mn) and preserving the same minimax-optimal convergence complexity for non-convex problems.

What carries the argument

Row-wise ℓ2 normalization of momentum-adjusted gradient matrices, used as a direct surrogate for Newton-Schulz orthogonalization under the observed diagonal-block Hessian structure of transformer layers.

If this is right

Per-iteration preconditioning cost drops from quadratic to linear in the dimensions of each weight matrix.
Convergence guarantees in the non-convex setting remain identical to those established for Muon.
Wall-clock preconditioning time decreases while training progress on large language models stays comparable.
The method applies directly to any matrix-based update in deep networks that exhibit similar Hessian structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same normalization shortcut may work for other architectures once their Hessian block structure is verified.
Lower per-step overhead could allow larger batch sizes or longer context lengths on fixed hardware.
Full orthogonalization may be unnecessary overhead in many practical loss landscapes.
The approach invites direct comparisons with even simpler first-order normalizers such as Adam variants.

Load-bearing premise

Orthogonalization and row-wise L2 normalization become equivalent for the block-diagonal Hessians that arise in transformer layers.

What would settle it

A side-by-side run on a non-transformer architecture whose Hessian lacks the claimed block-diagonal structure, checking whether RMNP's optimization performance falls measurably behind Muon.

Figures

Figures reproduced from arXiv: 2603.20527 by Ruochen Jin, Shenyang Deng, Shuhua Yu, Tianyu Pang, Yaoqing Yang, Zhuoli Ouyang, Zihang Liu.

**Figure 1.** Figure 1: Time overhead comparison. The figure illustrates the wall-clock time for 100 computation steps for preconditioning process of RMNP versus Muon. Work Smooth Conv. Complexity Muon [16] LF ∥∇f∥∗ O(m2 Lσ2∆ϵ −4 ) [17] L∗ ∥∇f∥∗ O(mL∗σ 2∆ϵ −4 ) [16] L∗ ∥∇f∥∗ O(mL∗σ 2∆ϵ −4 ) RMNP Thm. 5.5 LF ∥∇f∥F O(m2 LF σ 2∆ϵ −4 ) Thm. 5.7 LF ∥∇f∥1,2 O(m2 LF σ 2∆ϵ −4 ) Thm. 5.9 L∞,2 ∥∇f∥1,2 O(mL∞,2σ 2∆ϵ −4 ) [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 2.** Figure 2: Comparison among Transformer layerwise Hessian, Preconditioner for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Per-parameter diagonal dominance ratios ravg, rmin, rmax (rows) for three representative matrix parameters (columns) during GPT-2 Small (125M), GPT-2 Medium (355M) and GPT-2 Large (770M) pre-training. Transparent curves: raw values; solid curves: smoothed with window size 50. Red dashed line: y = 1 threshold. Discussion with Recent Work Pethick et al. [39] propose a general framework that unifies various o… view at source ↗

**Figure 4.** Figure 4: Global diagonal dominance ratios ravg, rmin, rmax averaged across all matrix parameters during GPT-2 Small (125M), GPT-2 Medium (355M), and GPT-2 Large (770M) pre-training. Y-axis in log scale. Transparent curves: raw values; solid curves: smoothed with window size 50. Red dashed line: y = 1 threshold. The metrics quickly rise above 1 after warm-up and remain mostly above 1, confirming strong diagonal domi… view at source ↗

**Figure 6.** Figure 6: Results for LLaMA: 60M trained with 1B to [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Per-parameter diagonal dominance ratios [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8: Global diagonal dominance ratios ravg, rmin, rmax averaged across all matrix parameters during GPT-2 Small (125M), GPT-2 Medium (355M), and GPT-2 Large (770M) pre-training. Y-axis in log scale. Transparent curves: raw values; solid curves: smoothed with window size 50. Red dashed line: y = 1 threshold. C.1 Model Configuration for Preconditioning Time Cost [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗

**Figure 9.** Figure 9: Per-parameter diagonal dominance ratios [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗

**Figure 10.** Figure 10: Global diagonal dominance ratios ravg, rmin, rmax averaged across all matrix parameters during LLaMA 60M, LLaMA 130M, and LLaMA 350M pre-training. Y-axis in log scale. Transparent curves: raw values; solid curves: smoothed with window size 50. Red dashed line: y = 1 threshold. fix lrAdamW and vary lrMatrix to evaluate its impact on convergence. The results are summarized in Tables 10 and 11 for GPT-2, and… view at source ↗

**Figure 11.** Figure 11: Results for GPT-2 on FineWeb-Edu-100B: Small (125M) trained with 5B tokens; Medium (355M) [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗

read the original abstract

Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape. The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, Muon stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of Muon still leaves room for further improvement. In this paper, we introduce RMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise ($d_{\text{in}}$) $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. We empirically verified that orthogonalization and row-wise (on input dim) $\ell_2$ normalization are asymptotically equivalent in the case of the transformer. This substitution reduces the per-iteration computational complexity from ${O}(mn\cdot\min(m,n))$ to ${O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for RMNP in the non-convex setting that match recent results for Muon optimizers, achieving the minimax optimal complexity. Extensive experiments on large language model pretraining show that RMNP delivers competitive optimization performance compared with Muon while substantially reducing preconditioning wall-clock time. Our code is available at https://github.com/Dominator-Index/RMNP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RMNP replaces Newton-Schulz in Muon with row-wise L2 normalization to cut preconditioning cost to O(mn) while claiming matching performance and convergence on transformers.

read the letter

The core move in this paper is straightforward: drop the Newton-Schulz iteration that Muon uses for orthogonalization and replace it with a row-wise ℓ2 normalization on the input dimension of the weight matrix. This drops per-step cost from O(mn min(m,n)) to O(mn) and the authors report that training curves on LLM pretraining stay competitive with Muon. They also give a non-convex convergence result that matches the minimax rate claimed for Muon. Code is released, which is useful for anyone who wants to test it directly. The motivation comes from an observed diagonal-block structure in the layerwise Hessian of transformers, plus an empirical check that the two preconditioners produce asymptotically similar updates on those models. That is the actual novelty and the practical win. The soft spot is that the equivalence between orthogonalization and row-wise normalization is stated as empirically verified for transformers but without operator-norm or Frobenius bounds on the difference, and without a proof that the gap vanishes under the Hessian assumption. The convergence theorem is written directly for RMNP, so it does not collapse if the substitution is only approximate, but the headline claim of “matching performance at lower cost” still rests on how close the updates stay in practice. Experiments show competitive results yet omit error bars on the wall-clock timings, which makes the efficiency gain harder to quantify precisely. This paper is aimed at people who already run large transformer training and are looking for cheaper matrix preconditioners. A reader who cares about optimization for deep nets will find the complexity reduction and the released code worth checking. It is solid enough on the empirical side and has enough theory to warrant a serious referee, though the equivalence claim would benefit from tighter analysis or tests on non-transformer architectures.

Referee Report

3 major / 1 minor

Summary. The paper introduces RMNP, an optimizer replacing Newton-Schulz iteration in Muon-style preconditioning with row-wise (d_in) ℓ2 normalization on weight matrices. Motivated by the empirically observed diagonal-block structure of the Transformer layerwise Hessian and an empirical verification of asymptotic equivalence between the two operations for transformers, RMNP reduces per-iteration complexity from O(mn min(m,n)) to O(mn). It establishes non-convex convergence guarantees matching recent Muon results at minimax optimal complexity and reports competitive LLM pretraining performance with substantially lower preconditioning wall-clock time.

Significance. If the empirical equivalence between Newton-Schulz orthogonalization and row-wise normalization holds with small error on the targeted architectures, RMNP would provide a simpler, faster drop-in replacement for Muon while preserving its theoretical guarantees and practical effectiveness. The O(mn) complexity and matching convergence rate would be a meaningful efficiency gain for large-scale matrix-based optimization in deep learning.

major comments (3)

[Motivation and §3 (Method)] The substitution of Newton-Schulz orthogonalization by row-wise ℓ2 normalization rests on the claim of asymptotic equivalence for transformers, justified by the observed diagonal-block Hessian structure. No quantitative bounds on ||U_NS - U_row|| (operator or Frobenius norm) or formal proof that the difference vanishes under the Hessian assumption are supplied; this equivalence is load-bearing for the headline claim that RMNP matches Muon performance at reduced cost.
[Theorem 1] Theorem 1 (convergence analysis): the proof is stated to be direct for RMNP and independent of the specific normalization once equivalence is granted, yet the manuscript does not clarify whether the analysis requires the preconditioner to be exactly orthogonal or only approximately so; without this, the minimax optimality claim for the deployed RMNP variant is not fully supported.
[§5 (Experiments)] §5 (Experiments), timing tables: the reported wall-clock reductions lack error bars, standard deviations, or statistics over multiple independent runs, making it difficult to assess the reliability of the 'substantially reducing' preconditioning time claim.

minor comments (1)

[Abstract] The abstract states code availability but omits license information or instructions for exact reproduction of the reported timing and performance numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: The substitution of Newton-Schulz orthogonalization by row-wise ℓ2 normalization rests on the claim of asymptotic equivalence for transformers, justified by the observed diagonal-block Hessian structure. No quantitative bounds on ||U_NS - U_row|| (operator or Frobenius norm) or formal proof that the difference vanishes under the Hessian assumption are supplied; this equivalence is load-bearing for the headline claim that RMNP matches Muon performance at reduced cost.

Authors: We acknowledge that the manuscript relies on empirical verification rather than quantitative bounds or a formal proof of the asymptotic equivalence. The equivalence is motivated by the observed diagonal-block Hessian structure in transformers and is supported by our empirical checks showing near-identical behavior in practice. In the revised manuscript, we will expand §3 with additional quantitative measurements of ||U_NS - U_row|| (both Frobenius and operator norms) across layers, model scales, and training stages to better characterize the approximation error. A rigorous theoretical proof remains an open question for future work. revision: partial
Referee: Theorem 1 (convergence analysis): the proof is stated to be direct for RMNP and independent of the specific normalization once equivalence is granted, yet the manuscript does not clarify whether the analysis requires the preconditioner to be exactly orthogonal or only approximately so; without this, the minimax optimality claim for the deployed RMNP variant is not fully supported.

Authors: We agree that the manuscript should explicitly address the exact versus approximate orthogonality requirement. The proof of Theorem 1 follows the Muon analysis and assumes an exactly orthogonal preconditioner to obtain the stated minimax-optimal rates. In the revision, we will add a clarifying remark and short extension in the theorem statement and proof sketch noting that the guarantees extend to preconditioners with bounded deviation from orthogonality, with the additional error controlled by the empirical approximation quality observed for RMNP. This will support the practical claim of matching performance at reduced cost. revision: yes
Referee: §5 (Experiments), timing tables: the reported wall-clock reductions lack error bars, standard deviations, or statistics over multiple independent runs, making it difficult to assess the reliability of the 'substantially reducing' preconditioning time claim.

Authors: We agree that statistical reporting would improve the reliability assessment of the timing results. The current tables reflect single-run measurements. In the revised version, we will re-execute the preconditioning timing benchmarks over at least three independent runs and report means together with standard deviations in the tables of §5. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents the core substitution of Newton-Schulz orthogonalization by row-wise ℓ2 normalization as an empirical approximation justified by observed diagonal-block Hessian structure in transformers and direct verification of asymptotic equivalence on those models. Convergence guarantees are derived directly for the RMNP update rule and stated to match existing Muon results without any self-referential definitions, fitted parameters renamed as predictions, or load-bearing steps that reduce to the paper's own inputs by construction. The complexity reduction follows immediately from replacing the iterative orthogonalization with a single normalization pass. No self-citation chains or ansatzes smuggled via prior work appear in the provided derivation; the central claims remain independent of the specific normalization once the empirical equivalence is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about transformer Hessians and the asymptotic equivalence of orthogonalization and row normalization; no free parameters or new invented entities are introduced.

axioms (2)

domain assumption Transformer layerwise Hessians exhibit a diagonal block structure.
Used to motivate why row-wise normalization suffices.
domain assumption Orthogonalization and row-wise (input-dim) L2 normalization are asymptotically equivalent for transformers.
Empirically verified in the paper and used to justify the substitution.

pith-pipeline@v0.9.0 · 5591 in / 1235 out tokens · 43328 ms · 2026-05-15T07:46:35.514216+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
cs.LG 2026-05 unverdicted novelty 4.0

Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(m...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011

work page 2011
[2]

Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2):26–31, 2012

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. 12

work page 2012
[3]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[4]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[5]

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Natalie Abreu, Nikhil Vyas, Sham M. Kakade, and Depen Morwani. The potential of second-order optimization for LLMs: A study with full Gauss-Newton.arXiv preprint arXiv:2510.09378, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Optimizing neural networks with Kronecker-factored approximate curvature

James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. InInternational Conference on Machine Learning (ICML), volume 37 ofProceedings of Machine Learning Research, pages 2408–2417. PMLR, 2015

work page 2015
[7]

Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454–1466, 2018

Xi-Lin Li. Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454–1466, 2018

work page 2018
[8]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018

work page 2018
[9]

Kronecker-factored quasi-newton methods for deep learning.arXiv preprint arXiv:2102.06737, 2021

Yi Ren, Achraf Bahamou, and Donald Goldfarb. Kronecker-factored quasi-newton methods for deep learning.arXiv preprint arXiv:2102.06737, 2021

work page arXiv 2021
[10]

ASGO: Adaptive structured gradient optimization

Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=fru52tkjHf

work page 2025
[11]

Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github. io/posts/muon/, 2024

work page 2024
[12]

Ran Tian and Ankur P. Parikh. Amos: An Adam-style optimizer with adaptive weight decay towards model-oriented scale.arXiv preprint arXiv:2210.11693, 2022

work page arXiv 2022
[13]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=IDxZhXrpNf

work page 2025
[14]

AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025

Chongjie Si, Daquan Zhang, and Wei Shen. AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025

work page arXiv 2025
[15]

COSMOS: A hybrid adaptive optimizer for memory-efficient training of LLMs.arXiv preprint arXiv:2502.17410, 2025

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. COSMOS: A hybrid adaptive optimizer for memory-efficient training of LLMs.arXiv preprint arXiv:2502.17410, 2025

work page arXiv 2025
[16]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Convergence of muon with newton-schulz, 2026

Gyu Yeol Kim and Min hwan Oh. Convergence of muon with newton-schulz, 2026. URL https: //arxiv.org/abs/2601.19156

work page arXiv 2026
[18]

Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhiquan Luo. Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024

work page 2024
[19]

Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025

Zhaorui Dong, Yushun Zhang, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025. 13

work page arXiv 2025
[20]

Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

work page 2023
[21]

Black box lie group preconditioners for sgd.arXiv preprint arXiv:2211.04422, 2022

Xi-Lin Li. Black box lie group preconditioners for sgd.arXiv preprint arXiv:2211.04422, 2022

work page arXiv 2022
[22]

A distributed data-parallel PyTorch imple- mentation of the distributed Shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch imple- mentation of the distributed Shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023

work page arXiv 2023
[23]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

work page arXiv 2025
[25]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

work page arXiv 2025
[26]

Htmuon: Improving muon via heavy-tailed spectral correction, 2026

Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, and Yaoqing Yang. Htmuon: Improving muon via heavy-tailed spectral correction, 2026. URLhttps://arxiv.org/abs/ 2603.10067

work page arXiv 2026
[27]

Visualizing the loss landscape of neural nets.Advances in Neural Information Processing Systems, 31, 2018

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[28]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Levent Sagun, Léon Bottou, and Yann LeCun. Eigenvalues of the Hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the Hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026

Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Minhak Song, and Yaoqing Yang. Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026. URLhttps://arxiv.org/abs/2601. 11789

work page 2026
[31]

Depth, not data: An analysis of hessian spectral bifurcation, 2026

Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, and Yaoqing Yang. Depth, not data: An analysis of hessian spectral bifurcation, 2026. URLhttps://arxiv.org/abs/2602.00545

work page arXiv 2026
[32]

An investigation into neural net optimization via hessian eigenvalue density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pages 2232–2241. PMLR, 2019

work page 2019
[33]

Dissecting Hessian: Understanding common structure of Hessian in neural networks

Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting Hessian: Understanding common structure of Hessian in neural networks. InAdvances in Neural Information Processing Systems, volume 33, pages 10193–10204, 2020

work page 2020
[34]

Analytic insights into structure and rank of neural network Hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021

Sidak Pal Singh, Gregor Bachmann, and Thomas Hofmann. Analytic insights into structure and rank of neural network Hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021

work page 2021
[35]

Hessian eigenspectra of more realistic nonlinear models.Advances in Neural Information Processing Systems, 34:20104–20117, 2021

Zhenyu Liao and Michael W Mahoney. Hessian eigenspectra of more realistic nonlinear models.Advances in Neural Information Processing Systems, 34:20104–20117, 2021

work page 2021
[36]

Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration.Journal of Machine Learning Research, 23(229):1–47, 2022

Congliang Chen, Li Shen, Fangyu Zou, and Wei Liu. Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration.Journal of Machine Learning Research, 23(229):1–47, 2022. 14

work page 2022
[37]

On theo( √ dt1/4)convergence rate of RMSProp and its momentum extension measured byℓ 1 norm.arXiv preprint arXiv:2402.00389, 2024

Huan Li and Zhouchen Lin. On theo( √ dt1/4)convergence rate of RMSProp and its momentum extension measured byℓ 1 norm.arXiv preprint arXiv:2402.00389, 2024

work page arXiv 2024
[38]

Adam exploits $\ell_\infty$-geometry of loss landscape via coordinate-wise adaptivity

Shuo Xie, Mohamad Amin Mohamadi, and Zhiyuan Li. Adam exploits $\ell_\infty$-geometry of loss landscape via coordinate-wise adaptivity. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=PUnD86UEK5

work page 2025
[39]

Training deep learning models with norm-constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained LMOs. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=2Oqm2IzTy9

work page 2025
[40]

Openwebtext corpus

Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http:// Skylion007.github.io/OpenWebTextCorpus, 2019

work page 2019
[41]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[43]

MARS: Unleashing the power of variance reduction for training large models

Huizhuo Yuan, Yifeng Liu, Shuang Wu, zhou Xun, and Quanquan Gu. MARS: Unleashing the power of variance reduction for training large models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=NrcKQ3ASLZ

work page 2025
[44]

Alphadecay: Module-wise weight decay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562, 2025

Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562, 2025

work page arXiv 2025
[45]

Fineweb-edu-100b-shuffle

Andrej Karpathy. Fineweb-edu-100b-shuffle. https://huggingface.co/datasets/karpathy/ fineweb-edu-100b-shuffle, 2024. 15 A Proof of Theorem A.1 Notation We first recall our notation here, LetW∈R m×n denote the parameter matrix, whereWi,: ∈R n denotes the i-th row. The matrix inner product is⟨Z, W⟩ = Tr(Z ⊤W ). We use the Frobenius norm∥W∥ F = qP i,j W 2 i,...

work page 2024
[46]

Row-wise Ratio Calculation:For each rowi∈ { 1, . . . , m}, we compute the ratiori between the diagonal element and the average magnitude of off-diagonal elements: ri = Gii 1 m−1 P j̸=i |Gij| (10) whereG ii =∥V t,i:∥2 2 is the squared norm of thei-th row ofVt. 35

work page
[47]

Per-Parameter Aggregation:For each matrix parameter, we aggregate the row-wise ratios into three statistics: ravg = 1 m mX i=1 ri,(11) rmin = min i∈{1,...,m} ri,(12) rmax = max i∈{1,...,m} ri.(13)

work page
[48]

Logging ConfigurationWe use Weights & Biases (wandb) for metric tracking

Global Aggregation:The global statistics ravg, rmin, and rmax are computed by averaging the corresponding per-parameter metrics across allKmatrix parameters in the network: ravg = 1 K KX k=1 r(k) avg,(14) rmin = 1 K KX k=1 r(k) min,(15) rmax = 1 K KX k=1 r(k) max,(16) where the superscript(k)denotes the metric for thek-th matrix parameter. Logging Configu...

work page arXiv 2048