Recognition: no theorem link
RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization
Pith reviewed 2026-05-15 07:46 UTC · model grok-4.3
The pith
RMNP replaces Newton-Schulz orthogonalization with row-wise L2 normalization to match Muon performance at linear cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RMNP shows that row-momentum normalized preconditioning via simple row-wise ℓ2 normalization on the input dimension delivers competitive optimization performance to Muon while reducing preconditioning complexity from O(mn min(m,n)) to O(mn) and preserving the same minimax-optimal convergence complexity for non-convex problems.
What carries the argument
Row-wise ℓ2 normalization of momentum-adjusted gradient matrices, used as a direct surrogate for Newton-Schulz orthogonalization under the observed diagonal-block Hessian structure of transformer layers.
If this is right
- Per-iteration preconditioning cost drops from quadratic to linear in the dimensions of each weight matrix.
- Convergence guarantees in the non-convex setting remain identical to those established for Muon.
- Wall-clock preconditioning time decreases while training progress on large language models stays comparable.
- The method applies directly to any matrix-based update in deep networks that exhibit similar Hessian structure.
Where Pith is reading between the lines
- The same normalization shortcut may work for other architectures once their Hessian block structure is verified.
- Lower per-step overhead could allow larger batch sizes or longer context lengths on fixed hardware.
- Full orthogonalization may be unnecessary overhead in many practical loss landscapes.
- The approach invites direct comparisons with even simpler first-order normalizers such as Adam variants.
Load-bearing premise
Orthogonalization and row-wise L2 normalization become equivalent for the block-diagonal Hessians that arise in transformer layers.
What would settle it
A side-by-side run on a non-transformer architecture whose Hessian lacks the claimed block-diagonal structure, checking whether RMNP's optimization performance falls measurably behind Muon.
Figures
read the original abstract
Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape. The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, Muon stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of Muon still leaves room for further improvement. In this paper, we introduce RMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise ($d_{\text{in}}$) $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. We empirically verified that orthogonalization and row-wise (on input dim) $\ell_2$ normalization are asymptotically equivalent in the case of the transformer. This substitution reduces the per-iteration computational complexity from ${O}(mn\cdot\min(m,n))$ to ${O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for RMNP in the non-convex setting that match recent results for Muon optimizers, achieving the minimax optimal complexity. Extensive experiments on large language model pretraining show that RMNP delivers competitive optimization performance compared with Muon while substantially reducing preconditioning wall-clock time. Our code is available at https://github.com/Dominator-Index/RMNP.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RMNP, an optimizer replacing Newton-Schulz iteration in Muon-style preconditioning with row-wise (d_in) ℓ2 normalization on weight matrices. Motivated by the empirically observed diagonal-block structure of the Transformer layerwise Hessian and an empirical verification of asymptotic equivalence between the two operations for transformers, RMNP reduces per-iteration complexity from O(mn min(m,n)) to O(mn). It establishes non-convex convergence guarantees matching recent Muon results at minimax optimal complexity and reports competitive LLM pretraining performance with substantially lower preconditioning wall-clock time.
Significance. If the empirical equivalence between Newton-Schulz orthogonalization and row-wise normalization holds with small error on the targeted architectures, RMNP would provide a simpler, faster drop-in replacement for Muon while preserving its theoretical guarantees and practical effectiveness. The O(mn) complexity and matching convergence rate would be a meaningful efficiency gain for large-scale matrix-based optimization in deep learning.
major comments (3)
- [Motivation and §3 (Method)] The substitution of Newton-Schulz orthogonalization by row-wise ℓ2 normalization rests on the claim of asymptotic equivalence for transformers, justified by the observed diagonal-block Hessian structure. No quantitative bounds on ||U_NS - U_row|| (operator or Frobenius norm) or formal proof that the difference vanishes under the Hessian assumption are supplied; this equivalence is load-bearing for the headline claim that RMNP matches Muon performance at reduced cost.
- [Theorem 1] Theorem 1 (convergence analysis): the proof is stated to be direct for RMNP and independent of the specific normalization once equivalence is granted, yet the manuscript does not clarify whether the analysis requires the preconditioner to be exactly orthogonal or only approximately so; without this, the minimax optimality claim for the deployed RMNP variant is not fully supported.
- [§5 (Experiments)] §5 (Experiments), timing tables: the reported wall-clock reductions lack error bars, standard deviations, or statistics over multiple independent runs, making it difficult to assess the reliability of the 'substantially reducing' preconditioning time claim.
minor comments (1)
- [Abstract] The abstract states code availability but omits license information or instructions for exact reproduction of the reported timing and performance numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: The substitution of Newton-Schulz orthogonalization by row-wise ℓ2 normalization rests on the claim of asymptotic equivalence for transformers, justified by the observed diagonal-block Hessian structure. No quantitative bounds on ||U_NS - U_row|| (operator or Frobenius norm) or formal proof that the difference vanishes under the Hessian assumption are supplied; this equivalence is load-bearing for the headline claim that RMNP matches Muon performance at reduced cost.
Authors: We acknowledge that the manuscript relies on empirical verification rather than quantitative bounds or a formal proof of the asymptotic equivalence. The equivalence is motivated by the observed diagonal-block Hessian structure in transformers and is supported by our empirical checks showing near-identical behavior in practice. In the revised manuscript, we will expand §3 with additional quantitative measurements of ||U_NS - U_row|| (both Frobenius and operator norms) across layers, model scales, and training stages to better characterize the approximation error. A rigorous theoretical proof remains an open question for future work. revision: partial
-
Referee: Theorem 1 (convergence analysis): the proof is stated to be direct for RMNP and independent of the specific normalization once equivalence is granted, yet the manuscript does not clarify whether the analysis requires the preconditioner to be exactly orthogonal or only approximately so; without this, the minimax optimality claim for the deployed RMNP variant is not fully supported.
Authors: We agree that the manuscript should explicitly address the exact versus approximate orthogonality requirement. The proof of Theorem 1 follows the Muon analysis and assumes an exactly orthogonal preconditioner to obtain the stated minimax-optimal rates. In the revision, we will add a clarifying remark and short extension in the theorem statement and proof sketch noting that the guarantees extend to preconditioners with bounded deviation from orthogonality, with the additional error controlled by the empirical approximation quality observed for RMNP. This will support the practical claim of matching performance at reduced cost. revision: yes
-
Referee: §5 (Experiments), timing tables: the reported wall-clock reductions lack error bars, standard deviations, or statistics over multiple independent runs, making it difficult to assess the reliability of the 'substantially reducing' preconditioning time claim.
Authors: We agree that statistical reporting would improve the reliability assessment of the timing results. The current tables reflect single-run measurements. In the revised version, we will re-execute the preconditioning timing benchmarks over at least three independent runs and report means together with standard deviations in the tables of §5. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents the core substitution of Newton-Schulz orthogonalization by row-wise ℓ2 normalization as an empirical approximation justified by observed diagonal-block Hessian structure in transformers and direct verification of asymptotic equivalence on those models. Convergence guarantees are derived directly for the RMNP update rule and stated to match existing Muon results without any self-referential definitions, fitted parameters renamed as predictions, or load-bearing steps that reduce to the paper's own inputs by construction. The complexity reduction follows immediately from replacing the iterative orthogonalization with a single normalization pass. No self-citation chains or ansatzes smuggled via prior work appear in the provided derivation; the central claims remain independent of the specific normalization once the empirical equivalence is granted.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Transformer layerwise Hessians exhibit a diagonal block structure.
- domain assumption Orthogonalization and row-wise (input-dim) L2 normalization are asymptotically equivalent for transformers.
Forward citations
Cited by 1 Pith paper
-
Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(m...
Reference graph
Works this paper leans on
-
[1]
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011
work page 2011
-
[2]
Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. 12
work page 2012
-
[3]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[4]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
work page 2019
-
[5]
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
Natalie Abreu, Nikhil Vyas, Sham M. Kakade, and Depen Morwani. The potential of second-order optimization for LLMs: A study with full Gauss-Newton.arXiv preprint arXiv:2510.09378, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Optimizing neural networks with Kronecker-factored approximate curvature
James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. InInternational Conference on Machine Learning (ICML), volume 37 ofProceedings of Machine Learning Research, pages 2408–2417. PMLR, 2015
work page 2015
-
[7]
Xi-Lin Li. Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454–1466, 2018
work page 2018
-
[8]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018
work page 2018
-
[9]
Kronecker-factored quasi-newton methods for deep learning.arXiv preprint arXiv:2102.06737, 2021
Yi Ren, Achraf Bahamou, and Donald Goldfarb. Kronecker-factored quasi-newton methods for deep learning.arXiv preprint arXiv:2102.06737, 2021
-
[10]
ASGO: Adaptive structured gradient optimization
Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=fru52tkjHf
work page 2025
-
[11]
Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github. io/posts/muon/, 2024
work page 2024
- [12]
-
[13]
Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=IDxZhXrpNf
work page 2025
-
[14]
AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025
Chongjie Si, Daquan Zhang, and Wei Shen. AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025
-
[15]
Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. COSMOS: A hybrid adaptive optimizer for memory-efficient training of LLMs.arXiv preprint arXiv:2502.17410, 2025
-
[16]
On the Convergence Analysis of Muon
Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Convergence of muon with newton-schulz, 2026
Gyu Yeol Kim and Min hwan Oh. Convergence of muon with newton-schulz, 2026. URL https: //arxiv.org/abs/2601.19156
-
[18]
Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhiquan Luo. Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024
work page 2024
-
[19]
Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025
Zhaorui Dong, Yushun Zhang, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025. 13
-
[20]
Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023
Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023
work page 2023
-
[21]
Black box lie group preconditioners for sgd.arXiv preprint arXiv:2211.04422, 2022
Xi-Lin Li. Black box lie group preconditioners for sgd.arXiv preprint arXiv:2211.04422, 2022
-
[22]
Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch imple- mentation of the distributed Shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023
-
[23]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025
Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025
-
[25]
Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025
Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025
-
[26]
Htmuon: Improving muon via heavy-tailed spectral correction, 2026
Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, and Yaoqing Yang. Htmuon: Improving muon via heavy-tailed spectral correction, 2026. URLhttps://arxiv.org/abs/ 2603.10067
-
[27]
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[28]
Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
Levent Sagun, Léon Bottou, and Yann LeCun. Eigenvalues of the Hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the Hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026
Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Minhak Song, and Yaoqing Yang. Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026. URLhttps://arxiv.org/abs/2601. 11789
work page 2026
-
[31]
Depth, not data: An analysis of hessian spectral bifurcation, 2026
Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, and Yaoqing Yang. Depth, not data: An analysis of hessian spectral bifurcation, 2026. URLhttps://arxiv.org/abs/2602.00545
-
[32]
An investigation into neural net optimization via hessian eigenvalue density
Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pages 2232–2241. PMLR, 2019
work page 2019
-
[33]
Dissecting Hessian: Understanding common structure of Hessian in neural networks
Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting Hessian: Understanding common structure of Hessian in neural networks. InAdvances in Neural Information Processing Systems, volume 33, pages 10193–10204, 2020
work page 2020
-
[34]
Sidak Pal Singh, Gregor Bachmann, and Thomas Hofmann. Analytic insights into structure and rank of neural network Hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021
work page 2021
-
[35]
Zhenyu Liao and Michael W Mahoney. Hessian eigenspectra of more realistic nonlinear models.Advances in Neural Information Processing Systems, 34:20104–20117, 2021
work page 2021
-
[36]
Congliang Chen, Li Shen, Fangyu Zou, and Wei Liu. Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration.Journal of Machine Learning Research, 23(229):1–47, 2022. 14
work page 2022
-
[37]
Huan Li and Zhouchen Lin. On theo( √ dt1/4)convergence rate of RMSProp and its momentum extension measured byℓ 1 norm.arXiv preprint arXiv:2402.00389, 2024
-
[38]
Adam exploits $\ell_\infty$-geometry of loss landscape via coordinate-wise adaptivity
Shuo Xie, Mohamad Amin Mohamadi, and Zhiyuan Li. Adam exploits $\ell_\infty$-geometry of loss landscape via coordinate-wise adaptivity. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=PUnD86UEK5
work page 2025
-
[39]
Training deep learning models with norm-constrained LMOs
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained LMOs. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=2Oqm2IzTy9
work page 2025
-
[40]
Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http:// Skylion007.github.io/OpenWebTextCorpus, 2019
work page 2019
-
[41]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[43]
MARS: Unleashing the power of variance reduction for training large models
Huizhuo Yuan, Yifeng Liu, Shuang Wu, zhou Xun, and Quanquan Gu. MARS: Unleashing the power of variance reduction for training large models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=NrcKQ3ASLZ
work page 2025
-
[44]
Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562, 2025
-
[45]
Andrej Karpathy. Fineweb-edu-100b-shuffle. https://huggingface.co/datasets/karpathy/ fineweb-edu-100b-shuffle, 2024. 15 A Proof of Theorem A.1 Notation We first recall our notation here, LetW∈R m×n denote the parameter matrix, whereWi,: ∈R n denotes the i-th row. The matrix inner product is⟨Z, W⟩ = Tr(Z ⊤W ). We use the Frobenius norm∥W∥ F = qP i,j W 2 i,...
work page 2024
-
[46]
Row-wise Ratio Calculation:For each rowi∈ { 1, . . . , m}, we compute the ratiori between the diagonal element and the average magnitude of off-diagonal elements: ri = Gii 1 m−1 P j̸=i |Gij| (10) whereG ii =∥V t,i:∥2 2 is the squared norm of thei-th row ofVt. 35
-
[47]
Per-Parameter Aggregation:For each matrix parameter, we aggregate the row-wise ratios into three statistics: ravg = 1 m mX i=1 ri,(11) rmin = min i∈{1,...,m} ri,(12) rmax = max i∈{1,...,m} ri.(13)
-
[48]
Logging ConfigurationWe use Weights & Biases (wandb) for metric tracking
Global Aggregation:The global statistics ravg, rmin, and rmax are computed by averaging the corresponding per-parameter metrics across allKmatrix parameters in the network: ravg = 1 K KX k=1 r(k) avg,(14) rmin = 1 K KX k=1 r(k) min,(15) rmax = 1 K KX k=1 r(k) max,(16) where the superscript(k)denotes the metric for thek-th matrix parameter. Logging Configu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.