RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Shenyang Deng , Zhuoli Ouyang , Tianyu Pang , Zihang Liu , Ruochen Jin , Shuhua Yu , Yaoqing Yang

Authors on Pith no claims yet

classification 💻 cs.LG

keywords preconditioningrmnpmuonoptimizationcomplexitycomputationalefficiencyempirically

read the original abstract

Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape. The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, Muon stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of Muon still leaves room for further improvement. In this paper, we introduce RMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise ($d_{\text{in}}$) $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. We empirically verified that orthogonalization and row-wise (on input dim) $\ell_2$ normalization are asymptotically equivalent in the case of the transformer. This substitution reduces the per-iteration computational complexity from ${O}(mn\cdot\min(m,n))$ to ${O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for RMNP in the non-convex setting that match recent results for Muon optimizers, achieving the minimax optimal complexity. Extensive experiments on large language model pretraining show that RMNP delivers competitive optimization performance compared with Muon while substantially reducing preconditioning wall-clock time. Our code is available at https://github.com/Dominator-Index/RMNP.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
cs.LG 2026-05 unverdicted novelty 4.0

Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(m...