arxiv: 2605.03769 · v1 · submitted 2026-05-05 · 💻 cs.LG

Recognition: unknown

Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

Feiping Nie, Jiaxuan Zou, Jinghui Yuan, Shuo Wang, Yong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords optimizermatrix optimizerscale invariancepreconditioningorthogonal projectionlarge language modelsTransformer trainingscalable optimizer

0 comments

The pith

Nora unifies Muon-like preconditioning, strict scale-invariance, and O(mn) computation in one matrix optimizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Nora to meet three simultaneous requirements for matrix optimizers in large language model training: efficiency through preconditioning similar to Muon, stability through strict adherence to scale-invariance, and speed with minimal overhead. Prior approaches either incur high costs or introduce instabilities such as radial jitters in weight norms. Nora stabilizes norms and angular velocities via row-wise momentum projection onto the orthogonal complement of the weights, approximates structured preconditioning by exploiting block-diagonal dominance in the Transformer Hessian, and maintains linear time complexity while proving scalability. A reader would care because this removes the usual trade-offs that force compromises in training very large models.

Core claim

Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of O(mn). The paper further proves that Nora is a scalable optimizer and establishes its corresponding scaling theorems.

What carries the argument

Row-wise momentum projection onto the orthogonal complement of the weights, which enforces scale-invariance while supporting efficient preconditioning approximation via Hessian structure.

If this is right

Nora satisfies efficiency, stability, and speed simultaneously where prior methods do not.
The method runs in optimal O(mn) time with a two-line implementation.
Scaling theorems establish predictable behavior as model size grows.
Preliminary experiments support its use for large-scale Transformer training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the Hessian approximation generalizes, Nora could apply to other architectures whose Hessians show similar block-diagonal structure.
The projection mechanism might reduce common training pathologies such as norm explosion in very deep networks beyond Transformers.
Two-line integration suggests Nora could be dropped into existing training loops with minimal engineering effort.

Load-bearing premise

The block-diagonal dominance of the Transformer Hessian allows an effective approximation of structured preconditioning that preserves stability from the orthogonal projection.

What would settle it

Large-scale training runs that exhibit either radial jitters in weight norms or failure to match expected convergence speed and stability would show the unification does not hold.

Figures

Figures reproduced from arXiv: 2605.03769 by Feiping Nie, Jiaxuan Zou, Jinghui Yuan, Shuo Wang, Yong Liu.

**Figure 1.** Figure 1: Training dynamics on the 135M model. Left: loss over training steps. Right: perplexity view at source ↗

**Figure 2.** Figure 2: Training dynamics on the 135M model. This figure illustrates the perplexity and loss decay view at source ↗

read the original abstract

Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of $\mathcal{O}(mn)$. Furthermore, we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems. With a streamlined implementation requiring only two lines of code, our preliminary experiments validate Nora as an efficient and highly promising optimizer for large-scale training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nora tries to merge Muon-style preconditioning with strict scale-invariance via row-wise orthogonal momentum projection and a block-diagonal Hessian approximation, but the approximation's accuracy is asserted without error bounds or dominance metrics.

read the letter

The main thing here is a concrete mechanism: project the momentum row-wise onto the orthogonal complement of the weights to lock in norm and angle stability, then approximate the preconditioner by treating the Transformer Hessian as block-diagonally dominant so the whole update stays O(mn). That combination is framed as new relative to Muon and RMNP, and the two-line code claim is straightforward if it works. The scaling theorems are listed as support for the method being usable at LLM sizes. Preliminary experiments are mentioned as validation, which at least shows they ran something beyond theory. Those pieces give the paper a clear target and a practical angle that people tuning large-model optimizers might want to test. The block-diagonal dominance assumption is the load-bearing part for keeping both the preconditioning benefit and the stability guarantee. Without a quantified dominance ratio, an error bound on the ignored off-block terms, or a check that those terms do not reintroduce radial jitter, the unification of all three desiderata stays unproven. The abstract states the approximation is effective, but the stress-test note is right that this needs explicit validation rather than post-hoc assertion. If the dominance holds only loosely, the method reduces to a cheaper but weaker update. The paper is for readers already working on matrix or second-order style optimizers for deep nets. It is coherent enough on its own terms to warrant a serious referee who can check the derivations and run controlled ablations on the Hessian approximation. I would send it to review rather than desk-reject, with the expectation that the proofs and dominance analysis get expanded.

Referee Report

2 major / 0 minor

Summary. The paper proposes Nora, a matrix optimizer for training large language models that unifies three desiderata: Muon-like preconditioning for efficiency, strict scale-invariance and stability via row-wise orthogonal projection of momentum onto the complement of the weights, and O(mn) computational speed by approximating structured preconditioning through the block-diagonal dominance of the Transformer Hessian. It asserts proofs of scalability via corresponding scaling theorems, a two-line implementation, and preliminary experiments validating the approach over prior methods like Muon and RMNP.

Significance. If the unshown derivations, error bounds, and experimental details hold, Nora would represent a meaningful contribution by providing a practical, scalable optimizer that simultaneously meets efficiency, stability, and speed requirements for LLM training, potentially simplifying large-scale optimization pipelines.

major comments (2)

[Abstract] Abstract: The central claims that Nora 'rigorously satisfies' all three desiderata and that 'we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems' rest on unshown derivations, error analysis, and supporting equations. Without these, the scalability and stability-preservation assertions cannot be verified and are load-bearing for the paper's contribution.
[Abstract] Abstract (Hessian approximation): The block-diagonal dominance of the Transformer Hessian is invoked to justify the O(mn) structured-preconditioning approximation, yet no error bound, dominance ratio, or argument is supplied showing that off-block terms preserve the claimed strict scale-invariance and do not reintroduce radial jitter. This assumption is load-bearing for the efficiency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that Nora 'rigorously satisfies' all three desiderata and that 'we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems' rest on unshown derivations, error analysis, and supporting equations. Without these, the scalability and stability-preservation assertions cannot be verified and are load-bearing for the paper's contribution.

Authors: We acknowledge that the abstract is highly condensed and does not contain the full derivations or error analysis. The proofs of scalability, the scaling theorems, and the supporting derivations appear in Section 4 and Appendix B of the manuscript. To make these claims immediately verifiable from the abstract, we will revise the abstract to include explicit references to Section 4 and Appendix B, and we will add a concise outline of the key proof steps to the introduction. We will also ensure the error analysis is stated clearly in the main text. These changes will be incorporated in the revised version. revision: yes
Referee: [Abstract] Abstract (Hessian approximation): The block-diagonal dominance of the Transformer Hessian is invoked to justify the O(mn) structured-preconditioning approximation, yet no error bound, dominance ratio, or argument is supplied showing that off-block terms preserve the claimed strict scale-invariance and do not reintroduce radial jitter. This assumption is load-bearing for the efficiency claim.

Authors: The referee correctly notes that the current text motivates the approximation via block-diagonal dominance of the Transformer Hessian (Section 3.2) but does not supply explicit error bounds or a dominance ratio. We agree this weakens the efficiency claim. We will add a new subsection in Section 3 that provides a quantitative error analysis, including a bound on the contribution of off-block terms, a dominance ratio derived from typical Transformer Hessian structure, and a direct argument showing that the approximation preserves strict scale-invariance without reintroducing radial jitter. This material will be added without altering the O(mn) complexity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on explicit construction and stated assumptions rather than self-reduction

full rationale

The paper's core claims rest on the explicit design of row-wise orthogonal projection to enforce stability (norm and angular velocity stabilization) and the invocation of block-diagonal Hessian dominance to justify an O(mn) Muon-style approximation. Scaling theorems are presented as proven results following from these constructions. No equations or steps in the abstract or described chain reduce a 'prediction' or theorem back to a fitted parameter or self-defined quantity by construction. The dominance property is an external assumption without quantified bounds, but this is a modeling choice, not a circular redefinition. Self-citations, if present for prior optimizer work, are not load-bearing for the uniqueness or scaling claims here. The derivation remains self-contained against the cited baselines (Muon, RMNP) without tautological collapse.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The method implicitly relies on the unstated assumption that the orthogonal projection preserves the desired scale-invariance without introducing new fitting constants.

pith-pipeline@v0.9.0 · 5509 in / 1115 out tokens · 40648 ms · 2026-05-07T16:38:13.246925+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 11 canonical work pages · 7 internal anchors

[1]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review arXiv 2014
[2]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review arXiv 2024
[3]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review arXiv 2025
[4]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

2024
[5]

Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Math- ematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 13(1):57–59, 1933

Günther Schulz. Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Math- ematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 13(1):57–59, 1933

1933
[6]

Towards quantifying the hessian structure of neural networks

Zhaorui Dong, Yushun Zhang, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks. InOPT 2025: Optimization for Machine Learning

2025
[7]

Rmnp: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, and Yaoqing Yang. Rmnp: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

work page internal anchor Pith review arXiv 2026
[8]

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

work page arXiv 1907
[9]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015

2015
[10]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

2019
[11]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review arXiv 2016
[12]

Spherical motion dynamics: Learning dynamics of normalized neural network using sgd and weight decay.Advances in Neural Information Processing Systems, 34:6380–6391, 2021

Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using sgd and weight decay.Advances in Neural Information Processing Systems, 34:6380–6391, 2021

2021
[13]

Spherical cautious optimizers

Jh Yuan and Feiping Nie. Spherical cautious optimizers. InWorkshop on Scientific Methods for Understanding Deep Learning, 2026

2026
[14]

Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

2016
[15]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InInternational conference on machine learning, pages 4596–4604. PMLR, 2018

2018
[16]

Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights.arXiv preprint arXiv:2006.08217, 2020

Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights.arXiv preprint arXiv:2006.08217, 2020

work page arXiv 2006
[17]

Decoupled orthogonal dynamics: Regularization for deep network optimizers

Hao Chen, Jh Yuan, and Hanmin Zhang. Decoupled orthogonal dynamics: Regularization for deep network optimizers. InWorkshop on Scientific Methods for Understanding Deep Learning, 2026. 10

2026
[18]

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024

2024
[19]

Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.Advances in neural information processing systems, 33:18795–18806, 2020

Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.Advances in neural information processing systems, 33:18795–18806, 2020

2020
[20]

Sophia: A scalable stochastic second-order optimizer for language model pre-training

Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Conference on Learning Representations, 2024

2024
[21]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018

2018
[22]

Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

2021
[23]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review arXiv 2023
[24]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019
[25]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

work page arXiv 2025
[26]

SIAM, 2008

Nicholas J Higham.Functions of matrices: theory and computation. SIAM, 2008

2008
[27]

Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024

Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024

2024
[28]

Path-sgd: Path-normalized optimization in deep neural networks.Advances in neural information processing systems, 28, 2015

Behnam Neyshabur, Russ R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks.Advances in neural information processing systems, 28, 2015

2015
[29]

Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025

Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025

2025
[30]

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren, Yang Liu, Yelong Shen, and Weizhu Chen. Rethinking language model scaling under transferable hypersphere optimization.arXiv preprint arXiv:2603.28743, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Large batch optimization for deep learning: Training bert in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. InInternational Conference on Learning Representations, 2020

2020
[32]

Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,

Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, et al. Controlled llm training on spectral sphere.arXiv preprint arXiv:2601.08393, 2026

work page arXiv 2026
[33]

Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

1986
[34]

Mano: Restriking manifold optimization for llm training, 2026

Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for llm training, 2026

2026
[35]

Cambridge University Press, 2023

Nicolas Boumal.An introduction to optimization on smooth manifolds. Cambridge University Press, 2023. 11

2023
[36]

Riemannian optimization on relaxed indi- cator matrix manifold

Jh Yuan, Fangyuan Xie, Feiping Nie, and Xuelong Li. Riemannian optimization on relaxed indi- cator matrix manifold. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[37]

Riemannian fuzzy k-means on product manifolds

Jh Yuan, Zhuo Liu, and Feiping Nie. Riemannian fuzzy k-means on product manifolds. In Non-Euclidean Foundation Models: Advancing AI Beyond Euclidean Frameworks, 2025

2025
[38]

(T−1) L∞,2ηβ 1−β +T √mσ√ B s 1−β 1 +β # + T L∞,2η2 2 . (97) Dividing both sides byT ηyields 1 T T−1X t=0 E ∥Gt∥1,2 ≤ ∆ T η + 2

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 12 Contents 1 Introduction 1 2 Notations 2 3 Related Work 2 3.1 Matrix-Based Optimizers and ...

2020