Recognition: unknown
Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer
Pith reviewed 2026-05-07 16:38 UTC · model grok-4.3
The pith
Nora unifies Muon-like preconditioning, strict scale-invariance, and O(mn) computation in one matrix optimizer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of O(mn). The paper further proves that Nora is a scalable optimizer and establishes its corresponding scaling theorems.
What carries the argument
Row-wise momentum projection onto the orthogonal complement of the weights, which enforces scale-invariance while supporting efficient preconditioning approximation via Hessian structure.
If this is right
- Nora satisfies efficiency, stability, and speed simultaneously where prior methods do not.
- The method runs in optimal O(mn) time with a two-line implementation.
- Scaling theorems establish predictable behavior as model size grows.
- Preliminary experiments support its use for large-scale Transformer training.
Where Pith is reading between the lines
- If the Hessian approximation generalizes, Nora could apply to other architectures whose Hessians show similar block-diagonal structure.
- The projection mechanism might reduce common training pathologies such as norm explosion in very deep networks beyond Transformers.
- Two-line integration suggests Nora could be dropped into existing training loops with minimal engineering effort.
Load-bearing premise
The block-diagonal dominance of the Transformer Hessian allows an effective approximation of structured preconditioning that preserves stability from the orthogonal projection.
What would settle it
Large-scale training runs that exhibit either radial jitters in weight norms or failure to match expected convergence speed and stability would show the unification does not hold.
Figures
read the original abstract
Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of $\mathcal{O}(mn)$. Furthermore, we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems. With a streamlined implementation requiring only two lines of code, our preliminary experiments validate Nora as an efficient and highly promising optimizer for large-scale training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Nora, a matrix optimizer for training large language models that unifies three desiderata: Muon-like preconditioning for efficiency, strict scale-invariance and stability via row-wise orthogonal projection of momentum onto the complement of the weights, and O(mn) computational speed by approximating structured preconditioning through the block-diagonal dominance of the Transformer Hessian. It asserts proofs of scalability via corresponding scaling theorems, a two-line implementation, and preliminary experiments validating the approach over prior methods like Muon and RMNP.
Significance. If the unshown derivations, error bounds, and experimental details hold, Nora would represent a meaningful contribution by providing a practical, scalable optimizer that simultaneously meets efficiency, stability, and speed requirements for LLM training, potentially simplifying large-scale optimization pipelines.
major comments (2)
- [Abstract] Abstract: The central claims that Nora 'rigorously satisfies' all three desiderata and that 'we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems' rest on unshown derivations, error analysis, and supporting equations. Without these, the scalability and stability-preservation assertions cannot be verified and are load-bearing for the paper's contribution.
- [Abstract] Abstract (Hessian approximation): The block-diagonal dominance of the Transformer Hessian is invoked to justify the O(mn) structured-preconditioning approximation, yet no error bound, dominance ratio, or argument is supplied showing that off-block terms preserve the claimed strict scale-invariance and do not reintroduce radial jitter. This assumption is load-bearing for the efficiency claim.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. We address each major comment below and describe the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims that Nora 'rigorously satisfies' all three desiderata and that 'we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems' rest on unshown derivations, error analysis, and supporting equations. Without these, the scalability and stability-preservation assertions cannot be verified and are load-bearing for the paper's contribution.
Authors: We acknowledge that the abstract is highly condensed and does not contain the full derivations or error analysis. The proofs of scalability, the scaling theorems, and the supporting derivations appear in Section 4 and Appendix B of the manuscript. To make these claims immediately verifiable from the abstract, we will revise the abstract to include explicit references to Section 4 and Appendix B, and we will add a concise outline of the key proof steps to the introduction. We will also ensure the error analysis is stated clearly in the main text. These changes will be incorporated in the revised version. revision: yes
-
Referee: [Abstract] Abstract (Hessian approximation): The block-diagonal dominance of the Transformer Hessian is invoked to justify the O(mn) structured-preconditioning approximation, yet no error bound, dominance ratio, or argument is supplied showing that off-block terms preserve the claimed strict scale-invariance and do not reintroduce radial jitter. This assumption is load-bearing for the efficiency claim.
Authors: The referee correctly notes that the current text motivates the approximation via block-diagonal dominance of the Transformer Hessian (Section 3.2) but does not supply explicit error bounds or a dominance ratio. We agree this weakens the efficiency claim. We will add a new subsection in Section 3 that provides a quantitative error analysis, including a bound on the contribution of off-block terms, a dominance ratio derived from typical Transformer Hessian structure, and a direct argument showing that the approximation preserves strict scale-invariance without reintroducing radial jitter. This material will be added without altering the O(mn) complexity. revision: yes
Circularity Check
No significant circularity; derivation relies on explicit construction and stated assumptions rather than self-reduction
full rationale
The paper's core claims rest on the explicit design of row-wise orthogonal projection to enforce stability (norm and angular velocity stabilization) and the invocation of block-diagonal Hessian dominance to justify an O(mn) Muon-style approximation. Scaling theorems are presented as proven results following from these constructions. No equations or steps in the abstract or described chain reduce a 'prediction' or theorem back to a fitted parameter or self-defined quantity by construction. The dominance property is an external assumption without quantified bounds, but this is a modeling choice, not a circular redefinition. Self-citations, if present for prior optimizer work, are not load-bearing for the uniqueness or scaling claims here. The derivation remains self-contained against the cited baselines (Muon, RMNP) without tautological collapse.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review arXiv 2014
-
[2]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review arXiv 2024
-
[3]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Muon: An optimizer for hidden layers in neural networks, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024
2024
-
[5]
Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Math- ematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 13(1):57–59, 1933
Günther Schulz. Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Math- ematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 13(1):57–59, 1933
1933
-
[6]
Towards quantifying the hessian structure of neural networks
Zhaorui Dong, Yushun Zhang, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks. InOPT 2025: Optimization for Machine Learning
2025
-
[7]
Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, and Yaoqing Yang. Rmnp: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026
work page internal anchor Pith review arXiv 2026
-
[8]
Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019
-
[9]
Batch normalization: Accelerating deep network training by reducing internal covariate shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015
2015
-
[10]
Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
2019
-
[11]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review arXiv 2016
-
[12]
Spherical motion dynamics: Learning dynamics of normalized neural network using sgd and weight decay.Advances in Neural Information Processing Systems, 34:6380–6391, 2021
Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using sgd and weight decay.Advances in Neural Information Processing Systems, 34:6380–6391, 2021
2021
-
[13]
Spherical cautious optimizers
Jh Yuan and Feiping Nie. Spherical cautious optimizers. InWorkshop on Scientific Methods for Understanding Deep Learning, 2026
2026
-
[14]
Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016
Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016
2016
-
[15]
Adafactor: Adaptive learning rates with sublinear memory cost
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InInternational conference on machine learning, pages 4596–4604. PMLR, 2018
2018
-
[16]
Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights.arXiv preprint arXiv:2006.08217, 2020
-
[17]
Decoupled orthogonal dynamics: Regularization for deep network optimizers
Hao Chen, Jh Yuan, and Hanmin Zhang. Decoupled orthogonal dynamics: Regularization for deep network optimizers. InWorkshop on Scientific Methods for Understanding Deep Learning, 2026. 10
2026
-
[18]
Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024
Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024
2024
-
[19]
Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.Advances in neural information processing systems, 33:18795–18806, 2020
Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.Advances in neural information processing systems, 33:18795–18806, 2020
2020
-
[20]
Sophia: A scalable stochastic second-order optimizer for language model pre-training
Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[21]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018
2018
-
[22]
Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021
Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021
2021
-
[23]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review arXiv 2023
-
[24]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
2019
-
[25]
Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025
Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025
-
[26]
SIAM, 2008
Nicholas J Higham.Functions of matrices: theory and computation. SIAM, 2008
2008
-
[27]
Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024
Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024
2024
-
[28]
Path-sgd: Path-normalized optimization in deep neural networks.Advances in neural information processing systems, 28, 2015
Behnam Neyshabur, Russ R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks.Advances in neural information processing systems, 28, 2015
2015
-
[29]
Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025
Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025
2025
-
[30]
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
Liliang Ren, Yang Liu, Yelong Shen, and Weizhu Chen. Rethinking language model scaling under transferable hypersphere optimization.arXiv preprint arXiv:2603.28743, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Large batch optimization for deep learning: Training bert in 76 minutes
Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. InInternational Conference on Learning Representations, 2020
2020
-
[32]
Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,
Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, et al. Controlled llm training on spectral sphere.arXiv preprint arXiv:2601.08393, 2026
-
[33]
Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986
1986
-
[34]
Mano: Restriking manifold optimization for llm training, 2026
Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for llm training, 2026
2026
-
[35]
Cambridge University Press, 2023
Nicolas Boumal.An introduction to optimization on smooth manifolds. Cambridge University Press, 2023. 11
2023
-
[36]
Riemannian optimization on relaxed indi- cator matrix manifold
Jh Yuan, Fangyuan Xie, Feiping Nie, and Xuelong Li. Riemannian optimization on relaxed indi- cator matrix manifold. InThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[37]
Riemannian fuzzy k-means on product manifolds
Jh Yuan, Zhuo Liu, and Feiping Nie. Riemannian fuzzy k-means on product manifolds. In Non-Euclidean Foundation Models: Advancing AI Beyond Euclidean Frameworks, 2025
2025
-
[38]
(T−1) L∞,2ηβ 1−β +T √mσ√ B s 1−β 1 +β # + T L∞,2η2 2 . (97) Dividing both sides byT ηyields 1 T T−1X t=0 E ∥Gt∥1,2 ≤ ∆ T η + 2
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 12 Contents 1 Introduction 1 2 Notations 2 3 Related Work 2 3.1 Matrix-Based Optimizers and ...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.