pith. machine review for the scientific record. sign in

arxiv: 2605.03769 · v1 · submitted 2026-05-05 · 💻 cs.LG

Recognition: unknown

Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

Feiping Nie, Jiaxuan Zou, Jinghui Yuan, Shuo Wang, Yong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:38 UTC · model grok-4.3

classification 💻 cs.LG
keywords optimizermatrix optimizerscale invariancepreconditioningorthogonal projectionlarge language modelsTransformer trainingscalable optimizer
0
0 comments X

The pith

Nora unifies Muon-like preconditioning, strict scale-invariance, and O(mn) computation in one matrix optimizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Nora to meet three simultaneous requirements for matrix optimizers in large language model training: efficiency through preconditioning similar to Muon, stability through strict adherence to scale-invariance, and speed with minimal overhead. Prior approaches either incur high costs or introduce instabilities such as radial jitters in weight norms. Nora stabilizes norms and angular velocities via row-wise momentum projection onto the orthogonal complement of the weights, approximates structured preconditioning by exploiting block-diagonal dominance in the Transformer Hessian, and maintains linear time complexity while proving scalability. A reader would care because this removes the usual trade-offs that force compromises in training very large models.

Core claim

Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of O(mn). The paper further proves that Nora is a scalable optimizer and establishes its corresponding scaling theorems.

What carries the argument

Row-wise momentum projection onto the orthogonal complement of the weights, which enforces scale-invariance while supporting efficient preconditioning approximation via Hessian structure.

If this is right

  • Nora satisfies efficiency, stability, and speed simultaneously where prior methods do not.
  • The method runs in optimal O(mn) time with a two-line implementation.
  • Scaling theorems establish predictable behavior as model size grows.
  • Preliminary experiments support its use for large-scale Transformer training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the Hessian approximation generalizes, Nora could apply to other architectures whose Hessians show similar block-diagonal structure.
  • The projection mechanism might reduce common training pathologies such as norm explosion in very deep networks beyond Transformers.
  • Two-line integration suggests Nora could be dropped into existing training loops with minimal engineering effort.

Load-bearing premise

The block-diagonal dominance of the Transformer Hessian allows an effective approximation of structured preconditioning that preserves stability from the orthogonal projection.

What would settle it

Large-scale training runs that exhibit either radial jitters in weight norms or failure to match expected convergence speed and stability would show the unification does not hold.

Figures

Figures reproduced from arXiv: 2605.03769 by Feiping Nie, Jiaxuan Zou, Jinghui Yuan, Shuo Wang, Yong Liu.

Figure 1
Figure 1. Figure 1: Training dynamics on the 135M model. Left: loss over training steps. Right: perplexity view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics on the 135M model. This figure illustrates the perplexity and loss decay view at source ↗
read the original abstract

Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of $\mathcal{O}(mn)$. Furthermore, we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems. With a streamlined implementation requiring only two lines of code, our preliminary experiments validate Nora as an efficient and highly promising optimizer for large-scale training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Nora, a matrix optimizer for training large language models that unifies three desiderata: Muon-like preconditioning for efficiency, strict scale-invariance and stability via row-wise orthogonal projection of momentum onto the complement of the weights, and O(mn) computational speed by approximating structured preconditioning through the block-diagonal dominance of the Transformer Hessian. It asserts proofs of scalability via corresponding scaling theorems, a two-line implementation, and preliminary experiments validating the approach over prior methods like Muon and RMNP.

Significance. If the unshown derivations, error bounds, and experimental details hold, Nora would represent a meaningful contribution by providing a practical, scalable optimizer that simultaneously meets efficiency, stability, and speed requirements for LLM training, potentially simplifying large-scale optimization pipelines.

major comments (2)
  1. [Abstract] Abstract: The central claims that Nora 'rigorously satisfies' all three desiderata and that 'we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems' rest on unshown derivations, error analysis, and supporting equations. Without these, the scalability and stability-preservation assertions cannot be verified and are load-bearing for the paper's contribution.
  2. [Abstract] Abstract (Hessian approximation): The block-diagonal dominance of the Transformer Hessian is invoked to justify the O(mn) structured-preconditioning approximation, yet no error bound, dominance ratio, or argument is supplied showing that off-block terms preserve the claimed strict scale-invariance and do not reintroduce radial jitter. This assumption is load-bearing for the efficiency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address each major comment below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims that Nora 'rigorously satisfies' all three desiderata and that 'we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems' rest on unshown derivations, error analysis, and supporting equations. Without these, the scalability and stability-preservation assertions cannot be verified and are load-bearing for the paper's contribution.

    Authors: We acknowledge that the abstract is highly condensed and does not contain the full derivations or error analysis. The proofs of scalability, the scaling theorems, and the supporting derivations appear in Section 4 and Appendix B of the manuscript. To make these claims immediately verifiable from the abstract, we will revise the abstract to include explicit references to Section 4 and Appendix B, and we will add a concise outline of the key proof steps to the introduction. We will also ensure the error analysis is stated clearly in the main text. These changes will be incorporated in the revised version. revision: yes

  2. Referee: [Abstract] Abstract (Hessian approximation): The block-diagonal dominance of the Transformer Hessian is invoked to justify the O(mn) structured-preconditioning approximation, yet no error bound, dominance ratio, or argument is supplied showing that off-block terms preserve the claimed strict scale-invariance and do not reintroduce radial jitter. This assumption is load-bearing for the efficiency claim.

    Authors: The referee correctly notes that the current text motivates the approximation via block-diagonal dominance of the Transformer Hessian (Section 3.2) but does not supply explicit error bounds or a dominance ratio. We agree this weakens the efficiency claim. We will add a new subsection in Section 3 that provides a quantitative error analysis, including a bound on the contribution of off-block terms, a dominance ratio derived from typical Transformer Hessian structure, and a direct argument showing that the approximation preserves strict scale-invariance without reintroducing radial jitter. This material will be added without altering the O(mn) complexity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on explicit construction and stated assumptions rather than self-reduction

full rationale

The paper's core claims rest on the explicit design of row-wise orthogonal projection to enforce stability (norm and angular velocity stabilization) and the invocation of block-diagonal Hessian dominance to justify an O(mn) Muon-style approximation. Scaling theorems are presented as proven results following from these constructions. No equations or steps in the abstract or described chain reduce a 'prediction' or theorem back to a fitted parameter or self-defined quantity by construction. The dominance property is an external assumption without quantified bounds, but this is a modeling choice, not a circular redefinition. Self-citations, if present for prior optimizer work, are not load-bearing for the uniqueness or scaling claims here. The derivation remains self-contained against the cited baselines (Muon, RMNP) without tautological collapse.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The method implicitly relies on the unstated assumption that the orthogonal projection preserves the desired scale-invariance without introducing new fitting constants.

pith-pipeline@v0.9.0 · 5509 in / 1115 out tokens · 40648 ms · 2026-05-07T16:38:13.246925+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 11 canonical work pages · 7 internal anchors

  1. [1]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  2. [2]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  3. [3]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  4. [4]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

  5. [5]

    Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Math- ematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 13(1):57–59, 1933

    Günther Schulz. Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Math- ematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 13(1):57–59, 1933

  6. [6]

    Towards quantifying the hessian structure of neural networks

    Zhaorui Dong, Yushun Zhang, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks. InOPT 2025: Optimization for Machine Learning

  7. [7]

    Rmnp: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

    Shenyang Deng, Zhuoli Ouyang, Tianyu Pang, Zihang Liu, Ruochen Jin, Shuhua Yu, and Yaoqing Yang. Rmnp: Row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026

  8. [8]

    Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

    Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

  9. [9]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational conference on machine learning, pages 448–456. pmlr, 2015

  10. [10]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  11. [11]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  12. [12]

    Spherical motion dynamics: Learning dynamics of normalized neural network using sgd and weight decay.Advances in Neural Information Processing Systems, 34:6380–6391, 2021

    Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using sgd and weight decay.Advances in Neural Information Processing Systems, 34:6380–6391, 2021

  13. [13]

    Spherical cautious optimizers

    Jh Yuan and Feiping Nie. Spherical cautious optimizers. InWorkshop on Scientific Methods for Understanding Deep Learning, 2026

  14. [14]

    Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

    Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

  15. [15]

    Adafactor: Adaptive learning rates with sublinear memory cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InInternational conference on machine learning, pages 4596–4604. PMLR, 2018

  16. [16]

    Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights.arXiv preprint arXiv:2006.08217, 2020

    Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights.arXiv preprint arXiv:2006.08217, 2020

  17. [17]

    Decoupled orthogonal dynamics: Regularization for deep network optimizers

    Hao Chen, Jh Yuan, and Hanmin Zhang. Decoupled orthogonal dynamics: Regularization for deep network optimizers. InWorkshop on Scientific Methods for Understanding Deep Learning, 2026. 10

  18. [18]

    Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024

    Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9508–9520, 2024

  19. [19]

    Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.Advances in neural information processing systems, 33:18795–18806, 2020

    Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients.Advances in neural information processing systems, 33:18795–18806, 2020

  20. [20]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training

    Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Conference on Learning Representations, 2024

  21. [21]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018

  22. [22]

    Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

    Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

  23. [23]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  24. [24]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  25. [25]

    Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

  26. [26]

    SIAM, 2008

    Nicholas J Higham.Functions of matrices: theory and computation. SIAM, 2008

  27. [27]

    Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024

    Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhi-Quan Luo. Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024

  28. [28]

    Path-sgd: Path-normalized optimization in deep neural networks.Advances in neural information processing systems, 28, 2015

    Behnam Neyshabur, Russ R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks.Advances in neural information processing systems, 28, 2015

  29. [29]

    Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025

    Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025

  30. [30]

    Rethinking Language Model Scaling under Transferable Hypersphere Optimization

    Liliang Ren, Yang Liu, Yelong Shen, and Weizhu Chen. Rethinking language model scaling under transferable hypersphere optimization.arXiv preprint arXiv:2603.28743, 2026

  31. [31]

    Large batch optimization for deep learning: Training bert in 76 minutes

    Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. InInternational Conference on Learning Representations, 2020

  32. [32]

    Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,

    Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, et al. Controlled llm training on spectral sphere.arXiv preprint arXiv:2601.08393, 2026

  33. [33]

    Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

    David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors.nature, 323(6088):533–536, 1986

  34. [34]

    Mano: Restriking manifold optimization for llm training, 2026

    Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for llm training, 2026

  35. [35]

    Cambridge University Press, 2023

    Nicolas Boumal.An introduction to optimization on smooth manifolds. Cambridge University Press, 2023. 11

  36. [36]

    Riemannian optimization on relaxed indi- cator matrix manifold

    Jh Yuan, Fangyuan Xie, Feiping Nie, and Xuelong Li. Riemannian optimization on relaxed indi- cator matrix manifold. InThe Fourteenth International Conference on Learning Representations, 2026

  37. [37]

    Riemannian fuzzy k-means on product manifolds

    Jh Yuan, Zhuo Liu, and Feiping Nie. Riemannian fuzzy k-means on product manifolds. In Non-Euclidean Foundation Models: Advancing AI Beyond Euclidean Frameworks, 2025

  38. [38]

    (T−1) L∞,2ηβ 1−β +T √mσ√ B s 1−β 1 +β # + T L∞,2η2 2 . (97) Dividing both sides byT ηyields 1 T T−1X t=0 E ∥Gt∥1,2 ≤ ∆ T η + 2

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 12 Contents 1 Introduction 1 2 Notations 2 3 Related Work 2 3.1 Matrix-Based Optimizers and ...