pith. machine review for the scientific record. sign in

arxiv: 2603.20527 · v3 · submitted 2026-03-20 · 💻 cs.LG

Recognition: no theorem link

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords RMNPpreconditioningMuon optimizerrow-wise normalizationtransformer trainingnon-convex convergenceLLM pretrainingadaptive optimization
0
0 comments X

The pith

RMNP replaces Newton-Schulz orthogonalization with row-wise L2 normalization to match Muon performance at linear cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Preconditioned methods such as Muon capture curvature for faster deep network training but rely on iterative orthogonalization that scales quadratically with matrix dimensions. RMNP substitutes this step with a direct row-wise L2 normalization applied to momentum-adjusted gradients along the input dimension. The substitution rests on the empirical block-diagonal structure of transformer Hessians, under which the two operations become asymptotically equivalent. This change lowers per-step complexity to O(mn) while the paper proves matching non-convex convergence rates that attain minimax optimality. Experiments on large language model pretraining confirm that training curves remain competitive with Muon yet preconditioning wall-clock time drops substantially.

Core claim

RMNP shows that row-momentum normalized preconditioning via simple row-wise ℓ2 normalization on the input dimension delivers competitive optimization performance to Muon while reducing preconditioning complexity from O(mn min(m,n)) to O(mn) and preserving the same minimax-optimal convergence complexity for non-convex problems.

What carries the argument

Row-wise ℓ2 normalization of momentum-adjusted gradient matrices, used as a direct surrogate for Newton-Schulz orthogonalization under the observed diagonal-block Hessian structure of transformer layers.

If this is right

  • Per-iteration preconditioning cost drops from quadratic to linear in the dimensions of each weight matrix.
  • Convergence guarantees in the non-convex setting remain identical to those established for Muon.
  • Wall-clock preconditioning time decreases while training progress on large language models stays comparable.
  • The method applies directly to any matrix-based update in deep networks that exhibit similar Hessian structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same normalization shortcut may work for other architectures once their Hessian block structure is verified.
  • Lower per-step overhead could allow larger batch sizes or longer context lengths on fixed hardware.
  • Full orthogonalization may be unnecessary overhead in many practical loss landscapes.
  • The approach invites direct comparisons with even simpler first-order normalizers such as Adam variants.

Load-bearing premise

Orthogonalization and row-wise L2 normalization become equivalent for the block-diagonal Hessians that arise in transformer layers.

What would settle it

A side-by-side run on a non-transformer architecture whose Hessian lacks the claimed block-diagonal structure, checking whether RMNP's optimization performance falls measurably behind Muon.

Figures

Figures reproduced from arXiv: 2603.20527 by Ruochen Jin, Shenyang Deng, Shuhua Yu, Tianyu Pang, Yaoqing Yang, Zhuoli Ouyang, Zihang Liu.

Figure 1
Figure 1. Figure 1: Time overhead comparison. The figure illustrates the wall-clock time for 100 computation steps for preconditioning process of RMNP versus Muon. Work Smooth Conv. Complexity Muon [16] LF ∥∇f∥∗ O(m2 Lσ2∆ϵ −4 ) [17] L∗ ∥∇f∥∗ O(mL∗σ 2∆ϵ −4 ) [16] L∗ ∥∇f∥∗ O(mL∗σ 2∆ϵ −4 ) RMNP Thm. 5.5 LF ∥∇f∥F O(m2 LF σ 2∆ϵ −4 ) Thm. 5.7 LF ∥∇f∥1,2 O(m2 LF σ 2∆ϵ −4 ) Thm. 5.9 L∞,2 ∥∇f∥1,2 O(mL∞,2σ 2∆ϵ −4 ) [PITH_FULL_IMAGE:fi… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison among Transformer layerwise Hessian, Preconditioner for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-parameter diagonal dominance ratios ravg, rmin, rmax (rows) for three representative matrix parameters (columns) during GPT-2 Small (125M), GPT-2 Medium (355M) and GPT-2 Large (770M) pre-training. Transparent curves: raw values; solid curves: smoothed with window size 50. Red dashed line: y = 1 threshold. Discussion with Recent Work Pethick et al. [39] propose a general framework that unifies various o… view at source ↗
Figure 4
Figure 4. Figure 4: Global diagonal dominance ratios ravg, rmin, rmax averaged across all matrix parameters during GPT-2 Small (125M), GPT-2 Medium (355M), and GPT-2 Large (770M) pre-training. Y-axis in log scale. Transparent curves: raw values; solid curves: smoothed with window size 50. Red dashed line: y = 1 threshold. The metrics quickly rise above 1 after warm-up and remain mostly above 1, confirming strong diagonal domi… view at source ↗
Figure 6
Figure 6. Figure 6: Results for LLaMA: 60M trained with 1B to [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-parameter diagonal dominance ratios [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Global diagonal dominance ratios ravg, rmin, rmax averaged across all matrix parameters during GPT-2 Small (125M), GPT-2 Medium (355M), and GPT-2 Large (770M) pre-training. Y-axis in log scale. Transparent curves: raw values; solid curves: smoothed with window size 50. Red dashed line: y = 1 threshold. C.1 Model Configuration for Preconditioning Time Cost [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-parameter diagonal dominance ratios [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Global diagonal dominance ratios ravg, rmin, rmax averaged across all matrix parameters during LLaMA 60M, LLaMA 130M, and LLaMA 350M pre-training. Y-axis in log scale. Transparent curves: raw values; solid curves: smoothed with window size 50. Red dashed line: y = 1 threshold. fix lrAdamW and vary lrMatrix to evaluate its impact on convergence. The results are summarized in Tables 10 and 11 for GPT-2, and… view at source ↗
Figure 11
Figure 11. Figure 11: Results for GPT-2 on FineWeb-Edu-100B: Small (125M) trained with 5B tokens; Medium (355M) [PITH_FULL_IMAGE:figures/full_fig_p041_11.png] view at source ↗
read the original abstract

Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape. The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, Muon stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of Muon still leaves room for further improvement. In this paper, we introduce RMNP (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise ($d_{\text{in}}$) $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. We empirically verified that orthogonalization and row-wise (on input dim) $\ell_2$ normalization are asymptotically equivalent in the case of the transformer. This substitution reduces the per-iteration computational complexity from ${O}(mn\cdot\min(m,n))$ to ${O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for RMNP in the non-convex setting that match recent results for Muon optimizers, achieving the minimax optimal complexity. Extensive experiments on large language model pretraining show that RMNP delivers competitive optimization performance compared with Muon while substantially reducing preconditioning wall-clock time. Our code is available at https://github.com/Dominator-Index/RMNP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces RMNP, an optimizer replacing Newton-Schulz iteration in Muon-style preconditioning with row-wise (d_in) ℓ2 normalization on weight matrices. Motivated by the empirically observed diagonal-block structure of the Transformer layerwise Hessian and an empirical verification of asymptotic equivalence between the two operations for transformers, RMNP reduces per-iteration complexity from O(mn min(m,n)) to O(mn). It establishes non-convex convergence guarantees matching recent Muon results at minimax optimal complexity and reports competitive LLM pretraining performance with substantially lower preconditioning wall-clock time.

Significance. If the empirical equivalence between Newton-Schulz orthogonalization and row-wise normalization holds with small error on the targeted architectures, RMNP would provide a simpler, faster drop-in replacement for Muon while preserving its theoretical guarantees and practical effectiveness. The O(mn) complexity and matching convergence rate would be a meaningful efficiency gain for large-scale matrix-based optimization in deep learning.

major comments (3)
  1. [Motivation and §3 (Method)] The substitution of Newton-Schulz orthogonalization by row-wise ℓ2 normalization rests on the claim of asymptotic equivalence for transformers, justified by the observed diagonal-block Hessian structure. No quantitative bounds on ||U_NS - U_row|| (operator or Frobenius norm) or formal proof that the difference vanishes under the Hessian assumption are supplied; this equivalence is load-bearing for the headline claim that RMNP matches Muon performance at reduced cost.
  2. [Theorem 1] Theorem 1 (convergence analysis): the proof is stated to be direct for RMNP and independent of the specific normalization once equivalence is granted, yet the manuscript does not clarify whether the analysis requires the preconditioner to be exactly orthogonal or only approximately so; without this, the minimax optimality claim for the deployed RMNP variant is not fully supported.
  3. [§5 (Experiments)] §5 (Experiments), timing tables: the reported wall-clock reductions lack error bars, standard deviations, or statistics over multiple independent runs, making it difficult to assess the reliability of the 'substantially reducing' preconditioning time claim.
minor comments (1)
  1. [Abstract] The abstract states code availability but omits license information or instructions for exact reproduction of the reported timing and performance numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: The substitution of Newton-Schulz orthogonalization by row-wise ℓ2 normalization rests on the claim of asymptotic equivalence for transformers, justified by the observed diagonal-block Hessian structure. No quantitative bounds on ||U_NS - U_row|| (operator or Frobenius norm) or formal proof that the difference vanishes under the Hessian assumption are supplied; this equivalence is load-bearing for the headline claim that RMNP matches Muon performance at reduced cost.

    Authors: We acknowledge that the manuscript relies on empirical verification rather than quantitative bounds or a formal proof of the asymptotic equivalence. The equivalence is motivated by the observed diagonal-block Hessian structure in transformers and is supported by our empirical checks showing near-identical behavior in practice. In the revised manuscript, we will expand §3 with additional quantitative measurements of ||U_NS - U_row|| (both Frobenius and operator norms) across layers, model scales, and training stages to better characterize the approximation error. A rigorous theoretical proof remains an open question for future work. revision: partial

  2. Referee: Theorem 1 (convergence analysis): the proof is stated to be direct for RMNP and independent of the specific normalization once equivalence is granted, yet the manuscript does not clarify whether the analysis requires the preconditioner to be exactly orthogonal or only approximately so; without this, the minimax optimality claim for the deployed RMNP variant is not fully supported.

    Authors: We agree that the manuscript should explicitly address the exact versus approximate orthogonality requirement. The proof of Theorem 1 follows the Muon analysis and assumes an exactly orthogonal preconditioner to obtain the stated minimax-optimal rates. In the revision, we will add a clarifying remark and short extension in the theorem statement and proof sketch noting that the guarantees extend to preconditioners with bounded deviation from orthogonality, with the additional error controlled by the empirical approximation quality observed for RMNP. This will support the practical claim of matching performance at reduced cost. revision: yes

  3. Referee: §5 (Experiments), timing tables: the reported wall-clock reductions lack error bars, standard deviations, or statistics over multiple independent runs, making it difficult to assess the reliability of the 'substantially reducing' preconditioning time claim.

    Authors: We agree that statistical reporting would improve the reliability assessment of the timing results. The current tables reflect single-run measurements. In the revised version, we will re-execute the preconditioning timing benchmarks over at least three independent runs and report means together with standard deviations in the tables of §5. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents the core substitution of Newton-Schulz orthogonalization by row-wise ℓ2 normalization as an empirical approximation justified by observed diagonal-block Hessian structure in transformers and direct verification of asymptotic equivalence on those models. Convergence guarantees are derived directly for the RMNP update rule and stated to match existing Muon results without any self-referential definitions, fitted parameters renamed as predictions, or load-bearing steps that reduce to the paper's own inputs by construction. The complexity reduction follows immediately from replacing the iterative orthogonalization with a single normalization pass. No self-citation chains or ansatzes smuggled via prior work appear in the provided derivation; the central claims remain independent of the specific normalization once the empirical equivalence is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about transformer Hessians and the asymptotic equivalence of orthogonalization and row normalization; no free parameters or new invented entities are introduced.

axioms (2)
  • domain assumption Transformer layerwise Hessians exhibit a diagonal block structure.
    Used to motivate why row-wise normalization suffices.
  • domain assumption Orthogonalization and row-wise (input-dim) L2 normalization are asymptotically equivalent for transformers.
    Empirically verified in the paper and used to justify the substitution.

pith-pipeline@v0.9.0 · 5591 in / 1235 out tokens · 43328 ms · 2026-05-15T07:46:35.514216+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Nora: Normalized Orthogonal Row Alignment for Scalable Matrix Optimizer

    cs.LG 2026-05 unverdicted novelty 4.0

    Nora is a matrix optimizer that stabilizes weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights while approximating structured preconditioning with O(m...

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011

  2. [2]

    Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2):26–31, 2012

    Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. 12

  3. [3]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  4. [4]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  5. [5]

    The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

    Natalie Abreu, Nikhil Vyas, Sham M. Kakade, and Depen Morwani. The potential of second-order optimization for LLMs: A study with full Gauss-Newton.arXiv preprint arXiv:2510.09378, 2025

  6. [6]

    Optimizing neural networks with Kronecker-factored approximate curvature

    James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. InInternational Conference on Machine Learning (ICML), volume 37 ofProceedings of Machine Learning Research, pages 2408–2417. PMLR, 2015

  7. [7]

    Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454–1466, 2018

    Xi-Lin Li. Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454–1466, 2018

  8. [8]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018

  9. [9]

    Kronecker-factored quasi-newton methods for deep learning.arXiv preprint arXiv:2102.06737, 2021

    Yi Ren, Achraf Bahamou, and Donald Goldfarb. Kronecker-factored quasi-newton methods for deep learning.arXiv preprint arXiv:2102.06737, 2021

  10. [10]

    ASGO: Adaptive structured gradient optimization

    Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=fru52tkjHf

  11. [11]

    Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github. io/posts/muon/, 2024

  12. [12]

    Ran Tian and Ankur P. Parikh. Amos: An Adam-style optimizer with adaptive weight decay towards model-oriented scale.arXiv preprint arXiv:2210.11693, 2022

  13. [13]

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP: Improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=IDxZhXrpNf

  14. [14]

    AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025

    Chongjie Si, Daquan Zhang, and Wei Shen. AdaMuon: Adaptive Muon optimizer.arXiv preprint arXiv:2507.11005, 2025

  15. [15]

    COSMOS: A hybrid adaptive optimizer for memory-efficient training of LLMs.arXiv preprint arXiv:2502.17410, 2025

    Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. COSMOS: A hybrid adaptive optimizer for memory-efficient training of LLMs.arXiv preprint arXiv:2502.17410, 2025

  16. [16]

    On the Convergence Analysis of Muon

    Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon.arXiv preprint arXiv:2505.23737, 2025

  17. [17]

    Convergence of muon with newton-schulz, 2026

    Gyu Yeol Kim and Min hwan Oh. Convergence of muon with newton-schulz, 2026. URL https: //arxiv.org/abs/2601.19156

  18. [18]

    Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024

    Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhiquan Luo. Why transformers need adam: A hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024

  19. [19]

    Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025

    Zhaorui Dong, Yushun Zhang, Jianfeng Yao, and Ruoyu Sun. Towards quantifying the hessian structure of neural networks.arXiv preprint arXiv:2505.02809, 2025. 13

  20. [20]

    Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

    Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

  21. [21]

    Black box lie group preconditioners for sgd.arXiv preprint arXiv:2211.04422, 2022

    Xi-Lin Li. Black box lie group preconditioners for sgd.arXiv preprint arXiv:2211.04422, 2022

  22. [22]

    A distributed data-parallel PyTorch imple- mentation of the distributed Shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023

    Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch imple- mentation of the distributed Shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023

  23. [23]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025

  24. [24]

    Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

  25. [25]

    Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

  26. [26]

    Htmuon: Improving muon via heavy-tailed spectral correction, 2026

    Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, and Yaoqing Yang. Htmuon: Improving muon via heavy-tailed spectral correction, 2026. URLhttps://arxiv.org/abs/ 2603.10067

  27. [27]

    Visualizing the loss landscape of neural nets.Advances in Neural Information Processing Systems, 31, 2018

    Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets.Advances in Neural Information Processing Systems, 31, 2018

  28. [28]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    Levent Sagun, Léon Bottou, and Yann LeCun. Eigenvalues of the Hessian in deep learning: Singularity and beyond.arXiv preprint arXiv:1611.07476, 2016

  29. [29]

    Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

    Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the Hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017

  30. [30]

    Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026

    Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Minhak Song, and Yaoqing Yang. Suspicious alignment of sgd: A fine-grained step size condition analysis, 2026. URLhttps://arxiv.org/abs/2601. 11789

  31. [31]

    Depth, not data: An analysis of hessian spectral bifurcation, 2026

    Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, and Yaoqing Yang. Depth, not data: An analysis of hessian spectral bifurcation, 2026. URLhttps://arxiv.org/abs/2602.00545

  32. [32]

    An investigation into neural net optimization via hessian eigenvalue density

    Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pages 2232–2241. PMLR, 2019

  33. [33]

    Dissecting Hessian: Understanding common structure of Hessian in neural networks

    Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting Hessian: Understanding common structure of Hessian in neural networks. InAdvances in Neural Information Processing Systems, volume 33, pages 10193–10204, 2020

  34. [34]

    Analytic insights into structure and rank of neural network Hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021

    Sidak Pal Singh, Gregor Bachmann, and Thomas Hofmann. Analytic insights into structure and rank of neural network Hessian maps.Advances in Neural Information Processing Systems, 34:23914–23927, 2021

  35. [35]

    Hessian eigenspectra of more realistic nonlinear models.Advances in Neural Information Processing Systems, 34:20104–20117, 2021

    Zhenyu Liao and Michael W Mahoney. Hessian eigenspectra of more realistic nonlinear models.Advances in Neural Information Processing Systems, 34:20104–20117, 2021

  36. [36]

    Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration.Journal of Machine Learning Research, 23(229):1–47, 2022

    Congliang Chen, Li Shen, Fangyu Zou, and Wei Liu. Towards practical adam: Non-convexity, convergence theory, and mini-batch acceleration.Journal of Machine Learning Research, 23(229):1–47, 2022. 14

  37. [37]

    On theo( √ dt1/4)convergence rate of RMSProp and its momentum extension measured byℓ 1 norm.arXiv preprint arXiv:2402.00389, 2024

    Huan Li and Zhouchen Lin. On theo( √ dt1/4)convergence rate of RMSProp and its momentum extension measured byℓ 1 norm.arXiv preprint arXiv:2402.00389, 2024

  38. [38]

    Adam exploits $\ell_\infty$-geometry of loss landscape via coordinate-wise adaptivity

    Shuo Xie, Mohamad Amin Mohamadi, and Zhiyuan Li. Adam exploits $\ell_\infty$-geometry of loss landscape via coordinate-wise adaptivity. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=PUnD86UEK5

  39. [39]

    Training deep learning models with norm-constrained LMOs

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained LMOs. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=2Oqm2IzTy9

  40. [40]

    Openwebtext corpus

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus. http:// Skylion007.github.io/OpenWebTextCorpus, 2019

  41. [41]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URLhttps://arxiv.org/abs/2406.17557

  42. [42]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  43. [43]

    MARS: Unleashing the power of variance reduction for training large models

    Huizhuo Yuan, Yifeng Liu, Shuang Wu, zhou Xun, and Quanquan Gu. MARS: Unleashing the power of variance reduction for training large models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=NrcKQ3ASLZ

  44. [44]

    Alphadecay: Module-wise weight decay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562, 2025

    Di He, Songjun Tu, Ajay Jaiswal, Li Shen, Ganzhao Yuan, Shiwei Liu, and Lu Yin. Alphadecay: Module-wise weight decay for heavy-tailed balancing in llms.arXiv preprint arXiv:2506.14562, 2025

  45. [45]

    Fineweb-edu-100b-shuffle

    Andrej Karpathy. Fineweb-edu-100b-shuffle. https://huggingface.co/datasets/karpathy/ fineweb-edu-100b-shuffle, 2024. 15 A Proof of Theorem A.1 Notation We first recall our notation here, LetW∈R m×n denote the parameter matrix, whereWi,: ∈R n denotes the i-th row. The matrix inner product is⟨Z, W⟩ = Tr(Z ⊤W ). We use the Frobenius norm∥W∥ F = qP i,j W 2 i,...

  46. [46]

    Row-wise Ratio Calculation:For each rowi∈ { 1, . . . , m}, we compute the ratiori between the diagonal element and the average magnitude of off-diagonal elements: ri = Gii 1 m−1 P j̸=i |Gij| (10) whereG ii =∥V t,i:∥2 2 is the squared norm of thei-th row ofVt. 35

  47. [47]

    Per-Parameter Aggregation:For each matrix parameter, we aggregate the row-wise ratios into three statistics: ravg = 1 m mX i=1 ri,(11) rmin = min i∈{1,...,m} ri,(12) rmax = max i∈{1,...,m} ri.(13)

  48. [48]

    Logging ConfigurationWe use Weights & Biases (wandb) for metric tracking

    Global Aggregation:The global statistics ravg, rmin, and rmax are computed by averaging the corresponding per-parameter metrics across allKmatrix parameters in the network: ravg = 1 K KX k=1 r(k) avg,(14) rmin = 1 K KX k=1 r(k) min,(15) rmax = 1 K KX k=1 r(k) max,(16) where the superscript(k)denotes the metric for thek-th matrix parameter. Logging Configu...