Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization

Tianshi Xu; Yousef Saad; Yuanzhe Xi; Ziyuan Tang

arxiv: 2606.27216 · v1 · pith:YMDU4HY2new · submitted 2026-06-25 · 🧮 math.NA · cs.LG· cs.NA

Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization

Ziyuan Tang , Tianshi Xu , Yousef Saad , Yuanzhe Xi This is my paper

Pith reviewed 2026-06-26 03:29 UTC · model grok-4.3

classification 🧮 math.NA cs.LGcs.NA

keywords Muon optimizerNewton-Schulz iterationtiled matrix operationsneural network optimizationhierarchical methodsmatrix functionsefficient training

0 comments

The pith

By applying Newton-Schulz independently to each T by T tile of momentum-gradient matrices, Hierarchical Muon defines a local update rule that reduces work to O(H W T K).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Muon-type optimizers apply a finite Newton-Schulz iteration to momentum-gradient matrices to form update directions for neural network weights. Hierarchical Muon partitions these matrices into T by T tiles and runs the iteration separately on each tile before reassembling. This produces a local matrix-function map that keeps spectral interactions inside tiles but severs them across tile boundaries. The leading cost drops from O(r squared s K) to O(H W T K) and the work splits into independent small dense operations. Reported transformer training runs show step efficiency gains with training trajectories remaining close to those of the original full-matrix Muon.

Core claim

Hierarchical Muon partitions each momentum-gradient matrix into T × T tiles, applies the same finite Newton-Schulz map independently to each tile, and reassembles the results. For finite T below the matrix dimensions, HiMuon defines a local matrix-function map rather than a convergent approximation to the full-matrix update: spectral interactions are preserved within tiles and discarded across tile boundaries. For fixed finite T, the leading Newton-Schulz work decreases to O(H W T K), and the computation decomposes into independent small dense matrix operations.

What carries the argument

Independent application of the finite Newton-Schulz iteration to each T by T tile of the momentum-gradient matrix

If this is right

The Newton-Schulz work decreases to O(H W T K) for fixed finite T.
The computation decomposes into independent small dense matrix operations.
This structure enables tile-size-dependent GPU kernels, cross-layer batching, memory-bounded chunking, and runtime tile-size schedules.
Experiments show improved optimizer-step efficiency while keeping training behavior close to full-matrix Muon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The local tile map could be combined with other matrix-function based optimizers.
Different tile sizes per layer might be chosen based on matrix aspect ratios to further reduce cost.
The tile independence opens the possibility of processing tiles in parallel across multiple devices.

Load-bearing premise

The local tile-wise Newton-Schulz map preserves enough spectral coupling for optimizer behavior to remain close to the full-matrix version.

What would settle it

A controlled experiment on a small transformer where varying the tile size T produces a statistically significant change in final validation loss compared to the full-matrix baseline.

read the original abstract

Muon-type optimizers construct update directions for dense neural-network weights by applying a finite Newton-Schulz map to momentum-gradient matrices. For an $H \times W$ matrix, with $r=\min\{H,W\}$ and $s=\max\{H,W\}$, $K$ steps of the full-matrix Newton-Schulz update require $O(r^2 s K)$ work and couple all rows and columns through repeated Gram matrix products. We introduce Hierarchical Muon (HiMuon), a tiled Newton-Schulz scheme for Muon-type optimization. HiMuon partitions each momentum-gradient matrix into $T \times T$ tiles, applies the same finite Newton-Schulz map independently to each tile, and reassembles the results. For finite $T$ below the matrix dimensions, HiMuon defines a local matrix-function map rather than a convergent approximation to the full-matrix update: spectral interactions are preserved within tiles and discarded across tile boundaries. For fixed finite $T$, the leading Newton-Schulz work decreases to $O(H W T K)$, and the computation decomposes into independent small dense matrix operations. This structure enables tile-size-dependent GPU kernels, cross-layer batching, memory-bounded chunking, and runtime tile-size schedules. Experiments on transformer training and controlled matrix-function diagnostics show that HiMuon improves optimizer-step efficiency while keeping training behavior close to full-matrix Muon in the tested regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HiMuon tiles the Newton-Schulz step independently per tile to cut cost from O(r²sK) to O(HWTK), but the claim that training stays close rests on thin evidence that the lost cross-tile coupling does not matter much.

read the letter

The new piece is the explicit construction of a local, tile-independent Newton-Schulz map for Muon. The paper states up front that this is not an approximation to the global iterate; spectral coupling stops at tile boundaries. That distinction is useful, and the arithmetic count follows directly from running the same fixed-T iteration on each small tile, which is reproducible from the description.

The practical gain is real on paper: for fixed T the work scales linearly with matrix area rather than with the cube of the smaller dimension, and the work decomposes into independent small dense operations that map to existing GPU kernels. The transformer runs are presented as evidence that optimizer behavior remains close enough for the tested regimes.

The soft spot is the missing quantitative check on how much the update directions actually change. The stress-test concern lands: each tile computes its own Gram products, so any singular vector with support across tiles is replaced by a local version. The abstract cites the training curves but gives no cosine similarities, no step-size deviation statistics, and no error bars. Without those, the “close” claim is plausible but not yet strongly supported.

This is for readers who already use or want to scale matrix-sign optimizers on large dense layers. A serious referee should see it; the complexity reduction and the local-map framing are clear enough to merit review even if the experiments need tightening.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Hierarchical Muon (HiMuon), a tiled Newton-Schulz scheme for Muon-type optimizers. Each H×W momentum-gradient matrix is partitioned into T×T tiles; the same finite Newton-Schulz iteration is applied independently inside each tile and the results are reassembled. For fixed finite T the leading arithmetic cost drops from O(r² s K) to O(H W T K), the work decomposes into independent small dense matrix operations, and the method is explicitly characterized as a local matrix-function map rather than a convergent approximation to the global iterate. Transformer training runs and controlled matrix-function diagnostics are reported to show improved optimizer-step efficiency while keeping training behavior close to full-matrix Muon.

Significance. If the empirical observation that training dynamics remain close holds under wider conditions, the tiling construction supplies a concrete route to hardware-efficient implementations (tile-size-dependent kernels, cross-layer batching, memory-bounded chunking). The complexity reduction follows directly from counting arithmetic inside independent tiles and does not rely on fitted constants or self-referential definitions.

major comments (2)

[Abstract] Abstract: the central practical claim that 'training behavior close to full-matrix Muon' is supported only by the reported transformer runs; no quantitative diagnostics (cosine similarity of update directions, deviation in effective step-size distributions, or spectral-norm difference between HiMuon and full-Muon matrices) are described that would bound the effect of the discarded cross-tile singular-vector coupling.
[Abstract] Abstract / Experiments: the assertion that intra-tile spectral information is sufficient for optimizer behavior rests on an unproven premise; the manuscript correctly notes that finite-T HiMuon is a local map, yet provides no derivation or a priori bound showing that the resulting update matrix preserves the orthogonality or scaling properties required by the Muon step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central practical claim that 'training behavior close to full-matrix Muon' is supported only by the reported transformer runs; no quantitative diagnostics (cosine similarity of update directions, deviation in effective step-size distributions, or spectral-norm difference between HiMuon and full-Muon matrices) are described that would bound the effect of the discarded cross-tile singular-vector coupling.

Authors: We agree that the current manuscript supports the closeness claim primarily through transformer training runs together with the mentioned controlled matrix-function diagnostics, without the specific quantitative bounds listed. In the revision we will add cosine similarity of the resulting update directions, spectral-norm differences between HiMuon and full-Muon matrices, and statistics on effective step-size distributions across layers. These additions will directly quantify the impact of the discarded cross-tile coupling. revision: yes
Referee: [Abstract] Abstract / Experiments: the assertion that intra-tile spectral information is sufficient for optimizer behavior rests on an unproven premise; the manuscript correctly notes that finite-T HiMuon is a local map, yet provides no derivation or a priori bound showing that the resulting update matrix preserves the orthogonality or scaling properties required by the Muon step.

Authors: The manuscript explicitly defines finite-T HiMuon as a local matrix-function map and does not assert that it converges to or exactly preserves the orthogonality/scaling properties of the global Newton-Schulz iterate. No a priori bound is derived because the construction deliberately discards cross-tile singular-vector interactions; the claim of practical sufficiency is supported only by the reported empirical evidence (training dynamics and matrix diagnostics). We therefore do not plan to add a theoretical derivation that would contradict the local character of the operator. revision: no

Circularity Check

0 steps flagged

No circularity: complexity follows directly from tile partitioning definition

full rationale

The paper defines HiMuon by partitioning each momentum-gradient matrix into independent T×T tiles and applying the finite Newton-Schulz map to each tile separately. The stated complexity reduction to O(H W T K) is obtained by direct arithmetic counting on the smaller per-tile Gram products and matrix multiplies; no parameter is fitted to data and then relabeled as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The claim that training behavior remains close is presented as an empirical observation from transformer runs rather than a derived equality, so the central derivation chain contains no self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard dense linear algebra operations and the user-chosen tile size T; no new physical or mathematical entities are postulated.

free parameters (1)

T
Tile side length chosen by the user; controls locality and arithmetic cost.

axioms (1)

standard math Standard matrix multiplication, inversion, and addition are associative and distributive as defined in linear algebra.
Invoked implicitly when the Newton-Schulz iteration is applied to each tile.

pith-pipeline@v0.9.1-grok · 5792 in / 1315 out tokens · 56083 ms · 2026-06-26T03:29:56.568237+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 7 linked inside Pith

[1]

SIAM Journal on Scientific Computing , volume =

A new approach to probabilistic rounding error analysis , author =. SIAM Journal on Scientific Computing , volume =. 2019 , publisher =

2019
[2]

2024 , howpublished =

Muon: An optimizer for hidden layers in neural networks , author =. 2024 , howpublished =

2024
[3]

2022 , eprint =

Training Compute-Optimal Large Language Models , author =. 2022 , eprint =

2022
[4]

modded-nanogpt: Speedrunning the

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and. modded-nanogpt: Speedrunning the. 2024 , howpublished =

2024
[5]

2512.16928 , archiveprefix =

Kwangjun Ahn and Noah Amsel and John Langford , year =. 2512.16928 , archiveprefix =

arXiv
[6]

2026 , eprint =

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers , author =. 2026 , eprint =

2026
[7]

2023 , editor =

Lu, Yucheng and Agrawal, Shivani and Subramanian, Suvinay and Rybakov, Oleg and De Sa, Christopher and Yazdanbakhsh, Amir , booktitle =. 2023 , editor =

2023
[8]

Muon is Scalable for

Jingyuan Liu and Jianling Su and Xingcheng Yao and others , year =. Muon is Scalable for. arxiv.2502.16982 , archiveprefix =

Pith/arXiv arXiv
[9]

2025 , eprint =

Practical Efficiency of Muon for Pretraining , author =. 2025 , eprint =

2025
[10]

arxiv.2510.05491 , archiveprefix =

Zichong Li and Liming Liu and Chen Liang and Weizhu Chen and Tuo Zhao , year =. arxiv.2510.05491 , archiveprefix =

arXiv
[11]

2025 , eprint =

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm , author =. 2025 , eprint =

2025
[12]

2025 , eprint =

MuonBP: Faster Muon via Block-Periodic Orthogonalization , author =. 2025 , eprint =

2025
[13]

2025 , eprint =

Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning , author =. 2025 , eprint =

2025
[14]

2020 , eprint =

Two-Level K-FAC Preconditioning for Deep Learning , author =. 2020 , eprint =

2020
[15]

Towards understanding of orthogonalization in

Valentyn Boreiko and Zhiqi Bu and Sheng Zha , booktitle =. Towards understanding of orthogonalization in. 2025 , url =

2025
[16]

Gander and Sébastien Loisel and Daniel B

Martin J. Gander and Sébastien Loisel and Daniel B. Szyld , year =. An Optimal Block Iterative Method and Preconditioner for Banded Matrices with Applications to. SIAM Journal on Matrix Analysis and Applications , doi =
[17]

Understanding Approximate

Ryo Karakida and Kazuki Osawa , year =. Understanding Approximate. Neural Information Processing Systems , doi =
[18]

2008 , journal =

Steepest Descent and Conjugate Gradient Methods with Variable Preconditioning , author =. 2008 , journal =

2008
[19]

2022 , journal =

Overlapping Domain Decomposition Preconditioner for Integral Equations , author =. 2022 , journal =

2022
[20]

1994 , doi =

The Matrix Sign Decomposition and Its Relation to the Polar Decomposition , author =. 1994 , doi =

1994
[21]

1991 , doi =

A Black Box Generalized Conjugate Gradient Solver with Inner Iterations and Variable-Step Preconditioning , author =. 1991 , doi =

1991
[22]

2025 , eprint =

Fantastic Pretraining Optimizers and Where to Find Them , author =. 2025 , eprint =

2025
[23]

2025 , eprint =

Benchmarking Optimizers for Large Language Model Pretraining , author =. 2025 , eprint =

2025
[24]

2024 , eprint =

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author =. 2024 , eprint =

2024
[25]

2019 , journal =

Triton: an intermediate language and compiler for tiled neural network computations , author =. 2019 , journal =

2019
[26]

2025 , journal =

Performance Analysis of CUDA-based General Matrix Multiplication through Memory Coalescing and Grid-Level Parallelization , author =. 2025 , journal =

2025
[27]

1998 , journal =

Using the Matrix Sign Function to Compute Invariant Subspaces , author =. 1998 , journal =

1998
[28]

1997 , journal =

The Matrix Sign Function Method and the Computation of Invariant Subspaces , author =. 1997 , journal =

1997
[29]

2012 , journal =

Backward Stability of Iterations for Computing the Polar Decomposition , author =. 2012 , journal =

2012
[30]

2018 , booktitle =

Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , author =. 2018 , booktitle =

2018
[31]

1995 , journal =

New Perturbation Bounds for the Unitary Polar Factor , author =. 1995 , journal =

1995
[32]

1993 , journal =

A perturbation bound for the generalized polar decomposition , author =. 1993 , journal =

1993
[33]

1986 , journal =

Computing the polar decomposition with applications , author =. 1986 , journal =

1986
[34]

2022 , journal =

Energy-adaptive Riemannian optimization on the Stiefel manifold , author =. 2022 , journal =

2022
[35]

2019 , journal =

Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices , author =. 2019 , journal =

2019
[36]

2021 , doi =

Hierarchical Roofline Performance Analysis for Deep Learning Applications , author =. 2021 , doi =

2021
[37]

Hierarchical Roofline Analysis for

Charlene Yang and Thorsten Kurth and Samuel Williams , year =. Hierarchical Roofline Analysis for. Concurrency and Computation , doi =
[38]

2014 , eprint =

Adam: A Method for Stochastic Optimization , author =. 2014 , eprint =

2014
[39]

2019 , eprint =

Decoupled Weight Decay Regularization , author =. 2019 , eprint =

2019
[40]

Second-order optimization for neural networks , author =
[41]

2018 , eprint =

Gradient Descent Happens in a Tiny Subspace , author =. 2018 , eprint =

2018
[42]

2025 , eprint =

When Do Spectral Gradient Updates Help in Deep Learning? , author =. 2025 , eprint =

2025
[43]

An Investigation into Neural Net Optimization via

Behrooz Ghorbani and Shankar Krishnan and Ying Xiao , year =. An Investigation into Neural Net Optimization via. arxiv.1901.10159 , archiveprefix =

Pith/arXiv arXiv 1901
[44]

2023 , eprint =

Rethinking the Structure of Stochastic Gradients: Empirical and Statistical Evidence , author =. 2023 , eprint =

2023
[45]

2006 , journal =

Some Remarks on the Perturbation of Polar Decompositions for Rectangular Matrices , author =. 2006 , journal =

2006
[46]

Pathological Spectra of the

Ryo Karakida and Shotaro Akaho and Shun-ichi Amari , year =. Pathological Spectra of the
[47]

Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent

Lei Huang and Xianglong Liu and Bo Lang and Adams Yu and Yongliang Wang and Bo Li , year =. Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent. AAAI Conference on Artificial Intelligence , doi =
[48]

Controllable Orthogonalization in Training

Lei Huang and Li Liu and Fan Zhu and Diwen Wan and Zehuan Yuan and Bo Li and Ling Shao , year =. Controllable Orthogonalization in Training. arxiv.2004.00917 , archiveprefix =

arXiv 2004
[49]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

2025
[50]

Southworth and Stephen Thomas , year =

Ben S. Southworth and Stephen Thomas , year =. Beyond. 2603.17970 , archiveprefix =

arXiv
[51]

2505.21799 , archiveprefix =

Tim Tsz-Kit Lau and Qi Long and Weijie Su , year =. 2505.21799 , archiveprefix =

arXiv
[52]

Benjamin Erickson and Michael W

Shenghao Yang and Zhichao Wang and Oleg Balabanov and N. Benjamin Erickson and Michael W. Mahoney , year =. 2601.22137 , archiveprefix =

arXiv
[53]

Zhehang Du and Weijie Su , year =. The. 2604.01472 , archiveprefix =

arXiv
[54]

Convergence of

Gyu Yeol Kim and Min-hwan Oh , booktitle=. Convergence of. 2026 , url=

2026
[55]

2602.02500 , archiveprefix =

Chen Hu and Qianxi Zhao and Xiaochen Yuan and Hong Zhang and Ding Yuan and Yanbin Wu and Xiying Li , year =. 2602.02500 , archiveprefix =

arXiv
[56]

2602.13498 , archivePrefix=

Peng Cheng and Jiucheng Zang and Qingnan Li and Liheng Ma and Yufei Cui and Yingxue Zhang and Boxing Chen and Ming Jian and Wen Tong , year=. 2602.13498 , archivePrefix=

arXiv
[57]

2026 , eprint =

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory , author =. 2026 , eprint =

2026
[58]

2026 , eprint =

Muon is Not That Special: Random or Inverted Spectra Work Just as Well , author =. 2026 , eprint =

2026
[59]

2026 , eprint =

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws , author =. 2026 , eprint =

2026
[60]

2025 , eprint =

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? , author =. 2025 , eprint =

2025
[61]

2025 , eprint=

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs , author=. 2025 , eprint=

2025
[62]

2603.28254 , archiveprefix =

Da Chang and Qiankun Shi and Lvgang Zhang and Yu Li and Ruijie Zhang and Yao Lu and Yongxiang Liu and Ganzhao Yuan , year =. 2603.28254 , archiveprefix =

Pith/arXiv arXiv
[63]

2603.20527 , archiveprefix =

Shenyang Deng and Zhuoli Ouyang and Tianyu Pang and Zihang Liu and Ruochen Jin and Shuhua Yu and Yaoqing Yang , year =. 2603.20527 , archiveprefix =

Pith/arXiv arXiv
[64]

2026 , eprint=

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training , author=. 2026 , eprint=

2026
[65]

2025 , eprint =

What Really Matters in Matrix-Whitening Optimizers? , author =. 2025 , eprint =

2025
[66]

2412.13148 , archiveprefix =

Chao Ma and Wenbo Gong and Meyer Scetbon and Edward Meeds , year =. 2412.13148 , archiveprefix =

arXiv
[67]

2602.02016 , archiveprefix =

Ionut-Vlad Modoranu and Philip Zmushko and Erik Schultheis and Mher Safaryan and Dan Alistarh , year =. 2602.02016 , archiveprefix =

Pith/arXiv arXiv
[68]

2604.09967 , archiveprefix =

Ziyue Liu and Ruijie Zhang and Zhengyang Wang and Yequan Zhao and Yupeng Su and Zi Yang and Zheng Zhang , year =. 2604.09967 , archiveprefix =

Pith/arXiv arXiv
[69]

Tri Dao and others , howpublished =. Gram. 2026 , note =

2026
[70]

Variance-Adaptive

Jingru Li and Yibo Fan and Huan Li , year =. Variance-Adaptive. 2601.14603 , archiveprefix =

arXiv
[71]

Liu and Zhengyang Wang and Dongyang Li and Yupeng Su and Sijia Liu and Zheng Zhang , year =

Ruijie Zhang and Yequan Zhao and Z. Liu and Zhengyang Wang and Dongyang Li and Yupeng Su and Sijia Liu and Zheng Zhang , year =. 2601.23261 , archiveprefix =

arXiv
[72]

When and Why Grouping Attention Heads Accelerates

Hongtao Zhang and Wenjie Zhou and Wei Chen and Xueqi Cheng , year =. When and Why Grouping Attention Heads Accelerates. 2605.08933 , archiveprefix =

Pith/arXiv arXiv
[73]

Uniform Spectral Growth and Convergence of

Changmin Kang and Jihun Yun and Baekrok Shin and Yeseul Cho and Chulhee Yun , year =. Uniform Spectral Growth and Convergence of. 2602.06385 , archiveprefix =

arXiv
[74]

2602.01105 , archiveprefix =

Zixiao Wang and Yifei Shen and Huishuai Zhang , year =. 2602.01105 , archiveprefix =

arXiv
[75]

2602.03096 , archiveprefix =

Yujie Yang , year =. 2602.03096 , archiveprefix =

arXiv

[1] [1]

SIAM Journal on Scientific Computing , volume =

A new approach to probabilistic rounding error analysis , author =. SIAM Journal on Scientific Computing , volume =. 2019 , publisher =

2019

[2] [2]

2024 , howpublished =

Muon: An optimizer for hidden layers in neural networks , author =. 2024 , howpublished =

2024

[3] [3]

2022 , eprint =

Training Compute-Optimal Large Language Models , author =. 2022 , eprint =

2022

[4] [4]

modded-nanogpt: Speedrunning the

Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and. modded-nanogpt: Speedrunning the. 2024 , howpublished =

2024

[5] [5]

2512.16928 , archiveprefix =

Kwangjun Ahn and Noah Amsel and John Langford , year =. 2512.16928 , archiveprefix =

arXiv

[6] [6]

2026 , eprint =

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers , author =. 2026 , eprint =

2026

[7] [7]

2023 , editor =

Lu, Yucheng and Agrawal, Shivani and Subramanian, Suvinay and Rybakov, Oleg and De Sa, Christopher and Yazdanbakhsh, Amir , booktitle =. 2023 , editor =

2023

[8] [8]

Muon is Scalable for

Jingyuan Liu and Jianling Su and Xingcheng Yao and others , year =. Muon is Scalable for. arxiv.2502.16982 , archiveprefix =

Pith/arXiv arXiv

[9] [9]

2025 , eprint =

Practical Efficiency of Muon for Pretraining , author =. 2025 , eprint =

2025

[10] [10]

arxiv.2510.05491 , archiveprefix =

Zichong Li and Liming Liu and Chen Liang and Weizhu Chen and Tuo Zhao , year =. arxiv.2510.05491 , archiveprefix =

arXiv

[11] [11]

2025 , eprint =

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm , author =. 2025 , eprint =

2025

[12] [12]

2025 , eprint =

MuonBP: Faster Muon via Block-Periodic Orthogonalization , author =. 2025 , eprint =

2025

[13] [13]

2025 , eprint =

Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning , author =. 2025 , eprint =

2025

[14] [14]

2020 , eprint =

Two-Level K-FAC Preconditioning for Deep Learning , author =. 2020 , eprint =

2020

[15] [15]

Towards understanding of orthogonalization in

Valentyn Boreiko and Zhiqi Bu and Sheng Zha , booktitle =. Towards understanding of orthogonalization in. 2025 , url =

2025

[16] [16]

Gander and Sébastien Loisel and Daniel B

Martin J. Gander and Sébastien Loisel and Daniel B. Szyld , year =. An Optimal Block Iterative Method and Preconditioner for Banded Matrices with Applications to. SIAM Journal on Matrix Analysis and Applications , doi =

[17] [17]

Understanding Approximate

Ryo Karakida and Kazuki Osawa , year =. Understanding Approximate. Neural Information Processing Systems , doi =

[18] [18]

2008 , journal =

Steepest Descent and Conjugate Gradient Methods with Variable Preconditioning , author =. 2008 , journal =

2008

[19] [19]

2022 , journal =

Overlapping Domain Decomposition Preconditioner for Integral Equations , author =. 2022 , journal =

2022

[20] [20]

1994 , doi =

The Matrix Sign Decomposition and Its Relation to the Polar Decomposition , author =. 1994 , doi =

1994

[21] [21]

1991 , doi =

A Black Box Generalized Conjugate Gradient Solver with Inner Iterations and Variable-Step Preconditioning , author =. 1991 , doi =

1991

[22] [22]

2025 , eprint =

Fantastic Pretraining Optimizers and Where to Find Them , author =. 2025 , eprint =

2025

[23] [23]

2025 , eprint =

Benchmarking Optimizers for Large Language Model Pretraining , author =. 2025 , eprint =

2025

[24] [24]

2024 , eprint =

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author =. 2024 , eprint =

2024

[25] [25]

2019 , journal =

Triton: an intermediate language and compiler for tiled neural network computations , author =. 2019 , journal =

2019

[26] [26]

2025 , journal =

Performance Analysis of CUDA-based General Matrix Multiplication through Memory Coalescing and Grid-Level Parallelization , author =. 2025 , journal =

2025

[27] [27]

1998 , journal =

Using the Matrix Sign Function to Compute Invariant Subspaces , author =. 1998 , journal =

1998

[28] [28]

1997 , journal =

The Matrix Sign Function Method and the Computation of Invariant Subspaces , author =. 1997 , journal =

1997

[29] [29]

2012 , journal =

Backward Stability of Iterations for Computing the Polar Decomposition , author =. 2012 , journal =

2012

[30] [30]

2018 , booktitle =

Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , author =. 2018 , booktitle =

2018

[31] [31]

1995 , journal =

New Perturbation Bounds for the Unitary Polar Factor , author =. 1995 , journal =

1995

[32] [32]

1993 , journal =

A perturbation bound for the generalized polar decomposition , author =. 1993 , journal =

1993

[33] [33]

1986 , journal =

Computing the polar decomposition with applications , author =. 1986 , journal =

1986

[34] [34]

2022 , journal =

Energy-adaptive Riemannian optimization on the Stiefel manifold , author =. 2022 , journal =

2022

[35] [35]

2019 , journal =

Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices , author =. 2019 , journal =

2019

[36] [36]

2021 , doi =

Hierarchical Roofline Performance Analysis for Deep Learning Applications , author =. 2021 , doi =

2021

[37] [37]

Hierarchical Roofline Analysis for

Charlene Yang and Thorsten Kurth and Samuel Williams , year =. Hierarchical Roofline Analysis for. Concurrency and Computation , doi =

[38] [38]

2014 , eprint =

Adam: A Method for Stochastic Optimization , author =. 2014 , eprint =

2014

[39] [39]

2019 , eprint =

Decoupled Weight Decay Regularization , author =. 2019 , eprint =

2019

[40] [40]

Second-order optimization for neural networks , author =

[41] [41]

2018 , eprint =

Gradient Descent Happens in a Tiny Subspace , author =. 2018 , eprint =

2018

[42] [42]

2025 , eprint =

When Do Spectral Gradient Updates Help in Deep Learning? , author =. 2025 , eprint =

2025

[43] [43]

An Investigation into Neural Net Optimization via

Behrooz Ghorbani and Shankar Krishnan and Ying Xiao , year =. An Investigation into Neural Net Optimization via. arxiv.1901.10159 , archiveprefix =

Pith/arXiv arXiv 1901

[44] [44]

2023 , eprint =

Rethinking the Structure of Stochastic Gradients: Empirical and Statistical Evidence , author =. 2023 , eprint =

2023

[45] [45]

2006 , journal =

Some Remarks on the Perturbation of Polar Decompositions for Rectangular Matrices , author =. 2006 , journal =

2006

[46] [46]

Pathological Spectra of the

Ryo Karakida and Shotaro Akaho and Shun-ichi Amari , year =. Pathological Spectra of the

[47] [47]

Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent

Lei Huang and Xianglong Liu and Bo Lang and Adams Yu and Yongliang Wang and Bo Li , year =. Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent. AAAI Conference on Artificial Intelligence , doi =

[48] [48]

Controllable Orthogonalization in Training

Lei Huang and Li Liu and Fan Zhu and Diwen Wan and Zehuan Yuan and Bo Li and Ling Shao , year =. Controllable Orthogonalization in Training. arxiv.2004.00917 , archiveprefix =

arXiv 2004

[49] [49]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

2025

[50] [50]

Southworth and Stephen Thomas , year =

Ben S. Southworth and Stephen Thomas , year =. Beyond. 2603.17970 , archiveprefix =

arXiv

[51] [51]

2505.21799 , archiveprefix =

Tim Tsz-Kit Lau and Qi Long and Weijie Su , year =. 2505.21799 , archiveprefix =

arXiv

[52] [52]

Benjamin Erickson and Michael W

Shenghao Yang and Zhichao Wang and Oleg Balabanov and N. Benjamin Erickson and Michael W. Mahoney , year =. 2601.22137 , archiveprefix =

arXiv

[53] [53]

Zhehang Du and Weijie Su , year =. The. 2604.01472 , archiveprefix =

arXiv

[54] [54]

Convergence of

Gyu Yeol Kim and Min-hwan Oh , booktitle=. Convergence of. 2026 , url=

2026

[55] [55]

2602.02500 , archiveprefix =

Chen Hu and Qianxi Zhao and Xiaochen Yuan and Hong Zhang and Ding Yuan and Yanbin Wu and Xiying Li , year =. 2602.02500 , archiveprefix =

arXiv

[56] [56]

2602.13498 , archivePrefix=

Peng Cheng and Jiucheng Zang and Qingnan Li and Liheng Ma and Yufei Cui and Yingxue Zhang and Boxing Chen and Ming Jian and Wen Tong , year=. 2602.13498 , archivePrefix=

arXiv

[57] [57]

2026 , eprint =

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory , author =. 2026 , eprint =

2026

[58] [58]

2026 , eprint =

Muon is Not That Special: Random or Inverted Spectra Work Just as Well , author =. 2026 , eprint =

2026

[59] [59]

2026 , eprint =

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws , author =. 2026 , eprint =

2026

[60] [60]

2025 , eprint =

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? , author =. 2025 , eprint =

2025

[61] [61]

2025 , eprint=

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs , author=. 2025 , eprint=

2025

[62] [62]

2603.28254 , archiveprefix =

Da Chang and Qiankun Shi and Lvgang Zhang and Yu Li and Ruijie Zhang and Yao Lu and Yongxiang Liu and Ganzhao Yuan , year =. 2603.28254 , archiveprefix =

Pith/arXiv arXiv

[63] [63]

2603.20527 , archiveprefix =

Shenyang Deng and Zhuoli Ouyang and Tianyu Pang and Zihang Liu and Ruochen Jin and Shuhua Yu and Yaoqing Yang , year =. 2603.20527 , archiveprefix =

Pith/arXiv arXiv

[64] [64]

2026 , eprint=

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training , author=. 2026 , eprint=

2026

[65] [65]

2025 , eprint =

What Really Matters in Matrix-Whitening Optimizers? , author =. 2025 , eprint =

2025

[66] [66]

2412.13148 , archiveprefix =

Chao Ma and Wenbo Gong and Meyer Scetbon and Edward Meeds , year =. 2412.13148 , archiveprefix =

arXiv

[67] [67]

2602.02016 , archiveprefix =

Ionut-Vlad Modoranu and Philip Zmushko and Erik Schultheis and Mher Safaryan and Dan Alistarh , year =. 2602.02016 , archiveprefix =

Pith/arXiv arXiv

[68] [68]

2604.09967 , archiveprefix =

Ziyue Liu and Ruijie Zhang and Zhengyang Wang and Yequan Zhao and Yupeng Su and Zi Yang and Zheng Zhang , year =. 2604.09967 , archiveprefix =

Pith/arXiv arXiv

[69] [69]

Tri Dao and others , howpublished =. Gram. 2026 , note =

2026

[70] [70]

Variance-Adaptive

Jingru Li and Yibo Fan and Huan Li , year =. Variance-Adaptive. 2601.14603 , archiveprefix =

arXiv

[71] [71]

Liu and Zhengyang Wang and Dongyang Li and Yupeng Su and Sijia Liu and Zheng Zhang , year =

Ruijie Zhang and Yequan Zhao and Z. Liu and Zhengyang Wang and Dongyang Li and Yupeng Su and Sijia Liu and Zheng Zhang , year =. 2601.23261 , archiveprefix =

arXiv

[72] [72]

When and Why Grouping Attention Heads Accelerates

Hongtao Zhang and Wenjie Zhou and Wei Chen and Xueqi Cheng , year =. When and Why Grouping Attention Heads Accelerates. 2605.08933 , archiveprefix =

Pith/arXiv arXiv

[73] [73]

Uniform Spectral Growth and Convergence of

Changmin Kang and Jihun Yun and Baekrok Shin and Yeseul Cho and Chulhee Yun , year =. Uniform Spectral Growth and Convergence of. 2602.06385 , archiveprefix =

arXiv

[74] [74]

2602.01105 , archiveprefix =

Zixiao Wang and Yifei Shen and Huishuai Zhang , year =. 2602.01105 , archiveprefix =

arXiv

[75] [75]

2602.03096 , archiveprefix =

Yujie Yang , year =. 2602.03096 , archiveprefix =

arXiv