Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization
Pith reviewed 2026-06-26 03:29 UTC · model grok-4.3
The pith
By applying Newton-Schulz independently to each T by T tile of momentum-gradient matrices, Hierarchical Muon defines a local update rule that reduces work to O(H W T K).
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hierarchical Muon partitions each momentum-gradient matrix into T × T tiles, applies the same finite Newton-Schulz map independently to each tile, and reassembles the results. For finite T below the matrix dimensions, HiMuon defines a local matrix-function map rather than a convergent approximation to the full-matrix update: spectral interactions are preserved within tiles and discarded across tile boundaries. For fixed finite T, the leading Newton-Schulz work decreases to O(H W T K), and the computation decomposes into independent small dense matrix operations.
What carries the argument
Independent application of the finite Newton-Schulz iteration to each T by T tile of the momentum-gradient matrix
If this is right
- The Newton-Schulz work decreases to O(H W T K) for fixed finite T.
- The computation decomposes into independent small dense matrix operations.
- This structure enables tile-size-dependent GPU kernels, cross-layer batching, memory-bounded chunking, and runtime tile-size schedules.
- Experiments show improved optimizer-step efficiency while keeping training behavior close to full-matrix Muon.
Where Pith is reading between the lines
- The local tile map could be combined with other matrix-function based optimizers.
- Different tile sizes per layer might be chosen based on matrix aspect ratios to further reduce cost.
- The tile independence opens the possibility of processing tiles in parallel across multiple devices.
Load-bearing premise
The local tile-wise Newton-Schulz map preserves enough spectral coupling for optimizer behavior to remain close to the full-matrix version.
What would settle it
A controlled experiment on a small transformer where varying the tile size T produces a statistically significant change in final validation loss compared to the full-matrix baseline.
read the original abstract
Muon-type optimizers construct update directions for dense neural-network weights by applying a finite Newton-Schulz map to momentum-gradient matrices. For an $H \times W$ matrix, with $r=\min\{H,W\}$ and $s=\max\{H,W\}$, $K$ steps of the full-matrix Newton-Schulz update require $O(r^2 s K)$ work and couple all rows and columns through repeated Gram matrix products. We introduce Hierarchical Muon (HiMuon), a tiled Newton-Schulz scheme for Muon-type optimization. HiMuon partitions each momentum-gradient matrix into $T \times T$ tiles, applies the same finite Newton-Schulz map independently to each tile, and reassembles the results. For finite $T$ below the matrix dimensions, HiMuon defines a local matrix-function map rather than a convergent approximation to the full-matrix update: spectral interactions are preserved within tiles and discarded across tile boundaries. For fixed finite $T$, the leading Newton-Schulz work decreases to $O(H W T K)$, and the computation decomposes into independent small dense matrix operations. This structure enables tile-size-dependent GPU kernels, cross-layer batching, memory-bounded chunking, and runtime tile-size schedules. Experiments on transformer training and controlled matrix-function diagnostics show that HiMuon improves optimizer-step efficiency while keeping training behavior close to full-matrix Muon in the tested regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Hierarchical Muon (HiMuon), a tiled Newton-Schulz scheme for Muon-type optimizers. Each H×W momentum-gradient matrix is partitioned into T×T tiles; the same finite Newton-Schulz iteration is applied independently inside each tile and the results are reassembled. For fixed finite T the leading arithmetic cost drops from O(r² s K) to O(H W T K), the work decomposes into independent small dense matrix operations, and the method is explicitly characterized as a local matrix-function map rather than a convergent approximation to the global iterate. Transformer training runs and controlled matrix-function diagnostics are reported to show improved optimizer-step efficiency while keeping training behavior close to full-matrix Muon.
Significance. If the empirical observation that training dynamics remain close holds under wider conditions, the tiling construction supplies a concrete route to hardware-efficient implementations (tile-size-dependent kernels, cross-layer batching, memory-bounded chunking). The complexity reduction follows directly from counting arithmetic inside independent tiles and does not rely on fitted constants or self-referential definitions.
major comments (2)
- [Abstract] Abstract: the central practical claim that 'training behavior close to full-matrix Muon' is supported only by the reported transformer runs; no quantitative diagnostics (cosine similarity of update directions, deviation in effective step-size distributions, or spectral-norm difference between HiMuon and full-Muon matrices) are described that would bound the effect of the discarded cross-tile singular-vector coupling.
- [Abstract] Abstract / Experiments: the assertion that intra-tile spectral information is sufficient for optimizer behavior rests on an unproven premise; the manuscript correctly notes that finite-T HiMuon is a local map, yet provides no derivation or a priori bound showing that the resulting update matrix preserves the orthogonality or scaling properties required by the Muon step.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central practical claim that 'training behavior close to full-matrix Muon' is supported only by the reported transformer runs; no quantitative diagnostics (cosine similarity of update directions, deviation in effective step-size distributions, or spectral-norm difference between HiMuon and full-Muon matrices) are described that would bound the effect of the discarded cross-tile singular-vector coupling.
Authors: We agree that the current manuscript supports the closeness claim primarily through transformer training runs together with the mentioned controlled matrix-function diagnostics, without the specific quantitative bounds listed. In the revision we will add cosine similarity of the resulting update directions, spectral-norm differences between HiMuon and full-Muon matrices, and statistics on effective step-size distributions across layers. These additions will directly quantify the impact of the discarded cross-tile coupling. revision: yes
-
Referee: [Abstract] Abstract / Experiments: the assertion that intra-tile spectral information is sufficient for optimizer behavior rests on an unproven premise; the manuscript correctly notes that finite-T HiMuon is a local map, yet provides no derivation or a priori bound showing that the resulting update matrix preserves the orthogonality or scaling properties required by the Muon step.
Authors: The manuscript explicitly defines finite-T HiMuon as a local matrix-function map and does not assert that it converges to or exactly preserves the orthogonality/scaling properties of the global Newton-Schulz iterate. No a priori bound is derived because the construction deliberately discards cross-tile singular-vector interactions; the claim of practical sufficiency is supported only by the reported empirical evidence (training dynamics and matrix diagnostics). We therefore do not plan to add a theoretical derivation that would contradict the local character of the operator. revision: no
Circularity Check
No circularity: complexity follows directly from tile partitioning definition
full rationale
The paper defines HiMuon by partitioning each momentum-gradient matrix into independent T×T tiles and applying the finite Newton-Schulz map to each tile separately. The stated complexity reduction to O(H W T K) is obtained by direct arithmetic counting on the smaller per-tile Gram products and matrix multiplies; no parameter is fitted to data and then relabeled as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The claim that training behavior remains close is presented as an empirical observation from transformer runs rather than a derived equality, so the central derivation chain contains no self-referential reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- T
axioms (1)
- standard math Standard matrix multiplication, inversion, and addition are associative and distributive as defined in linear algebra.
Reference graph
Works this paper leans on
-
[1]
SIAM Journal on Scientific Computing , volume =
A new approach to probabilistic rounding error analysis , author =. SIAM Journal on Scientific Computing , volume =. 2019 , publisher =
2019
-
[2]
2024 , howpublished =
Muon: An optimizer for hidden layers in neural networks , author =. 2024 , howpublished =
2024
-
[3]
2022 , eprint =
Training Compute-Optimal Large Language Models , author =. 2022 , eprint =
2022
-
[4]
modded-nanogpt: Speedrunning the
Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and. modded-nanogpt: Speedrunning the. 2024 , howpublished =
2024
-
[5]
Kwangjun Ahn and Noah Amsel and John Langford , year =. 2512.16928 , archiveprefix =
-
[6]
2026 , eprint =
On Surprising Effectiveness of Masking Updates in Adaptive Optimizers , author =. 2026 , eprint =
2026
-
[7]
2023 , editor =
Lu, Yucheng and Agrawal, Shivani and Subramanian, Suvinay and Rybakov, Oleg and De Sa, Christopher and Yazdanbakhsh, Amir , booktitle =. 2023 , editor =
2023
-
[8]
Jingyuan Liu and Jianling Su and Xingcheng Yao and others , year =. Muon is Scalable for. arxiv.2502.16982 , archiveprefix =
-
[9]
2025 , eprint =
Practical Efficiency of Muon for Pretraining , author =. 2025 , eprint =
2025
-
[10]
arxiv.2510.05491 , archiveprefix =
Zichong Li and Liming Liu and Chen Liang and Weizhu Chen and Tuo Zhao , year =. arxiv.2510.05491 , archiveprefix =
-
[11]
2025 , eprint =
The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm , author =. 2025 , eprint =
2025
-
[12]
2025 , eprint =
MuonBP: Faster Muon via Block-Periodic Orthogonalization , author =. 2025 , eprint =
2025
-
[13]
2025 , eprint =
Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning , author =. 2025 , eprint =
2025
-
[14]
2020 , eprint =
Two-Level K-FAC Preconditioning for Deep Learning , author =. 2020 , eprint =
2020
-
[15]
Towards understanding of orthogonalization in
Valentyn Boreiko and Zhiqi Bu and Sheng Zha , booktitle =. Towards understanding of orthogonalization in. 2025 , url =
2025
-
[16]
Gander and Sébastien Loisel and Daniel B
Martin J. Gander and Sébastien Loisel and Daniel B. Szyld , year =. An Optimal Block Iterative Method and Preconditioner for Banded Matrices with Applications to. SIAM Journal on Matrix Analysis and Applications , doi =
-
[17]
Understanding Approximate
Ryo Karakida and Kazuki Osawa , year =. Understanding Approximate. Neural Information Processing Systems , doi =
-
[18]
2008 , journal =
Steepest Descent and Conjugate Gradient Methods with Variable Preconditioning , author =. 2008 , journal =
2008
-
[19]
2022 , journal =
Overlapping Domain Decomposition Preconditioner for Integral Equations , author =. 2022 , journal =
2022
-
[20]
1994 , doi =
The Matrix Sign Decomposition and Its Relation to the Polar Decomposition , author =. 1994 , doi =
1994
-
[21]
1991 , doi =
A Black Box Generalized Conjugate Gradient Solver with Inner Iterations and Variable-Step Preconditioning , author =. 1991 , doi =
1991
-
[22]
2025 , eprint =
Fantastic Pretraining Optimizers and Where to Find Them , author =. 2025 , eprint =
2025
-
[23]
2025 , eprint =
Benchmarking Optimizers for Large Language Model Pretraining , author =. 2025 , eprint =
2025
-
[24]
2024 , eprint =
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author =. 2024 , eprint =
2024
-
[25]
2019 , journal =
Triton: an intermediate language and compiler for tiled neural network computations , author =. 2019 , journal =
2019
-
[26]
2025 , journal =
Performance Analysis of CUDA-based General Matrix Multiplication through Memory Coalescing and Grid-Level Parallelization , author =. 2025 , journal =
2025
-
[27]
1998 , journal =
Using the Matrix Sign Function to Compute Invariant Subspaces , author =. 1998 , journal =
1998
-
[28]
1997 , journal =
The Matrix Sign Function Method and the Computation of Invariant Subspaces , author =. 1997 , journal =
1997
-
[29]
2012 , journal =
Backward Stability of Iterations for Computing the Polar Decomposition , author =. 2012 , journal =
2012
-
[30]
2018 , booktitle =
Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , author =. 2018 , booktitle =
2018
-
[31]
1995 , journal =
New Perturbation Bounds for the Unitary Polar Factor , author =. 1995 , journal =
1995
-
[32]
1993 , journal =
A perturbation bound for the generalized polar decomposition , author =. 1993 , journal =
1993
-
[33]
1986 , journal =
Computing the polar decomposition with applications , author =. 1986 , journal =
1986
-
[34]
2022 , journal =
Energy-adaptive Riemannian optimization on the Stiefel manifold , author =. 2022 , journal =
2022
-
[35]
2019 , journal =
Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices , author =. 2019 , journal =
2019
-
[36]
2021 , doi =
Hierarchical Roofline Performance Analysis for Deep Learning Applications , author =. 2021 , doi =
2021
-
[37]
Hierarchical Roofline Analysis for
Charlene Yang and Thorsten Kurth and Samuel Williams , year =. Hierarchical Roofline Analysis for. Concurrency and Computation , doi =
-
[38]
2014 , eprint =
Adam: A Method for Stochastic Optimization , author =. 2014 , eprint =
2014
-
[39]
2019 , eprint =
Decoupled Weight Decay Regularization , author =. 2019 , eprint =
2019
-
[40]
Second-order optimization for neural networks , author =
-
[41]
2018 , eprint =
Gradient Descent Happens in a Tiny Subspace , author =. 2018 , eprint =
2018
-
[42]
2025 , eprint =
When Do Spectral Gradient Updates Help in Deep Learning? , author =. 2025 , eprint =
2025
-
[43]
An Investigation into Neural Net Optimization via
Behrooz Ghorbani and Shankar Krishnan and Ying Xiao , year =. An Investigation into Neural Net Optimization via. arxiv.1901.10159 , archiveprefix =
Pith/arXiv arXiv 1901
-
[44]
2023 , eprint =
Rethinking the Structure of Stochastic Gradients: Empirical and Statistical Evidence , author =. 2023 , eprint =
2023
-
[45]
2006 , journal =
Some Remarks on the Perturbation of Polar Decompositions for Rectangular Matrices , author =. 2006 , journal =
2006
-
[46]
Pathological Spectra of the
Ryo Karakida and Shotaro Akaho and Shun-ichi Amari , year =. Pathological Spectra of the
-
[47]
Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent
Lei Huang and Xianglong Liu and Bo Lang and Adams Yu and Yongliang Wang and Bo Li , year =. Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent. AAAI Conference on Artificial Intelligence , doi =
-
[48]
Controllable Orthogonalization in Training
Lei Huang and Li Liu and Fan Zhu and Diwen Wan and Zehuan Yuan and Bo Li and Ling Shao , year =. Controllable Orthogonalization in Training. arxiv.2004.00917 , archiveprefix =
arXiv 2004
-
[49]
2025 , eprint =
Qwen3 Technical Report , author =. 2025 , eprint =
2025
-
[50]
Southworth and Stephen Thomas , year =
Ben S. Southworth and Stephen Thomas , year =. Beyond. 2603.17970 , archiveprefix =
-
[51]
Tim Tsz-Kit Lau and Qi Long and Weijie Su , year =. 2505.21799 , archiveprefix =
-
[52]
Benjamin Erickson and Michael W
Shenghao Yang and Zhichao Wang and Oleg Balabanov and N. Benjamin Erickson and Michael W. Mahoney , year =. 2601.22137 , archiveprefix =
-
[53]
Zhehang Du and Weijie Su , year =. The. 2604.01472 , archiveprefix =
-
[54]
Convergence of
Gyu Yeol Kim and Min-hwan Oh , booktitle=. Convergence of. 2026 , url=
2026
-
[55]
Chen Hu and Qianxi Zhao and Xiaochen Yuan and Hong Zhang and Ding Yuan and Yanbin Wu and Xiying Li , year =. 2602.02500 , archiveprefix =
-
[56]
Peng Cheng and Jiucheng Zang and Qingnan Li and Liheng Ma and Yufei Cui and Yingxue Zhang and Boxing Chen and Ming Jian and Wen Tong , year=. 2602.13498 , archivePrefix=
-
[57]
2026 , eprint =
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory , author =. 2026 , eprint =
2026
-
[58]
2026 , eprint =
Muon is Not That Special: Random or Inverted Spectra Work Just as Well , author =. 2026 , eprint =
2026
-
[59]
2026 , eprint =
Muon in Associative Memory Learning: Training Dynamics and Scaling Laws , author =. 2026 , eprint =
2026
-
[60]
2025 , eprint =
Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? , author =. 2025 , eprint =
2025
-
[61]
2025 , eprint=
Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs , author=. 2025 , eprint=
2025
-
[62]
Da Chang and Qiankun Shi and Lvgang Zhang and Yu Li and Ruijie Zhang and Yao Lu and Yongxiang Liu and Ganzhao Yuan , year =. 2603.28254 , archiveprefix =
-
[63]
Shenyang Deng and Zhuoli Ouyang and Tianyu Pang and Zihang Liu and Ruochen Jin and Shuhua Yu and Yaoqing Yang , year =. 2603.20527 , archiveprefix =
-
[64]
2026 , eprint=
MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training , author=. 2026 , eprint=
2026
-
[65]
2025 , eprint =
What Really Matters in Matrix-Whitening Optimizers? , author =. 2025 , eprint =
2025
-
[66]
Chao Ma and Wenbo Gong and Meyer Scetbon and Edward Meeds , year =. 2412.13148 , archiveprefix =
-
[67]
Ionut-Vlad Modoranu and Philip Zmushko and Erik Schultheis and Mher Safaryan and Dan Alistarh , year =. 2602.02016 , archiveprefix =
-
[68]
Ziyue Liu and Ruijie Zhang and Zhengyang Wang and Yequan Zhao and Yupeng Su and Zi Yang and Zheng Zhang , year =. 2604.09967 , archiveprefix =
-
[69]
Tri Dao and others , howpublished =. Gram. 2026 , note =
2026
-
[70]
Jingru Li and Yibo Fan and Huan Li , year =. Variance-Adaptive. 2601.14603 , archiveprefix =
-
[71]
Liu and Zhengyang Wang and Dongyang Li and Yupeng Su and Sijia Liu and Zheng Zhang , year =
Ruijie Zhang and Yequan Zhao and Z. Liu and Zhengyang Wang and Dongyang Li and Yupeng Su and Sijia Liu and Zheng Zhang , year =. 2601.23261 , archiveprefix =
-
[72]
When and Why Grouping Attention Heads Accelerates
Hongtao Zhang and Wenjie Zhou and Wei Chen and Xueqi Cheng , year =. When and Why Grouping Attention Heads Accelerates. 2605.08933 , archiveprefix =
-
[73]
Uniform Spectral Growth and Convergence of
Changmin Kang and Jihun Yun and Baekrok Shin and Yeseul Cho and Chulhee Yun , year =. Uniform Spectral Growth and Convergence of. 2602.06385 , archiveprefix =
-
[74]
Zixiao Wang and Yifei Shen and Huishuai Zhang , year =. 2602.01105 , archiveprefix =
- [75]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.