pith. sign in

arxiv: 2606.27216 · v1 · pith:YMDU4HY2new · submitted 2026-06-25 · 🧮 math.NA · cs.LG· cs.NA

Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization

Pith reviewed 2026-06-26 03:29 UTC · model grok-4.3

classification 🧮 math.NA cs.LGcs.NA
keywords Muon optimizerNewton-Schulz iterationtiled matrix operationsneural network optimizationhierarchical methodsmatrix functionsefficient training
0
0 comments X

The pith

By applying Newton-Schulz independently to each T by T tile of momentum-gradient matrices, Hierarchical Muon defines a local update rule that reduces work to O(H W T K).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Muon-type optimizers apply a finite Newton-Schulz iteration to momentum-gradient matrices to form update directions for neural network weights. Hierarchical Muon partitions these matrices into T by T tiles and runs the iteration separately on each tile before reassembling. This produces a local matrix-function map that keeps spectral interactions inside tiles but severs them across tile boundaries. The leading cost drops from O(r squared s K) to O(H W T K) and the work splits into independent small dense operations. Reported transformer training runs show step efficiency gains with training trajectories remaining close to those of the original full-matrix Muon.

Core claim

Hierarchical Muon partitions each momentum-gradient matrix into T × T tiles, applies the same finite Newton-Schulz map independently to each tile, and reassembles the results. For finite T below the matrix dimensions, HiMuon defines a local matrix-function map rather than a convergent approximation to the full-matrix update: spectral interactions are preserved within tiles and discarded across tile boundaries. For fixed finite T, the leading Newton-Schulz work decreases to O(H W T K), and the computation decomposes into independent small dense matrix operations.

What carries the argument

Independent application of the finite Newton-Schulz iteration to each T by T tile of the momentum-gradient matrix

If this is right

  • The Newton-Schulz work decreases to O(H W T K) for fixed finite T.
  • The computation decomposes into independent small dense matrix operations.
  • This structure enables tile-size-dependent GPU kernels, cross-layer batching, memory-bounded chunking, and runtime tile-size schedules.
  • Experiments show improved optimizer-step efficiency while keeping training behavior close to full-matrix Muon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The local tile map could be combined with other matrix-function based optimizers.
  • Different tile sizes per layer might be chosen based on matrix aspect ratios to further reduce cost.
  • The tile independence opens the possibility of processing tiles in parallel across multiple devices.

Load-bearing premise

The local tile-wise Newton-Schulz map preserves enough spectral coupling for optimizer behavior to remain close to the full-matrix version.

What would settle it

A controlled experiment on a small transformer where varying the tile size T produces a statistically significant change in final validation loss compared to the full-matrix baseline.

read the original abstract

Muon-type optimizers construct update directions for dense neural-network weights by applying a finite Newton-Schulz map to momentum-gradient matrices. For an $H \times W$ matrix, with $r=\min\{H,W\}$ and $s=\max\{H,W\}$, $K$ steps of the full-matrix Newton-Schulz update require $O(r^2 s K)$ work and couple all rows and columns through repeated Gram matrix products. We introduce Hierarchical Muon (HiMuon), a tiled Newton-Schulz scheme for Muon-type optimization. HiMuon partitions each momentum-gradient matrix into $T \times T$ tiles, applies the same finite Newton-Schulz map independently to each tile, and reassembles the results. For finite $T$ below the matrix dimensions, HiMuon defines a local matrix-function map rather than a convergent approximation to the full-matrix update: spectral interactions are preserved within tiles and discarded across tile boundaries. For fixed finite $T$, the leading Newton-Schulz work decreases to $O(H W T K)$, and the computation decomposes into independent small dense matrix operations. This structure enables tile-size-dependent GPU kernels, cross-layer batching, memory-bounded chunking, and runtime tile-size schedules. Experiments on transformer training and controlled matrix-function diagnostics show that HiMuon improves optimizer-step efficiency while keeping training behavior close to full-matrix Muon in the tested regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Hierarchical Muon (HiMuon), a tiled Newton-Schulz scheme for Muon-type optimizers. Each H×W momentum-gradient matrix is partitioned into T×T tiles; the same finite Newton-Schulz iteration is applied independently inside each tile and the results are reassembled. For fixed finite T the leading arithmetic cost drops from O(r² s K) to O(H W T K), the work decomposes into independent small dense matrix operations, and the method is explicitly characterized as a local matrix-function map rather than a convergent approximation to the global iterate. Transformer training runs and controlled matrix-function diagnostics are reported to show improved optimizer-step efficiency while keeping training behavior close to full-matrix Muon.

Significance. If the empirical observation that training dynamics remain close holds under wider conditions, the tiling construction supplies a concrete route to hardware-efficient implementations (tile-size-dependent kernels, cross-layer batching, memory-bounded chunking). The complexity reduction follows directly from counting arithmetic inside independent tiles and does not rely on fitted constants or self-referential definitions.

major comments (2)
  1. [Abstract] Abstract: the central practical claim that 'training behavior close to full-matrix Muon' is supported only by the reported transformer runs; no quantitative diagnostics (cosine similarity of update directions, deviation in effective step-size distributions, or spectral-norm difference between HiMuon and full-Muon matrices) are described that would bound the effect of the discarded cross-tile singular-vector coupling.
  2. [Abstract] Abstract / Experiments: the assertion that intra-tile spectral information is sufficient for optimizer behavior rests on an unproven premise; the manuscript correctly notes that finite-T HiMuon is a local map, yet provides no derivation or a priori bound showing that the resulting update matrix preserves the orthogonality or scaling properties required by the Muon step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central practical claim that 'training behavior close to full-matrix Muon' is supported only by the reported transformer runs; no quantitative diagnostics (cosine similarity of update directions, deviation in effective step-size distributions, or spectral-norm difference between HiMuon and full-Muon matrices) are described that would bound the effect of the discarded cross-tile singular-vector coupling.

    Authors: We agree that the current manuscript supports the closeness claim primarily through transformer training runs together with the mentioned controlled matrix-function diagnostics, without the specific quantitative bounds listed. In the revision we will add cosine similarity of the resulting update directions, spectral-norm differences between HiMuon and full-Muon matrices, and statistics on effective step-size distributions across layers. These additions will directly quantify the impact of the discarded cross-tile coupling. revision: yes

  2. Referee: [Abstract] Abstract / Experiments: the assertion that intra-tile spectral information is sufficient for optimizer behavior rests on an unproven premise; the manuscript correctly notes that finite-T HiMuon is a local map, yet provides no derivation or a priori bound showing that the resulting update matrix preserves the orthogonality or scaling properties required by the Muon step.

    Authors: The manuscript explicitly defines finite-T HiMuon as a local matrix-function map and does not assert that it converges to or exactly preserves the orthogonality/scaling properties of the global Newton-Schulz iterate. No a priori bound is derived because the construction deliberately discards cross-tile singular-vector interactions; the claim of practical sufficiency is supported only by the reported empirical evidence (training dynamics and matrix diagnostics). We therefore do not plan to add a theoretical derivation that would contradict the local character of the operator. revision: no

Circularity Check

0 steps flagged

No circularity: complexity follows directly from tile partitioning definition

full rationale

The paper defines HiMuon by partitioning each momentum-gradient matrix into independent T×T tiles and applying the finite Newton-Schulz map to each tile separately. The stated complexity reduction to O(H W T K) is obtained by direct arithmetic counting on the smaller per-tile Gram products and matrix multiplies; no parameter is fitted to data and then relabeled as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The claim that training behavior remains close is presented as an empirical observation from transformer runs rather than a derived equality, so the central derivation chain contains no self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard dense linear algebra operations and the user-chosen tile size T; no new physical or mathematical entities are postulated.

free parameters (1)
  • T
    Tile side length chosen by the user; controls locality and arithmetic cost.
axioms (1)
  • standard math Standard matrix multiplication, inversion, and addition are associative and distributive as defined in linear algebra.
    Invoked implicitly when the Newton-Schulz iteration is applied to each tile.

pith-pipeline@v0.9.1-grok · 5792 in / 1315 out tokens · 56083 ms · 2026-06-26T03:29:56.568237+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 7 linked inside Pith

  1. [1]

    SIAM Journal on Scientific Computing , volume =

    A new approach to probabilistic rounding error analysis , author =. SIAM Journal on Scientific Computing , volume =. 2019 , publisher =

  2. [2]

    2024 , howpublished =

    Muon: An optimizer for hidden layers in neural networks , author =. 2024 , howpublished =

  3. [3]

    2022 , eprint =

    Training Compute-Optimal Large Language Models , author =. 2022 , eprint =

  4. [4]

    modded-nanogpt: Speedrunning the

    Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and. modded-nanogpt: Speedrunning the. 2024 , howpublished =

  5. [5]

    2512.16928 , archiveprefix =

    Kwangjun Ahn and Noah Amsel and John Langford , year =. 2512.16928 , archiveprefix =

  6. [6]

    2026 , eprint =

    On Surprising Effectiveness of Masking Updates in Adaptive Optimizers , author =. 2026 , eprint =

  7. [7]

    2023 , editor =

    Lu, Yucheng and Agrawal, Shivani and Subramanian, Suvinay and Rybakov, Oleg and De Sa, Christopher and Yazdanbakhsh, Amir , booktitle =. 2023 , editor =

  8. [8]

    Muon is Scalable for

    Jingyuan Liu and Jianling Su and Xingcheng Yao and others , year =. Muon is Scalable for. arxiv.2502.16982 , archiveprefix =

  9. [9]

    2025 , eprint =

    Practical Efficiency of Muon for Pretraining , author =. 2025 , eprint =

  10. [10]

    arxiv.2510.05491 , archiveprefix =

    Zichong Li and Liming Liu and Chen Liang and Weizhu Chen and Tuo Zhao , year =. arxiv.2510.05491 , archiveprefix =

  11. [11]

    2025 , eprint =

    The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm , author =. 2025 , eprint =

  12. [12]

    2025 , eprint =

    MuonBP: Faster Muon via Block-Periodic Orthogonalization , author =. 2025 , eprint =

  13. [13]

    2025 , eprint =

    Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning , author =. 2025 , eprint =

  14. [14]

    2020 , eprint =

    Two-Level K-FAC Preconditioning for Deep Learning , author =. 2020 , eprint =

  15. [15]

    Towards understanding of orthogonalization in

    Valentyn Boreiko and Zhiqi Bu and Sheng Zha , booktitle =. Towards understanding of orthogonalization in. 2025 , url =

  16. [16]

    Gander and Sébastien Loisel and Daniel B

    Martin J. Gander and Sébastien Loisel and Daniel B. Szyld , year =. An Optimal Block Iterative Method and Preconditioner for Banded Matrices with Applications to. SIAM Journal on Matrix Analysis and Applications , doi =

  17. [17]

    Understanding Approximate

    Ryo Karakida and Kazuki Osawa , year =. Understanding Approximate. Neural Information Processing Systems , doi =

  18. [18]

    2008 , journal =

    Steepest Descent and Conjugate Gradient Methods with Variable Preconditioning , author =. 2008 , journal =

  19. [19]

    2022 , journal =

    Overlapping Domain Decomposition Preconditioner for Integral Equations , author =. 2022 , journal =

  20. [20]

    1994 , doi =

    The Matrix Sign Decomposition and Its Relation to the Polar Decomposition , author =. 1994 , doi =

  21. [21]

    1991 , doi =

    A Black Box Generalized Conjugate Gradient Solver with Inner Iterations and Variable-Step Preconditioning , author =. 1991 , doi =

  22. [22]

    2025 , eprint =

    Fantastic Pretraining Optimizers and Where to Find Them , author =. 2025 , eprint =

  23. [23]

    2025 , eprint =

    Benchmarking Optimizers for Large Language Model Pretraining , author =. 2025 , eprint =

  24. [24]

    2024 , eprint =

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author =. 2024 , eprint =

  25. [25]

    2019 , journal =

    Triton: an intermediate language and compiler for tiled neural network computations , author =. 2019 , journal =

  26. [26]

    2025 , journal =

    Performance Analysis of CUDA-based General Matrix Multiplication through Memory Coalescing and Grid-Level Parallelization , author =. 2025 , journal =

  27. [27]

    1998 , journal =

    Using the Matrix Sign Function to Compute Invariant Subspaces , author =. 1998 , journal =

  28. [28]

    1997 , journal =

    The Matrix Sign Function Method and the Computation of Invariant Subspaces , author =. 1997 , journal =

  29. [29]

    2012 , journal =

    Backward Stability of Iterations for Computing the Polar Decomposition , author =. 2012 , journal =

  30. [30]

    2018 , booktitle =

    Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , author =. 2018 , booktitle =

  31. [31]

    1995 , journal =

    New Perturbation Bounds for the Unitary Polar Factor , author =. 1995 , journal =

  32. [32]

    1993 , journal =

    A perturbation bound for the generalized polar decomposition , author =. 1993 , journal =

  33. [33]

    1986 , journal =

    Computing the polar decomposition with applications , author =. 1986 , journal =

  34. [34]

    2022 , journal =

    Energy-adaptive Riemannian optimization on the Stiefel manifold , author =. 2022 , journal =

  35. [35]

    2019 , journal =

    Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices , author =. 2019 , journal =

  36. [36]

    2021 , doi =

    Hierarchical Roofline Performance Analysis for Deep Learning Applications , author =. 2021 , doi =

  37. [37]

    Hierarchical Roofline Analysis for

    Charlene Yang and Thorsten Kurth and Samuel Williams , year =. Hierarchical Roofline Analysis for. Concurrency and Computation , doi =

  38. [38]

    2014 , eprint =

    Adam: A Method for Stochastic Optimization , author =. 2014 , eprint =

  39. [39]

    2019 , eprint =

    Decoupled Weight Decay Regularization , author =. 2019 , eprint =

  40. [40]

    Second-order optimization for neural networks , author =

  41. [41]

    2018 , eprint =

    Gradient Descent Happens in a Tiny Subspace , author =. 2018 , eprint =

  42. [42]

    2025 , eprint =

    When Do Spectral Gradient Updates Help in Deep Learning? , author =. 2025 , eprint =

  43. [43]

    An Investigation into Neural Net Optimization via

    Behrooz Ghorbani and Shankar Krishnan and Ying Xiao , year =. An Investigation into Neural Net Optimization via. arxiv.1901.10159 , archiveprefix =

  44. [44]

    2023 , eprint =

    Rethinking the Structure of Stochastic Gradients: Empirical and Statistical Evidence , author =. 2023 , eprint =

  45. [45]

    2006 , journal =

    Some Remarks on the Perturbation of Polar Decompositions for Rectangular Matrices , author =. 2006 , journal =

  46. [46]

    Pathological Spectra of the

    Ryo Karakida and Shotaro Akaho and Shun-ichi Amari , year =. Pathological Spectra of the

  47. [47]

    Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent

    Lei Huang and Xianglong Liu and Bo Lang and Adams Yu and Yongliang Wang and Bo Li , year =. Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent. AAAI Conference on Artificial Intelligence , doi =

  48. [48]

    Controllable Orthogonalization in Training

    Lei Huang and Li Liu and Fan Zhu and Diwen Wan and Zehuan Yuan and Bo Li and Ling Shao , year =. Controllable Orthogonalization in Training. arxiv.2004.00917 , archiveprefix =

  49. [49]

    2025 , eprint =

    Qwen3 Technical Report , author =. 2025 , eprint =

  50. [50]

    Southworth and Stephen Thomas , year =

    Ben S. Southworth and Stephen Thomas , year =. Beyond. 2603.17970 , archiveprefix =

  51. [51]

    2505.21799 , archiveprefix =

    Tim Tsz-Kit Lau and Qi Long and Weijie Su , year =. 2505.21799 , archiveprefix =

  52. [52]

    Benjamin Erickson and Michael W

    Shenghao Yang and Zhichao Wang and Oleg Balabanov and N. Benjamin Erickson and Michael W. Mahoney , year =. 2601.22137 , archiveprefix =

  53. [53]

    Zhehang Du and Weijie Su , year =. The. 2604.01472 , archiveprefix =

  54. [54]

    Convergence of

    Gyu Yeol Kim and Min-hwan Oh , booktitle=. Convergence of. 2026 , url=

  55. [55]

    2602.02500 , archiveprefix =

    Chen Hu and Qianxi Zhao and Xiaochen Yuan and Hong Zhang and Ding Yuan and Yanbin Wu and Xiying Li , year =. 2602.02500 , archiveprefix =

  56. [56]

    2602.13498 , archivePrefix=

    Peng Cheng and Jiucheng Zang and Qingnan Li and Liheng Ma and Yufei Cui and Yingxue Zhang and Boxing Chen and Ming Jian and Wen Tong , year=. 2602.13498 , archivePrefix=

  57. [57]

    2026 , eprint =

    Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory , author =. 2026 , eprint =

  58. [58]

    2026 , eprint =

    Muon is Not That Special: Random or Inverted Spectra Work Just as Well , author =. 2026 , eprint =

  59. [59]

    2026 , eprint =

    Muon in Associative Memory Learning: Training Dynamics and Scaling Laws , author =. 2026 , eprint =

  60. [60]

    2025 , eprint =

    Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal? , author =. 2025 , eprint =

  61. [61]

    2025 , eprint=

    Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs , author=. 2025 , eprint=

  62. [62]

    2603.28254 , archiveprefix =

    Da Chang and Qiankun Shi and Lvgang Zhang and Yu Li and Ruijie Zhang and Yao Lu and Yongxiang Liu and Ganzhao Yuan , year =. 2603.28254 , archiveprefix =

  63. [63]

    2603.20527 , archiveprefix =

    Shenyang Deng and Zhuoli Ouyang and Tianyu Pang and Zihang Liu and Ruochen Jin and Shuhua Yu and Yaoqing Yang , year =. 2603.20527 , archiveprefix =

  64. [64]

    2026 , eprint=

    MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training , author=. 2026 , eprint=

  65. [65]

    2025 , eprint =

    What Really Matters in Matrix-Whitening Optimizers? , author =. 2025 , eprint =

  66. [66]

    2412.13148 , archiveprefix =

    Chao Ma and Wenbo Gong and Meyer Scetbon and Edward Meeds , year =. 2412.13148 , archiveprefix =

  67. [67]

    2602.02016 , archiveprefix =

    Ionut-Vlad Modoranu and Philip Zmushko and Erik Schultheis and Mher Safaryan and Dan Alistarh , year =. 2602.02016 , archiveprefix =

  68. [68]

    2604.09967 , archiveprefix =

    Ziyue Liu and Ruijie Zhang and Zhengyang Wang and Yequan Zhao and Yupeng Su and Zi Yang and Zheng Zhang , year =. 2604.09967 , archiveprefix =

  69. [69]

    Tri Dao and others , howpublished =. Gram. 2026 , note =

  70. [70]

    Variance-Adaptive

    Jingru Li and Yibo Fan and Huan Li , year =. Variance-Adaptive. 2601.14603 , archiveprefix =

  71. [71]

    Liu and Zhengyang Wang and Dongyang Li and Yupeng Su and Sijia Liu and Zheng Zhang , year =

    Ruijie Zhang and Yequan Zhao and Z. Liu and Zhengyang Wang and Dongyang Li and Yupeng Su and Sijia Liu and Zheng Zhang , year =. 2601.23261 , archiveprefix =

  72. [72]

    When and Why Grouping Attention Heads Accelerates

    Hongtao Zhang and Wenjie Zhou and Wei Chen and Xueqi Cheng , year =. When and Why Grouping Attention Heads Accelerates. 2605.08933 , archiveprefix =

  73. [73]

    Uniform Spectral Growth and Convergence of

    Changmin Kang and Jihun Yun and Baekrok Shin and Yeseul Cho and Chulhee Yun , year =. Uniform Spectral Growth and Convergence of. 2602.06385 , archiveprefix =

  74. [74]

    2602.01105 , archiveprefix =

    Zixiao Wang and Yifei Shen and Huishuai Zhang , year =. 2602.01105 , archiveprefix =

  75. [75]

    2602.03096 , archiveprefix =

    Yujie Yang , year =. 2602.03096 , archiveprefix =