HiMuon partitions momentum-gradient matrices into T x T tiles, runs independent Newton-Schulz iterations on each tile, and reassembles the results, reducing leading cost to O(H W T K) while defining a local rather than global matrix map.
2602.13498 , archivePrefix=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
Zeta applies coordinate whitening followed by spectral whitening in a fixed order to reduce orthogonalization error in matrix optimization for neural networks.
citing papers explorer
-
Hierarchical Muon: Tiled Newton-Schulz Updates for Efficient Muon Optimization
HiMuon partitions momentum-gradient matrices into T x T tiles, runs independent Newton-Schulz iterations on each tile, and reassembles the results, reducing leading cost to O(H W T K) while defining a local rather than global matrix map.
-
Why Muon Outperforms Adam: A Curvature Perspective
Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.
-
Zeta: Dual Whitening for Matrix Optimization via Coordinate-Adaptive Preconditioning
Zeta applies coordinate whitening followed by spectral whitening in a fixed order to reduce orthogonalization error in matrix optimization for neural networks.