HiMuon partitions momentum-gradient matrices into T x T tiles, runs independent Newton-Schulz iterations on each tile, and reassembles the results, reducing leading cost to O(H W T K) while defining a local rather than global matrix map.
Southworth and Stephen Thomas , year =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Muon optimizer outperforms AdamW in ViT training on two image datasets, with gains that depend on data augmentation strength and are linked to wider singular-value spread in QKV gradients and prevention of late-training mode collapse in MLP blocks.
citing papers explorer
-
Muon in Vision Transformers: Optimizer-Recipe Interactions and Gradient Spectra
Muon optimizer outperforms AdamW in ViT training on two image datasets, with gains that depend on data augmentation strength and are linked to wider singular-value spread in QKV gradients and prevention of late-training mode collapse in MLP blocks.