Muon momentum matrices show layer-dependent power-law scaling of stabilized singular value quantiles with model size from 77M to 2.8B parameters.
Disentangling adaptive gradient methods from learning rates.arXiv preprint arXiv:2002.11803,
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
FOAM adaptively controls damping and update frequency in Shampoo based on staleness-oriented error approximation to cut wall-clock time while preserving convergence.
citing papers explorer
-
Spectral Scaling Laws of Muon
Muon momentum matrices show layer-dependent power-law scaling of stabilized singular value quantiles with model size from 77M to 2.8B parameters.
-
FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo
FOAM adaptively controls damping and update frequency in Shampoo based on staleness-oriented error approximation to cut wall-clock time while preserving convergence.