SparseOpt is a new optimizer that counters batch normalization's gradient skew in dynamic sparse training, yielding faster convergence and better accuracy on ResNet models for CIFAR-100 and ImageNet.
Preconditioned Stochastic Gradient Descent , volume=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2verdicts
UNVERDICTED 2representative citing papers
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
citing papers explorer
-
SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training
SparseOpt is a new optimizer that counters batch normalization's gradient skew in dynamic sparse training, yielding faster convergence and better accuracy on ResNet models for CIFAR-100 and ImageNet.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.