Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
How to set AdamW 's weight decay as you scale model and dataset size
5 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 5years
2026 5representative citing papers
MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
A Weibull diagnostic framework classifies transformer weight matrices into consistent functional classes via the shape parameter k and tracks training progress via the scale parameter lambda across multiple architectures.
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.
citing papers explorer
-
GQA-{\mu}P: The maximal parameterization update for grouped query attention
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
-
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
-
A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions
A Weibull diagnostic framework classifies transformer weight matrices into consistent functional classes via the shape parameter k and tracks training progress via the scale parameter lambda across multiple architectures.
-
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
-
Muon Learns More Robust and Transferable Features than Adam
Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.