HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
How to set adamw’s weight decay as you scale model and dataset size.arXiv preprint arXiv: 2405.13698,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.