How to set adamw’s weight decay as you scale model and dataset size.arXiv preprint arXiv: 2405.13698,

[W A24] Xi Wang, Laurence Aitchison · arXiv 2405.13698

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

cs.LG · 2026-03-30 · conditional · novelty 6.0

HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.

citing papers explorer

Showing 1 of 1 citing paper.

Rethinking Language Model Scaling under Transferable Hypersphere Optimization cs.LG · 2026-03-30 · conditional · none · ref 22
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.

How to set adamw’s weight decay as you scale model and dataset size.arXiv preprint arXiv: 2405.13698,

fields

years

verdicts

representative citing papers

citing papers explorer