Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
hub
Don’t be lazy: Completep enables compute-efficient deep transformers
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 12roles
background 1polarities
background 1representative citing papers
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.
Looped Transformers require residual scaling ε = 1/N due to correlated updates from weight sharing, unlike standard 1/sqrt(L), enabling learning rate transfer independent of loop count N.
A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing the embedding layer learning rate to avoid bottlenecks and instabilities in AdamW.
A 2x growth factor in model warmstarting yields reliable training speedups for language models under 20 tokens/parameter budgets, with an empirical upper bound on effective growth factors.
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructuring for large-output tasks.
A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
Neural scaling laws in deep learning interact with physics constraints and inductive biases beyond classical statistics.
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
citing papers explorer
-
GQA-{\mu}P: The maximal parameterization update for grouped query attention
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
-
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
-
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model
In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.
-
On the Residual Scaling of Looped Transformers: Stability and Transferability
Looped Transformers require residual scaling ε = 1/N due to correlated updates from weight sharing, unlike standard 1/sqrt(L), enabling learning rate transfer independent of loop count N.
-
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing the embedding layer learning rate to avoid bottlenecks and instabilities in AdamW.
-
When is Warmstarting Effective for Scaling Language Models?
A 2x growth factor in model warmstarting yields reliable training speedups for language models under 20 tokens/parameter budgets, with an empirical upper bound on effective growth factors.
-
Sparse Layers are Critical to Scaling Looped Language Models
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
-
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructuring for large-output tasks.
-
Spectral Condition for $\mu$P under Width-Depth Scaling
A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
-
Statistical Properties of Training & Generalization
Neural scaling laws in deep learning interact with physics constraints and inductive biases beyond classical statistics.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
- One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs