hub

Don’t be lazy: Completep enables compute-efficient deep transformers

Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, Joel Hestness · 2025 · arXiv 2505.01618

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

cond-mat.dis-nn · 2026-02-04 · unverdicted · novelty 7.0

In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.

On the Residual Scaling of Looped Transformers: Stability and Transferability

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Looped Transformers require residual scaling ε = 1/N due to correlated updates from weight sharing, unlike standard 1/sqrt(L), enabling learning rate transfer independent of loop count N.

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing the embedding layer learning rate to avoid bottlenecks and instabilities in AdamW.

When is Warmstarting Effective for Scaling Language Models?

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

A 2x growth factor in model warmstarting yields reliable training speedups for language models under 20 tokens/parameter budgets, with an empirical upper bound on effective growth factors.

Sparse Layers are Critical to Scaling Looped Language Models

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.

Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

cond-mat.dis-nn · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructuring for large-output tasks.

Spectral Condition for $\mu$P under Width-Depth Scaling

cs.LG · 2026-02-28 · unverdicted · novelty 6.0

A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.

Statistical Properties of Training & Generalization

stat.ML · 2026-06-18 · unverdicted · novelty 2.0

Neural scaling laws in deep learning interact with physics constraints and inductive biases beyond classical statistics.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

cs.LG · 2026-05-21

citing papers explorer

Showing 12 of 12 citing papers.

GQA-{\mu}P: The maximal parameterization update for grouped query attention cs.LG · 2026-05-14 · unverdicted · none · ref 5
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization cs.LG · 2026-05-13 · unverdicted · none · ref 50
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model cond-mat.dis-nn · 2026-02-04 · unverdicted · none · ref 6
In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.
On the Residual Scaling of Looped Transformers: Stability and Transferability cs.LG · 2026-06-16 · unverdicted · none · ref 8
Looped Transformers require residual scaling ε = 1/N due to correlated updates from weight sharing, unlike standard 1/sqrt(L), enabling learning rate transfer independent of loop count N.
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate cs.LG · 2026-05-20 · unverdicted · none · ref 9
A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing the embedding layer learning rate to avoid bottlenecks and instabilities in AdamW.
When is Warmstarting Effective for Scaling Language Models? cs.LG · 2026-05-13 · unverdicted · none · ref 5
A 2x growth factor in model warmstarting yields reliable training speedups for language models under 20 tokens/parameter budgets, with an empirical upper bound on effective growth factors.
Sparse Layers are Critical to Scaling Looped Language Models cs.LG · 2026-05-09 · unverdicted · none · ref 34
Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer cond-mat.dis-nn · 2026-05-08 · unverdicted · none · ref 22 · 2 links
A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructuring for large-output tasks.
Spectral Condition for $\mu$P under Width-Depth Scaling cs.LG · 2026-02-28 · unverdicted · none · ref 10
A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
Statistical Properties of Training & Generalization stat.ML · 2026-06-18 · unverdicted · none · ref 38
Neural scaling laws in deep learning interact with physics constraints and inductive biases beyond classical statistics.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 96
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs cs.LG · 2026-05-21 · unreviewed · ref 2

Don’t be lazy: Completep enables compute-efficient deep transformers

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer