A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
Tensor programs ivb: Adaptive optimization in the infinite-width limit.CoRR, abs/2308.01814
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
citing papers explorer
-
Spectral Condition for $\mu$P under Width-Depth Scaling
A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.