A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
A method for solving the convex programming problem with convergence rate o(1/k2)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Spectral Condition for $\mu$P under Width-Depth Scaling
A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.