Spectral Condition for $\mu$P under Width-Depth Scaling

· 2026 · cs.LG · arXiv 2603.00541

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($\mu$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $\mu$P under joint width-depth scaling. For deep residual networks whose residual blocks contain $k$ transformations, the framework specifies how the norms of weights and their per-step updates should scale with width and depth. It reveals a fundamental transition from $k=1$ to $k\geq 2$, unifying previously disparate $\mu$P formulations and identifying the $k\geq 2$ case as more appropriate for practical architectures with multi-transformation branches such as Transformers. Building on this framework, we derive a general recipe for implementing $\mu$P across a broad class of optimizers by mapping spectral constraints to concrete HP parameterizations, recovering existing results and extending them to additional optimizers. Finally, experiments on GPT-2 style language models show that the $\mu$P formulation derived from the $k\geq 2$ case achieves stable feature learning and robust HP transfer under width-depth scaling, whereas standard parameterization and $\mu$P in the $k=1$ case often fail to do so. These results support the practical effectiveness of the proposed spectral framework.

representative citing papers

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

MuCon: Clipped Muon Updates for LLM Training

cs.LG · 2026-05-26 · unverdicted · novelty 5.0

MuCon defines a clipped-Muon update via singular-value clipping and derives two exact identities for approximating the clip without dense SVD, while noting numerical instability near the threshold.

citing papers explorer

Showing 2 of 2 citing papers.

GQA-{\mu}P: The maximal parameterization update for grouped query attention cs.LG · 2026-05-14 · unverdicted · none · ref 21 · internal anchor
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
MuCon: Clipped Muon Updates for LLM Training cs.LG · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
MuCon defines a clipped-Muon update via singular-value clipping and derives two exact identities for approximating the clip without dense SVD, while noting numerical instability near the threshold.

Spectral Condition for $\mu$P under Width-Depth Scaling

fields

years

verdicts

representative citing papers

citing papers explorer