Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
International Conference on Machine Learning (ICML) , year=
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 12roles
background 1polarities
background 1representative citing papers
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
ALU uses public data to suppress unlearning cost quadratically while characterizing distribution mismatch effects, enabling mass unlearning with maintained utility.
In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.
C-voting improves recurrent reasoning models by selecting among multiple latent trajectories the one with highest average top-1 probability, achieving 4.9% better Sudoku-hard accuracy than energy-based voting and outperforming HRM on Sudoku-extreme and Maze when paired with the new ItrSA++ model.
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.
GWT projects gradients into wavelet subspaces to compress optimizer states for memory-efficient LLM training while claiming performance parity with full-rank updates.
Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.
citing papers explorer
No citing papers match the current filters.