pith. sign in

arxiv: 2606.25971 · v1 · pith:ZX4YW64Enew · submitted 2026-06-24 · 💻 cs.LG

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

Pith reviewed 2026-06-25 20:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords optimizer modificationweight magnitudeweight directionlearning rate transferAdamMuonMixture-of-Experts
0
0 comments X

The pith

MD Decoupling separates magnitude from direction in each weight matrix so they can be updated at independent learning rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard optimizers like Adam and Muon couple magnitude and direction inside each weight matrix, so changes in one affect the other in ways the learning rate cannot control directly. MD Decoupling fixes this by representing every weight as a fixed-norm direction vector on a hypersphere plus separate learnable magnitude gains for each row and each column. These magnitude gains receive their own learning rates while the direction is updated independently. The resulting optimizer works without weight decay or warmup, improves over well-tuned baselines, lets the optimal learning rate transfer across different model widths, and continues to help on large Mixture-of-Experts models.

Core claim

MD Decoupling factorizes each weight matrix into a fixed-norm direction on the hypersphere together with learnable per-row and per-column magnitude gains that are stepped at separate learning rates; the model still sees only the fused weight tensor, yet the separation removes the indirect coupling that normally forces reliance on weight decay and warmup.

What carries the argument

Magnitude-Direction (MD) Decoupling: the factorization of each weight into fixed-norm direction plus independently learned per-row and per-column magnitude gains.

If this is right

  • Both Adam and Muon with MD Decoupling outperform their well-tuned baselines on the tested tasks.
  • The optimal learning rate found on one model width remains optimal when width changes, removing the need to retune.
  • The same modification continues to improve training on large Mixture-of-Experts models.
  • Weight decay and warmup can be removed while training remains stable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same factorization could be tested on other first-order methods beyond Adam and Muon.
  • If magnitude control becomes explicit, scale-related instabilities in very deep or wide networks may become easier to diagnose.
  • Training recipes that currently rely on normalization layers to absorb all scale changes might be simplified.

Load-bearing premise

Updating per-row and per-column magnitude gains at separate learning rates will produce more stable and transferable dynamics than the coupled updates performed by ordinary optimizers.

What would settle it

A controlled run on a standard benchmark where MD Decoupling either matches or underperforms a well-tuned baseline after removing weight decay and warmup, or where the optimal learning rate still changes when model width is varied.

Figures

Figures reproduced from arXiv: 2606.25971 by Alejandro Hern\'andez-Cano, Alexander H\"agele, Atli Kosson, Martin Jaggi.

Figure 1
Figure 1. Figure 1: Magnitude–Direction (MD) Decoupling improves on well-tuned Adam and Muon, keeps the improvement across compute on large MoEs, and makes the optimal learning rate transfer across model width. Three views of the method (full details in Section 4). (Left) Learning-rate sweep on a dense model: independently of the base optimizer, fixing the weights onto a sphere improves the optimal loss, and adding learnable … view at source ↗
Figure 2
Figure 2. Figure 2: In standard optimizers the weight magnitude silently distorts each update: the same step rotates the weights more at small magnitude and inflates the norm even when only the direction matters. Illustrated on a toy scale-invariant loss, where only the direction of the weights affects the loss. (Left) The loss landscape in polar coordinates, with the same normalized optimizer step taken from a small (red) an… view at source ↗
Figure 3
Figure 3. Figure 3: Magnitude–Direction Decoupling has two independent per-matrix choices: which axis the direction is normalized along, and which axis the learnable gain acts on. (Left) The axis along which a matrix can be constrained, and (Right) The axis along which the gains can act: row, column, both, or flat / Frobenius. the gain is free to act. We take these in turn below: the normalization axis, then the special case … view at source ↗
Figure 4
Figure 4. Figure 4: The choice of normalization axis barely affects the final loss, so we adopt the most flexible Frobenius constraint. Comparison of constraining each output row, each input column, or the whole matrix (Frobenius) to a fixed norm, on the 181M dense model (25B tokens), without gains. (Left) LR sweep of the final loss for each normalization mode. (Right) The corresponding loss curves over training. 2 1:5 2 2 2 … view at source ↗
Figure 5
Figure 5. Figure 5: Holding each embedding vector at unit norm performs slightly better than leaving them unconstrained, while keeping the embedding update better behaved. Ablation of the embedding normalization on the 181M dense model (25B tokens): constraining every embedding vector to unit norm versus letting its norm vary. (Left) LR sweep of the final loss for each mode. (Center) The loss over training for the various emb… view at source ↗
Figure 6
Figure 6. Figure 6: Adding learnable magnitude gains on top of spherical training helps noticeably; a combined per-row-and-column gain works best. The gain parameterization makes little difference, with softplus giving a minimal edge, and all parameterizations training stably. On the 181M dense model (25B tokens). (Left) LR sweep over gain modes (scalar, per-row, per-column, or both rows and columns). (Center) LR sweep over g… view at source ↗
Figure 7
Figure 7. Figure 7: With Magnitude–Direction Decoupling the optimal matrix learning rate stays essentially fixed as the model grows, so it can be tuned once on a small model and reused. Matrix-LR sweeps on dense models (from the 181M base) scaled across width (Left) , depth (Center) , and width and depth (Right) jointly. In each panel the optimal matrix LR stays roughly fixed across model sizes. 0 5B 10B 15B 20B 25B Consumed … view at source ↗
Figure 8
Figure 8. Figure 8: The LR transfer of [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: On the sphere, the relative weight update follows the learning-rate schedule directly, so the shape of the decay matters more than it does under weight decay. Comparison of a Warmup-Stable-Decay (WSD) schedule against a simple linear decay on the 181M dense model. (Left) LR sweep comparing WSD and linear decay. (Center) The corresponding loss curves. (Right) The relative weight update for the attention que… view at source ↗
Figure 10
Figure 10. Figure 10: Decoupling and fixing weight norms removes the need for warmup, and dropping it improves the loss, both from the beginning and when resuming training from a checkpoint. (Left) LR sweep with and without warmup on the 181M model, showing dropping warmup improves the loss. (Center) Loss curves for re-warming runs on a 150M model. (Right) The gradient norm over the same re-warming runs, confirming training st… view at source ↗
Figure 11
Figure 11. Figure 11: The gains from decoupling persist at scale: on large MoEs, MuonMD beats well-tuned Muon and AdamW and reaches AdamW’s loss with roughly 2× less compute. DeepSeekMoE-style models (270M–810M active parameters). (Left) LR sweep of the base optimizers on the 270M-active base. (Center) Scaling law of loss vs. compute (non-embedding active-parameter FLOPs), where the improvement holds across a wide range of com… view at source ↗
Figure 12
Figure 12. Figure 12: The matrix learning rate is the most important hyperparameter worth sweeping per method: the loss is broad in the other groups and essentially flat in the gains LR over more than an order of magnitude. Learning-rate sweeps of the parameter groups held fixed in the main text (181M model, 25B tokens). (Left) AdamW base LR — the shared LR of all Adam-managed groups (embeddings, output layer, gains) — with th… view at source ↗
Figure 13
Figure 13. Figure 13: For reference, the plain AdamW and Muon baselines under joint width-and-depth scaling, where at the largest model MuonMD reaches a lower optimum than both. Sweeps changing only the matrix LR (no magnitude–direction decoupling). (Left) AdamW across joint width-and￾depth scaling. (Center) The same sweep with Muon. (Right) A head-to-head sweep of all three optimizers at the largest joint-scaled model (646M).… view at source ↗
Figure 14
Figure 14. Figure 14: All common Muon scale-factor conventions clearly beat AdamW, but the choice still matters: the unit-RMS-norm qdout din and shape-scaling max(1, qdout din ) factors are best and nearly identical, while the RMS-matching factor is noticeably worse. Sweeps of the matrix LR for each shape-dependent factor that rescales Muon’s orthogonalized update, on the 181M model (25B tokens). (Left) Plain Muon across the s… view at source ↗
Figure 15
Figure 15. Figure 15: nGPT is a distinct architecture, not just spherical optimization, and applying our reparameterization on top of it outperforms nGPT as proposed. Comparison on the 181M model (25B tokens, 50k iters), matched on parameter count and budget. (Top) LR sweeps of the final loss and (bottom) the corresponding loss curves, for the (Left) Adam family and (Right) Muon family. As proposed by Loshchilov et al. (2025),… view at source ↗
Figure 16
Figure 16. Figure 16: Both block-output scales transfer the matrix LR across depth; the softer α = √ 1 2L gives a small but consistent loss improvement and keeps per-layer activations better controlled around 1. Comparison of α = 1 L against α = √ 1 2L . (Left) LR sweep of the final loss at depths 12–30 (181M–252M parameters), with α = 1 L (solid) and α = √ 1 2L (dashed); the optimal matrix LR stays roughly fixed across depth … view at source ↗
Figure 17
Figure 17. Figure 17: A higher-rank gain improves over no gains but does not beat the simpler row-and￾column gain, which remains our default. MuonMD on the 181M model (25B tokens), comparing the spherical baseline without gains, our default per-row/per-column gain γrow + γcol, and a rank-k gain matrix Γ = 1 + AB⊤ with k = 4. (Left) LR sweep of the final loss. (Right) The corresponding loss curves. H Gain Dynamics The gains γro… view at source ↗
Figure 18
Figure 18. Figure 18: Across all three projections, the learned gains spread over more than an order of magnitude during training. Gain dynamics at layer 6 of the 181M dense model, for the four parameterizations of [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
read the original abstract

Modern neural network training relies on optimizers such as Adam and Muon which act on each weight matrix as a single object. Yet every weight matrix carries two distinct quantities -- a \emph{magnitude} and a \emph{direction} -- and all optimizers stepping in the matrix as a whole couple their dynamics: the directional change from an update depends on the current magnitude, while the magnitude drifts as a byproduct of learning the direction, so neither is governed directly by the learning rate. Typical training therefore leans on surrounding recipes such as weight decay and warmup to keep learning stable at scale, though these regulate the coupling only indirectly; other recent methods instead constrain the weight to a fixed-norm sphere, but add no learnable magnitude, leaving scale control to normalization layers alone. We propose \emph{Magnitude--Direction (MD) Decoupling}, an optimizer modification that factorizes each weight into a fixed-norm direction on a hypersphere and learnable per-row and per-column magnitude gains, updated at separate learning rates, all while the model still sees a single fused weight tensor. The method is agnostic to the base optimizer and removes the need for weight decay and warmup. Across both Adam and Muon, MD Decoupling improves on well-tuned baselines, transfers the optimal LR across model width without retuning, and continues to help at scale on large Mixture-of-Experts (MoE) models. Treating magnitude and direction as separately controlled quantities thus yields more predictable training dynamics and a simple, broadly applicable improvement to modern optimizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Magnitude-Direction (MD) Decoupling, a modification to optimizers such as Adam and Muon. Each weight matrix is factorized into a fixed-norm direction on the hypersphere plus learnable per-row and per-column magnitude gains that receive separate learning rates; the network still receives a single fused tensor at inference. The central claims are that this yields better performance than well-tuned baselines, eliminates the need for weight decay and warmup, transfers the optimal learning rate across model widths without retuning, and continues to help at the scale of large Mixture-of-Experts models.

Significance. If the reported gains survive controls that isolate the decoupling effect from the added per-row/column parameters and optimizer states, the approach would constitute a simple, optimizer-agnostic change that makes training dynamics more directly controllable and reduces reliance on indirect stabilization recipes. The empirical demonstration on both Adam and Muon plus scaling to MoE models would be a practical contribution to the optimizer literature.

major comments (3)
  1. [Abstract and Method] Abstract and Method description: MD Decoupling introduces additional trainable per-row and per-column magnitude parameters (and corresponding optimizer states) that are absent from the Adam/Muon baselines. The manuscript must clarify whether baselines were augmented with an equivalent number of extra parameters or whether an ablation demonstrates that the gains persist when parameter count is matched; otherwise the headline improvements, LR transfer, and MoE-scale benefits cannot be attributed to decoupling rather than increased capacity.
  2. [Abstract] Abstract: The claim that the method 'removes the need for weight decay and warmup' is load-bearing for the central thesis yet is stated without reference to the specific experimental controls (e.g., training curves or tables) that establish stable convergence in their absence. Explicit comparison of runs with and without these components under MD Decoupling versus baselines is required.
  3. [Experiments] Experiments section: The statements that optimal LR transfers across widths and that benefits continue at MoE scale rest on empirical results whose robustness (multiple seeds, error bars, or statistical tests) is not visible in the provided description. Tables reporting these quantities are needed to substantiate the transferability and scaling claims.
minor comments (1)
  1. [Method] The factorization into direction and magnitude gains would benefit from an explicit equation (e.g., W = D ⊙ M_row ⊙ M_col or equivalent) placed in the main text rather than left implicit.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation and controls.

read point-by-point responses
  1. Referee: [Abstract and Method] Abstract and Method description: MD Decoupling introduces additional trainable per-row and per-column magnitude parameters (and corresponding optimizer states) that are absent from the Adam/Muon baselines. The manuscript must clarify whether baselines were augmented with an equivalent number of extra parameters or whether an ablation demonstrates that the gains persist when parameter count is matched; otherwise the headline improvements, LR transfer, and MoE-scale benefits cannot be attributed to decoupling rather than increased capacity.

    Authors: We agree that isolating the decoupling mechanism from the effect of added parameters is essential. The original experiments compared against standard Adam/Muon without extra parameters. In the revision we will add a controlled ablation in which the baselines are augmented with an equivalent number of extra learnable per-row/column parameters (updated at the same rate as the weights) and demonstrate that the reported gains, LR transfer, and MoE benefits remain attributable to the independent magnitude/direction updates and fixed-norm constraint. revision: yes

  2. Referee: [Abstract] Abstract: The claim that the method 'removes the need for weight decay and warmup' is load-bearing for the central thesis yet is stated without reference to the specific experimental controls (e.g., training curves or tables) that establish stable convergence in their absence. Explicit comparison of runs with and without these components under MD Decoupling versus baselines is required.

    Authors: We will revise the abstract to reference the supporting experiments and add explicit side-by-side comparisons (training curves and summary tables) in the Experiments section. These will show stable convergence of MD Decoupling without weight decay or warmup, contrasted with the divergence or degraded performance of the baselines under identical conditions, thereby providing the requested controls. revision: yes

  3. Referee: [Experiments] Experiments section: The statements that optimal LR transfers across widths and that benefits continue at MoE scale rest on empirical results whose robustness (multiple seeds, error bars, or statistical tests) is not visible in the provided description. Tables reporting these quantities are needed to substantiate the transferability and scaling claims.

    Authors: We acknowledge the importance of statistical robustness. The revised manuscript will include expanded tables for the width-transfer and MoE-scale experiments that report results across multiple random seeds, with means, standard deviations or error bars, and, where appropriate, statistical significance tests. This will directly substantiate the transferability and scaling claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical algorithmic proposal validated externally

full rationale

The paper proposes an optimizer modification (MD Decoupling) that factorizes weights into fixed-norm directions plus learnable per-row/column magnitudes updated at separate rates. All claims of improvement, LR transfer, and scaling benefits are presented as outcomes of training runs on external benchmarks, not as mathematical predictions derived from the method itself. No equations, fitted parameters renamed as predictions, self-citations as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The central contribution is an algorithmic change whose value is measured by independent experiments rather than internal redefinition or construction.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The method introduces no new physical or mathematical axioms; it relies on standard assumptions that gradient descent on matrix factorizations remains stable when magnitude and direction are updated separately.

free parameters (2)
  • separate learning rate for magnitude gains
    Chosen per experiment; controls how fast row/column scales adapt independently of direction.
  • per-row and per-column magnitude initialization
    Initial values for the learnable magnitude scalars; not specified in abstract.

pith-pipeline@v0.9.1-grok · 5817 in / 1257 out tokens · 23595 ms · 2026-06-25T20:03:15.143771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

128 extracted references · 1 canonical work pages

  1. [3]

    osch, Maximilian B\

    Project Apertus, Alejandro Hern\'andez-Cano, Alexander H\"agele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert i Llaquet, Barna P\'asztor, Bettina Messmer, Dhia Garbaya, Eduard Frank D urech, Ido Hakimi, Juan Garc\'ia Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabol c ec, Yixuan Xu, Michael Aerni, Badr Al...

  2. [6]

    Power lines: Scaling laws for weight decay and batch size in LLM pre-training

    Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in LLM pre-training. arXiv preprint arXiv:2505.13738, 2025 a . URL https://arxiv.org/abs/2505.13738

  3. [7]

    Scaling with collapse: Efficient and predictable training of LLM families

    Shane Bergsma, Bin Claire Zhang, Nolan Dey, Shaheer Muhammad, Gurpreet Gosal, and Joel Hestness. Scaling with collapse: Efficient and predictable training of LLM families. arXiv preprint arXiv:2509.25087, 2025 b . URL https://arxiv.org/abs/2509.25087

  4. [8]

    Modular manifolds

    Jeremy Bernstein. Modular manifolds. Thinking Machines Lab: Connectionism, 2025. doi:10.64434/tml.20250926. https://thinkingmachines.ai/blog/modular-manifolds/

  5. [10]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp.\ 1280--1297, 2024. URL https://arxiv.o...

  6. [14]

    Don't be lazy: CompleteP enables compute-efficient deep transformers

    Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don't be lazy: CompleteP enables compute-efficient deep transformers. In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://arxiv.org/abs/2505.01618

  7. [15]

    Improving our llm pretraining efficiency

    Larry Dial. Improving our llm pretraining efficiency. https://www.openathena.ai/blog/pretraining-speedup/, jun 2026. Open Athena Blog

  8. [17]

    Training dynamics of the cooldown stage in warmup-stable-decay learning rate scheduler

    Aleksandr Dremov, Alexander H\"agele, Atli Kosson, and Martin Jaggi. Training dynamics of the cooldown stage in warmup-stable-decay learning rate scheduler. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=ZnSYEcZod3

  9. [28]

    Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights

    Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. In International Conference on Learning Representations (ICLR), 2021. URL https://arxiv.org/abs/2006.08217

  10. [31]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv.org/abs/2106.09685

  11. [32]

    MiniCPM : Unveiling the potential of small language models with scalable training strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, et al. MiniCPM : Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. URL https://arxiv.org/abs/2404.06395

  12. [34]

    Muon: An optimizer for hidden layers in neural networks

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/, 2024 a . URL https://kellerjordan.github.io/posts/muon/

  13. [35]

    modded-nanogpt: Speedrunning the nanogpt baseline

    Keller Jordan et al. modded-nanogpt: Speedrunning the nanogpt baseline. https://github.com/KellerJordan/modded-nanogpt, 2024 b . URL https://github.com/KellerJordan/modded-nanogpt

  14. [42]

    On Balanced Representation Learning in Neural Networks

    Atli Kosson. On Balanced Representation Learning in Neural Networks. PhD thesis, \'Ecole Polytechnique F\'ed\'erale de Lausanne (EPFL), 2026. URL https://infoscience.epfl.ch/entities/publication/2766967a-1920-4f95-bacf-60ecf7a40eaf

  15. [54]

    Normalization and effective learning rates in reinforcement learning

    Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado van Hasselt, Razvan Pascanu, and Will Dabney. Normalization and effective learning rates in reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/c04d37be05ba74419d2d5705972a9d64-Abstract-Conference.html

  16. [55]

    On the SDE s and scaling rules for adaptive gradient algorithms

    Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora. On the SDE s and scaling rules for adaptive gradient algorithms. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.\ 7697--7711, 2022. URL https://arxiv.org/abs/2205.10287

  17. [56]

    Enhancing multilingual LLM pretraining with model-based data selection

    Bettina Messmer, Vinko Sabol c ec, and Martin Jaggi. Enhancing multilingual LLM pretraining with model-based data selection. In Jonathan Gerber, Mark Cieliebak, Don Tuggener, and Manuela H \"u rlimann (eds.), Proceedings of the 10th edition of the Swiss Text Analytics Conference, pp.\ 31--56, Winterthur, Switzerland, May 2025. Association for Computationa...

  18. [63]

    Training deep learning models with norm-constrained lmos

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos. In International Conference on Machine Learning, pp.\ 49069--49104. PMLR, 2025

  19. [65]

    Zero: Memory optimizations toward training trillion parameter models, 2020

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models, 2020. URL https://arxiv.org/abs/1910.02054

  20. [68]

    The surprising agreement between convex optimization theory and learning-rate scheduling for large model training

    Fabian Schaipp, Alexander H \"a gele, Adrien Taylor, Umut Simsekli, and Francis Bach. The surprising agreement between convex optimization theory and learning-rate scheduling for large model training. In International Conference on Machine Learning, pp.\ 53267--53294. PMLR, 2025

  21. [72]

    Megatron- LM : Training multi-billion parameter language models using model parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM : Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. URL https://arxiv.org/abs/1909.08053

  22. [75]

    L2 regularization versus batch and weight normalization

    Twan van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017. URL https://arxiv.org/abs/1706.05350

  23. [77]

    SOAP : Improving and stabilizing shampoo using adam

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. SOAP : Improving and stabilizing shampoo using adam. arXiv preprint arXiv:2409.11321, 2024. URL https://arxiv.org/abs/2409.11321

  24. [78]

    Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay

    Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay. In Advances in Neural Information Processing Systems (NeurIPS), 2021. URL https://arxiv.org/abs/2006.08419

  25. [81]

    How to set AdamW 's weight decay as you scale model and dataset size

    Xi Wang and Laurence Aitchison. How to set AdamW 's weight decay as you scale model and dataset size. In International Conference on Machine Learning (ICML), 2025. URL https://arxiv.org/abs/2405.13698

  26. [83]

    Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization

    Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization. https://tinyurl.com/muonh, 2026. URL https://tinyurl.com/muonh

  27. [89]

    Tensor programs VI : Feature learning in infinite-depth neural networks

    Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs VI : Feature learning in infinite-depth neural networks. In International Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2310.02244

  28. [94]

    arXiv preprint arXiv:2605.11125 , year=

    Language Modeling with Hyperspherical Flows , author=. arXiv preprint arXiv:2605.11125 , year=. 2605.11125 , archivePrefix=

  29. [95]

    International Conference on Machine Learning (ICML) , year=

    Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks , author=. International Conference on Machine Learning (ICML) , year=. 2305.17212 , archivePrefix=

  30. [96]

    arXiv preprint arXiv:2507.20534 , year=

    Kimi K2: Open Agentic Intelligence , author=. arXiv preprint arXiv:2507.20534 , year=. 2507.20534 , archivePrefix=

  31. [97]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    CogView: Mastering Text-to-Image Generation via Transformers , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2105.13290 , archivePrefix=

  32. [98]

    International Conference on Machine Learning (ICML) , year=

    Peri-LN: Revisiting Normalization Layer in the Transformer Architecture , author=. International Conference on Machine Learning (ICML) , year=. 2502.02732 , archivePrefix=

  33. [99]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Analyzing and Improving the Training Dynamics of Diffusion Models , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=. 2312.02696 , archivePrefix=

  34. [100]

    International Conference on Learning Representations (ICLR) , year=

    nGPT: Normalized Transformer with Representation Learning on the Hypersphere , author=. International Conference on Learning Representations (ICLR) , year=. 2410.01131 , archivePrefix=

  35. [101]

    2026 , howpublished =

    Fantastic Pretraining Optimizers and Where to Find Them 2.1: Hyperball Optimization , author =. 2026 , howpublished =

  36. [102]

    arXiv preprint arXiv:2603.28743 , year=

    Rethinking Language Model Scaling under Transferable Hypersphere Optimization , author=. arXiv preprint arXiv:2603.28743 , year=. 2603.28743 , archivePrefix=

  37. [103]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2405.18392 , archivePrefix=

  38. [104]

    International Conference on Machine Learning , pages=

    The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  39. [105]

    Transactions on Machine Learning Research , issn=

    Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

  40. [106]

    arXiv preprint arXiv:2512.22382 , year=

    Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration , author=. arXiv preprint arXiv:2512.22382 , year=. 2512.22382 , archivePrefix=

  41. [107]

    International Conference on Learning Representations (ICLR) , year=

    Weight Decay may matter more than muP for Learning Rate Transfer in Practice , author=. International Conference on Learning Representations (ICLR) , year=. 2510.19093 , archivePrefix=

  42. [108]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2410.23922 , archivePrefix=

  43. [109]

    arXiv preprint arXiv:2605.10797 , year=

    Muown: Row-Norm Control for Muon Optimization , author=. arXiv preprint arXiv:2605.10797 , year=. 2605.10797 , archivePrefix=

  44. [110]

    arXiv preprint arXiv:2606.23637 , year=

    Muown Implicitly Performs Angular Step-size Decay , author=. arXiv preprint arXiv:2606.23637 , year=. 2606.23637 , archivePrefix=

  45. [111]

    arXiv preprint arXiv:2601.04890 , year=

    Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers , author=. arXiv preprint arXiv:2601.04890 , year=. 2601.04890 , archivePrefix=

  46. [112]

    arXiv preprint arXiv:2605.26895 , year=

    Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models , author=. arXiv preprint arXiv:2605.26895 , year=. 2605.26895 , archivePrefix=

  47. [113]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Learning in Compact Spaces with Approximately Normalized Transformer , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2505.22014 , archivePrefix=

  48. [114]

    arXiv preprint arXiv:2511.18890 , year=

    Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models , author=. arXiv preprint arXiv:2511.18890 , year=. 2511.18890 , archivePrefix=

  49. [115]

    arXiv preprint arXiv:2601.23000 , year=

    Mano: Restriking Manifold Optimization for LLM Training , author=. arXiv preprint arXiv:2601.23000 , year=. 2601.23000 , archivePrefix=

  50. [116]

    arXiv preprint arXiv:2601.08393 , year=

    Controlled LLM Training on Spectral Sphere , author=. arXiv preprint arXiv:2601.08393 , year=. 2601.08393 , archivePrefix=

  51. [117]

    arXiv preprint arXiv:2409.20325 , year=

    Old Optimizer, New Norm: An Anthology , author=. arXiv preprint arXiv:2409.20325 , year=. 2409.20325 , archivePrefix=

  52. [118]

    International Conference on Machine Learning , pages=

    Training Deep Learning Models with Norm-Constrained LMOs , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  53. [119]

    2026 , month =

    Dial, Larry , title =. 2026 , month =

  54. [120]

    Thinking Machines Lab: Connectionism , year =

    Jeremy Bernstein , title =. Thinking Machines Lab: Connectionism , year =

  55. [121]

    arXiv preprint arXiv:2507.13338 , year=

    Training Transformers with Enforced Lipschitz Constants , author=. arXiv preprint arXiv:2507.13338 , year=. 2507.13338 , archivePrefix=

  56. [122]

    arXiv preprint arXiv:2603.09952 , year=

    On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer , author=. arXiv preprint arXiv:2603.09952 , year=. 2603.09952 , archivePrefix=

  57. [123]

    arXiv preprint arXiv:2503.17500 , year=

    Variance Control via Weight Rescaling in LLM Pre-training , author=. arXiv preprint arXiv:2503.17500 , year=. 2503.17500 , archivePrefix=

  58. [124]

    International Conference on Machine Learning (ICML) , year=

    Learning by Turning: Neural Architecture Aware Optimisation , author=. International Conference on Machine Learning (ICML) , year=. 2102.07227 , archivePrefix=

  59. [125]

    arXiv preprint arXiv:1708.03888 , year=

    Large Batch Training of Convolutional Networks , author=. arXiv preprint arXiv:1708.03888 , year=. 1708.03888 , archivePrefix=

  60. [126]

    International Conference on Learning Representations (ICLR) , year=

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , author=. International Conference on Learning Representations (ICLR) , year=. 1904.00962 , archivePrefix=

  61. [127]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Scaling Vision Transformers , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=. 2106.04560 , archivePrefix=

  62. [128]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Normalization and effective learning rates in reinforcement learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  63. [129]

    International Conference on Learning Representations (ICLR) , year=

    AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights , author=. International Conference on Learning Representations (ICLR) , year=

  64. [130]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 1602.07868 , archivePrefix=

  65. [131]

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Decoupled Networks , author=. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=. 1804.08071 , archivePrefix=

  66. [132]

    arXiv preprint arXiv:1903.10520 , year=

    Micro-Batch Training with Batch-Channel Normalization and Weight Standardization , author=. arXiv preprint arXiv:1903.10520 , year=. 1903.10520 , archivePrefix=

  67. [133]

    European Conference on Computer Vision (ECCV) , year=

    Big Transfer (BiT): General Visual Representation Learning , author=. European Conference on Computer Vision (ECCV) , year=. 1912.11370 , archivePrefix=

  68. [134]

    International Conference on Learning Representations (ICLR) , year=

    Spectral Normalization for Generative Adversarial Networks , author=. International Conference on Learning Representations (ICLR) , year=. 1802.05957 , archivePrefix=

  69. [135]

    International Conference on Learning Representations (ICLR) , year=

    Artificial Kuramoto Oscillatory Neurons , author=. International Conference on Learning Representations (ICLR) , year=. 2410.13821 , archivePrefix=

  70. [136]

    2026 , url=

    On Balanced Representation Learning in Neural Networks , author=. 2026 , url=

  71. [137]

    2024 , howpublished=

    Muon: An optimizer for hidden layers in neural networks , author=. 2024 , howpublished=

  72. [138]

    2024 , howpublished=

    modded-nanogpt: Speedrunning the NanoGPT baseline , author=. 2024 , howpublished=

  73. [139]

    2026 , booktitle=

    Project Apertus and Alejandro Hern\'andez-Cano and Alexander H\"agele and Allen Hao Huang and Angelika Romanou and Antoni-Joan Solergibert i Llaquet and Barna P\'asztor and Bettina Messmer and Dhia Garbaya and Eduard Frank. 2026 , booktitle=

  74. [140]

    International Conference on Learning Representations (ICLR) , year=

    Adam: A Method for Stochastic Optimization , author=. International Conference on Learning Representations (ICLR) , year=. 1412.6980 , archivePrefix=

  75. [141]

    International Conference on Learning Representations (ICLR) , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations (ICLR) , year=. 1711.05101 , archivePrefix=

  76. [142]

    2017 , eprint=

    Loshchilov, Ilya and Hutter, Frank , booktitle=. 2017 , eprint=

  77. [143]

    arXiv preprint arXiv:1607.06450 , year=

    Layer Normalization , author=. arXiv preprint arXiv:1607.06450 , year=. 1607.06450 , archivePrefix=

  78. [144]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Root Mean Square Layer Normalization , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 1910.07467 , archivePrefix=

  79. [145]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2203.03466 , archivePrefix=

  80. [146]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Training Compute-Optimal Large Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=. 2203.15556 , archivePrefix=

Showing first 80 references.