pith. machine review for the scientific record. sign in

arxiv: 2605.04418 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI· math.OC

Recognition: 3 theorem links

· Lean Theorem

Demystifying Manifold Constraints in LLM Pre-training

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC
keywords manifold constraintsLLM pre-trainingRiemannian optimizationnormalization layersweight decayactivation scalesrotational equilibriumconstrained optimization
0
0 comments X

The pith

Manifold constraints on LLM weights independently bound activation scales and enforce rotational equilibrium, replacing normalization and weight decay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that explicit manifold constraints on model weights can take over the stabilizing roles of common heuristic techniques like normalization layers and weight decay during large language model pre-training. Through the introduction of the Msign-Aligned Constrained Riemannian Optimizer (MACRO), a single-loop framework with convergence guarantees, the work separates these geometric restrictions from interacting mechanisms and shows they limit forward activation scales while maintaining rotational balance in the weights. Empirical results on large-scale architectures confirm that this approach delivers competitive performance without the usual heuristics. A sympathetic reader would care because it provides a more principled, geometry-based account of training stability and a path toward simpler optimization methods with theoretical backing.

Core claim

By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO) as a provably convergent single-loop framework, the paper disentangles explicit manifold constraints from heuristics such as RMS normalization and decoupled weight decay. Theoretical analyses demonstrate that these constraints independently bound forward activation scales and enforce stable rotational equilibrium. Comprehensive empirical evaluations on large-scale LLM architectures show that MACRO achieves highly competitive performance while rigorously preserving the guarantees of exact Riemannian optimization.

What carries the argument

Manifold constraints on the weights, enforced via the Msign-Aligned Constrained Riemannian Optimizer (MACRO) in a single-loop manner to restrict weights geometrically and separate their regularization effects from normalization and decay.

If this is right

  • Manifold constraints bound forward activation scales independently of explicit normalization layers.
  • They enforce stable rotational equilibrium in the weights, subsuming the role of weight decay.
  • MACRO delivers competitive performance on large LLMs while retaining exact Riemannian convergence guarantees.
  • Weight regularization effects can be isolated from interacting mechanisms like normalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training code could drop separate normalization and decay modules if the geometric constraints prove sufficient across tasks.
  • The rotational equilibrium view may help explain stability patterns in other neural architectures beyond language models.
  • Similar single-loop manifold enforcement could be tested for efficiency gains when scaling to even larger parameter counts.

Load-bearing premise

That manifold constraints can be enforced efficiently in a single-loop manner on large-scale LLM architectures without introducing new instabilities or degrading performance relative to standard heuristics.

What would settle it

Running MACRO on a full-scale LLM pre-training task and observing either lower final performance than standard normalized training or the emergence of numerical instabilities or divergence.

Figures

Figures reproduced from arXiv: 2605.04418 by Donald Goldfarb, Jiaxiang Li, Kang An, Shiqian Ma.

Figure 1
Figure 1. Figure 1: Train loss for 120M QWEN3- like model. In this subsection, we analyze and compare Frobenius and spectral spheres, and output/input Oblique manifolds. The major benefit of weight constraints is controlling the scale of activations. Consider a single linear layer Y = XW⊤, where X ∈ RT ×Din , Y ∈ RT ×Dout and W ∈ RDout×Din for a sequence of length T. Following Yang et al. [20] and Su [21], our goal is for the… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical validation of κ. The true mechanism of the Frobenius constraint becomes clear when we analyze average-case behavior governed by the empirical input covariance, ΣX = 1 T X⊤X. During training, optimization dynamics continuously change this covariance. Therefore, we evaluate the average output RMS norm using the trace formulation: EX[∥ vec(Y)∥ 2 RMS] = 1 Dout tr(ΣXW⊤W). We tightly bound this trace i… view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of ℓ2 norm of γ during training. To maintain an input scale of ∥ vec(X)∥RMS = Θ(1) for each layers, we still insert a parameter-free RMSNorm immediately after the attention block, alongside QK-norm to stabilize the pre-softmax logits. In addition, for the SwiGLU activation, the Hadamard product fundamentally disrupts linear stability (Mul￾tiplying two matrices of scale c element-wise yields a c 2… view at source ↗
Figure 4
Figure 4. Figure 4: θt under Frobenius Sphere. Static Rotational Equilibrium under the Frobenius Sphere. Manifold constraints also fundamentally alter the angular trajectory. Traditional weight decay requires a tran￾sient phase to reach rotational equilibrium [15]. Manifold constraints bypass this phase entirely and enforce a pure rotational state from the very first step. We explicitly quantify the rotational angle for the F… view at source ↗
Figure 5
Figure 5. Figure 5: Rotational angle under the spectral constraint: The spectral rotation angles display high view at source ↗
Figure 6
Figure 6. Figure 6: Training and validation loss for Muon, MACRO-fro, and MACRO-spec on the 330M QWEN3- like model with all learnable RMSNorm layers removed, at the chosen learning rate ηt = 10−2 ( view at source ↗
Figure 7
Figure 7. Figure 7: Global gradient norm across training for view at source ↗
Figure 8
Figure 8. Figure 8: Validation loss versus learning rate at four model widths view at source ↗
Figure 9
Figure 9. Figure 9: Training (left) and validation (right) loss across the full training schedule for the 120M, view at source ↗
Figure 10
Figure 10. Figure 10: The average of Per-step tangent space violation view at source ↗
Figure 11
Figure 11. Figure 11: Ablation study on geometric constraints. view at source ↗
read the original abstract

The empirical success of large language model (LLM) pre-training relies heavily on heuristic stabilization techniques, such as explicit normalization layers and weight decay. While recent constrained optimization approaches that explicitly restrict weights may improve numerical stability and performance, the mechanism and motivation for adding constraints still remain elusive. This paper systematically demystifies the role of explicit manifold constraints in LLM pre-training. By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO)-a provably convergent, single-loop optimization framework-our study disentangles weight regularization heuristics from interacting mechanisms like RMS normalization and decoupled weight decay. Theoretical analyses and comprehensive empirical evaluations reveal that manifold constraints independently bound forward activation scales and enforce stable rotational equilibrium, thereby subsuming the roles of these heuristic mechanisms. Evaluations on large-scale LLM architectures demonstrate that MACRO achieves highly competitive performance while rigorously preserving the theoretical guarantees of exact Riemannian optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Msign-Aligned Constrained Riemannian Optimizer (MACRO), a provably convergent single-loop Riemannian optimization framework for LLM pre-training. It claims that explicit manifold constraints independently bound forward activation scales and enforce stable rotational equilibrium, thereby subsuming the roles of RMS normalization and decoupled weight decay heuristics. Theoretical analyses establish convergence guarantees, while empirical evaluations on large-scale LLM architectures demonstrate competitive performance with preserved exact Riemannian optimization properties.

Significance. If the central claims hold, the work offers substantial significance by providing a principled, constraint-based explanation for stabilization in LLM training that could replace ad-hoc heuristics. The single-loop Riemannian formulation with provable convergence, combined with large-scale empirical validation, strengthens the case for adopting manifold constraints as a more transparent alternative to current practices.

major comments (2)
  1. [Theoretical Analyses] The independence claim in the theoretical analyses—that manifold constraints bound activation scales without relying on normalization—is load-bearing for the subsumption argument. The derivation should explicitly show that the bounding effect persists under the paper's manifold definition even when normalization layers are removed, rather than emerging from the interaction of definitions.
  2. [Empirical Evaluations] Empirical section on large-scale evaluations: the reported competitive performance must be supported by ablations that isolate the manifold constraint's effect on rotational equilibrium (e.g., comparing MACRO variants with and without the constraint while holding other optimizer components fixed). Without this, the claim that constraints subsume weight decay remains partially confounded.
minor comments (2)
  1. [Introduction] The acronym MACRO and its full expansion should be introduced at the first use in the main text body for clarity, even though it appears in the abstract.
  2. [Theoretical Analyses] Notation for the Riemannian projection operator and Msign alignment could be clarified with a brief reminder of their definitions when first used in the convergence proof to aid readers outside Riemannian optimization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with point-by-point responses, committing to revisions that strengthen the clarity of our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Theoretical Analyses] The independence claim in the theoretical analyses—that manifold constraints bound activation scales without relying on normalization—is load-bearing for the subsumption argument. The derivation should explicitly show that the bounding effect persists under the paper's manifold definition even when normalization layers are removed, rather than emerging from the interaction of definitions.

    Authors: We agree that an explicit derivation of independence is essential to substantiate the subsumption argument. In the revised manuscript, we will expand the theoretical section with a dedicated lemma and proof that derives the forward activation scale bounds solely from the Msign-aligned Riemannian projection and manifold constraint, without reference to normalization layers. This will demonstrate that the bounding holds directly under the paper's manifold definition by considering the geometry of the constraint set in isolation. revision: yes

  2. Referee: [Empirical Evaluations] Empirical section on large-scale evaluations: the reported competitive performance must be supported by ablations that isolate the manifold constraint's effect on rotational equilibrium (e.g., comparing MACRO variants with and without the constraint while holding other optimizer components fixed). Without this, the claim that constraints subsume weight decay remains partially confounded.

    Authors: We acknowledge that isolating the rotational equilibrium constraint is necessary to avoid potential confounding in the subsumption claim. In the revised manuscript, we will add targeted ablation experiments on the large-scale LLM setups. These will compare the full MACRO optimizer against a controlled variant in which the rotational equilibrium constraint is relaxed (by modifying only the alignment step while holding the single-loop structure, other optimizer components, and hyperparameters fixed). The results will quantify the isolated contribution to stability and performance, directly supporting the relation to decoupled weight decay. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces the MACRO framework as a new single-loop Riemannian optimizer and derives its properties (activation scale bounding and rotational equilibrium) from the explicit manifold constraints and convergence proofs within that framework. These results are presented as independent of prior heuristics like RMSNorm and weight decay, supported by both theoretical analysis and large-scale experiments rather than by re-using fitted parameters or self-citations as load-bearing inputs. No equation or claim reduces by construction to a redefinition of its own inputs, and the central disentangling argument rests on the novel constrained formulation rather than circular self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard assumptions from Riemannian optimization and manifold geometry that are not detailed in the abstract; no free parameters or invented entities are explicitly described beyond the new optimizer name.

axioms (2)
  • standard math The weight space can be treated as a Riemannian manifold with appropriate metric for constrained optimization.
    Invoked implicitly by the use of Riemannian optimizer and manifold constraints.
  • domain assumption Single-loop updates preserve convergence guarantees under the chosen constraint alignment.
    Central to the provably convergent claim but not expanded in abstract.
invented entities (1)
  • MACRO optimizer no independent evidence
    purpose: Single-loop constrained Riemannian method for LLM pre-training
    New framework introduced to enforce manifold constraints while disentangling heuristics.

pith-pipeline@v0.9.0 · 5449 in / 1437 out tokens · 59333 ms · 2026-05-08T17:40:58.363580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 43 canonical work pages · 3 internal anchors

  1. [1]

    Towards a principled Muon under µP: Ensuring spectral conditions throughout training, 2026

    John Zhao. Towards a principled Muon under µP: Ensuring spectral conditions throughout training, 2026. URLhttps://arxiv.org/abs/2601.01306

  2. [2]

    Xiaowen Jiang, Andrei Semenov, and Sebastian U. Stich. Enhancing LLM training via spectral clipping,

  3. [3]

    URLhttps://arxiv.org/abs/2603.14315

  4. [4]

    Learning in transformers under spectral constraints

    Md Rifat Arefin, Ravid Shwartz-Ziv, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Aristide Baratin, Adithya Sagar, and Patrick Huber. Learning in transformers under spectral constraints. InICLR 2026 Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2026

  5. [5]

    Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026

    Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for LLM training, 2026. URL https: //arxiv.org/abs/2601.23000

  6. [6]

    Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola

    Laker Newhouse, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola. Training transformers with enforced Lipschitz constants, 2025. URL https://arxiv.org/abs/2507. 13338

  7. [7]

    Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,

    Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, and Baining Guo. Controlled LLM training on spectral sphere, 2026. URLhttps://arxiv.org/abs/2601.08393

  8. [8]

    Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

    Kaiwei Yang and Lexiao Lai. Manifold constrained steepest descent, 2026. URL https://arxiv.org/ abs/2601.21487

  9. [9]

    Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avra- ham, and Alexander Long

    Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputu- godage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avraham, and Alexander Long. NuMuon: Nuclear-norm-constrained Muon for compressible LLM training, 2026. URL https://arxiv.org/abs/2603.03597

  10. [10]

    Fastest descent on a manifold: 4

    Jianlin Su. Fastest descent on a manifold: 4. Muon + spectral sphere, Aug 2025. URL https://spaces. ac.cn/archives/11241. (In Chinese)

  11. [11]

    Fastest descent on a manifold: 2

    Jianlin Su. Fastest descent on a manifold: 2. Muon + orthogonal, Aug 2025. URL https://spaces.ac. cn/archives/11215. (In Chinese)

  12. [12]

    Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025

    Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025. URLhttps://tinyurl.com/muonh

  13. [13]

    20251026

    Jeremy Bernstein. Modular manifolds.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml. 20250926. https://thinkingmachines.ai/blog/modular-manifolds/

  14. [14]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv. org/abs/1711.05101

  15. [15]

    Why gradients rapidly increase near the end of training.arXiv preprint arXiv: 2506.02285,

    Aaron Defazio. Why gradients rapidly increase near the end of training, 2025. URL https://arxiv. org/abs/2506.02285

  16. [16]

    Rotational equilibrium: How weight decay balances learning across neural networks, 2024

    Atli Kosson, Bettina Messmer, and Martin Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks, 2024. URLhttps://arxiv.org/abs/2305.17212

  17. [17]

    On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective, 2024

    Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective, 2024. URL https://arxiv.org/ abs/2011.11152

  18. [18]

    Why do we need weight decay in modern deep learning? ArXiv, abs/2310.04415, 2023

    Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?, 2024. URLhttps://arxiv.org/abs/2310.04415

  19. [19]

    Muon: An optimizer for hidden layers, 2024

    Keller Jordan. Muon: An optimizer for hidden layers, 2024. URL https://github.com/ KellerJordan/Muon

  20. [20]

    Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

    Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

  21. [21]

    A spectral condition for feature learning

    Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813, 2023

  22. [22]

    Beyond MuP: 2

    Jianlin Su. Beyond MuP: 2. linear layers and steepest descent, Feb 2026. URL https://spaces.ac.cn/ archives/11605. (In Chinese). 11

  23. [23]

    Limit of the smallest eigenvalue of a large dimensional sample covariance matrix.Ann

    Zhi-Dong Bai and Yong-Qua Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix.Ann. Probab, 21(3):1275–1294, 1993

  24. [24]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015. URLhttps://arxiv.org/abs/1502.03167

  25. [25]

    Group normalization

    Yuxin Wu and Kaiming He. Group normalization. InProceedings of the European Conference on Computer Vision (ECCV), September 2018

  26. [26]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https: //arxiv.org/abs/1607.06450

  27. [27]

    Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks, 2016. URLhttps://arxiv.org/abs/1602.07868

  28. [28]

    Micro-batch training with batch-channel normalization and weight standardization, 2020

    Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Micro-batch training with batch-channel normalization and weight standardization, 2020. URLhttps://arxiv.org/abs/1903.10520

  29. [29]

    Available: https://arxiv.org/abs/1910.07467

    Biao Zhang and Rico Sennrich. Root mean square layer normalization, 2019. URL https://arxiv. org/abs/1910.07467

  30. [30]

    Implicit bias of adamw: L inf norm constrained optimization

    Shuo Xie and Zhiyuan Li. Implicit bias of AdamW: ℓ∞ norm constrained optimization, 2024. URL https://arxiv.org/abs/2404.04454

  31. [31]

    Reconciling modern deep learning with traditional optimiza- tion analyses: The intrinsic learning rate, 2020

    Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. Reconciling modern deep learning with traditional optimiza- tion analyses: The intrinsic learning rate, 2020. URLhttps://arxiv.org/abs/2010.02916

  32. [32]

    The rotation of eigenvectors by a perturbation

    Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii.SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

  33. [33]

    Perturbation bounds in connection with singular value decomposition.BIT Numerical Mathematics, 12(1):99–111, 1972

    Per-Åke Wedin. Perturbation bounds in connection with singular value decomposition.BIT Numerical Mathematics, 12(1):99–111, 1972

  34. [34]

    Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained LMOs, 2025. URL https://arxiv.org/ abs/2502.07529

  35. [35]

    Purifying shampoo: Investigating shampoo’s heuristics by decomposing its preconditioner.arXiv preprint arXiv:2506.03595,

    Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, and Hao-Jun Michael Shi. Purifying Shampoo: Investigating Shampoo’s heuristics by decomposing its preconditioner, 2025. URL https://arxiv.org/abs/2506.03595

  36. [36]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018

  37. [37]

    Clarifying shampoo: Adapting spectral descent to stochasticity and the parameter trajectory.arXiv preprint arXiv:2602.09314,

    Runa Eschenhagen, Anna Cai, Tsung-Hsien Lee, and Hao-Jun Michael Shi. Clarifying Shampoo: Adapting spectral descent to stochasticity and the parameter trajectory, 2026. URL https://arxiv.org/abs/ 2602.09314

  38. [38]

    A distributed data-parallel PyTorch imple- mentation of the distributed Shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023

    Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed Shampoo optimizer for training neural networks at-scale, 2023. URL https://arxiv. org/abs/2309.06497

  39. [39]

    Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

  40. [40]

    Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762, 2025

    Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization, 2025. URLhttps://arxiv.org/abs/2503.20762

  41. [41]

    SWAN: SGD with normalization and whitening enables stateless LLM training, 2025

    Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. SWAN: SGD with normalization and whitening enables stateless LLM training, 2025. URLhttps://arxiv.org/abs/2412.13148

  42. [42]

    A minimalist optimizer design for llm pretraining.arXiv preprint arXiv:2506.16659, 2025

    Athanasios Glentis, Jiaxiang Li, Andi Han, and Mingyi Hong. A minimalist optimizer design for LLM pretraining, 2025. URLhttps://arxiv.org/abs/2506.16659

  43. [43]

    On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer.arXiv preprint arXiv:2603.09952,

    Ruihan Xu, Jiajin Li, and Yiping Lu. On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer, 2026. URL https://arxiv.org/ abs/2603.09952. 12

  44. [44]

    Gradient multi-normalization for stateless and scalable LLM training.arXiv preprint arXiv:2502.06742,

    Meyer Scetbon, Chao Ma, Wenbo Gong, and Edward Meeds. Gradient multi-normalization for stateless and scalable LLM training, 2025. URLhttps://arxiv.org/abs/2502.06742

  45. [45]

    Spectral normalization for generative adversarial networks

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. InInternational Conference on Learning Representations, 2018

  46. [46]

    Learning by turning: Neural architecture aware optimisation, 2021

    Yang Liu, Jeremy Bernstein, Markus Meister, and Yisong Yue. Learning by turning: Neural architecture aware optimisation, 2021. URLhttps://arxiv.org/abs/2102.07227

  47. [47]

    Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzęb- ski

    Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. nGPT: Normalized transformer with representation learning on the hypersphere, 2024. URLhttps://arxiv.org/abs/2410.01131

  48. [48]

    Nemotron-flash: Towards latency-optimal hybrid small language models.arXiv preprint arXiv: 2511.18890,

    Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Nemotron-flash: Towards latency-optimal hybrid small language models, 2025. URLhttps://arxiv.org/abs/2511.18890

  49. [49]

    Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, and Michael Hefenbrock

    Jörg K.H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, and Michael Hefenbrock. Learning in compact spaces with approximately normalized transformer, 2025. URL https://arxiv.org/abs/2505.22014

  50. [50]

    Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay

    Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 21759–21770. Curran Associates, Inc., 2021. U...

  51. [51]

    Rehg, and Le Song

    Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M. Rehg, and Le Song. Decoupled networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

  52. [52]

    Artificial kuramoto oscillatory neurons,

    Takeru Miyato, Sindy Löwe, Andreas Geiger, and Max Welling. Artificial kuramoto oscillatory neurons,

  53. [53]

    URLhttps://arxiv.org/abs/2410.13821

  54. [54]

    Analyzing and improving the training dynamics of diffusion models, 2024

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models, 2024. URL https://arxiv.org/abs/2312. 02696

  55. [55]

    Variance control via weight rescaling in LLM pre-training, 2025

    Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, and Fabian Güra. Variance control via weight rescaling in LLM pre-training, 2025. URLhttps://arxiv.org/abs/2503.17500

  56. [56]

    Three mechanisms of weight decay regularization, 2018

    Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization, 2018. URLhttps://arxiv.org/abs/1810.12281

  57. [57]

    Understanding AdamW through proximal methods and scale-freeness, 2022

    Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, and Francesco Orabona. Understanding AdamW through proximal methods and scale-freeness, 2022. URLhttps://arxiv.org/abs/2202.00089

  58. [58]

    An exponential learning rate schedule for deep learning, 2019

    Zhiyuan Li and Sanjeev Arora. An exponential learning rate schedule for deep learning, 2019. URL https://arxiv.org/abs/1910.07454

  59. [59]

    Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights.arXiv preprint arXiv:2006.08217, 2020

    Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. AdamP: Slowing down the slowdown for momentum optimizers on scale-invariant weights, 2021. URLhttps://arxiv.org/abs/2006.08217

  60. [60]

    L2 regularization versus batch and weight normalization

    Twan van Laarhoven. L2 regularization versus batch and weight normalization, 2017. URL https: //arxiv.org/abs/1706.05350

  61. [61]

    Analyzing & reducing the need for learning rate warmup in GPT training, 2024

    Atli Kosson, Bettina Messmer, and Martin Jaggi. Analyzing & reducing the need for learning rate warmup in GPT training, 2024. URLhttps://arxiv.org/abs/2410.23922

  62. [62]

    Beyond MuP: 4

    Jianlin Su. Beyond MuP: 4. ensuring parameter stability, Apr 2026. URL https://spaces.ac.cn/ archives/11729. (In Chinese)

  63. [63]

    Lion secretly solves constrained optimization: As Lyapunov predicts

    Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu. Lion secretly solves constrained optimization: As Lyapunov predicts, 2025. URLhttps://arxiv.org/abs/2310.05898

  64. [64]

    Nonsmooth analysis of singular values

    Adrian S Lewis and Hristo S Sendov. Nonsmooth analysis of singular values. part i: Theory.Set-Valued Analysis, 13(3):213–241, 2005. 13

  65. [65]

    Smooth manifolds

    John M Lee. Smooth manifolds. InIntroduction to smooth manifolds, pages 1–29. Springer, 2003

  66. [66]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson.Topics in Matrix Analysis. Cambridge University Press, 1991

  67. [67]

    Sampling from large matrices: An approach through geometric functional analysis.Journal of the ACM, 54(4):21:1–21:19, 2007

    Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis.Journal of the ACM, 54(4):21:1–21:19, 2007

  68. [68]

    G. W. Stewart and Ji-Guang Sun.Matrix Perturbation Theory. Computer Science and Scientific Computing. Academic Press, 1990

  69. [69]

    NanoChat: The best ChatGPT that $100 can buy, 2025

    Andrej Karpathy. NanoChat: The best ChatGPT that $100 can buy, 2025. URL https://github.com/ karpathy/nanochat. A Related Works Muon optimizers and manifold-constrained variants.Muon [ 18, 33] introduced a spectral-norm steepest descent update for weight matrices, achieving strong empirical performance in LLM pre-training. It is a parallel work of matrix ...

  70. [70]

    Adaptive Rotational Equilibrium under Spectral Sphere

    show that gradient norms increase rapidly near the end of training, a phenomenon linked to the decay of weight norms. Our analysis shows that manifold constraints intrinsically govern the quantities that weight decay controls heuristically—locking the relative learning rate and rotation angle—and thus provide a principled geometric alternative to weight d...