arxiv: 2605.04418 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI· math.OC

Recognition: 3 theorem links

· Lean Theorem

Demystifying Manifold Constraints in LLM Pre-training

Kang An , Jiaxiang Li , Donald Goldfarb , Shiqian Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC

keywords manifold constraintsLLM pre-trainingRiemannian optimizationnormalization layersweight decayactivation scalesrotational equilibriumconstrained optimization

0 comments

The pith

Manifold constraints on LLM weights independently bound activation scales and enforce rotational equilibrium, replacing normalization and weight decay.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that explicit manifold constraints on model weights can take over the stabilizing roles of common heuristic techniques like normalization layers and weight decay during large language model pre-training. Through the introduction of the Msign-Aligned Constrained Riemannian Optimizer (MACRO), a single-loop framework with convergence guarantees, the work separates these geometric restrictions from interacting mechanisms and shows they limit forward activation scales while maintaining rotational balance in the weights. Empirical results on large-scale architectures confirm that this approach delivers competitive performance without the usual heuristics. A sympathetic reader would care because it provides a more principled, geometry-based account of training stability and a path toward simpler optimization methods with theoretical backing.

Core claim

By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO) as a provably convergent single-loop framework, the paper disentangles explicit manifold constraints from heuristics such as RMS normalization and decoupled weight decay. Theoretical analyses demonstrate that these constraints independently bound forward activation scales and enforce stable rotational equilibrium. Comprehensive empirical evaluations on large-scale LLM architectures show that MACRO achieves highly competitive performance while rigorously preserving the guarantees of exact Riemannian optimization.

What carries the argument

Manifold constraints on the weights, enforced via the Msign-Aligned Constrained Riemannian Optimizer (MACRO) in a single-loop manner to restrict weights geometrically and separate their regularization effects from normalization and decay.

If this is right

Manifold constraints bound forward activation scales independently of explicit normalization layers.
They enforce stable rotational equilibrium in the weights, subsuming the role of weight decay.
MACRO delivers competitive performance on large LLMs while retaining exact Riemannian convergence guarantees.
Weight regularization effects can be isolated from interacting mechanisms like normalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training code could drop separate normalization and decay modules if the geometric constraints prove sufficient across tasks.
The rotational equilibrium view may help explain stability patterns in other neural architectures beyond language models.
Similar single-loop manifold enforcement could be tested for efficiency gains when scaling to even larger parameter counts.

Load-bearing premise

That manifold constraints can be enforced efficiently in a single-loop manner on large-scale LLM architectures without introducing new instabilities or degrading performance relative to standard heuristics.

What would settle it

Running MACRO on a full-scale LLM pre-training task and observing either lower final performance than standard normalized training or the emergence of numerical instabilities or divergence.

Figures

Figures reproduced from arXiv: 2605.04418 by Donald Goldfarb, Jiaxiang Li, Kang An, Shiqian Ma.

**Figure 1.** Figure 1: Train loss for 120M QWEN3- like model. In this subsection, we analyze and compare Frobenius and spectral spheres, and output/input Oblique manifolds. The major benefit of weight constraints is controlling the scale of activations. Consider a single linear layer Y = XW⊤, where X ∈ RT ×Din , Y ∈ RT ×Dout and W ∈ RDout×Din for a sequence of length T. Following Yang et al. [20] and Su [21], our goal is for the… view at source ↗

**Figure 2.** Figure 2: Empirical validation of κ. The true mechanism of the Frobenius constraint becomes clear when we analyze average-case behavior governed by the empirical input covariance, ΣX = 1 T X⊤X. During training, optimization dynamics continuously change this covariance. Therefore, we evaluate the average output RMS norm using the trace formulation: EX[∥ vec(Y)∥ 2 RMS] = 1 Dout tr(ΣXW⊤W). We tightly bound this trace i… view at source ↗

**Figure 3.** Figure 3: Evolution of ℓ2 norm of γ during training. To maintain an input scale of ∥ vec(X)∥RMS = Θ(1) for each layers, we still insert a parameter-free RMSNorm immediately after the attention block, alongside QK-norm to stabilize the pre-softmax logits. In addition, for the SwiGLU activation, the Hadamard product fundamentally disrupts linear stability (Multiplying two matrices of scale c element-wise yields a c 2… view at source ↗

**Figure 4.** Figure 4: θt under Frobenius Sphere. Static Rotational Equilibrium under the Frobenius Sphere. Manifold constraints also fundamentally alter the angular trajectory. Traditional weight decay requires a transient phase to reach rotational equilibrium [15]. Manifold constraints bypass this phase entirely and enforce a pure rotational state from the very first step. We explicitly quantify the rotational angle for the F… view at source ↗

**Figure 5.** Figure 5: Rotational angle under the spectral constraint: The spectral rotation angles display high view at source ↗

**Figure 6.** Figure 6: Training and validation loss for Muon, MACRO-fro, and MACRO-spec on the 330M QWEN3- like model with all learnable RMSNorm layers removed, at the chosen learning rate ηt = 10−2 ( view at source ↗

**Figure 7.** Figure 7: Global gradient norm across training for view at source ↗

**Figure 8.** Figure 8: Validation loss versus learning rate at four model widths view at source ↗

**Figure 9.** Figure 9: Training (left) and validation (right) loss across the full training schedule for the 120M, view at source ↗

**Figure 10.** Figure 10: The average of Per-step tangent space violation view at source ↗

**Figure 11.** Figure 11: Ablation study on geometric constraints. view at source ↗

read the original abstract

The empirical success of large language model (LLM) pre-training relies heavily on heuristic stabilization techniques, such as explicit normalization layers and weight decay. While recent constrained optimization approaches that explicitly restrict weights may improve numerical stability and performance, the mechanism and motivation for adding constraints still remain elusive. This paper systematically demystifies the role of explicit manifold constraints in LLM pre-training. By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO)-a provably convergent, single-loop optimization framework-our study disentangles weight regularization heuristics from interacting mechanisms like RMS normalization and decoupled weight decay. Theoretical analyses and comprehensive empirical evaluations reveal that manifold constraints independently bound forward activation scales and enforce stable rotational equilibrium, thereby subsuming the roles of these heuristic mechanisms. Evaluations on large-scale LLM architectures demonstrate that MACRO achieves highly competitive performance while rigorously preserving the theoretical guarantees of exact Riemannian optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MACRO is a single-loop Riemannian optimizer that claims manifold constraints on weights can replace RMSNorm and weight decay in LLM pre-training, with convergence theory and competitive large-scale results.

read the letter

The main thing here is that the authors treat weight constraints as the core mechanism rather than an add-on. They introduce MACRO, a provably convergent single-loop method on the manifold, and argue that the constraints alone bound activation scales and keep rotational equilibrium stable, which lets them drop the usual heuristics. The abstract and stress-test note line up on this: the independence claim is the central point, and they back it with both analysis and runs on big models that stay competitive without extra tricks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Msign-Aligned Constrained Riemannian Optimizer (MACRO), a provably convergent single-loop Riemannian optimization framework for LLM pre-training. It claims that explicit manifold constraints independently bound forward activation scales and enforce stable rotational equilibrium, thereby subsuming the roles of RMS normalization and decoupled weight decay heuristics. Theoretical analyses establish convergence guarantees, while empirical evaluations on large-scale LLM architectures demonstrate competitive performance with preserved exact Riemannian optimization properties.

Significance. If the central claims hold, the work offers substantial significance by providing a principled, constraint-based explanation for stabilization in LLM training that could replace ad-hoc heuristics. The single-loop Riemannian formulation with provable convergence, combined with large-scale empirical validation, strengthens the case for adopting manifold constraints as a more transparent alternative to current practices.

major comments (2)

[Theoretical Analyses] The independence claim in the theoretical analyses—that manifold constraints bound activation scales without relying on normalization—is load-bearing for the subsumption argument. The derivation should explicitly show that the bounding effect persists under the paper's manifold definition even when normalization layers are removed, rather than emerging from the interaction of definitions.
[Empirical Evaluations] Empirical section on large-scale evaluations: the reported competitive performance must be supported by ablations that isolate the manifold constraint's effect on rotational equilibrium (e.g., comparing MACRO variants with and without the constraint while holding other optimizer components fixed). Without this, the claim that constraints subsume weight decay remains partially confounded.

minor comments (2)

[Introduction] The acronym MACRO and its full expansion should be introduced at the first use in the main text body for clarity, even though it appears in the abstract.
[Theoretical Analyses] Notation for the Riemannian projection operator and Msign alignment could be clarified with a brief reminder of their definitions when first used in the convergence proof to aid readers outside Riemannian optimization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with point-by-point responses, committing to revisions that strengthen the clarity of our claims without altering the core contributions.

read point-by-point responses

Referee: [Theoretical Analyses] The independence claim in the theoretical analyses—that manifold constraints bound activation scales without relying on normalization—is load-bearing for the subsumption argument. The derivation should explicitly show that the bounding effect persists under the paper's manifold definition even when normalization layers are removed, rather than emerging from the interaction of definitions.

Authors: We agree that an explicit derivation of independence is essential to substantiate the subsumption argument. In the revised manuscript, we will expand the theoretical section with a dedicated lemma and proof that derives the forward activation scale bounds solely from the Msign-aligned Riemannian projection and manifold constraint, without reference to normalization layers. This will demonstrate that the bounding holds directly under the paper's manifold definition by considering the geometry of the constraint set in isolation. revision: yes
Referee: [Empirical Evaluations] Empirical section on large-scale evaluations: the reported competitive performance must be supported by ablations that isolate the manifold constraint's effect on rotational equilibrium (e.g., comparing MACRO variants with and without the constraint while holding other optimizer components fixed). Without this, the claim that constraints subsume weight decay remains partially confounded.

Authors: We acknowledge that isolating the rotational equilibrium constraint is necessary to avoid potential confounding in the subsumption claim. In the revised manuscript, we will add targeted ablation experiments on the large-scale LLM setups. These will compare the full MACRO optimizer against a controlled variant in which the rotational equilibrium constraint is relaxed (by modifying only the alignment step while holding the single-loop structure, other optimizer components, and hyperparameters fixed). The results will quantify the isolated contribution to stability and performance, directly supporting the relation to decoupled weight decay. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces the MACRO framework as a new single-loop Riemannian optimizer and derives its properties (activation scale bounding and rotational equilibrium) from the explicit manifold constraints and convergence proofs within that framework. These results are presented as independent of prior heuristics like RMSNorm and weight decay, supported by both theoretical analysis and large-scale experiments rather than by re-using fitted parameters or self-citations as load-bearing inputs. No equation or claim reduces by construction to a redefinition of its own inputs, and the central disentangling argument rests on the novel constrained formulation rather than circular self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard assumptions from Riemannian optimization and manifold geometry that are not detailed in the abstract; no free parameters or invented entities are explicitly described beyond the new optimizer name.

axioms (2)

standard math The weight space can be treated as a Riemannian manifold with appropriate metric for constrained optimization.
Invoked implicitly by the use of Riemannian optimizer and manifold constraints.
domain assumption Single-loop updates preserve convergence guarantees under the chosen constraint alignment.
Central to the provably convergent claim but not expanded in abstract.

invented entities (1)

MACRO optimizer no independent evidence
purpose: Single-loop constrained Riemannian method for LLM pre-training
New framework introduced to enforce manifold constraints while disentangling heuristics.

pith-pipeline@v0.9.0 · 5449 in / 1437 out tokens · 59333 ms · 2026-05-08T17:40:58.363580+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 43 canonical work pages · 3 internal anchors

[1]

Towards a principled Muon under µP: Ensuring spectral conditions throughout training, 2026

John Zhao. Towards a principled Muon under µP: Ensuring spectral conditions throughout training, 2026. URLhttps://arxiv.org/abs/2601.01306

work page arXiv 2026
[2]

Xiaowen Jiang, Andrei Semenov, and Sebastian U. Stich. Enhancing LLM training via spectral clipping,
[3]

URLhttps://arxiv.org/abs/2603.14315

work page arXiv
[4]

Learning in transformers under spectral constraints

Md Rifat Arefin, Ravid Shwartz-Ziv, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Aristide Baratin, Adithya Sagar, and Patrick Huber. Learning in transformers under spectral constraints. InICLR 2026 Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2026

2026
[5]

Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026

Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for LLM training, 2026. URL https: //arxiv.org/abs/2601.23000

work page arXiv 2026
[6]

Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola

Laker Newhouse, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola. Training transformers with enforced Lipschitz constants, 2025. URL https://arxiv.org/abs/2507. 13338

2025
[7]

Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,

Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, and Baining Guo. Controlled LLM training on spectral sphere, 2026. URLhttps://arxiv.org/abs/2601.08393

work page arXiv 2026
[8]

Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

Kaiwei Yang and Lexiao Lai. Manifold constrained steepest descent, 2026. URL https://arxiv.org/ abs/2601.21487

work page arXiv 2026
[9]

Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avra- ham, and Alexander Long

Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputu- godage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avraham, and Alexander Long. NuMuon: Nuclear-norm-constrained Muon for compressible LLM training, 2026. URL https://arxiv.org/abs/2603.03597

work page arXiv 2026
[10]

Fastest descent on a manifold: 4

Jianlin Su. Fastest descent on a manifold: 4. Muon + spectral sphere, Aug 2025. URL https://spaces. ac.cn/archives/11241. (In Chinese)

2025
[11]

Fastest descent on a manifold: 2

Jianlin Su. Fastest descent on a manifold: 2. Muon + orthogonal, Aug 2025. URL https://spaces.ac. cn/archives/11215. (In Chinese)

2025
[12]

Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025

Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025. URLhttps://tinyurl.com/muonh

2025
[13]

20251026

Jeremy Bernstein. Modular manifolds.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml. 20250926. https://thinkingmachines.ai/blog/modular-manifolds/

work page doi:10.64434/tml 2025
[14]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv. org/abs/1711.05101

work page internal anchor Pith review arXiv 2019
[15]

Why gradients rapidly increase near the end of training.arXiv preprint arXiv: 2506.02285,

Aaron Defazio. Why gradients rapidly increase near the end of training, 2025. URL https://arxiv. org/abs/2506.02285

work page arXiv 2025
[16]

Rotational equilibrium: How weight decay balances learning across neural networks, 2024

Atli Kosson, Bettina Messmer, and Martin Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks, 2024. URLhttps://arxiv.org/abs/2305.17212

work page arXiv 2024
[17]

On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective, 2024

Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective, 2024. URL https://arxiv.org/ abs/2011.11152

work page arXiv 2024
[18]

Why do we need weight decay in modern deep learning? ArXiv, abs/2310.04415, 2023

Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?, 2024. URLhttps://arxiv.org/abs/2310.04415

work page arXiv 2024
[19]

Muon: An optimizer for hidden layers, 2024

Keller Jordan. Muon: An optimizer for hidden layers, 2024. URL https://github.com/ KellerJordan/Muon

2024
[20]

Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023

2023
[21]

A spectral condition for feature learning

Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813, 2023

work page arXiv 2023
[22]

Beyond MuP: 2

Jianlin Su. Beyond MuP: 2. linear layers and steepest descent, Feb 2026. URL https://spaces.ac.cn/ archives/11605. (In Chinese). 11

2026
[23]

Limit of the smallest eigenvalue of a large dimensional sample covariance matrix.Ann

Zhi-Dong Bai and Yong-Qua Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix.Ann. Probab, 21(3):1275–1294, 1993

1993
[24]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015. URLhttps://arxiv.org/abs/1502.03167

work page internal anchor Pith review arXiv 2015
[25]

Group normalization

Yuxin Wu and Kaiming He. Group normalization. InProceedings of the European Conference on Computer Vision (ECCV), September 2018

2018
[26]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https: //arxiv.org/abs/1607.06450

work page internal anchor Pith review arXiv 2016
[27]

Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks, 2016. URLhttps://arxiv.org/abs/1602.07868

work page arXiv 2016
[28]

Micro-batch training with batch-channel normalization and weight standardization, 2020

Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Micro-batch training with batch-channel normalization and weight standardization, 2020. URLhttps://arxiv.org/abs/1903.10520

work page arXiv 2020
[29]

Available: https://arxiv.org/abs/1910.07467

Biao Zhang and Rico Sennrich. Root mean square layer normalization, 2019. URL https://arxiv. org/abs/1910.07467

work page arXiv 2019
[30]

Implicit bias of adamw: L inf norm constrained optimization

Shuo Xie and Zhiyuan Li. Implicit bias of AdamW: ℓ∞ norm constrained optimization, 2024. URL https://arxiv.org/abs/2404.04454

work page arXiv 2024
[31]

Reconciling modern deep learning with traditional optimiza- tion analyses: The intrinsic learning rate, 2020

Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. Reconciling modern deep learning with traditional optimiza- tion analyses: The intrinsic learning rate, 2020. URLhttps://arxiv.org/abs/2010.02916

work page arXiv 2020
[32]

The rotation of eigenvectors by a perturbation

Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii.SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

1970
[33]

Perturbation bounds in connection with singular value decomposition.BIT Numerical Mathematics, 12(1):99–111, 1972

Per-Åke Wedin. Perturbation bounds in connection with singular value decomposition.BIT Numerical Mathematics, 12(1):99–111, 1972

1972
[34]

Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained LMOs, 2025. URL https://arxiv.org/ abs/2502.07529

work page arXiv 2025
[35]

Purifying shampoo: Investigating shampoo’s heuristics by decomposing its preconditioner.arXiv preprint arXiv:2506.03595,

Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, and Hao-Jun Michael Shi. Purifying Shampoo: Investigating Shampoo’s heuristics by decomposing its preconditioner, 2025. URL https://arxiv.org/abs/2506.03595

work page arXiv 2025
[36]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018

2018
[37]

Clarifying shampoo: Adapting spectral descent to stochasticity and the parameter trajectory.arXiv preprint arXiv:2602.09314,

Runa Eschenhagen, Anna Cai, Tsung-Hsien Lee, and Hao-Jun Michael Shi. Clarifying Shampoo: Adapting spectral descent to stochasticity and the parameter trajectory, 2026. URL https://arxiv.org/abs/ 2602.09314

work page arXiv 2026
[38]

A distributed data-parallel PyTorch imple- mentation of the distributed Shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed Shampoo optimizer for training neural networks at-scale, 2023. URL https://arxiv. org/abs/2309.06497

work page arXiv 2023
[39]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page arXiv 2024
[40]

Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762, 2025

Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization, 2025. URLhttps://arxiv.org/abs/2503.20762

work page arXiv 2025
[41]

SWAN: SGD with normalization and whitening enables stateless LLM training, 2025

Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. SWAN: SGD with normalization and whitening enables stateless LLM training, 2025. URLhttps://arxiv.org/abs/2412.13148

work page arXiv 2025
[42]

A minimalist optimizer design for llm pretraining.arXiv preprint arXiv:2506.16659, 2025

Athanasios Glentis, Jiaxiang Li, Andi Han, and Mingyi Hong. A minimalist optimizer design for LLM pretraining, 2025. URLhttps://arxiv.org/abs/2506.16659

work page arXiv 2025
[43]

On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer.arXiv preprint arXiv:2603.09952,

Ruihan Xu, Jiajin Li, and Yiping Lu. On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer, 2026. URL https://arxiv.org/ abs/2603.09952. 12

work page arXiv 2026
[44]

Gradient multi-normalization for stateless and scalable LLM training.arXiv preprint arXiv:2502.06742,

Meyer Scetbon, Chao Ma, Wenbo Gong, and Edward Meeds. Gradient multi-normalization for stateless and scalable LLM training, 2025. URLhttps://arxiv.org/abs/2502.06742

work page arXiv 2025
[45]

Spectral normalization for generative adversarial networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. InInternational Conference on Learning Representations, 2018

2018
[46]

Learning by turning: Neural architecture aware optimisation, 2021

Yang Liu, Jeremy Bernstein, Markus Meister, and Yisong Yue. Learning by turning: Neural architecture aware optimisation, 2021. URLhttps://arxiv.org/abs/2102.07227

work page arXiv 2021
[47]

Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzęb- ski

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. nGPT: Normalized transformer with representation learning on the hypersphere, 2024. URLhttps://arxiv.org/abs/2410.01131

work page arXiv 2024
[48]

Nemotron-flash: Towards latency-optimal hybrid small language models.arXiv preprint arXiv: 2511.18890,

Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Nemotron-flash: Towards latency-optimal hybrid small language models, 2025. URLhttps://arxiv.org/abs/2511.18890

work page arXiv 2025
[49]

Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, and Michael Hefenbrock

Jörg K.H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, and Michael Hefenbrock. Learning in compact spaces with approximately normalized transformer, 2025. URL https://arxiv.org/abs/2505.22014

work page arXiv 2025
[50]

Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay

Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 21759–21770. Curran Associates, Inc., 2021. U...

2021
[51]

Rehg, and Le Song

Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M. Rehg, and Le Song. Decoupled networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

2018
[52]

Artificial kuramoto oscillatory neurons,

Takeru Miyato, Sindy Löwe, Andreas Geiger, and Max Welling. Artificial kuramoto oscillatory neurons,
[53]

URLhttps://arxiv.org/abs/2410.13821

work page arXiv
[54]

Analyzing and improving the training dynamics of diffusion models, 2024

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models, 2024. URL https://arxiv.org/abs/2312. 02696

2024
[55]

Variance control via weight rescaling in LLM pre-training, 2025

Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, and Fabian Güra. Variance control via weight rescaling in LLM pre-training, 2025. URLhttps://arxiv.org/abs/2503.17500

work page arXiv 2025
[56]

Three mechanisms of weight decay regularization, 2018

Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization, 2018. URLhttps://arxiv.org/abs/1810.12281

work page arXiv 2018
[57]

Understanding AdamW through proximal methods and scale-freeness, 2022

Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, and Francesco Orabona. Understanding AdamW through proximal methods and scale-freeness, 2022. URLhttps://arxiv.org/abs/2202.00089

work page arXiv 2022
[58]

An exponential learning rate schedule for deep learning, 2019

Zhiyuan Li and Sanjeev Arora. An exponential learning rate schedule for deep learning, 2019. URL https://arxiv.org/abs/1910.07454

work page arXiv 2019
[59]

Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights.arXiv preprint arXiv:2006.08217, 2020

Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. AdamP: Slowing down the slowdown for momentum optimizers on scale-invariant weights, 2021. URLhttps://arxiv.org/abs/2006.08217

work page arXiv 2021
[60]

L2 regularization versus batch and weight normalization

Twan van Laarhoven. L2 regularization versus batch and weight normalization, 2017. URL https: //arxiv.org/abs/1706.05350

work page arXiv 2017
[61]

Analyzing & reducing the need for learning rate warmup in GPT training, 2024

Atli Kosson, Bettina Messmer, and Martin Jaggi. Analyzing & reducing the need for learning rate warmup in GPT training, 2024. URLhttps://arxiv.org/abs/2410.23922

work page arXiv 2024
[62]

Beyond MuP: 4

Jianlin Su. Beyond MuP: 4. ensuring parameter stability, Apr 2026. URL https://spaces.ac.cn/ archives/11729. (In Chinese)

2026
[63]

Lion secretly solves constrained optimization: As Lyapunov predicts

Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu. Lion secretly solves constrained optimization: As Lyapunov predicts, 2025. URLhttps://arxiv.org/abs/2310.05898

work page arXiv 2025
[64]

Nonsmooth analysis of singular values

Adrian S Lewis and Hristo S Sendov. Nonsmooth analysis of singular values. part i: Theory.Set-Valued Analysis, 13(3):213–241, 2005. 13

2005
[65]

Smooth manifolds

John M Lee. Smooth manifolds. InIntroduction to smooth manifolds, pages 1–29. Springer, 2003

2003
[66]

Horn and Charles R

Roger A. Horn and Charles R. Johnson.Topics in Matrix Analysis. Cambridge University Press, 1991

1991
[67]

Sampling from large matrices: An approach through geometric functional analysis.Journal of the ACM, 54(4):21:1–21:19, 2007

Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis.Journal of the ACM, 54(4):21:1–21:19, 2007

2007
[68]

G. W. Stewart and Ji-Guang Sun.Matrix Perturbation Theory. Computer Science and Scientific Computing. Academic Press, 1990

1990
[69]

NanoChat: The best ChatGPT that $100 can buy, 2025

Andrej Karpathy. NanoChat: The best ChatGPT that $100 can buy, 2025. URL https://github.com/ karpathy/nanochat. A Related Works Muon optimizers and manifold-constrained variants.Muon [ 18, 33] introduced a spectral-norm steepest descent update for weight matrices, achieving strong empirical performance in LLM pre-training. It is a parallel work of matrix ...

2025
[70]

Adaptive Rotational Equilibrium under Spectral Sphere

show that gradient norms increase rapidly near the end of training, a phenomenon linked to the decay of weight norms. Our analysis shows that manifold constraints intrinsically govern the quantities that weight decay controls heuristically—locking the relative learning rate and rotation angle—and thus provide a principled geometric alternative to weight d...

2000