Recognition: 3 theorem links
· Lean TheoremDemystifying Manifold Constraints in LLM Pre-training
Pith reviewed 2026-05-08 17:40 UTC · model grok-4.3
The pith
Manifold constraints on LLM weights independently bound activation scales and enforce rotational equilibrium, replacing normalization and weight decay.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO) as a provably convergent single-loop framework, the paper disentangles explicit manifold constraints from heuristics such as RMS normalization and decoupled weight decay. Theoretical analyses demonstrate that these constraints independently bound forward activation scales and enforce stable rotational equilibrium. Comprehensive empirical evaluations on large-scale LLM architectures show that MACRO achieves highly competitive performance while rigorously preserving the guarantees of exact Riemannian optimization.
What carries the argument
Manifold constraints on the weights, enforced via the Msign-Aligned Constrained Riemannian Optimizer (MACRO) in a single-loop manner to restrict weights geometrically and separate their regularization effects from normalization and decay.
If this is right
- Manifold constraints bound forward activation scales independently of explicit normalization layers.
- They enforce stable rotational equilibrium in the weights, subsuming the role of weight decay.
- MACRO delivers competitive performance on large LLMs while retaining exact Riemannian convergence guarantees.
- Weight regularization effects can be isolated from interacting mechanisms like normalization.
Where Pith is reading between the lines
- Training code could drop separate normalization and decay modules if the geometric constraints prove sufficient across tasks.
- The rotational equilibrium view may help explain stability patterns in other neural architectures beyond language models.
- Similar single-loop manifold enforcement could be tested for efficiency gains when scaling to even larger parameter counts.
Load-bearing premise
That manifold constraints can be enforced efficiently in a single-loop manner on large-scale LLM architectures without introducing new instabilities or degrading performance relative to standard heuristics.
What would settle it
Running MACRO on a full-scale LLM pre-training task and observing either lower final performance than standard normalized training or the emergence of numerical instabilities or divergence.
Figures
read the original abstract
The empirical success of large language model (LLM) pre-training relies heavily on heuristic stabilization techniques, such as explicit normalization layers and weight decay. While recent constrained optimization approaches that explicitly restrict weights may improve numerical stability and performance, the mechanism and motivation for adding constraints still remain elusive. This paper systematically demystifies the role of explicit manifold constraints in LLM pre-training. By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO)-a provably convergent, single-loop optimization framework-our study disentangles weight regularization heuristics from interacting mechanisms like RMS normalization and decoupled weight decay. Theoretical analyses and comprehensive empirical evaluations reveal that manifold constraints independently bound forward activation scales and enforce stable rotational equilibrium, thereby subsuming the roles of these heuristic mechanisms. Evaluations on large-scale LLM architectures demonstrate that MACRO achieves highly competitive performance while rigorously preserving the theoretical guarantees of exact Riemannian optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Msign-Aligned Constrained Riemannian Optimizer (MACRO), a provably convergent single-loop Riemannian optimization framework for LLM pre-training. It claims that explicit manifold constraints independently bound forward activation scales and enforce stable rotational equilibrium, thereby subsuming the roles of RMS normalization and decoupled weight decay heuristics. Theoretical analyses establish convergence guarantees, while empirical evaluations on large-scale LLM architectures demonstrate competitive performance with preserved exact Riemannian optimization properties.
Significance. If the central claims hold, the work offers substantial significance by providing a principled, constraint-based explanation for stabilization in LLM training that could replace ad-hoc heuristics. The single-loop Riemannian formulation with provable convergence, combined with large-scale empirical validation, strengthens the case for adopting manifold constraints as a more transparent alternative to current practices.
major comments (2)
- [Theoretical Analyses] The independence claim in the theoretical analyses—that manifold constraints bound activation scales without relying on normalization—is load-bearing for the subsumption argument. The derivation should explicitly show that the bounding effect persists under the paper's manifold definition even when normalization layers are removed, rather than emerging from the interaction of definitions.
- [Empirical Evaluations] Empirical section on large-scale evaluations: the reported competitive performance must be supported by ablations that isolate the manifold constraint's effect on rotational equilibrium (e.g., comparing MACRO variants with and without the constraint while holding other optimizer components fixed). Without this, the claim that constraints subsume weight decay remains partially confounded.
minor comments (2)
- [Introduction] The acronym MACRO and its full expansion should be introduced at the first use in the main text body for clarity, even though it appears in the abstract.
- [Theoretical Analyses] Notation for the Riemannian projection operator and Msign alignment could be clarified with a brief reminder of their definitions when first used in the convergence proof to aid readers outside Riemannian optimization.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with point-by-point responses, committing to revisions that strengthen the clarity of our claims without altering the core contributions.
read point-by-point responses
-
Referee: [Theoretical Analyses] The independence claim in the theoretical analyses—that manifold constraints bound activation scales without relying on normalization—is load-bearing for the subsumption argument. The derivation should explicitly show that the bounding effect persists under the paper's manifold definition even when normalization layers are removed, rather than emerging from the interaction of definitions.
Authors: We agree that an explicit derivation of independence is essential to substantiate the subsumption argument. In the revised manuscript, we will expand the theoretical section with a dedicated lemma and proof that derives the forward activation scale bounds solely from the Msign-aligned Riemannian projection and manifold constraint, without reference to normalization layers. This will demonstrate that the bounding holds directly under the paper's manifold definition by considering the geometry of the constraint set in isolation. revision: yes
-
Referee: [Empirical Evaluations] Empirical section on large-scale evaluations: the reported competitive performance must be supported by ablations that isolate the manifold constraint's effect on rotational equilibrium (e.g., comparing MACRO variants with and without the constraint while holding other optimizer components fixed). Without this, the claim that constraints subsume weight decay remains partially confounded.
Authors: We acknowledge that isolating the rotational equilibrium constraint is necessary to avoid potential confounding in the subsumption claim. In the revised manuscript, we will add targeted ablation experiments on the large-scale LLM setups. These will compare the full MACRO optimizer against a controlled variant in which the rotational equilibrium constraint is relaxed (by modifying only the alignment step while holding the single-loop structure, other optimizer components, and hyperparameters fixed). The results will quantify the isolated contribution to stability and performance, directly supporting the relation to decoupled weight decay. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces the MACRO framework as a new single-loop Riemannian optimizer and derives its properties (activation scale bounding and rotational equilibrium) from the explicit manifold constraints and convergence proofs within that framework. These results are presented as independent of prior heuristics like RMSNorm and weight decay, supported by both theoretical analysis and large-scale experiments rather than by re-using fitted parameters or self-citations as load-bearing inputs. No equation or claim reduces by construction to a redefinition of its own inputs, and the central disentangling argument rests on the novel constrained formulation rather than circular self-reference.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The weight space can be treated as a Riemannian manifold with appropriate metric for constrained optimization.
- domain assumption Single-loop updates preserve convergence guarantees under the chosen constraint alignment.
invented entities (1)
-
MACRO optimizer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Towards a principled Muon under µP: Ensuring spectral conditions throughout training, 2026
John Zhao. Towards a principled Muon under µP: Ensuring spectral conditions throughout training, 2026. URLhttps://arxiv.org/abs/2601.01306
-
[2]
Xiaowen Jiang, Andrei Semenov, and Sebastian U. Stich. Enhancing LLM training via spectral clipping,
- [3]
-
[4]
Learning in transformers under spectral constraints
Md Rifat Arefin, Ravid Shwartz-Ziv, Ernie Chang, Chinnadhurai Sankar, Rylan Conway, Aristide Baratin, Adithya Sagar, and Patrick Huber. Learning in transformers under spectral constraints. InICLR 2026 Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2026
2026
-
[5]
Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026
Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for LLM training, 2026. URL https: //arxiv.org/abs/2601.23000
-
[6]
Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola
Laker Newhouse, R. Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola. Training transformers with enforced Lipschitz constants, 2025. URL https://arxiv.org/abs/2507. 13338
2025
-
[7]
Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,
Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, and Baining Guo. Controlled LLM training on spectral sphere, 2026. URLhttps://arxiv.org/abs/2601.08393
-
[8]
Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026
Kaiwei Yang and Lexiao Lai. Manifold constrained steepest descent, 2026. URL https://arxiv.org/ abs/2601.21487
-
[9]
Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P Hewa Koneputu- godage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avraham, and Alexander Long. NuMuon: Nuclear-norm-constrained Muon for compressible LLM training, 2026. URL https://arxiv.org/abs/2603.03597
-
[10]
Fastest descent on a manifold: 4
Jianlin Su. Fastest descent on a manifold: 4. Muon + spectral sphere, Aug 2025. URL https://spaces. ac.cn/archives/11241. (In Chinese)
2025
-
[11]
Fastest descent on a manifold: 2
Jianlin Su. Fastest descent on a manifold: 2. Muon + orthogonal, Aug 2025. URL https://spaces.ac. cn/archives/11215. (In Chinese)
2025
-
[12]
Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025
Kaiyue Wen, Xingyu Dang, Kaifeng Lyu, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them 2.1: Hyperball optimization, 12 2025. URLhttps://tinyurl.com/muonh
2025
-
[13]
Jeremy Bernstein. Modular manifolds.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml. 20250926. https://thinkingmachines.ai/blog/modular-manifolds/
-
[14]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv. org/abs/1711.05101
work page internal anchor Pith review arXiv 2019
-
[15]
Why gradients rapidly increase near the end of training.arXiv preprint arXiv: 2506.02285,
Aaron Defazio. Why gradients rapidly increase near the end of training, 2025. URL https://arxiv. org/abs/2506.02285
-
[16]
Rotational equilibrium: How weight decay balances learning across neural networks, 2024
Atli Kosson, Bettina Messmer, and Martin Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks, 2024. URLhttps://arxiv.org/abs/2305.17212
-
[17]
Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective, 2024. URL https://arxiv.org/ abs/2011.11152
-
[18]
Why do we need weight decay in modern deep learning? ArXiv, abs/2310.04415, 2023
Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?, 2024. URLhttps://arxiv.org/abs/2310.04415
-
[19]
Muon: An optimizer for hidden layers, 2024
Keller Jordan. Muon: An optimizer for hidden layers, 2024. URL https://github.com/ KellerJordan/Muon
2024
-
[20]
Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023
Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214, 2023
2023
-
[21]
A spectral condition for feature learning
Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813, 2023
-
[22]
Beyond MuP: 2
Jianlin Su. Beyond MuP: 2. linear layers and steepest descent, Feb 2026. URL https://spaces.ac.cn/ archives/11605. (In Chinese). 11
2026
-
[23]
Limit of the smallest eigenvalue of a large dimensional sample covariance matrix.Ann
Zhi-Dong Bai and Yong-Qua Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix.Ann. Probab, 21(3):1275–1294, 1993
1993
-
[24]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015. URLhttps://arxiv.org/abs/1502.03167
work page internal anchor Pith review arXiv 2015
-
[25]
Group normalization
Yuxin Wu and Kaiming He. Group normalization. InProceedings of the European Conference on Computer Vision (ECCV), September 2018
2018
-
[26]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016. URL https: //arxiv.org/abs/1607.06450
work page internal anchor Pith review arXiv 2016
- [27]
-
[28]
Micro-batch training with batch-channel normalization and weight standardization, 2020
Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Micro-batch training with batch-channel normalization and weight standardization, 2020. URLhttps://arxiv.org/abs/1903.10520
-
[29]
Available: https://arxiv.org/abs/1910.07467
Biao Zhang and Rico Sennrich. Root mean square layer normalization, 2019. URL https://arxiv. org/abs/1910.07467
-
[30]
Implicit bias of adamw: L inf norm constrained optimization
Shuo Xie and Zhiyuan Li. Implicit bias of AdamW: ℓ∞ norm constrained optimization, 2024. URL https://arxiv.org/abs/2404.04454
-
[31]
Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. Reconciling modern deep learning with traditional optimiza- tion analyses: The intrinsic learning rate, 2020. URLhttps://arxiv.org/abs/2010.02916
-
[32]
The rotation of eigenvectors by a perturbation
Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii.SIAM Journal on Numerical Analysis, 7(1):1–46, 1970
1970
-
[33]
Perturbation bounds in connection with singular value decomposition.BIT Numerical Mathematics, 12(1):99–111, 1972
Per-Åke Wedin. Perturbation bounds in connection with singular value decomposition.BIT Numerical Mathematics, 12(1):99–111, 1972
1972
-
[34]
Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained LMOs, 2025. URL https://arxiv.org/ abs/2502.07529
-
[35]
Runa Eschenhagen, Aaron Defazio, Tsung-Hsien Lee, Richard E. Turner, and Hao-Jun Michael Shi. Purifying Shampoo: Investigating Shampoo’s heuristics by decomposing its preconditioner, 2025. URL https://arxiv.org/abs/2506.03595
-
[36]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018
2018
-
[37]
Runa Eschenhagen, Anna Cai, Tsung-Hsien Lee, and Hao-Jun Michael Shi. Clarifying Shampoo: Adapting spectral descent to stochasticity and the parameter trajectory, 2026. URL https://arxiv.org/abs/ 2602.09314
-
[38]
Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel pytorch implementation of the distributed Shampoo optimizer for training neural networks at-scale, 2023. URL https://arxiv. org/abs/2309.06497
-
[39]
Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024
Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024
-
[40]
Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762, 2025
Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. ASGO: Adaptive structured gradient optimization, 2025. URLhttps://arxiv.org/abs/2503.20762
-
[41]
SWAN: SGD with normalization and whitening enables stateless LLM training, 2025
Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. SWAN: SGD with normalization and whitening enables stateless LLM training, 2025. URLhttps://arxiv.org/abs/2412.13148
-
[42]
A minimalist optimizer design for llm pretraining.arXiv preprint arXiv:2506.16659, 2025
Athanasios Glentis, Jiaxiang Li, Andi Han, and Mingyi Hong. A minimalist optimizer design for LLM pretraining, 2025. URLhttps://arxiv.org/abs/2506.16659
-
[43]
Ruihan Xu, Jiajin Li, and Yiping Lu. On the width scaling of neural optimizers under matrix operator norms i: Row/column normalization and hyperparameter transfer, 2026. URL https://arxiv.org/ abs/2603.09952. 12
-
[44]
Meyer Scetbon, Chao Ma, Wenbo Gong, and Edward Meeds. Gradient multi-normalization for stateless and scalable LLM training, 2025. URLhttps://arxiv.org/abs/2502.06742
-
[45]
Spectral normalization for generative adversarial networks
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. InInternational Conference on Learning Representations, 2018
2018
-
[46]
Learning by turning: Neural architecture aware optimisation, 2021
Yang Liu, Jeremy Bernstein, Markus Meister, and Yisong Yue. Learning by turning: Neural architecture aware optimisation, 2021. URLhttps://arxiv.org/abs/2102.07227
-
[47]
Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. nGPT: Normalized transformer with representation learning on the hypersphere, 2024. URLhttps://arxiv.org/abs/2410.01131
-
[48]
Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, and Pavlo Molchanov. Nemotron-flash: Towards latency-optimal hybrid small language models, 2025. URLhttps://arxiv.org/abs/2511.18890
-
[49]
Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, and Michael Hefenbrock
Jörg K.H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, and Michael Hefenbrock. Learning in compact spaces with approximately normalized transformer, 2025. URL https://arxiv.org/abs/2505.22014
-
[50]
Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay
Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of normalized neural network using SGD and weight decay. In M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 21759–21770. Curran Associates, Inc., 2021. U...
2021
-
[51]
Rehg, and Le Song
Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M. Rehg, and Le Song. Decoupled networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018
2018
-
[52]
Artificial kuramoto oscillatory neurons,
Takeru Miyato, Sindy Löwe, Andreas Geiger, and Max Welling. Artificial kuramoto oscillatory neurons,
- [53]
-
[54]
Analyzing and improving the training dynamics of diffusion models, 2024
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models, 2024. URL https://arxiv.org/abs/2312. 02696
2024
-
[55]
Variance control via weight rescaling in LLM pre-training, 2025
Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, and Fabian Güra. Variance control via weight rescaling in LLM pre-training, 2025. URLhttps://arxiv.org/abs/2503.17500
-
[56]
Three mechanisms of weight decay regularization, 2018
Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization, 2018. URLhttps://arxiv.org/abs/1810.12281
-
[57]
Understanding AdamW through proximal methods and scale-freeness, 2022
Zhenxun Zhuang, Mingrui Liu, Ashok Cutkosky, and Francesco Orabona. Understanding AdamW through proximal methods and scale-freeness, 2022. URLhttps://arxiv.org/abs/2202.00089
-
[58]
An exponential learning rate schedule for deep learning, 2019
Zhiyuan Li and Sanjeev Arora. An exponential learning rate schedule for deep learning, 2019. URL https://arxiv.org/abs/1910.07454
-
[59]
Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. AdamP: Slowing down the slowdown for momentum optimizers on scale-invariant weights, 2021. URLhttps://arxiv.org/abs/2006.08217
-
[60]
L2 regularization versus batch and weight normalization
Twan van Laarhoven. L2 regularization versus batch and weight normalization, 2017. URL https: //arxiv.org/abs/1706.05350
-
[61]
Analyzing & reducing the need for learning rate warmup in GPT training, 2024
Atli Kosson, Bettina Messmer, and Martin Jaggi. Analyzing & reducing the need for learning rate warmup in GPT training, 2024. URLhttps://arxiv.org/abs/2410.23922
-
[62]
Beyond MuP: 4
Jianlin Su. Beyond MuP: 4. ensuring parameter stability, Apr 2026. URL https://spaces.ac.cn/ archives/11729. (In Chinese)
2026
-
[63]
Lion secretly solves constrained optimization: As Lyapunov predicts
Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu. Lion secretly solves constrained optimization: As Lyapunov predicts, 2025. URLhttps://arxiv.org/abs/2310.05898
-
[64]
Nonsmooth analysis of singular values
Adrian S Lewis and Hristo S Sendov. Nonsmooth analysis of singular values. part i: Theory.Set-Valued Analysis, 13(3):213–241, 2005. 13
2005
-
[65]
Smooth manifolds
John M Lee. Smooth manifolds. InIntroduction to smooth manifolds, pages 1–29. Springer, 2003
2003
-
[66]
Horn and Charles R
Roger A. Horn and Charles R. Johnson.Topics in Matrix Analysis. Cambridge University Press, 1991
1991
-
[67]
Sampling from large matrices: An approach through geometric functional analysis.Journal of the ACM, 54(4):21:1–21:19, 2007
Mark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric functional analysis.Journal of the ACM, 54(4):21:1–21:19, 2007
2007
-
[68]
G. W. Stewart and Ji-Guang Sun.Matrix Perturbation Theory. Computer Science and Scientific Computing. Academic Press, 1990
1990
-
[69]
NanoChat: The best ChatGPT that $100 can buy, 2025
Andrej Karpathy. NanoChat: The best ChatGPT that $100 can buy, 2025. URL https://github.com/ karpathy/nanochat. A Related Works Muon optimizers and manifold-constrained variants.Muon [ 18, 33] introduced a spectral-norm steepest descent update for weight matrices, achieving strong empirical performance in LLM pre-training. It is a parallel work of matrix ...
2025
-
[70]
Adaptive Rotational Equilibrium under Spectral Sphere
show that gradient norms increase rapidly near the end of training, a phenomenon linked to the decay of weight norms. Our analysis shows that manifold constraints intrinsically govern the quantities that weight decay controls heuristically—locking the relative learning rate and rotation angle—and thus provide a principled geometric alternative to weight d...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.