pith. machine review for the scientific record. sign in

arxiv: 2605.12492 · v1 · submitted 2026-05-12 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Hanxuan Li, Kexuan Shi, Simon Buchholz, Weiyang Liu, Yandong Wen, Zeju Qiu

Pith reviewed 2026-05-13 04:48 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords spectrum-preserving optimizerorthogonal equivalence transformationsingular valuesspectral normLLM trainingweight update ruleconvergence analysis
0
0 comments X

The pith

Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Pion as a new optimizer for large language models that replaces additive updates with left and right orthogonal transformations on each weight matrix. Because orthogonal matrices leave singular values unchanged, the spectral norm of every weight stays fixed while the overall shape and orientation of the matrix can still adjust. This separation of spectrum control from directional movement is the core idea the authors want to test. If the mechanism works, training could avoid certain instabilities that arise when norms drift freely under standard methods like Adam. The authors derive the explicit update rule, study design variants, prove basic convergence properties, and report that the resulting optimizer matches or approaches the performance of conventional choices on both pretraining and finetuning tasks.

Core claim

Pion is a spectrum-preserving optimizer for LLM training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. The authors derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and

What carries the argument

The orthogonal equivalence transformation: a weight matrix W is replaced by U W V^T where U and V are orthogonal matrices chosen to reduce the loss; this operation leaves the singular values of W exactly the same.

If this is right

  • The spectral norm of every weight matrix remains constant for the entire run, removing one source of training instability.
  • Geometry of the weight matrices can still evolve because only their orientation and aspect ratios are allowed to change.
  • The same update rule applies without modification to both pretraining and finetuning stages.
  • Design parameters inside the orthogonal step (step size, choice of how to compute U and V) can be varied independently of the spectrum constraint.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The fixed-spectrum property may reduce the need for separate weight-decay or norm-clipping heuristics that are otherwise used to control scale.
  • Because the update never changes the rank or the set of singular values, the method could be useful in settings where one wants to preserve the initial conditioning of the network.
  • Extending the same left-right orthogonal step to attention or MLP blocks in other architectures might yield analogous spectrum control without extra regularization terms.

Load-bearing premise

That updates performed solely through orthogonal equivalence transformations can steer the loss downward to competitive minima without requiring additive changes to the weights.

What would settle it

Train a standard transformer with Pion on a common pretraining corpus and measure whether final perplexity is materially worse than the same run with Adam or whether the loss fails to decrease after the first few epochs.

Figures

Figures reproduced from arXiv: 2605.12492 by Hanxuan Li, Kexuan Shi, Simon Buchholz, Weiyang Liu, Yandong Wen, Zeju Qiu.

Figure 1
Figure 1. Figure 1: Comparison of POET and Pion (Green: learnable). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Inconsistent updates in Pion. To train deep neural networks effectively, prior work [4, 24, 34, 82, 86] has sought to keep network components oper￾ating under stable input/output distributions and receiving consistent feature updates. This consistency principle has also guided the scaling of modern optimizers [8, 45, 81] to large models. In particular, optimizer-induced parameter updates 1 η ∆W are expecte… view at source ↗
Figure 3
Figure 3. Figure 3: Validation loss comparison of differ￾ent consistent update strategies. “⋆” denotes the achieved minimum validation loss. Directly computing 1 η ∆W under the exponential map would nearly double the cost, so we use a first-order approximation to compute α: αt ≈ c p doutdin ∥∆W/η∥F +ϵ , where ∆W/η ≈−Gout t Wt−WtGin t . (5) Such per-weight scaling in Equation (4) makes the effective rota￾tional update magnitud… view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curves of momentum designs. Figures (a) and (b) show first-order-only and second-order-only momentum, both with RMS [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training loss of bilateral and alternate update. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of different approximation schemes for [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: µP learning rate transfer across model scales. To assess the µP-compatibility of Pion, we conduct a hyperpa￾rameter transfer experiment across model scales. Specifically, we consider the Pion variant that applies simple normalization to both ∥Gout t ∥2 and ∥Gin t ∥2. We defer the detailed experi￾mental settings and extended results for the Gt-orthogonalized variant to Appendix F. We train two representativ… view at source ↗
Figure 8
Figure 8. Figure 8: Dynamics of four diagnostic indicators for monitoring pretraining stability. From left to right, the four panels show, respectively, the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Weight spectrum comparison. Besides validation loss, we also monitor several indicators of training stability in [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Normalization-free pretraining. Normalization-free pretraining. To stress-test Pion’s training stabil￾ity, we remove all normalization layers from the 60M LLaMA-based model. Because normalization layers [4, 91] are widely regarded as essential for controlling activation scales and stabilizing gradi￾ent back-propagation, this challenging setting can effectively probe whether an optimizer alone can provide … view at source ↗
Figure 11
Figure 11. Figure 11: Training Loss of DeepNet. Pretraining on ultra-deep architectures. We further stress-test stability under LLMs with extreme depth. Training such networks often leads to severe optimization instabilities, including vanishing gradients and representation collapse [72]. To examine this setting, we scale the depth of a LLaMA 60M baseline from 8 to 200 layers and train each model on a 50B-token subset of the C… view at source ↗
Figure 12
Figure 12. Figure 12: Jacobian Norm in DeepNet. To further justify our results, we analyze the layer-wise expressivity induced by different optimizers. Following the anlysis in [56, 72], we quantify each layer’s local geometry using a shape-normalized Frobe￾nius distance ∥Jℓ−I∥F , where Jℓ is the Jacobian matrix of layer ℓ. A larger Jacobian norm signifies a greater deviation from identity-like transport, thereby reflecting mo… view at source ↗
Figure 13
Figure 13. Figure 13: Training dynamics of evaluation accuracy. [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: µP learning rate grid search. A key implication of satisfying the µP condition is that hyperparameters become transferable under width scal￾ing. To verify this property for both proposed schemes, we perform experiments on a LLaMA-based architec￾ture. Specifically, we scale the hidden size, intermediate size, and number of attention heads while keeping the head dimension fixed. For each configuration, we s… view at source ↗
Figure 15
Figure 15. Figure 15: Indicators for stable pretraining. These figures show the maximum attention logit in the attention block of Layer 1, 12, 24. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of final spectrum with the initial spectrum. [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
read the original abstract

We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Pion, a spectrum-preserving optimizer for LLM training. It updates each weight matrix W via left and right orthogonal transformations (W ← Q_l W Q_r) that leave the singular values unchanged, derives the corresponding update rule, examines design choices, analyzes convergence properties, and reports empirical competitiveness with Adam and Muon on both pretraining and finetuning tasks.

Significance. If the convergence analysis and empirical claims hold, the approach offers a distinct optimization paradigm that decouples geometric modulation from spectral scaling. This could improve training stability for large models by enforcing fixed spectral norms, and the explicit derivation plus convergence study would constitute a useful theoretical contribution in the space of manifold-constrained optimizers.

major comments (3)
  1. [Convergence analysis section] Convergence analysis section: the claimed convergence behavior rests on the assumption that every useful gradient direction can be realized by an orthogonal-equivalence update without loss of progress. The analysis must explicitly address whether the tangent-space projection onto the fixed-singular-value manifold can still guarantee descent when the loss landscape requires rescaling of singular values (as occurs routinely with Adam/Muon). Without a supporting lemma or counter-example discussion, the competitiveness claim is not yet load-bearing.
  2. [Update rule derivation] Update rule derivation: the paper states that the Pion rule is derived from standard orthogonal transformations, yet it is unclear how the left and right orthogonal matrices Q_l and Q_r are chosen from the gradient (e.g., via polar decomposition, SVD-based projection, or another mechanism). If this choice is not parameter-free and requires additional hyperparameters, the “spectrum-preserving” advantage over additive methods is reduced.
  3. [Empirical results] Empirical section: the competitiveness claim for LLM pretraining and finetuning is central, but the reported results must include ablation on model scale, sequence length, and whether singular-value histograms remain exactly constant across training steps. If any run shows drift in the singular-value spectrum, the core invariance is violated and the comparison to Adam/Muon is undermined.
minor comments (2)
  1. [Introduction] Notation for the orthogonal factors Q_l and Q_r should be introduced with an explicit equation immediately after the first mention of the update.
  2. [Abstract] The abstract claims “several key properties” are analyzed; these should be enumerated in a dedicated subsection or table for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major point below and will revise the paper to incorporate clarifications and additional material where needed.

read point-by-point responses
  1. Referee: Convergence analysis section: the claimed convergence behavior rests on the assumption that every useful gradient direction can be realized by an orthogonal-equivalence update without loss of progress. The analysis must explicitly address whether the tangent-space projection onto the fixed-singular-value manifold can still guarantee descent when the loss landscape requires rescaling of singular values (as occurs routinely with Adam/Muon). Without a supporting lemma or counter-example discussion, the competitiveness claim is not yet load-bearing.

    Authors: We agree that the convergence section would benefit from greater explicitness on this point. In the revision we will add a lemma establishing that the tangent-space projection of the orthogonal-equivalence update still guarantees sufficient descent for gradient directions that do not require singular-value rescaling, together with a short discussion of the complementary role of spectral rescaling in other optimizers. This directly addresses the concern while preserving the manuscript's core claims. revision: yes

  2. Referee: Update rule derivation: the paper states that the Pion rule is derived from standard orthogonal transformations, yet it is unclear how the left and right orthogonal matrices Q_l and Q_r are chosen from the gradient (e.g., via polar decomposition, SVD-based projection, or another mechanism). If this choice is not parameter-free and requires additional hyperparameters, the “spectrum-preserving” advantage over additive methods is reduced.

    Authors: The choice of Q_l and Q_r is obtained via polar decomposition of the projected gradient components, which is a deterministic, parameter-free operation. We will expand Section 3 with an explicit algorithmic description and pseudocode to remove any ambiguity and confirm that no additional hyperparameters are introduced. revision: yes

  3. Referee: Empirical section: the competitiveness claim for LLM pretraining and finetuning is central, but the reported results must include ablation on model scale, sequence length, and whether singular-value histograms remain exactly constant across training steps. If any run shows drift in the singular-value spectrum, the core invariance is violated and the comparison to Adam/Muon is undermined.

    Authors: We will augment the empirical section with ablations across additional model scales and sequence lengths. We will also add singular-value histogram plots at multiple training checkpoints to document that the spectrum remains exactly constant, as required by the orthogonal equivalence construction; our existing runs already exhibit this invariance with no drift. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on standard SVD properties

full rationale

The paper defines Pion via left/right orthogonal multiplications on weight matrices and states that this preserves singular values. This follows immediately from the SVD definition (W = U Σ V^T yields Q_l W Q_r = (Q_l U) Σ (V^T Q_r) with the new factors orthogonal), which is an external linear-algebra fact rather than a self-referential construction or fitted input renamed as prediction. No equations in the abstract or description reduce the update rule, convergence claim, or empirical performance to the paper's own outputs or self-citations; the design choices and analysis are presented as independent examinations of the resulting optimizer.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted or evaluated.

pith-pipeline@v0.9.0 · 5416 in / 964 out tokens · 60751 ms · 2026-05-13T04:48:34.904582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 20 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. 7

  2. [2]

    The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

    Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025. 9

  3. [3]

    Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

    Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. 9, 23

  4. [4]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016. 3, 8, 9

  5. [5]

    Spectrally-normalized margin bounds for neural networks

    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. InNIPS, 2017. 1

  6. [6]

    & Sepulchre, R

    Gary Bécigneul and Octavian-Eugen Ganea. Riemannian adaptive optimization methods.arXiv preprint arXiv:1810.00760, 2018. 4

  7. [7]

    Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

    Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024. 9

  8. [8]

    Old optimizer, new norm: An anthology

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024. 3, 9

  9. [9]

    Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024. 8

  10. [10]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InAAAI, 2020. 9

  11. [11]

    Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

    Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018. 22

  12. [12]

    Cambridge University Press, 2023

    Nicolas Boumal.An introduction to optimization on smooth manifolds. Cambridge University Press, 2023. 2

  13. [13]

    Stochastic spectral descent for restricted boltzmann machines

    David Carlson, V olkan Cevher, and Lawrence Carin. Stochastic spectral descent for restricted boltzmann machines. InAISTATS, 2015. 9

  14. [14]

    Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016

    David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and V olkan Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016. 9

  15. [15]

    Muon optimizes under spectral norm constraints

    Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025. 9

  16. [16]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 8, 23

  17. [17]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. 9

  18. [18]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 8, 23

  19. [19]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. 7 10

  20. [20]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InICML, 2023. 7

  21. [21]

    Cogview: Mastering text-to-image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. InNeurIPS, 2021. 1

  22. [22]

    Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011. 4

  23. [23]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  24. [24]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InAISTATS, 2010. 3

  25. [25]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 8, 22

  26. [26]

    Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026

    Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026. 3, 7

  27. [27]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 9, 23

  28. [28]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InICML, 2018. 9

  29. [29]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  30. [30]

    Deepmath-103k: A large-scale, challenging, de- contaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, de- contaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025. 9, 23

  31. [31]

    Orthogonal recurrent neural networks with scaled cayley transform

    Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled cayley transform. InICML, 2018. 5

  32. [32]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings of EMNLP, 2020. 1, 9

  33. [33]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022. 7

  34. [34]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, 2015. 3

  35. [35]

    On computation and generalization of generative adversarial networks under spectrum control

    Haoming Jiang, Zhehui Chen, Minshuo Chen, Feng Liu, Dingding Wang, and Tuo Zhao. On computation and generalization of generative adversarial networks under spectrum control. In ICLR, 2019. 1

  36. [36]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. 1, 7, 9 11

  37. [37]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 1, 4, 9

  38. [38]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 23

  39. [39]

    Solving quanti- tative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quanti- tative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022. 9, 23

  40. [40]

    Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group

    Mario Lezcano-Casado and David Martınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. InICML, 2019. 2, 9

  41. [41]

    Efficient riemannian optimization on the stiefel manifold via the cayley transform.arXiv preprint arXiv:2002.01113, 2020

    Jun Li, Li Fuxin, and Sinisa Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayley transform.arXiv preprint arXiv:2002.01113, 2020. 5

  42. [42]

    Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025. 1, 9

  43. [43]

    Rehg, Li Xiong, and Le Song

    Rongmei Lin, Weiyang Liu, Zhen Liu, Chen Feng, Zhiding Yu, James M. Rehg, Li Xiong, and Le Song. Regularizing neural networks via minimizing hyperspherical energy. InCVPR, 2020. 1

  44. [44]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 7

  45. [45]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025. 3, 6, 7, 9

  46. [46]

    Learning towards minimum hyperspherical energy

    Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and Le Song. Learning towards minimum hyperspherical energy. InNeurIPS, 2018. 1, 9

  47. [47]

    Rehg, Liam Paull, Li Xiong, Le Song, and Adrian Weller

    Weiyang Liu, Rongmei Lin, Zhen Liu, James M. Rehg, Liam Paull, Li Xiong, Le Song, and Adrian Weller. Orthogonal over-parameterized training. InCVPR, 2021. 1, 5

  48. [48]

    Learning with hyperspherical uniformity

    Weiyang Liu, Rongmei Lin, Zhen Liu, Li Xiong, Bernhard Schölkopf, and Adrian Weller. Learning with hyperspherical uniformity. InAISTATS, 2021. 1, 9

  49. [49]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 1, 7

  50. [50]

    American mathematics contest 12 (amc 12), November 2023

    MAA. American mathematics contest 12 (amc 12), November 2023. 9, 23

  51. [51]

    American invitational mathematics examination (aime), February 2024

    MAA. American invitational mathematics examination (aime), February 2024. 9, 23

  52. [52]

    American invitational mathematics examination (aime), February 2025

    MAA. American invitational mathematics examination (aime), February 2025. 9, 23

  53. [53]

    The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects.Frontiers in psychology, 4:504, 2013

    Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects.Frontiers in psychology, 4:504, 2013. 8

  54. [54]

    Efficient orthogonal parametrisation of recurrent neural networks using householder reflections

    Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. Efficient orthogonal parametrisation of recurrent neural networks using householder reflections. InICML, 2017. 2, 9

  55. [55]

    arXiv preprint arXiv:1802.05957

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks.arXiv preprint arXiv:1802.05957, 2018. 1

  56. [56]

    When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026

    Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hof- mann, and Shiwei Liu. When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026. 8 12

  57. [57]

    A method for solving the convex programming problem with convergence rate o (1/k2)

    Yurii Nesterov. A method for solving the convex programming problem with convergence rate o (1/k2). InDokl akad nauk Sssr, volume 269, page 543, 1983. 4

  58. [58]

    Pethick, W

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025. 9

  59. [59]

    Some methods of speeding up the convergence of iteration methods.Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964

    Boris T Polyak. Some methods of speeding up the convergence of iteration methods.Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964. 4

  60. [60]

    Reparameterized llm training via orthogonal equivalence transformation

    Zeju Qiu, Simon Buchholz, Tim Z Xiao, Maximilian Dax, Bernhard Schölkopf, and Weiyang Liu. Reparameterized llm training via orthogonal equivalence transformation. InNeurIPS, 2025. 1, 2, 3, 5, 7, 9

  61. [61]

    Poet-x: Memory-efficient llm training by scaling orthogonal transformation

    Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, and Weiyang Liu. Poet-x: Memory-efficient llm training by scaling orthogonal transformation. InICML, 2026. 1, 2, 9

  62. [62]

    Controlling text-to-image diffusion by orthogonal finetuning

    Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. InNeurIPS, 2023. 5

  63. [63]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 3, 7

  64. [64]

    On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237, 2019

    Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237, 2019. 4

  65. [65]

    A stochastic approximation method.The annals of mathe- matical statistics, pages 400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathe- matical statistics, pages 400–407, 1951. 22

  66. [66]

    Methods of improving llm training stability.arXiv preprint arXiv:2410.16682, 2024

    Oleg Rybakov, Mike Chrzanowski, Peter Dykas, Jinze Xue, and Ben Lanir. Methods of improving llm training stability.arXiv preprint arXiv:2410.16682, 2024. 7

  67. [67]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106,

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106,

  68. [68]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 9, 23

  69. [69]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 3

  70. [70]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. 9, 23

  71. [71]

    Optimization techniques on riemannian manifolds.arXiv preprint arXiv:1407.5965, 2014

    Steven Thomas Smith. Optimization techniques on riemannian manifolds.arXiv preprint arXiv:1407.5965, 2014. 4

  72. [72]

    The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025

    Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025. 8

  73. [73]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

  74. [74]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 1, 6, 9

  75. [75]

    Qwen2.5: A party of foundation models, 2024

    Qwen Team. Qwen2.5: A party of foundation models, 2024. 8, 22 13

  76. [76]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 3

  77. [77]

    Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052, 2022

    Mark Tuddenham, Adam Prügel-Bennett, and Jonathan Hare. Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052, 2022. 9

  78. [78]

    Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024. 9

  79. [79]

    Magicoder: Empowering code generation with oss-instruct

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct. InICML, 2024. 8, 22

  80. [80]

    Small-scale proxies for large-scale transformer training instabilities

    Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co- Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale transformer training instabilities.arXiv preprint arXiv:2309.14322, 2023. 7

Showing first 80 references.