arxiv: 2605.12492 · v1 · submitted 2026-05-12 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Hanxuan Li, Kexuan Shi, Simon Buchholz, Weiyang Liu, Yandong Wen, Zeju Qiu

Pith reviewed 2026-05-13 04:48 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords spectrum-preserving optimizerorthogonal equivalence transformationsingular valuesspectral normLLM trainingweight update ruleconvergence analysis

0 comments

The pith

Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Pion as a new optimizer for large language models that replaces additive updates with left and right orthogonal transformations on each weight matrix. Because orthogonal matrices leave singular values unchanged, the spectral norm of every weight stays fixed while the overall shape and orientation of the matrix can still adjust. This separation of spectrum control from directional movement is the core idea the authors want to test. If the mechanism works, training could avoid certain instabilities that arise when norms drift freely under standard methods like Adam. The authors derive the explicit update rule, study design variants, prove basic convergence properties, and report that the resulting optimizer matches or approaches the performance of conventional choices on both pretraining and finetuning tasks.

Core claim

Pion is a spectrum-preserving optimizer for LLM training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. The authors derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and

What carries the argument

The orthogonal equivalence transformation: a weight matrix W is replaced by U W V^T where U and V are orthogonal matrices chosen to reduce the loss; this operation leaves the singular values of W exactly the same.

If this is right

The spectral norm of every weight matrix remains constant for the entire run, removing one source of training instability.
Geometry of the weight matrices can still evolve because only their orientation and aspect ratios are allowed to change.
The same update rule applies without modification to both pretraining and finetuning stages.
Design parameters inside the orthogonal step (step size, choice of how to compute U and V) can be varied independently of the spectrum constraint.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The fixed-spectrum property may reduce the need for separate weight-decay or norm-clipping heuristics that are otherwise used to control scale.
Because the update never changes the rank or the set of singular values, the method could be useful in settings where one wants to preserve the initial conditioning of the network.
Extending the same left-right orthogonal step to attention or MLP blocks in other architectures might yield analogous spectrum control without extra regularization terms.

Load-bearing premise

That updates performed solely through orthogonal equivalence transformations can steer the loss downward to competitive minima without requiring additive changes to the weights.

What would settle it

Train a standard transformer with Pion on a common pretraining corpus and measure whether final perplexity is materially worse than the same run with Adam or whether the loss fails to decrease after the first few epochs.

Figures

Figures reproduced from arXiv: 2605.12492 by Hanxuan Li, Kexuan Shi, Simon Buchholz, Weiyang Liu, Yandong Wen, Zeju Qiu.

**Figure 2.** Figure 2: Inconsistent updates in Pion. To train deep neural networks effectively, prior work [4, 24, 34, 82, 86] has sought to keep network components operating under stable input/output distributions and receiving consistent feature updates. This consistency principle has also guided the scaling of modern optimizers [8, 45, 81] to large models. In particular, optimizer-induced parameter updates 1 η ∆W are expecte… view at source ↗

**Figure 3.** Figure 3: Validation loss comparison of different consistent update strategies. “⋆” denotes the achieved minimum validation loss. Directly computing 1 η ∆W under the exponential map would nearly double the cost, so we use a first-order approximation to compute α: αt ≈ c p doutdin ∥∆W/η∥F +ϵ , where ∆W/η ≈−Gout t Wt−WtGin t . (5) Such per-weight scaling in Equation (4) makes the effective rotational update magnitud… view at source ↗

**Figure 4.** Figure 4: Training loss curves of momentum designs. Figures (a) and (b) show first-order-only and second-order-only momentum, both with RMS [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Training loss of bilateral and alternate update. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of different approximation schemes for [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: µP learning rate transfer across model scales. To assess the µP-compatibility of Pion, we conduct a hyperparameter transfer experiment across model scales. Specifically, we consider the Pion variant that applies simple normalization to both ∥Gout t ∥2 and ∥Gin t ∥2. We defer the detailed experimental settings and extended results for the Gt-orthogonalized variant to Appendix F. We train two representativ… view at source ↗

**Figure 8.** Figure 8: Dynamics of four diagnostic indicators for monitoring pretraining stability. From left to right, the four panels show, respectively, the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Weight spectrum comparison. Besides validation loss, we also monitor several indicators of training stability in [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Normalization-free pretraining. Normalization-free pretraining. To stress-test Pion’s training stability, we remove all normalization layers from the 60M LLaMA-based model. Because normalization layers [4, 91] are widely regarded as essential for controlling activation scales and stabilizing gradient back-propagation, this challenging setting can effectively probe whether an optimizer alone can provide … view at source ↗

**Figure 11.** Figure 11: Training Loss of DeepNet. Pretraining on ultra-deep architectures. We further stress-test stability under LLMs with extreme depth. Training such networks often leads to severe optimization instabilities, including vanishing gradients and representation collapse [72]. To examine this setting, we scale the depth of a LLaMA 60M baseline from 8 to 200 layers and train each model on a 50B-token subset of the C… view at source ↗

**Figure 12.** Figure 12: Jacobian Norm in DeepNet. To further justify our results, we analyze the layer-wise expressivity induced by different optimizers. Following the anlysis in [56, 72], we quantify each layer’s local geometry using a shape-normalized Frobenius distance ∥Jℓ−I∥F , where Jℓ is the Jacobian matrix of layer ℓ. A larger Jacobian norm signifies a greater deviation from identity-like transport, thereby reflecting mo… view at source ↗

**Figure 13.** Figure 13: Training dynamics of evaluation accuracy. [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

**Figure 14.** Figure 14: µP learning rate grid search. A key implication of satisfying the µP condition is that hyperparameters become transferable under width scaling. To verify this property for both proposed schemes, we perform experiments on a LLaMA-based architecture. Specifically, we scale the hidden size, intermediate size, and number of attention heads while keeping the head dimension fixed. For each configuration, we s… view at source ↗

**Figure 15.** Figure 15: Indicators for stable pretraining. These figures show the maximum attention logit in the attention block of Layer 1, 12, 24. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Comparison of final spectrum with the initial spectrum. [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

read the original abstract

We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pion keeps singular values fixed via orthogonal updates, which is technically clean but raises questions about whether it can reach all necessary parameter changes.

read the letter

Pion's key move is to replace the usual additive update with left and right orthogonal transformations on each weight matrix. This keeps every singular value exactly the same during training, so the spectral norm never changes. The paper derives the update rule from this constraint, works through several design decisions, and provides a convergence analysis. They also report experiments on both pretraining and finetuning of LLMs, where it holds up against Adam and Muon in terms of stability and final performance. The systematic examination of design choices is a plus, as it makes the method more transparent. The main question is whether restricting updates to the orthogonal equivalence class is enough. Standard optimizers can stretch or shrink the singular values as needed. If the loss landscape has directions that require those scale changes, Pion would have to approximate them through geometry alone. Their analysis must show that the projected steps still make reliable progress, and the experiments need to demonstrate that this does not raise the loss floor on real models. This work is for people who care about spectral properties in deep network training. It has a clear technical core and some empirical support, so it is worth sending out for peer review. Reviewers can check whether the fixed-spectrum constraint is a real advantage or just a limitation that happens to work on the tested scales.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Pion, a spectrum-preserving optimizer for LLM training. It updates each weight matrix W via left and right orthogonal transformations (W ← Q_l W Q_r) that leave the singular values unchanged, derives the corresponding update rule, examines design choices, analyzes convergence properties, and reports empirical competitiveness with Adam and Muon on both pretraining and finetuning tasks.

Significance. If the convergence analysis and empirical claims hold, the approach offers a distinct optimization paradigm that decouples geometric modulation from spectral scaling. This could improve training stability for large models by enforcing fixed spectral norms, and the explicit derivation plus convergence study would constitute a useful theoretical contribution in the space of manifold-constrained optimizers.

major comments (3)

[Convergence analysis section] Convergence analysis section: the claimed convergence behavior rests on the assumption that every useful gradient direction can be realized by an orthogonal-equivalence update without loss of progress. The analysis must explicitly address whether the tangent-space projection onto the fixed-singular-value manifold can still guarantee descent when the loss landscape requires rescaling of singular values (as occurs routinely with Adam/Muon). Without a supporting lemma or counter-example discussion, the competitiveness claim is not yet load-bearing.
[Update rule derivation] Update rule derivation: the paper states that the Pion rule is derived from standard orthogonal transformations, yet it is unclear how the left and right orthogonal matrices Q_l and Q_r are chosen from the gradient (e.g., via polar decomposition, SVD-based projection, or another mechanism). If this choice is not parameter-free and requires additional hyperparameters, the “spectrum-preserving” advantage over additive methods is reduced.
[Empirical results] Empirical section: the competitiveness claim for LLM pretraining and finetuning is central, but the reported results must include ablation on model scale, sequence length, and whether singular-value histograms remain exactly constant across training steps. If any run shows drift in the singular-value spectrum, the core invariance is violated and the comparison to Adam/Muon is undermined.

minor comments (2)

[Introduction] Notation for the orthogonal factors Q_l and Q_r should be introduced with an explicit equation immediately after the first mention of the update.
[Abstract] The abstract claims “several key properties” are analyzed; these should be enumerated in a dedicated subsection or table for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address each major point below and will revise the paper to incorporate clarifications and additional material where needed.

read point-by-point responses

Referee: Convergence analysis section: the claimed convergence behavior rests on the assumption that every useful gradient direction can be realized by an orthogonal-equivalence update without loss of progress. The analysis must explicitly address whether the tangent-space projection onto the fixed-singular-value manifold can still guarantee descent when the loss landscape requires rescaling of singular values (as occurs routinely with Adam/Muon). Without a supporting lemma or counter-example discussion, the competitiveness claim is not yet load-bearing.

Authors: We agree that the convergence section would benefit from greater explicitness on this point. In the revision we will add a lemma establishing that the tangent-space projection of the orthogonal-equivalence update still guarantees sufficient descent for gradient directions that do not require singular-value rescaling, together with a short discussion of the complementary role of spectral rescaling in other optimizers. This directly addresses the concern while preserving the manuscript's core claims. revision: yes
Referee: Update rule derivation: the paper states that the Pion rule is derived from standard orthogonal transformations, yet it is unclear how the left and right orthogonal matrices Q_l and Q_r are chosen from the gradient (e.g., via polar decomposition, SVD-based projection, or another mechanism). If this choice is not parameter-free and requires additional hyperparameters, the “spectrum-preserving” advantage over additive methods is reduced.

Authors: The choice of Q_l and Q_r is obtained via polar decomposition of the projected gradient components, which is a deterministic, parameter-free operation. We will expand Section 3 with an explicit algorithmic description and pseudocode to remove any ambiguity and confirm that no additional hyperparameters are introduced. revision: yes
Referee: Empirical section: the competitiveness claim for LLM pretraining and finetuning is central, but the reported results must include ablation on model scale, sequence length, and whether singular-value histograms remain exactly constant across training steps. If any run shows drift in the singular-value spectrum, the core invariance is violated and the comparison to Adam/Muon is undermined.

Authors: We will augment the empirical section with ablations across additional model scales and sequence lengths. We will also add singular-value histogram plots at multiple training checkpoints to document that the spectrum remains exactly constant, as required by the orthogonal equivalence construction; our existing runs already exhibit this invariance with no drift. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on standard SVD properties

full rationale

The paper defines Pion via left/right orthogonal multiplications on weight matrices and states that this preserves singular values. This follows immediately from the SVD definition (W = U Σ V^T yields Q_l W Q_r = (Q_l U) Σ (V^T Q_r) with the new factors orthogonal), which is an external linear-algebra fact rather than a self-referential construction or fitted input renamed as prediction. No equations in the abstract or description reduce the update rule, convergence claim, or empirical performance to the paper's own outputs or self-citations; the design choices and analysis are presented as independent examinations of the resulting optimizer.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted or evaluated.

pith-pipeline@v0.9.0 · 5416 in / 964 out tokens · 60751 ms · 2026-05-13T04:48:34.904582+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training... Wt+1 = exp(−η G_out_t) W_t exp(−η G_in_t)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
Pion’s update preserves the singular values of Wt and only changes its row and column subspaces through orthogonal transformations

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 20 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025

Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. 9, 23

work page 2025
[4]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016. 3, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. InNIPS, 2017. 1

work page 2017
[6]

& Sepulchre, R

Gary Bécigneul and Octavian-Eugen Ganea. Riemannian adaptive optimization methods.arXiv preprint arXiv:1810.00760, 2018. 4

work page arXiv 2018
[7]

Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024. 9

work page arXiv 2024
[8]

Old optimizer, new norm: An anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024. 3, 9

work page arXiv 2024
[9]

Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024. 8

work page arXiv 2024
[10]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InAAAI, 2020. 9

work page 2020
[11]

Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018. 22

work page 2018
[12]

Cambridge University Press, 2023

Nicolas Boumal.An introduction to optimization on smooth manifolds. Cambridge University Press, 2023. 2

work page 2023
[13]

Stochastic spectral descent for restricted boltzmann machines

David Carlson, V olkan Cevher, and Lawrence Carin. Stochastic spectral descent for restricted boltzmann machines. InAISTATS, 2015. 9

work page 2015
[14]

Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016

David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and V olkan Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016. 9

work page 2016
[15]

Muon optimizes under spectral norm constraints

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025. 9

work page arXiv 2025
[16]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 8, 23

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. 9

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 8, 23

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. 7 10

work page 2026
[20]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InICML, 2023. 7

work page 2023
[21]

Cogview: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. InNeurIPS, 2021. 1

work page 2021
[22]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of machine learning research, 12(7), 2011. 4

work page 2011
[23]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024
[24]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InAISTATS, 2010. 3

work page 2010
[25]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 8, 22

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026

Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026. 3, 7

work page arXiv 2026
[27]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 9, 23

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InICML, 2018. 9

work page 2018
[29]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024
[30]

Deepmath-103k: A large-scale, challenging, de- contaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, de- contaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025. 9, 23

work page arXiv 2025
[31]

Orthogonal recurrent neural networks with scaled cayley transform

Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled cayley transform. InICML, 2018. 5

work page 2018
[32]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. InFindings of EMNLP, 2020. 1, 9

work page 2020
[33]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022. 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, 2015. 3

work page 2015
[35]

On computation and generalization of generative adversarial networks under spectrum control

Haoming Jiang, Zhehui Chen, Minshuo Chen, Feng Liu, Dingding Wang, and Tuo Zhao. On computation and generalization of generative adversarial networks under spectrum control. In ICLR, 2019. 1

work page 2019
[36]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. 1, 7, 9 11

work page 2024
[37]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 1, 4, 9

work page internal anchor Pith review Pith/arXiv arXiv 2014
[38]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 23

work page 2023
[39]

Solving quanti- tative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quanti- tative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022. 9, 23

work page 2022
[40]

Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group

Mario Lezcano-Casado and David Martınez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. InICML, 2019. 2, 9

work page 2019
[41]

Efficient riemannian optimization on the stiefel manifold via the cayley transform.arXiv preprint arXiv:2002.01113, 2020

Jun Li, Li Fuxin, and Sinisa Todorovic. Efficient riemannian optimization on the stiefel manifold via the cayley transform.arXiv preprint arXiv:2002.01113, 2020. 5

work page arXiv 2002
[42]

Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025. 1, 9

work page arXiv 2025
[43]

Rehg, Li Xiong, and Le Song

Rongmei Lin, Weiyang Liu, Zhen Liu, Chen Feng, Zhiding Yu, James M. Rehg, Li Xiong, and Le Song. Regularizing neural networks via minimizing hyperspherical energy. InCVPR, 2020. 1

work page 2020
[44]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025. 3, 6, 7, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Learning towards minimum hyperspherical energy

Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and Le Song. Learning towards minimum hyperspherical energy. InNeurIPS, 2018. 1, 9

work page 2018
[47]

Rehg, Liam Paull, Li Xiong, Le Song, and Adrian Weller

Weiyang Liu, Rongmei Lin, Zhen Liu, James M. Rehg, Liam Paull, Li Xiong, Le Song, and Adrian Weller. Orthogonal over-parameterized training. InCVPR, 2021. 1, 5

work page 2021
[48]

Learning with hyperspherical uniformity

Weiyang Liu, Rongmei Lin, Zhen Liu, Li Xiong, Bernhard Schölkopf, and Adrian Weller. Learning with hyperspherical uniformity. InAISTATS, 2021. 1, 9

work page 2021
[49]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 1, 7

work page 2019
[50]

American mathematics contest 12 (amc 12), November 2023

MAA. American mathematics contest 12 (amc 12), November 2023. 9, 23

work page 2023
[51]

American invitational mathematics examination (aime), February 2024

MAA. American invitational mathematics examination (aime), February 2024. 9, 23

work page 2024
[52]

American invitational mathematics examination (aime), February 2025

MAA. American invitational mathematics examination (aime), February 2025. 9, 23

work page 2025
[53]

The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects.Frontiers in psychology, 4:504, 2013

Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects.Frontiers in psychology, 4:504, 2013. 8

work page 2013
[54]

Efficient orthogonal parametrisation of recurrent neural networks using householder reflections

Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. Efficient orthogonal parametrisation of recurrent neural networks using householder reflections. InICML, 2017. 2, 9

work page 2017
[55]

arXiv preprint arXiv:1802.05957

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks.arXiv preprint arXiv:1802.05957, 2018. 1

work page arXiv 2018
[56]

When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026

Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hof- mann, and Shiwei Liu. When does sparsity mitigate the curse of depth in llms.arXiv preprint arXiv:2603.15389, 2026. 8 12

work page arXiv 2026
[57]

A method for solving the convex programming problem with convergence rate o (1/k2)

Yurii Nesterov. A method for solving the convex programming problem with convergence rate o (1/k2). InDokl akad nauk Sssr, volume 269, page 543, 1983. 4

work page 1983
[58]

Pethick, W

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025. 9

work page arXiv 2025
[59]

Some methods of speeding up the convergence of iteration methods.Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964

Boris T Polyak. Some methods of speeding up the convergence of iteration methods.Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964. 4

work page 1964
[60]

Reparameterized llm training via orthogonal equivalence transformation

Zeju Qiu, Simon Buchholz, Tim Z Xiao, Maximilian Dax, Bernhard Schölkopf, and Weiyang Liu. Reparameterized llm training via orthogonal equivalence transformation. InNeurIPS, 2025. 1, 2, 3, 5, 7, 9

work page 2025
[61]

Poet-x: Memory-efficient llm training by scaling orthogonal transformation

Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, and Weiyang Liu. Poet-x: Memory-efficient llm training by scaling orthogonal transformation. InICML, 2026. 1, 2, 9

work page 2026
[62]

Controlling text-to-image diffusion by orthogonal finetuning

Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. InNeurIPS, 2023. 5

work page 2023
[63]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 3, 7

work page 2020
[64]

On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237, 2019

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237, 2019. 4

work page arXiv 1904
[65]

A stochastic approximation method.The annals of mathe- matical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathe- matical statistics, pages 400–407, 1951. 22

work page 1951
[66]

Methods of improving llm training stability.arXiv preprint arXiv:2410.16682, 2024

Oleg Rybakov, Mike Chrzanowski, Peter Dykas, Jinze Xue, and Ben Lanir. Methods of improving llm training stability.arXiv preprint arXiv:2410.16682, 2024. 7

work page arXiv 2024
[67]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106,

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106,

work page
[68]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 9, 23

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2002
[70]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024. 9, 23

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Optimization techniques on riemannian manifolds.arXiv preprint arXiv:1407.5965, 2014

Steven Thomas Smith. Optimization techniques on riemannian manifolds.arXiv preprint arXiv:1407.5965, 2014. 4

work page arXiv 2014
[72]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025. 8

work page arXiv 2025
[73]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 1, 6, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Qwen2.5: A party of foundation models, 2024

Qwen Team. Qwen2.5: A party of foundation models, 2024. 8, 22 13

work page 2024
[76]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052, 2022

Mark Tuddenham, Adam Prügel-Bennett, and Jonathan Hare. Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052, 2022. 9

work page arXiv 2022
[78]

Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024

Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321, 2024. 9

work page arXiv 2024
[79]

Magicoder: Empowering code generation with oss-instruct

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with oss-instruct. InICML, 2024. 8, 22

work page 2024
[80]

Small-scale proxies for large-scale transformer training instabilities

Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D Co- Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, et al. Small-scale proxies for large-scale transformer training instabilities.arXiv preprint arXiv:2309.14322, 2023. 7

work page arXiv 2023

Showing first 80 references.