arxiv: 2605.09238 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

Andi Han, Bamdev Mishra, Bihari Lal Pandey, Cyrus Mostajeran, Pratik Jawanpuria, Ravi Sah, Yibang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords intrinsic muonriemannian matrix optimizationunitarily invariant normsclosed-form updatesfixed-rank manifoldSPD manifoldstiefel manifoldgrassmann manifold

0 comments

The pith

Lifting unitarily invariant norms to tangent spaces via the Riemannian metric yields closed-form Muon updates on fixed-rank, SPD, Stiefel, and Grassmann manifolds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to extend Muon-style norm-constrained optimization to parameters that live on Riemannian matrix manifolds instead of flat Euclidean space. Standard Muon solves a linear maximization oracle over an ambient norm ball, but restricting that oracle directly to the tangent space breaks the manifold's quotient symmetries and couples the constraint in a way that blocks closed-form solutions. The resolution is a single observation: the Riemannian metric lifts any unitarily invariant Euclidean norm to a natural intrinsic norm on the tangent space, and the resulting tangent-space oracle automatically respects the manifold symmetries. This produces a unified iMuon framework that supplies explicit updates for the spectral, Frobenius, and nuclear norms on four common manifolds, together with convergence rates whose constants depend only on manifold dimension. The independence from factor conditioning on the fixed-rank manifold removes an extra rescaling step that earlier methods required.

Core claim

Every Riemannian metric canonically lifts a unitarily invariant Euclidean norm to an intrinsic norm on each tangent space, and the resulting intrinsic-norm-constrained linear maximization oracle is symmetry preserving; building on this single fact produces a unified intrinsic Muon (iMuon) algorithm that returns closed-form updates on the fixed-rank, SPD, Stiefel, and Grassmann manifolds for any unitarily invariant norm and supplies both deterministic and stochastic convergence guarantees whose rate constants depend only on the manifold dimension.

What carries the argument

The intrinsic norm on the tangent space obtained by lifting a unitarily invariant Euclidean norm through the Riemannian metric; this lift makes the constrained linear maximization oracle symmetry-preserving and therefore solvable in closed form.

If this is right

Deterministic and stochastic versions of iMuon converge with rates whose constants depend only on manifold dimension, independent of factor conditioning on the fixed-rank case.
No runtime factor-rescaling step is required for fixed-rank optimization.
The same closed-form machinery applies unchanged to the spectral, Frobenius, and nuclear norms on four different manifolds.
The framework directly supports LoRA fine-tuning of large language models, image classification, and subspace learning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dimension-only rate dependence suggests that iMuon could remain practical even when manifold dimension grows, provided the closed-form step itself scales acceptably.
The same lifting construction might be reusable on other matrix manifolds whose tangent spaces admit natural unitarily invariant structures.
Because the method removes an explicit rescaling heuristic, implementations on fixed-rank problems become simpler and potentially more stable across different conditioning regimes.

Load-bearing premise

That lifting any unitarily invariant Euclidean norm through the Riemannian metric produces a tangent-space norm whose linear maximization oracle automatically respects the manifold's quotient symmetries.

What would settle it

An explicit computation on the fixed-rank manifold showing that the lifted intrinsic-norm LMO for the spectral norm either fails to admit a closed-form solution or produces a matrix that violates the quotient symmetry of the manifold.

Figures

Figures reproduced from arXiv: 2605.09238 by Andi Han, Bamdev Mishra, Bihari Lal Pandey, Cyrus Mostajeran, Pratik Jawanpuria, Ravi Sah, Yibang Li.

**Figure 1.** Figure 1: SPD classification on CIFAR-100 (S 32 ++, 20 coarse classes). Each panel pairs a Euclidean LMO with its intrinsic counterpart under a common norm: Frobenius (EGD vs. RGD), spectral (Muon vs. iMuon), and nuclear (NuMuon vs. iMuon-Nu). Curves show mean test accuracy with ±1 std bands over 3 seeds at the validation-selected learning rate. The intrinsic method dominates in every pair, with the gap widening fro… view at source ↗

**Figure 2.** Figure 2: Noise sensitivity in large-scale synthetic fixed-rank matrix completion. Each row fixes the condition number and plots final relative recovery error as the relative observedentry noise scale ρ varies. The three panels in each row compare the Frobenius, spectral, and nuclear norm pairs. The y-axis is logarithmic, and lower is better. Results [PITH_FULL_IMAGE:figures/full_fig_p035_2.png] view at source ↗

**Figure 3.** Figure 3: Training trajectories for the balanced fixed-rank CIFAR-100 rank-head comparison. Curves show means over three seeds with standard-deviation bands. The three methods reach similar training objectives, while iMuon remains competitive in test accuracy [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗

**Figure 4.** Figure 4: SPD convergence plot on frozen covariance features. The top row reports training cross-entropy and the bottom row reports test accuracy. Columns compare the Frobenius, spectral, and nuclear norm pairs. This variant omits the prototype anchoring term, so the accuracy values are not directly comparable with [PITH_FULL_IMAGE:figures/full_fig_p039_4.png] view at source ↗

**Figure 5.** Figure 5: Stiefel subcenter prototype classification trajectories. The left panel reports the training objective and the right panel reports test accuracy over epochs. Curves show means over three seeds with standard deviation bands. Stiefel manifold from the embedded submanifold viewpoint, why the naive approach fails and how the block decomposition of Section 4 resolves it, (iv) the spectrahedron (trace-one fixed-… view at source ↗

read the original abstract

Muon and related norm-constrained matrix optimizers have become central to large-scale learning problems. They are formulated as a linear maximization oracle (LMO) over an ambient matrix-norm ball in unconstrained Euclidean space. However, these do not generalize cleanly to manifold-valued parameters such as low-rank factorizations, orthogonality constraints, or symmetric positive definite (SPD) matrices. Naively restricting the Muon LMO to the tangent space (i) breaks quotient symmetries and (ii) couples the tangent-space constraint with an ambient norm bound, thereby obstructing closed-form solutions on various manifolds of interest. We resolve both issues with a single observation: every Riemannian metric canonically lifts a unitarily invariant Euclidean norm to an intrinsic norm on each tangent space, and the resulting intrinsic norm constrained LMO is symmetry preserving. Building on this, we introduce intrinsic Muon (iMuon), a unified framework that yields closed-form updates on the fixed-rank, SPD, Stiefel, and Grassmann manifolds for any unitarily invariant norm, including the spectral, Frobenius, and nuclear norms. We establish convergence guarantees for both deterministic and stochastic iMuon with rate constants that depend only on the manifold dimension. Notably, on the fixed-rank manifold this constant depends only on the rank, making the rate independent of factor conditioning and removing the runtime factor-rescaling required by prior work. Experiments on LoRA finetuning of LLMs, image classification, and subspace learning illustrate the efficacy of the proposed approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iMuon gives closed-form norm-constrained updates on fixed-rank, SPD, Stiefel, and Grassmann manifolds by lifting unitarily invariant norms to intrinsic tangent-space norms, with rates depending only on dimension or rank.

read the letter

The main thing to know is that this paper delivers closed-form Muon-style updates on four matrix manifolds without breaking symmetries or requiring post-hoc rescaling. The central step is lifting a unitarily invariant Euclidean norm through the Riemannian metric to an intrinsic norm on each tangent space; the resulting linear maximization oracle stays symmetry-preserving and reduces to standard SVD or eigendecomposition steps. This works uniformly for spectral, Frobenius, and nuclear norms on the fixed-rank, SPD, Stiefel, and Grassmann cases. On the fixed-rank manifold the convergence rate depends only on rank, which removes the conditioning dependence that earlier approaches needed. The deterministic and stochastic analyses use standard Riemannian descent lemmas but keep the constants clean and dimension-only. Experiments on LoRA fine-tuning, image classification, and subspace learning show the method runs without extra tuning. The derivations look consistent and avoid circularity; the stress-test constructions match what is claimed. One minor soft spot is that the rates are upper bounds from generic lemmas rather than tight constants, so practical speed in very high dimensions may still vary. The stochastic variance assumptions are standard but receive less empirical stress-testing than the deterministic case. This is useful for anyone doing constrained matrix optimization in machine learning, especially low-rank or orthogonal parameterizations. It deserves peer review because the explicit constructions and rate independence address a real gap in applying spectral methods to these manifolds.

Referee Report

0 major / 3 minor

Summary. The paper introduces intrinsic Muon (iMuon), a unified framework extending Muon-style norm-constrained optimization to Riemannian matrix manifolds (fixed-rank, SPD, Stiefel, Grassmann). It defines an intrinsic norm on each tangent space by canonically lifting a unitarily invariant Euclidean norm via the Riemannian metric, yielding symmetry-preserving closed-form LMOs and updates for spectral, Frobenius, and nuclear norms. Convergence guarantees are established for deterministic and stochastic variants, with rates depending only on manifold dimension (or rank alone on the fixed-rank manifold). Experiments on LoRA finetuning of LLMs, image classification, and subspace learning support the approach.

Significance. If the explicit constructions and proofs hold, this is a significant contribution to constrained optimization in machine learning. The framework unifies Muon across multiple manifolds with closed-form updates that avoid coupling issues and factor rescaling, while delivering dimension-dependent convergence rates via standard Riemannian descent lemmas. The parameter-free character of the rates (depending solely on dimension or rank) and the symmetry preservation on quotient manifolds are notable strengths, with direct applicability to large-scale tasks like LLM adaptation.

minor comments (3)

[Abstract] In the abstract and introduction, the statement that rates 'depend only on the manifold dimension' could be accompanied by a brief parenthetical note on the precise constants or lemmas used, to immediately highlight the independence from conditioning.
[Experiments] Section 5 (experiments): the LoRA finetuning plots would benefit from reporting standard deviations across multiple random seeds, as single-run curves make it harder to assess robustness of the observed gains over baselines.
[Preliminaries] Notation for the intrinsic norm (e.g., how the horizontal projection is denoted on Grassmann and fixed-rank manifolds) is introduced clearly but could be collected in a single preliminary table for quick reference.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their detailed summary of our manuscript and for recommending minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation chain relies on standard Riemannian geometry: the intrinsic norm is obtained by restricting the ambient unitarily invariant norm to the tangent space via the Riemannian metric, which is a direct and non-circular construction. The LMO is then solved using the same singular-vector or eigenvalue routines as Euclidean Muon, with symmetry preservation following immediately from unitary invariance plus horizontal projection on quotient manifolds. Closed-form updates for spectral/Frobenius/nuclear norms on fixed-rank, SPD, Stiefel, and Grassmann manifolds are explicitly derived, and convergence rates are bounded using standard Riemannian descent lemmas with constants depending only on manifold dimension (or rank). No step reduces to a self-definitional loop, a fitted parameter renamed as prediction, or a load-bearing self-citation chain; all central claims are independent of the paper's own inputs and rest on external mathematical facts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard domain assumptions from Riemannian geometry and optimization; no free parameters or invented entities are introduced beyond the named framework.

axioms (1)

domain assumption Riemannian metrics canonically lift unitarily invariant Euclidean norms to intrinsic norms on tangent spaces that preserve quotient symmetries
This is the single observation resolving both symmetry-breaking and closed-form obstruction issues stated in the abstract.

pith-pipeline@v0.9.0 · 5588 in / 1397 out tokens · 58041 ms · 2026-05-12T05:03:48.384905+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
every Riemannian metric canonically lifts a unitarily invariant Euclidean norm to an intrinsic norm on each tangent space... ξ^* = arg max ... φ(G^{1/2}_x ξ) ≤ τ ... gx(ξ, grad f)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
Z_x is SV invariant... Z^* = U diag(z^*) V^⊤ ... z^* solves max φ(z)≤τ ⟨z,σ⟩

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 3 internal anchors

[1]

Absil, Robert Mahony, and Rodolphe Sepulchre.Optimization Algorithms on Matrix Manifolds

P.-A. Absil, Robert Mahony, and Rodolphe Sepulchre.Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008

work page 2008
[2]

Natural gradient works efficiently in learning.Neural Computation, 10(2):251–276, 1998

Shun-Ichi Amari. Natural gradient works efficiently in learning.Neural Computation, 10(2):251–276, 1998

work page 1998
[3]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Opti- mal matrix sign methods and their application to the Muon algorithm. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=yRtgZ1K8hO. Outstanding Paper Award; arXiv:2505.16932

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Geometric means in a novel vector space structure on symmetric positive-definite matrices.SIAM Journal on Matrix Analysis and Applications, 29(1):328–347, 2007

Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. Geometric means in a novel vector space structure on symmetric positive-definite matrices.SIAM Journal on Matrix Analysis and Applications, 29(1):328–347, 2007

work page 2007
[5]

Online identification and tracking of subspaces from highly incomplete information

Laura Balzano, Robert Nowak, and Benjamin Recht. Online identification and tracking of subspaces from highly incomplete information. In48th Annual Allerton Conference on Communication, Control, and Computing, pages 704–711, 2010

work page 2010
[6]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page arXiv 2024
[7]

Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning. InInternational Conference on Machine Learning (ICML), pages 3920–3930, 2025. arXiv:2410.21265

work page arXiv 2025
[8]

Princeton Series in Applied Mathematics

Rajendra Bhatia.Positive Definite Matrices. Princeton Series in Applied Mathematics. Princeton University Press, 2007

work page 2007
[9]

On the Bures–Wasserstein distance between positive definite matrices.Expositiones Mathematicae, 37(2):165–191, 2019

Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures–Wasserstein distance between positive definite matrices.Expositiones Mathematicae, 37(2):165–191, 2019. 10 Y. LI, B. L. PANDEY, R. SAH, A. HAN, C. MOSTAJERAN, P. JA W ANPURIA, AND B. MISHRA

work page 2019
[10]

Lora meets riemannion: Muon optimizer for parametrization-independent low-rank adapters.arXiv preprint arXiv:2507.12142, 2025

Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, and Maxim Rakhuba. LoRA meets Riemannion: Muon optimizer for parametrization- independent low-rank adapters. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=WtbXgc9GVA. arXiv:2507.12142

work page arXiv 2026
[11]

Cambridge University Press, 2023

Nicolas Boumal.An Introduction to Optimization on Smooth Manifolds. Cambridge University Press, 2023

work page 2023
[12]

Exact matrix completion via convex optimization.Communi- cations of the ACM, 55(6):111–119, 2012

Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization.Communi- cations of the ACM, 55(6):111–119, 2012

work page 2012
[13]

Preconditioned spec- tral descent for deep learning

David Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, and Volkan Cevher. Preconditioned spec- tral descent for deep learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28. Curran Associates, Inc., 2015. URLhttps://proceedings.neurips.cc/paper_files/ paper/2015/file/f50a6c02a3fc5a3a5d4d9391f05f3efc-Paper.pdf

work page 2015
[14]

Muon optimizes under spectral norm constraints.Transac- tions on Machine Learning Research, 2026

Lizhang Chen, Jonathan Li, and qiang liu. Muon optimizes under spectral norm constraints.Transac- tions on Machine Learning Research, 2026. ISSN 2835-8856. URLhttps://openreview.net/forum? id=Blz4hjxLwU

work page 2026
[15]

Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M. Gower. An exploration of non-Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

work page arXiv 2025
[16]

Arias, and Steven T

Alan Edelman, Tomás A. Arias, and Steven T. Smith. The geometry of algorithms with orthogonality constraints.SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998

work page 1998
[17]

Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026

Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for LLM training.arXiv preprint arXiv:2601.23000, 2026

work page arXiv 2026
[18]

Grassmann discriminant analysis: a unifying view on subspace-based learning

Jihun Hamm and Daniel D Lee. Grassmann discriminant analysis: a unifying view on subspace-based learning. InProceedings of the 25th international conference on Machine learning, pages 376–383, 2008

work page 2008
[19]

The movielens datasets: History and context.Acm transac- tions on interactive intelligent systems (tiis), 5(4):1–19, 2015

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transac- tions on interactive intelligent systems (tiis), 5(4):1–19, 2015

work page 2015
[20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[21]

Higham.Functions of Matrices: Theory and Computation

Nicholas J. Higham.Functions of Matrices: Theory and Computation. SIAM, 2008

work page 2008
[22]

Horn and Charles R

Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2nd edition, 2012

work page 2012
[23]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[24]

A Riemannian network for SPD matrix learning

Zhiwu Huang and Luc Van Gool. A Riemannian network for SPD matrix learning. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017
[25]

Projection metric learning on Grassmann manifold with application to video based face recognition

Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Projection metric learning on Grassmann manifold with application to video based face recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 140–149, 2015

work page 2015
[26]

Stabilizing native low-rank LLM pretraining

Paul Janson, Edouard Oyallon, and Eugene Belilovsky. Stabilizing native low-rank LLM pretraining. arXiv preprint arXiv:2602.12429, 2026

work page arXiv 2026
[27]

Muon: An optimizer for hidden layers in neural networks.https://kellerjordan

Keller Jordan. Muon: An optimizer for hidden layers in neural networks.https://kellerjordan. github.io/posts/muon/, 2024

work page 2024
[28]

Absil, and Rodolphe Sepulchre

Michel Journée, Francis Bach, P.-A. Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices.SIAM Journal on Optimization, 20(5):2327–2351, 2010

work page 2010
[29]

Uniform spectral growth and convergence of Muon in LoRA-style matrix factorization.arXiv preprint arXiv:2602.06385, 2026

Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, and Chulhee Yun. Uniform spectral growth and convergence of Muon in LoRA-style matrix factorization.arXiv preprint arXiv:2602.06385, 2026

work page arXiv 2026
[30]

Matrix completion from a few entries

Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980–2998, 2010

work page 2010
[31]

Convergence of Muon with Newton-Schulz.arXiv preprint arXiv:2601.19156, 2026

Gyu Yeol Kim and Min-hwan Oh. Convergence of Muon with Newton–Schulz. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id= lJSfxtLpLm. arXiv:2601.19156

work page arXiv 2026
[32]

Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009

Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009

work page 2009
[33]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. INTRINSIC MUON: SPECTRAL OPTIMIZATION ON RIEMANNIAN MATRIX MANIFOLDS 11

work page 2009
[34]

Scalable optimization in the modular norm

Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, and Jeremy Bernstein. Scalable optimization in the modular norm. InAdvances in Neural Information Processing Systems (NeurIPS),

work page
[35]

Awell-conditionedestimatorforlarge-dimensionalcovariancematrices

OlivierLedoitandMichaelWolf. Awell-conditionedestimatorforlarge-dimensionalcovariancematrices. Journal of multivariate analysis, 88(2):365–411, 2004

work page 2004
[36]

Jiaxiang Li and Mingyi Hong

Jiaxiang Li and Mingyi Hong. A note on the convergence of Muon.arXiv preprint arXiv:2502.02900, 2025

work page arXiv 2025
[37]

Preconditioning benefits of spectral orthogonalization in Muon.arXiv preprint arXiv:2601.13474, 2026

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonal- ization in Muon.arXiv preprint arXiv:2601.13474, 2026

work page arXiv 2026
[38]

Wasserstein Riemannian geometry of Gaussian densities.Information Geometry, 1(2):137–179, 2018

Luigi Malagò, Luigi Montrucchio, and Giovanni Pistone. Wasserstein Riemannian geometry of Gaussian densities.Information Geometry, 1(2):137–179, 2018

work page 2018
[39]

Riemannian preconditioning.SIAM Journal on Optimization, 26(1):635–660, 2016

Bamdev Mishra and Rodolphe Sepulchre. Riemannian preconditioning.SIAM Journal on Optimization, 26(1):635–660, 2016

work page 2016
[40]

A Riemannian geometry for low-rank matrix completion.arXiv preprint arXiv:1211.1550, 2012

Bamdev Mishra, K Aditya Apuroop, and Rodolphe Sepulchre. A Riemannian geometry for low-rank matrix completion.arXiv preprint arXiv:1211.1550, 2012

work page arXiv 2012
[41]

Fixed-rank matrix factor- izations and Riemannian low-rank optimization.Computational Statistics, 29(3–4):591–621, 2014

Bamdev Mishra, Gilles Meyer, Silvère Bonnabel, and Rodolphe Sepulchre. Fixed-rank matrix factor- izations and Riemannian low-rank optimization.Computational Statistics, 29(3–4):591–621, 2014

work page 2014
[42]

Parameter and memory efficient pretraining via low-rank Riemannian optimization

Zhanfeng Mo, Long-Kai Huang, and Sinno Jialin Pan. Parameter and memory efficient pretraining via low-rank Riemannian optimization. InInternational Conference on Learning Representations (ICLR),

work page
[43]

URLhttps://openreview.net/forum?id=i0zzO7Hslk

work page
[44]

Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avra- ham, and Alexander Long

Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P. Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avra- ham, and Alexander Long. NuMuon: Nuclear-norm-constrained Muon for compressible LLM training. arXiv preprint arXiv:2603.03597, 2026

work page arXiv 2026
[45]

Stochastic conditional gradient methods: From convex minimization to submodular maximization.Journal of Machine Learning Research, 21(105): 1–49, 2020

Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization.Journal of Machine Learning Research, 21(105): 1–49, 2020

work page 2020
[46]

The E2E dataset: New challenges for end-to- end generation

Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. The E2E dataset: New challenges for end-to- end generation. InProceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206, 2017

work page 2017
[47]

Riemannian optimization for LoRA on the Stiefel manifold

JuneYoungPark, MinjaeKang, SeongbaeLee, HaegangLee, SeongwanKim, andJaehoLee. Riemannian optimization for LoRA on the Stiefel manifold. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 20971–20985, 2025. arXiv:2508.17901

work page arXiv 2025
[48]

A Riemannian framework for tensor computing

Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A Riemannian framework for tensor computing. International Journal of computer vision, 66(1):41–66, 2006

work page 2006
[49]

Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained LMOs. InInternational Conference on Machine Learning (ICML), 2025. arXiv:2502.07529

work page arXiv 2025
[50]

A simpler approach to matrix completion.Journal of Machine Learning Research, 12 (12), 2011

Benjamin Recht. A simpler approach to matrix completion.Journal of Machine Learning Research, 12 (12), 2011

work page 2011
[51]

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs).arXiv preprint arXiv:2505.13416, 2025

work page arXiv 2025
[52]

arXiv preprint arXiv:2507.01598 , year =

Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka. Convergence bound and critical batch size of Muon optimizer.arXiv preprint arXiv:2507.01598, 2025

work page arXiv 2025
[53]

A geometric framework for momentum-based optimizers for low-rank training

Steffen Schotthöfer, Timon Klein, and Jonas Kusch. A geometric framework for momentum-based optimizers for low-rank training. InAdvances in Neural Information Processing Systems (NeurIPS),

work page
[54]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of Muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Complexities in projection-free stochastic non-convex minimization

Zebang Shen, Cong Fang, Peilin Zhao, Junzhou Huang, and Hui Qian. Complexities in projection-free stochastic non-convex minimization. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), pages 2868–2876. PMLR, 2019. 12 Y. LI, B. L. PANDEY, R. SAH, A. HAN, C. MOSTAJERAN, P. JA W ANPURIA, AND B. MISHRA

work page 2019
[56]

Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent.Journal of Machine Learning Research, 22(150):1–63, 2021

Tian Tong, Cong Ma, and Yuejie Chi. Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent.Journal of Machine Learning Research, 22(150):1–63, 2021

work page 2021
[57]

Region covariance: A fast descriptor for detection and classification

Oncel Tuzel, Fatih Porikli, and Peter Meer. Region covariance: A fast descriptor for detection and classification. InEuropean conference on computer vision, pages 589–600. Springer, 2006

work page 2006
[58]

Pedestrian detection via classification on Riemannian manifolds.IEEE transactions on pattern analysis and machine intelligence, 30(10):1713–1727, 2008

Oncel Tuzel, Fatih Porikli, and Peter Meer. Pedestrian detection via classification on Riemannian manifolds.IEEE transactions on pattern analysis and machine intelligence, 30(10):1713–1727, 2008

work page 2008
[59]

Low-rank matrix completion by Riemannian optimization.SIAM Journal on Optimization, 23(2):1214–1236, 2013

Bart Vandereycken. Low-rank matrix completion by Riemannian optimization.SIAM Journal on Optimization, 23(2):1214–1236, 2013

work page 2013
[60]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InInternational Conference on Learning Representations (ICLR), 2019. URLhttps://openreview.net/forum?id= rJ4km2R5t7. arXiv:1804.07461

work page internal anchor Pith review arXiv 2019
[61]

Additive margin softmax for face verification

Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018

work page 2018
[62]

Taming momentum: Rethinking opti- mizer states through low-rank approximation

Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. Taming momentum: Rethinking opti- mizer states through low-rank approximation. InInternational Conference on Learning Representations (ICLR), 2026. Oral; arXiv:2602.24283

work page arXiv 2026
[63]

Riemannian optimization via Frank–Wolfe methods.Mathematical Programming, 199:525–556, 2023

Melanie Weber and Suvrit Sra. Riemannian optimization via Frank–Wolfe methods.Mathematical Programming, 199:525–556, 2023

work page 2023
[64]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps: //openreview.net/forum?id=2J51qUZ0iG. arXiv:2509.02046

work page arXiv 2026
[65]

Face recognition in unconstrained videos with matched back- ground similarity

Lior Wolf, Tal Hassner, and Itay Maoz. Face recognition in unconstrained videos with matched back- ground similarity. InCVPR 2011, pages 529–534. IEEE, 2011

work page 2011
[66]

Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,

TianXie, HaomingLuo, HaoyuTang, YiwenHu, JasonKleinLiu, QingnanRen, YangWang, WayneXin Zhao, Rui Yan, Bing Su, Chong Luo, and Baining Guo. Controlled LLM training on spectral sphere. arXiv preprint arXiv:2601.08393, 2026

work page arXiv 2026
[67]

Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperpa- rameter transfer. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021. arXiv:2203.03466

work page arXiv 2021
[68]

A spectral condition for feature learning

Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813, 2024

work page arXiv 2024
[69]

Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

Kaiwei Yang and Lexiao Lai. Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

work page arXiv 2026
[70]

Riemannian preconditioned LoRA for fine-tuning foundation models

Fangzhao Zhang and Mert Pilanci. Riemannian preconditioned LoRA for fine-tuning foundation models. InInternational Conference on Machine Learning (ICML), 2024. arXiv:2402.02347

work page arXiv 2024
[71]

Zhao et al

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. InInternational Conference on Machine Learning (ICML), 2024. arXiv:2403.03507. Appendix Appendix A. Broader Societal Impact 13 Appendix B. Limitations 14 Appendix C. Closed-Form LMO Solutions vi...

work page arXiv 2024
[72]

observe that the objective function in feedforward neural networks admits a tighter majorization bound under the Schatten-∞norm than under the Frobenius norm, and derive the corresponding steepest-descent operator, which is precisely the orthogonal polar factorOrtho(·)used by Muon. They further combine this non-Euclidean gradient with element-wise adaptiv...

work page arXiv
[73]

We use the standard CIFAR-100 train/test split and reserve10%of the training set as a validation split for learning-rate selection

We then build a shrinkage regularized covariance descriptorCi ∈S 32 ++ [24, 35, 57] for each image using covariance shrinkage0.1and diagonal stabilizationε= 10 −4. We use the standard CIFAR-100 train/test split and reserve10%of the training set as a validation split for learning-rate selection. Results. Table 15 reports validation-selected test accuracy f...

work page