pith. machine review for the scientific record. sign in

arxiv: 2605.09238 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

Andi Han, Bamdev Mishra, Bihari Lal Pandey, Cyrus Mostajeran, Pratik Jawanpuria, Ravi Sah, Yibang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords intrinsic muonriemannian matrix optimizationunitarily invariant normsclosed-form updatesfixed-rank manifoldSPD manifoldstiefel manifoldgrassmann manifold
0
0 comments X

The pith

Lifting unitarily invariant norms to tangent spaces via the Riemannian metric yields closed-form Muon updates on fixed-rank, SPD, Stiefel, and Grassmann manifolds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to extend Muon-style norm-constrained optimization to parameters that live on Riemannian matrix manifolds instead of flat Euclidean space. Standard Muon solves a linear maximization oracle over an ambient norm ball, but restricting that oracle directly to the tangent space breaks the manifold's quotient symmetries and couples the constraint in a way that blocks closed-form solutions. The resolution is a single observation: the Riemannian metric lifts any unitarily invariant Euclidean norm to a natural intrinsic norm on the tangent space, and the resulting tangent-space oracle automatically respects the manifold symmetries. This produces a unified iMuon framework that supplies explicit updates for the spectral, Frobenius, and nuclear norms on four common manifolds, together with convergence rates whose constants depend only on manifold dimension. The independence from factor conditioning on the fixed-rank manifold removes an extra rescaling step that earlier methods required.

Core claim

Every Riemannian metric canonically lifts a unitarily invariant Euclidean norm to an intrinsic norm on each tangent space, and the resulting intrinsic-norm-constrained linear maximization oracle is symmetry preserving; building on this single fact produces a unified intrinsic Muon (iMuon) algorithm that returns closed-form updates on the fixed-rank, SPD, Stiefel, and Grassmann manifolds for any unitarily invariant norm and supplies both deterministic and stochastic convergence guarantees whose rate constants depend only on the manifold dimension.

What carries the argument

The intrinsic norm on the tangent space obtained by lifting a unitarily invariant Euclidean norm through the Riemannian metric; this lift makes the constrained linear maximization oracle symmetry-preserving and therefore solvable in closed form.

If this is right

  • Deterministic and stochastic versions of iMuon converge with rates whose constants depend only on manifold dimension, independent of factor conditioning on the fixed-rank case.
  • No runtime factor-rescaling step is required for fixed-rank optimization.
  • The same closed-form machinery applies unchanged to the spectral, Frobenius, and nuclear norms on four different manifolds.
  • The framework directly supports LoRA fine-tuning of large language models, image classification, and subspace learning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dimension-only rate dependence suggests that iMuon could remain practical even when manifold dimension grows, provided the closed-form step itself scales acceptably.
  • The same lifting construction might be reusable on other matrix manifolds whose tangent spaces admit natural unitarily invariant structures.
  • Because the method removes an explicit rescaling heuristic, implementations on fixed-rank problems become simpler and potentially more stable across different conditioning regimes.

Load-bearing premise

That lifting any unitarily invariant Euclidean norm through the Riemannian metric produces a tangent-space norm whose linear maximization oracle automatically respects the manifold's quotient symmetries.

What would settle it

An explicit computation on the fixed-rank manifold showing that the lifted intrinsic-norm LMO for the spectral norm either fails to admit a closed-form solution or produces a matrix that violates the quotient symmetry of the manifold.

Figures

Figures reproduced from arXiv: 2605.09238 by Andi Han, Bamdev Mishra, Bihari Lal Pandey, Cyrus Mostajeran, Pratik Jawanpuria, Ravi Sah, Yibang Li.

Figure 1
Figure 1. Figure 1: SPD classification on CIFAR-100 (S 32 ++, 20 coarse classes). Each panel pairs a Euclidean LMO with its intrinsic counterpart under a common norm: Frobenius (EGD vs. RGD), spectral (Muon vs. iMuon), and nuclear (NuMuon vs. iMuon-Nu). Curves show mean test accuracy with ±1 std bands over 3 seeds at the validation-selected learning rate. The intrinsic method dominates in every pair, with the gap widening fro… view at source ↗
Figure 2
Figure 2. Figure 2: Noise sensitivity in large-scale synthetic fixed-rank matrix completion. Each row fixes the condition number and plots final relative recovery error as the relative observed￾entry noise scale ρ varies. The three panels in each row compare the Frobenius, spectral, and nuclear norm pairs. The y-axis is logarithmic, and lower is better. Results [PITH_FULL_IMAGE:figures/full_fig_p035_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training trajectories for the balanced fixed-rank CIFAR-100 rank-head com￾parison. Curves show means over three seeds with standard-deviation bands. The three methods reach similar training objectives, while iMuon remains competitive in test accu￾racy [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SPD convergence plot on frozen covariance features. The top row reports train￾ing cross-entropy and the bottom row reports test accuracy. Columns compare the Frobe￾nius, spectral, and nuclear norm pairs. This variant omits the prototype anchoring term, so the accuracy values are not directly comparable with [PITH_FULL_IMAGE:figures/full_fig_p039_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Stiefel subcenter prototype classification trajectories. The left panel reports the training objective and the right panel reports test accuracy over epochs. Curves show means over three seeds with standard deviation bands. Stiefel manifold from the embedded submanifold viewpoint, why the naive approach fails and how the block decomposition of Section 4 resolves it, (iv) the spectrahedron (trace-one fixed-… view at source ↗
read the original abstract

Muon and related norm-constrained matrix optimizers have become central to large-scale learning problems. They are formulated as a linear maximization oracle (LMO) over an ambient matrix-norm ball in unconstrained Euclidean space. However, these do not generalize cleanly to manifold-valued parameters such as low-rank factorizations, orthogonality constraints, or symmetric positive definite (SPD) matrices. Naively restricting the Muon LMO to the tangent space (i) breaks quotient symmetries and (ii) couples the tangent-space constraint with an ambient norm bound, thereby obstructing closed-form solutions on various manifolds of interest. We resolve both issues with a single observation: every Riemannian metric canonically lifts a unitarily invariant Euclidean norm to an intrinsic norm on each tangent space, and the resulting intrinsic norm constrained LMO is symmetry preserving. Building on this, we introduce intrinsic Muon (iMuon), a unified framework that yields closed-form updates on the fixed-rank, SPD, Stiefel, and Grassmann manifolds for any unitarily invariant norm, including the spectral, Frobenius, and nuclear norms. We establish convergence guarantees for both deterministic and stochastic iMuon with rate constants that depend only on the manifold dimension. Notably, on the fixed-rank manifold this constant depends only on the rank, making the rate independent of factor conditioning and removing the runtime factor-rescaling required by prior work. Experiments on LoRA finetuning of LLMs, image classification, and subspace learning illustrate the efficacy of the proposed approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces intrinsic Muon (iMuon), a unified framework extending Muon-style norm-constrained optimization to Riemannian matrix manifolds (fixed-rank, SPD, Stiefel, Grassmann). It defines an intrinsic norm on each tangent space by canonically lifting a unitarily invariant Euclidean norm via the Riemannian metric, yielding symmetry-preserving closed-form LMOs and updates for spectral, Frobenius, and nuclear norms. Convergence guarantees are established for deterministic and stochastic variants, with rates depending only on manifold dimension (or rank alone on the fixed-rank manifold). Experiments on LoRA finetuning of LLMs, image classification, and subspace learning support the approach.

Significance. If the explicit constructions and proofs hold, this is a significant contribution to constrained optimization in machine learning. The framework unifies Muon across multiple manifolds with closed-form updates that avoid coupling issues and factor rescaling, while delivering dimension-dependent convergence rates via standard Riemannian descent lemmas. The parameter-free character of the rates (depending solely on dimension or rank) and the symmetry preservation on quotient manifolds are notable strengths, with direct applicability to large-scale tasks like LLM adaptation.

minor comments (3)
  1. [Abstract] In the abstract and introduction, the statement that rates 'depend only on the manifold dimension' could be accompanied by a brief parenthetical note on the precise constants or lemmas used, to immediately highlight the independence from conditioning.
  2. [Experiments] Section 5 (experiments): the LoRA finetuning plots would benefit from reporting standard deviations across multiple random seeds, as single-run curves make it harder to assess robustness of the observed gains over baselines.
  3. [Preliminaries] Notation for the intrinsic norm (e.g., how the horizontal projection is denoted on Grassmann and fixed-rank manifolds) is introduced clearly but could be collected in a single preliminary table for quick reference.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their detailed summary of our manuscript and for recommending minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation chain relies on standard Riemannian geometry: the intrinsic norm is obtained by restricting the ambient unitarily invariant norm to the tangent space via the Riemannian metric, which is a direct and non-circular construction. The LMO is then solved using the same singular-vector or eigenvalue routines as Euclidean Muon, with symmetry preservation following immediately from unitary invariance plus horizontal projection on quotient manifolds. Closed-form updates for spectral/Frobenius/nuclear norms on fixed-rank, SPD, Stiefel, and Grassmann manifolds are explicitly derived, and convergence rates are bounded using standard Riemannian descent lemmas with constants depending only on manifold dimension (or rank). No step reduces to a self-definitional loop, a fitted parameter renamed as prediction, or a load-bearing self-citation chain; all central claims are independent of the paper's own inputs and rest on external mathematical facts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard domain assumptions from Riemannian geometry and optimization; no free parameters or invented entities are introduced beyond the named framework.

axioms (1)
  • domain assumption Riemannian metrics canonically lift unitarily invariant Euclidean norms to intrinsic norms on tangent spaces that preserve quotient symmetries
    This is the single observation resolving both symmetry-breaking and closed-form obstruction issues stated in the abstract.

pith-pipeline@v0.9.0 · 5588 in / 1397 out tokens · 58041 ms · 2026-05-12T05:03:48.384905+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 3 internal anchors

  1. [1]

    Absil, Robert Mahony, and Rodolphe Sepulchre.Optimization Algorithms on Matrix Manifolds

    P.-A. Absil, Robert Mahony, and Rodolphe Sepulchre.Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008

  2. [2]

    Natural gradient works efficiently in learning.Neural Computation, 10(2):251–276, 1998

    Shun-Ichi Amari. Natural gradient works efficiently in learning.Neural Computation, 10(2):251–276, 1998

  3. [3]

    Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Opti- mal matrix sign methods and their application to the Muon algorithm. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=yRtgZ1K8hO. Outstanding Paper Award; arXiv:2505.16932

  4. [4]

    Geometric means in a novel vector space structure on symmetric positive-definite matrices.SIAM Journal on Matrix Analysis and Applications, 29(1):328–347, 2007

    Vincent Arsigny, Pierre Fillard, Xavier Pennec, and Nicholas Ayache. Geometric means in a novel vector space structure on symmetric positive-definite matrices.SIAM Journal on Matrix Analysis and Applications, 29(1):328–347, 2007

  5. [5]

    Online identification and tracking of subspaces from highly incomplete information

    Laura Balzano, Robert Nowak, and Benjamin Recht. Online identification and tracking of subspaces from highly incomplete information. In48th Annual Allerton Conference on Communication, Control, and Computing, pages 704–711, 2010

  6. [6]

    Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

  7. [7]

    Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024

    Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning. InInternational Conference on Machine Learning (ICML), pages 3920–3930, 2025. arXiv:2410.21265

  8. [8]

    Princeton Series in Applied Mathematics

    Rajendra Bhatia.Positive Definite Matrices. Princeton Series in Applied Mathematics. Princeton University Press, 2007

  9. [9]

    On the Bures–Wasserstein distance between positive definite matrices.Expositiones Mathematicae, 37(2):165–191, 2019

    Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures–Wasserstein distance between positive definite matrices.Expositiones Mathematicae, 37(2):165–191, 2019. 10 Y. LI, B. L. PANDEY, R. SAH, A. HAN, C. MOSTAJERAN, P. JA W ANPURIA, AND B. MISHRA

  10. [10]

    Lora meets riemannion: Muon optimizer for parametrization-independent low-rank adapters.arXiv preprint arXiv:2507.12142, 2025

    Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, and Maxim Rakhuba. LoRA meets Riemannion: Muon optimizer for parametrization- independent low-rank adapters. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=WtbXgc9GVA. arXiv:2507.12142

  11. [11]

    Cambridge University Press, 2023

    Nicolas Boumal.An Introduction to Optimization on Smooth Manifolds. Cambridge University Press, 2023

  12. [12]

    Exact matrix completion via convex optimization.Communi- cations of the ACM, 55(6):111–119, 2012

    Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization.Communi- cations of the ACM, 55(6):111–119, 2012

  13. [13]

    Preconditioned spec- tral descent for deep learning

    David Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, and Volkan Cevher. Preconditioned spec- tral descent for deep learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 28. Curran Associates, Inc., 2015. URLhttps://proceedings.neurips.cc/paper_files/ paper/2015/file/f50a6c02a3fc5a3a5d4d9391f05f3efc-Paper.pdf

  14. [14]

    Muon optimizes under spectral norm constraints.Transac- tions on Machine Learning Research, 2026

    Lizhang Chen, Jonathan Li, and qiang liu. Muon optimizes under spectral norm constraints.Transac- tions on Machine Learning Research, 2026. ISSN 2835-8856. URLhttps://openreview.net/forum? id=Blz4hjxLwU

  15. [15]

    Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M. Gower. An exploration of non-Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

  16. [16]

    Arias, and Steven T

    Alan Edelman, Tomás A. Arias, and Steven T. Smith. The geometry of algorithms with orthogonality constraints.SIAM Journal on Matrix Analysis and Applications, 20(2):303–353, 1998

  17. [17]

    Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026

    Yufei Gu and Zeke Xie. Mano: Restriking manifold optimization for LLM training.arXiv preprint arXiv:2601.23000, 2026

  18. [18]

    Grassmann discriminant analysis: a unifying view on subspace-based learning

    Jihun Hamm and Daniel D Lee. Grassmann discriminant analysis: a unifying view on subspace-based learning. InProceedings of the 25th international conference on Machine learning, pages 376–383, 2008

  19. [19]

    The movielens datasets: History and context.Acm transac- tions on interactive intelligent systems (tiis), 5(4):1–19, 2015

    F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context.Acm transac- tions on interactive intelligent systems (tiis), 5(4):1–19, 2015

  20. [20]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  21. [21]

    Higham.Functions of Matrices: Theory and Computation

    Nicholas J. Higham.Functions of Matrices: Theory and Computation. SIAM, 2008

  22. [22]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 2nd edition, 2012

  23. [23]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022

  24. [24]

    A Riemannian network for SPD matrix learning

    Zhiwu Huang and Luc Van Gool. A Riemannian network for SPD matrix learning. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

  25. [25]

    Projection metric learning on Grassmann manifold with application to video based face recognition

    Zhiwu Huang, Ruiping Wang, Shiguang Shan, and Xilin Chen. Projection metric learning on Grassmann manifold with application to video based face recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 140–149, 2015

  26. [26]

    Stabilizing native low-rank LLM pretraining

    Paul Janson, Edouard Oyallon, and Eugene Belilovsky. Stabilizing native low-rank LLM pretraining. arXiv preprint arXiv:2602.12429, 2026

  27. [27]

    Muon: An optimizer for hidden layers in neural networks.https://kellerjordan

    Keller Jordan. Muon: An optimizer for hidden layers in neural networks.https://kellerjordan. github.io/posts/muon/, 2024

  28. [28]

    Absil, and Rodolphe Sepulchre

    Michel Journée, Francis Bach, P.-A. Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of positive semidefinite matrices.SIAM Journal on Optimization, 20(5):2327–2351, 2010

  29. [29]

    Uniform spectral growth and convergence of Muon in LoRA-style matrix factorization.arXiv preprint arXiv:2602.06385, 2026

    Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, and Chulhee Yun. Uniform spectral growth and convergence of Muon in LoRA-style matrix factorization.arXiv preprint arXiv:2602.06385, 2026

  30. [30]

    Matrix completion from a few entries

    Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980–2998, 2010

  31. [31]

    Convergence of Muon with Newton-Schulz.arXiv preprint arXiv:2601.19156, 2026

    Gyu Yeol Kim and Min-hwan Oh. Convergence of Muon with Newton–Schulz. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id= lJSfxtLpLm. arXiv:2601.19156

  32. [32]

    Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009

    Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009

  33. [33]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. INTRINSIC MUON: SPECTRAL OPTIMIZATION ON RIEMANNIAN MATRIX MANIFOLDS 11

  34. [34]

    Scalable optimization in the modular norm

    Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, and Jeremy Bernstein. Scalable optimization in the modular norm. InAdvances in Neural Information Processing Systems (NeurIPS),

  35. [35]

    Awell-conditionedestimatorforlarge-dimensionalcovariancematrices

    OlivierLedoitandMichaelWolf. Awell-conditionedestimatorforlarge-dimensionalcovariancematrices. Journal of multivariate analysis, 88(2):365–411, 2004

  36. [36]

    Jiaxiang Li and Mingyi Hong

    Jiaxiang Li and Mingyi Hong. A note on the convergence of Muon.arXiv preprint arXiv:2502.02900, 2025

  37. [37]

    Preconditioning benefits of spectral orthogonalization in Muon.arXiv preprint arXiv:2601.13474, 2026

    Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonal- ization in Muon.arXiv preprint arXiv:2601.13474, 2026

  38. [38]

    Wasserstein Riemannian geometry of Gaussian densities.Information Geometry, 1(2):137–179, 2018

    Luigi Malagò, Luigi Montrucchio, and Giovanni Pistone. Wasserstein Riemannian geometry of Gaussian densities.Information Geometry, 1(2):137–179, 2018

  39. [39]

    Riemannian preconditioning.SIAM Journal on Optimization, 26(1):635–660, 2016

    Bamdev Mishra and Rodolphe Sepulchre. Riemannian preconditioning.SIAM Journal on Optimization, 26(1):635–660, 2016

  40. [40]

    A Riemannian geometry for low-rank matrix completion.arXiv preprint arXiv:1211.1550, 2012

    Bamdev Mishra, K Aditya Apuroop, and Rodolphe Sepulchre. A Riemannian geometry for low-rank matrix completion.arXiv preprint arXiv:1211.1550, 2012

  41. [41]

    Fixed-rank matrix factor- izations and Riemannian low-rank optimization.Computational Statistics, 29(3–4):591–621, 2014

    Bamdev Mishra, Gilles Meyer, Silvère Bonnabel, and Rodolphe Sepulchre. Fixed-rank matrix factor- izations and Riemannian low-rank optimization.Computational Statistics, 29(3–4):591–621, 2014

  42. [42]

    Parameter and memory efficient pretraining via low-rank Riemannian optimization

    Zhanfeng Mo, Long-Kai Huang, and Sinno Jialin Pan. Parameter and memory efficient pretraining via low-rank Riemannian optimization. InInternational Conference on Learning Representations (ICLR),

  43. [43]

    URLhttps://openreview.net/forum?id=i0zzO7Hslk

  44. [44]

    Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avra- ham, and Alexander Long

    Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin P. Hewa Koneputugodage, Shamane Siriwardhana, Violetta Shevchenko, Karol Pajak, James Snewin, Gil Avra- ham, and Alexander Long. NuMuon: Nuclear-norm-constrained Muon for compressible LLM training. arXiv preprint arXiv:2603.03597, 2026

  45. [45]

    Stochastic conditional gradient methods: From convex minimization to submodular maximization.Journal of Machine Learning Research, 21(105): 1–49, 2020

    Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization.Journal of Machine Learning Research, 21(105): 1–49, 2020

  46. [46]

    The E2E dataset: New challenges for end-to- end generation

    Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. The E2E dataset: New challenges for end-to- end generation. InProceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206, 2017

  47. [47]

    Riemannian optimization for LoRA on the Stiefel manifold

    JuneYoungPark, MinjaeKang, SeongbaeLee, HaegangLee, SeongwanKim, andJaehoLee. Riemannian optimization for LoRA on the Stiefel manifold. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 20971–20985, 2025. arXiv:2508.17901

  48. [48]

    A Riemannian framework for tensor computing

    Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A Riemannian framework for tensor computing. International Journal of computer vision, 66(1):41–66, 2006

  49. [49]

    Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained LMOs. InInternational Conference on Machine Learning (ICML), 2025. arXiv:2502.07529

  50. [50]

    A simpler approach to matrix completion.Journal of Machine Learning Research, 12 (12), 2011

    Benjamin Recht. A simpler approach to matrix completion.Journal of Machine Learning Research, 12 (12), 2011

  51. [51]

    Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)

    Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs).arXiv preprint arXiv:2505.13416, 2025

  52. [52]

    arXiv preprint arXiv:2507.01598 , year =

    Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka. Convergence bound and critical batch size of Muon optimizer.arXiv preprint arXiv:2507.01598, 2025

  53. [53]

    A geometric framework for momentum-based optimizers for low-rank training

    Steffen Schotthöfer, Timon Klein, and Jonas Kusch. A geometric framework for momentum-based optimizers for low-rank training. InAdvances in Neural Information Processing Systems (NeurIPS),

  54. [54]

    On the Convergence Analysis of Muon

    Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of Muon.arXiv preprint arXiv:2505.23737, 2025

  55. [55]

    Complexities in projection-free stochastic non-convex minimization

    Zebang Shen, Cong Fang, Peilin Zhao, Junzhou Huang, and Hui Qian. Complexities in projection-free stochastic non-convex minimization. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), pages 2868–2876. PMLR, 2019. 12 Y. LI, B. L. PANDEY, R. SAH, A. HAN, C. MOSTAJERAN, P. JA W ANPURIA, AND B. MISHRA

  56. [56]

    Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent.Journal of Machine Learning Research, 22(150):1–63, 2021

    Tian Tong, Cong Ma, and Yuejie Chi. Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent.Journal of Machine Learning Research, 22(150):1–63, 2021

  57. [57]

    Region covariance: A fast descriptor for detection and classification

    Oncel Tuzel, Fatih Porikli, and Peter Meer. Region covariance: A fast descriptor for detection and classification. InEuropean conference on computer vision, pages 589–600. Springer, 2006

  58. [58]

    Pedestrian detection via classification on Riemannian manifolds.IEEE transactions on pattern analysis and machine intelligence, 30(10):1713–1727, 2008

    Oncel Tuzel, Fatih Porikli, and Peter Meer. Pedestrian detection via classification on Riemannian manifolds.IEEE transactions on pattern analysis and machine intelligence, 30(10):1713–1727, 2008

  59. [59]

    Low-rank matrix completion by Riemannian optimization.SIAM Journal on Optimization, 23(2):1214–1236, 2013

    Bart Vandereycken. Low-rank matrix completion by Riemannian optimization.SIAM Journal on Optimization, 23(2):1214–1236, 2013

  60. [60]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InInternational Conference on Learning Representations (ICLR), 2019. URLhttps://openreview.net/forum?id= rJ4km2R5t7. arXiv:1804.07461

  61. [61]

    Additive margin softmax for face verification

    Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930, 2018

  62. [62]

    Taming momentum: Rethinking opti- mizer states through low-rank approximation

    Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. Taming momentum: Rethinking opti- mizer states through low-rank approximation. InInternational Conference on Learning Representations (ICLR), 2026. Oral; arXiv:2602.24283

  63. [63]

    Riemannian optimization via Frank–Wolfe methods.Mathematical Programming, 199:525–556, 2023

    Melanie Weber and Suvrit Sra. Riemannian optimization via Frank–Wolfe methods.Mathematical Programming, 199:525–556, 2023

  64. [64]

    Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps: //openreview.net/forum?id=2J51qUZ0iG. arXiv:2509.02046

  65. [65]

    Face recognition in unconstrained videos with matched back- ground similarity

    Lior Wolf, Tal Hassner, and Itay Maoz. Face recognition in unconstrained videos with matched back- ground similarity. InCVPR 2011, pages 529–534. IEEE, 2011

  66. [66]

    Controlled llm training on spectral sphere.arXiv preprint arXiv: 2601.08393,

    TianXie, HaomingLuo, HaoyuTang, YiwenHu, JasonKleinLiu, QingnanRen, YangWang, WayneXin Zhao, Rui Yan, Bing Su, Chong Luo, and Baining Guo. Controlled LLM training on spectral sphere. arXiv preprint arXiv:2601.08393, 2026

  67. [67]

    Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

    Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperpa- rameter transfer. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, 2021. arXiv:2203.03466

  68. [68]

    A spectral condition for feature learning

    Greg Yang, James B. Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813, 2024

  69. [69]

    Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

    Kaiwei Yang and Lexiao Lai. Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487, 2026

  70. [70]

    Riemannian preconditioned LoRA for fine-tuning foundation models

    Fangzhao Zhang and Mert Pilanci. Riemannian preconditioned LoRA for fine-tuning foundation models. InInternational Conference on Machine Learning (ICML), 2024. arXiv:2402.02347

  71. [71]

    Zhao et al

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. InInternational Conference on Machine Learning (ICML), 2024. arXiv:2403.03507. Appendix Appendix A. Broader Societal Impact 13 Appendix B. Limitations 14 Appendix C. Closed-Form LMO Solutions vi...

  72. [72]

    observe that the objective function in feedforward neural networks admits a tighter majorization bound under the Schatten-∞norm than under the Frobenius norm, and derive the corresponding steepest-descent operator, which is precisely the orthogonal polar factorOrtho(·)used by Muon. They further combine this non-Euclidean gradient with element-wise adaptiv...

  73. [73]

    We use the standard CIFAR-100 train/test split and reserve10%of the training set as a validation split for learning-rate selection

    We then build a shrinkage regularized covariance descriptorCi ∈S 32 ++ [24, 35, 57] for each image using covariance shrinkage0.1and diagonal stabilizationε= 10 −4. We use the standard CIFAR-100 train/test split and reserve10%of the training set as a validation split for learning-rate selection. Results. Table 15 reports validation-selected test accuracy f...