{"total":12,"items":[{"citing_arxiv_id":"2605.23871","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer","primary_cat":"stat.ML","submitted_at":"2026-05-22T17:28:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Regularized Muon induces a damped Hamiltonian flow on probability measures over matrix parameters, yielding exponential convergence under gradient dominance assumptions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18106","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers","primary_cat":"math.OC","submitted_at":"2026-05-18T09:17:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"IEEE Journal of Selected Topics in Signal Processing, 10(2):296-311, 2016. [24] D. Chang, Y. Liu, and G. Yuan. On the convergence of Muon and beyond.arXiv preprint arXiv:2509.15816, 2025. [25] D. Chang, Q. Shi, L. Zhang, Y. Li, R. Zhang, Y. Lu, Y. Liu, and G. Yuan. MuonEq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026. [26] L. Chen, J. Li, and Q. Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025. [27] Y. Chen, Y. Chi, J. Fan, and C. Ma. Spectral methods for data science: A statistical perspective. Foundations and Trends®in Machine Learning, 14(5):566-806, 2021. [28] M. Crawshaw, C. Modi, M. Liu, and R. M. Gower. An exploration of non-Euclidean gradient descent:"},{"citing_arxiv_id":"2605.13079","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence","primary_cat":"cs.LG","submitted_at":"2026-05-13T06:54:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12492","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[13] David Carlson, V olkan Cevher, and Lawrence Carin. Stochastic spectral descent for restricted boltzmann machines. InAISTATS, 2015. 9 [14] David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and V olkan Cevher. Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296-311, 2016. 9 [15] Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025. 9 [16] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code."},{"citing_arxiv_id":"2605.08949","ref_index":30,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning","primary_cat":"cs.LG","submitted_at":"2026-05-09T13:42:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"[28] Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuan-Jing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658-10671, 2023. [29] Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning.arXiv preprint arXiv:2410.21265, 2024. [30] Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025. [31] Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In Silvia Chiappa and Roberto Calandra, editors,Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of"},{"citing_arxiv_id":"2605.06615","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\\ell_1$-norm Lower Bounds","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:32:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"allocate the curvature across coordinates. Optimizing this allocation yields the desired lower bound under Assumption 3a. For the deterministic term in (24), we choose a highly imbalanced smoothness vector, for example L1 ≥ ∥L∥1/2and∥L∥ 1 =L ∞. Then∥L∥ ∞ = Ω(L∞), and the coordinate-wise lower bound gives E \u0014 min t∈[T] ∥∇f(x t)∥1 \u0015 = Ω r dL∞∆ T ! .(6) This shows that, in the worst case, SGD suffers an additional factor d even in the noiseless part of the complexity. For the stochastic term in (24), we allocate the curvature according to the noise profile by setting Li = σ2 i/∥σ∥2 2 \u0001 L∞, i∈[d]so that Pd i=1 Li =L ∞ and it yields that E \u0014 min t∈[T] ∥∇f(x t)∥1 \u0015 = Ω \u0012 d∥σ∥2 2L∞∆ T \u00131/4! .(7) The two constructions above give two valid ℓ∞-smooth hard instances satisfying the same global"},{"citing_arxiv_id":"2604.02505","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Optimal Projection-Free Adaptive SGD for Matrix Optimization","primary_cat":"math.OC","submitted_at":"2026-04-02T21:02:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(2) to the modified objective function fk(x): X → R, which is defined as follows: fk(x) = α2 kf(x/αk + (1 − 1/αk)xk), where αk ≥ 1, xk ∈ X , (16) Consequently, using our previously obtained Theorem 2, we obtain the following Lemma 5. Lemma 5 (↓). Let η = R and δ > 0. Then, for all x ∈ QR, the following inequality holds: E \u0014 KX k=0 Bfk(xk; xk+1) + fk(xk+1) − fk(x) \u0015 ≤ E h δR dim(X ) + 4R∥ p SK+1∥tr i . (17) The reference point xk is updated using line 7, which allows us to obtain the following Lemma 6. Lemma 6 (↓). Let α0 = 1 and α2 k ≤ αk + α2 k−1 for all k ∈ N. Let x∗ ∈ QR be a solution to problem (15). Then, the following inequality holds: α2 K[f(xK+1) − f ∗] ≤ KX k=0 [fk(xk+1) − fk(x∗)]. (18) Furthermore, we make another important modification to the algorithm."},{"citing_arxiv_id":"2603.28254","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration","primary_cat":"cs.LG","submitted_at":"2026-03-30T10:28:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.26554","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory","primary_cat":"cs.LG","submitted_at":"2026-03-27T16:13:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.10067","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HTMuon: Improving Muon via Heavy-Tailed Spectral Correction","primary_cat":"cs.LG","submitted_at":"2026-03-10T02:12:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.15816","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Convergence of Muon and Beyond","primary_cat":"cs.LG","submitted_at":"2025-09-19T09:43:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.11983","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training","primary_cat":"cs.LG","submitted_at":"2025-09-15T14:28:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}