{"total":16,"items":[{"citing_arxiv_id":"2606.08783","ref_index":191,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality","primary_cat":"math.OC","submitted_at":"2026-06-07T18:59:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31371","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling","primary_cat":"cs.LG","submitted_at":"2026-05-29T14:41:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20866","ref_index":300,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging","primary_cat":"cs.LG","submitted_at":"2026-05-20T08:01:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19619","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models","primary_cat":"cs.LG","submitted_at":"2026-05-19T09:56:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18174","ref_index":299,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method","primary_cat":"cs.LG","submitted_at":"2026-05-18T10:18:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18106","ref_index":132,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers","primary_cat":"math.OC","submitted_at":"2026-05-18T09:17:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Layerwise training itself is not new: classical examples include LARS and LAMB [163, 164], layerwise hyperparameter prescriptions such as those arising inµP [160], and block-coordinate views of neural network training [166, 90]. What is emerging, however, is a more refined class of layerwise optimizers that account for the geometry of each parameter block. Recent methods such asShampoo[ 62],Muon[ 76], SOAP [151],Scion[ 122],Gluon[ 132], andPolarGrad[ 89] can be viewed as part of this trend. The case for geometry-aware optimizer design becomes stronger as foundation models become more heterogeneous. Large language models [149, 130, 131, 19], vision transformers [40], multimodal models [129, 18], diffusion language models [110, 55, 119],MoEs [136, 91], and state space models [60, 30, 87] all contain parameter blocks with distinct natural symmetries."},{"citing_arxiv_id":"2605.13434","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity","primary_cat":"cs.LG","submitted_at":"2026-05-13T12:27:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11850","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives","primary_cat":"math.OC","submitted_at":"2026-05-12T09:36:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the LION-K family of algorithms and thus to an optimization problem with constraints on the spectral norm of the parameters, following the more general analysis of [12]. A stochastic Frank-Wolfe analysis for the convergence of Muon with weight decay was also performed in [59]. Further extensions to the setting where the cost function is layer-wise ( L0, L1)-smooth were provided in [56], while [72] introduced an adaptive method that combines AdaGrad with Muon. In [41], the MARS [69] variance-reduction technique was utilized on top of the Muon algorithm leading to improved convergence rates at the cost of additional samples per iteration and a similar approach was considered in [55]. We remark that significant attention has also been given to improving the computation of the generalized matrix sign operation employed"},{"citing_arxiv_id":"2605.09552","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Phases of Muon: When Muon Eclipses SignSGD","primary_cat":"math.OC","submitted_at":"2026-05-10T14:11:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 04, 1964. [62] Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, and Atish Agarwala. Scaling collapse reveals universal dynamics in compute-optimally trained neural networks. InInternational Conference on Machine Learning, pages 50697-50720. PMLR, 2025. [63] Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making Muon & Scion Great Again!(Bridging Theory and Practice of LMO-based Optimizers for LLMs).arXiv preprint arXiv:2505.13416, 2025. [64] Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al."},{"citing_arxiv_id":"2605.09238","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds","primary_cat":"cs.LG","submitted_at":"2026-05-10T00:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv:2502.07529. [49] Benjamin Recht. A simpler approach to matrix completion.Journal of Machine Learning Research, 12 (12), 2011. [50] Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs).arXiv preprint arXiv:2505.13416, 2025. [51] Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka. Convergence bound and critical batch size of Muon optimizer.arXiv preprint arXiv:2507.01598, 2025. [52] Steffen Schotthöfer, Timon Klein, and Jonas Kusch. A geometric framework for momentum-based optimizers for low-rank training. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2506."},{"citing_arxiv_id":"2605.08850","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle","primary_cat":"math.OC","submitted_at":"2026-05-09T10:03:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(24) Substitutingg=ι X , we obtain xk+1 ∈argmin z∈B(xk,tk) {ιX (z) +⟨∇f(x k), z−x k⟩}.(25) Sinceι X (z) = +∞outsideX, the feasible set in (25) collapses toX ∩ B(x k, tk), and hence xk+1 ∈argmin z∈X ∩B(xk,tk) ⟨∇f(x k), z−x k⟩.(26) Finally, since− ⟨∇f(x k), xk⟩is constant with respect toz, (26) is equivalent to xk+1 ∈argmin z∈X ∩B(xk,tk) ⟨∇f(x k), z⟩.(27) But (27) is exactly theLocal LMOupdate. We have therefore proved the following interpretation. Proposition A.1(Forward-backward-brox interpretation ofLocal LMO).Consider the con- strained optimization problem min x∈X f(x), and rewrite it in composite form as min x∈Rd f(x) +ι X (x). Then theLocal LMOiteration xk+1 ∈argmin z∈X ∩B(xk,tk) ⟨∇f(x k), z⟩"},{"citing_arxiv_id":"2605.06884","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition","primary_cat":"math.OC","submitted_at":"2026-05-07T19:32:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23980","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon","primary_cat":"math.OC","submitted_at":"2026-04-27T02:49:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is required to avoid non-stationary fixed points and that local-polarize-then-average is ","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10689","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Communication-Efficient Gluon in Federated Learning","primary_cat":"cs.LG","submitted_at":"2026-04-12T15:30:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.10777","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods","primary_cat":"cs.LG","submitted_at":"2025-10-12T19:39:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that are competitive in experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.11983","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training","primary_cat":"cs.LG","submitted_at":"2025-09-15T14:28:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}