Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)

Riabinin, Artem, Shulgin, Egor, Gruntkowska, Kaja, Richtárik, Peter , month = may, year = · 2025 · arXiv 2505.13416

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle

math.OC · 2026-05-09 · unverdicted · novelty 8.0

Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.

Phases of Muon: When Muon Eclipses SignSGD

math.OC · 2026-05-10 · unverdicted · novelty 7.0

On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

math.OC · 2026-05-07 · unverdicted · novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.

SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon

math.OC · 2026-04-27 · unverdicted · novelty 6.0

SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is required to avoid non-stationary fixed points and that local-polarize-then-average is

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

math.OC · 2026-05-12 · unverdicted · novelty 5.0

Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.

Communication-Efficient Gluon in Federated Learning

cs.LG · 2026-04-12 · unverdicted · novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

citing papers explorer

Showing 8 of 8 citing papers.

Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle math.OC · 2026-05-09 · unverdicted · none · ref 27
Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.
Phases of Muon: When Muon Eclipses SignSGD math.OC · 2026-05-10 · unverdicted · none · ref 63
On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds cs.LG · 2026-05-10 · unverdicted · none · ref 51
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition math.OC · 2026-05-07 · unverdicted · none · ref 43
Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity cs.LG · 2026-05-13 · unverdicted · none · ref 59
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon math.OC · 2026-04-27 · unverdicted · none · ref 20
SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is required to avoid non-stationary fixed points and that local-polarize-then-average is
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives math.OC · 2026-05-12 · unverdicted · none · ref 56
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
Communication-Efficient Gluon in Federated Learning cs.LG · 2026-04-12 · unverdicted · none · ref 32
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)

fields

years

verdicts

representative citing papers

citing papers explorer