Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization

Dmitry Kovalev · 2025 · arXiv 2503.12645

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.

Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

math.OC · 2026-05-07 · unverdicted · novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.

Dimension-Free Saddle-Point Escape in Muon

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.

Optimal Projection-Free Adaptive SGD for Matrix Optimization

math.OC · 2026-04-02 · unverdicted · novelty 6.0

Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.

Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives

math.OC · 2026-05-12 · unverdicted · novelty 5.0

Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.

Communication-Efficient Gluon in Federated Learning

cs.LG · 2026-04-12 · unverdicted · novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

Can Muon Fine-tune Adam-Pretrained Models?

cs.LG · 2026-05-11 · unverdicted · novelty 4.0

Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.

citing papers explorer

Showing 8 of 8 citing papers.

Muon is Not That Special: Random or Inverted Spectra Work Just as Well cs.LG · 2026-05-11 · unverdicted · none · ref 40
Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition math.OC · 2026-05-07 · unverdicted · none · ref 26
Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence cs.LG · 2026-05-13 · unverdicted · none · ref 7
Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
Dimension-Free Saddle-Point Escape in Muon cs.LG · 2026-05-10 · unverdicted · none · ref 19
Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
Optimal Projection-Free Adaptive SGD for Matrix Optimization math.OC · 2026-04-02 · unverdicted · none · ref 16
Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives math.OC · 2026-05-12 · unverdicted · none · ref 30
Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.
Communication-Efficient Gluon in Federated Learning cs.LG · 2026-04-12 · unverdicted · none · ref 19
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
Can Muon Fine-tune Adam-Pretrained Models? cs.LG · 2026-05-11 · unverdicted · none · ref 25
Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization

fields

years

verdicts

representative citing papers

citing papers explorer