DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

Chuan He; Erik G. Larsson; Jingwei Mao; Shuyi Ren

arxiv: 2510.01377 · v2 · pith:5PMEUGTMnew · submitted 2025-10-01 · 🧮 math.OC · cs.AI· cs.LG· cs.MA· cs.SY· eess.SY

DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

Chuan He , Shuyi Ren , Jingwei Mao , Erik G. Larsson This is my paper

classification 🧮 math.OC cs.AIcs.LGcs.MAcs.SYeess.SY

keywords demuondecentralizedcomplexitygraphsmatrixoptimizationalgorithmscentralized

0 comments

read the original abstract

In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon
math.OC 2026-04 unverdicted novelty 6.0

SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...