arxiv: 2604.23980 · v1 · submitted 2026-04-27 · 🧮 math.OC

Recognition: unknown

SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon

Hengrui Zhang , Boao Kong , Jiahe Geng , Zhengyang Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:44 UTC · model grok-4.3

classification 🧮 math.OC

keywords decentralized optimizationMuon algorithmnuclear norm geometryconvergence analysiscommunication topologygradient trackingprimal-dual templatenon-IID data

0 comments

The pith

A unified template separates modular communication choices from non-modular polarization in fully decentralized Muon to deliver topology-independent convergence rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the core difficulty that the nonlinear matrix-sign operator in Muon does not commute with linear gossip averaging, turning fully decentralized versions into a structural design task. It introduces the SUDA primal-dual template that treats algorithms such as EXTRA or gradient tracking as interchangeable modular backbones while keeping polarization steps non-modular. This separation produces a non-asymptotic convergence bound in nuclear-norm geometry whose leading term scales as O((1 + σ/√N) K^{-1/4}) and contains no explicit graph quantities. A sympathetic reader cares because the result reframes communication design as a plug-in decision rather than an entangled constraint, and it explains why certain local-polarize-then-average schemes fail to achieve linear speedup.

Core claim

We propose SUDA-Muon which realizes this separation through a unified primal-dual communication template called SUDA; within this template, ED/D², EXTRA, and gradient tracking become modular backbone choices. We prove a topology-separated non-asymptotic convergence guarantee in the nuclear-norm geometry: the dominant term scales as O((1+σ/√N)K^{-1/4}) and does not explicitly involve graph quantities, identifying the communication backbone as the modular axis in the structure design. We then establish two complementary non-modular boundaries. Internally, tracking-before-polarization is necessary for this natural no-tracking variant to avoid non-stationary fixed points under heterogeneous obj

What carries the argument

The SUDA unified primal-dual communication template that treats communication algorithms as modular backbone choices separate from non-modular polarization operations.

If this is right

Different communication algorithms become directly comparable inside the same template.
In near-IID regimes the resulting variants perform similarly.
In long-horizon non-IID regimes SUDA-Muon reaches higher accuracy and lower loss than DeMuon.
Absence of a central server prevents the average-then-polarize update that would otherwise enable linear speedup.
The communication backbone functions as the sole modular axis for structure design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modular/non-modular separation could be tested on other nonlinear first-order methods such as decentralized Adam variants.
Empirical verification on networks with thousands of nodes would check whether the claimed graph independence survives practical message delays.
The non-modular boundary implies that hybrid centralized-decentralized pipelines may retain an irreducible advantage over pure decentralized ones on heterogeneous data.

Load-bearing premise

The SUDA template successfully isolates modular communication choices from non-modular polarization and requires tracking before polarization to avoid non-stationary fixed points under heterogeneous objectives.

What would settle it

A concrete counter-example in which a fully decentralized Muon variant that performs polarization before tracking converges to a non-stationary point on heterogeneous objectives, or an implementation whose measured rate explicitly depends on graph connectivity measures.

Figures

Figures reproduced from arXiv: 2604.23980 by Boao Kong, Hengrui Zhang, Jiahe Geng, Zhengyang Huang.

**Figure 1.** Figure 1: Counterexample for linear speedup under transverse noise ( view at source ↗

**Figure 2.** Figure 2: CIFAR-100 performance of the three decentralized Muon variants on a view at source ↗

**Figure 3.** Figure 3: Validation perplexity of decentralized Muon variants on GPT-2 / Wikitext-2. The left panel view at source ↗

read the original abstract

Fully decentralized Muon is difficult because its nonlinear matrix-sign operator does not commute with linear gossip averaging. This makes decentralized Muon a structural design problem: in designing the algorithm, one must distinguish modular components from non-modular ones. We propose \sudamuon{}, which realizes this separation through a unified primal--dual communication template called SUDA; within this template, ED/D$^2$, EXTRA, and gradient tracking become modular backbone choices. We prove a topology-separated non-asymptotic convergence guarantee in the nuclear-norm geometry: the dominant term scales as $\mathcal{O}((1+\sigma/\sqrt{N})K^{-1/4})$ and does not explicitly involve graph quantities, identifying the communication backbone as the modular axis in the structure design. We then establish two complementary non-modular boundaries. Internally, tracking-before-polarization is necessary for this natural no-tracking variant to avoid non-stationary fixed points under heterogeneous objectives. Externally, in the absence of a central server, a fully decentralized method cannot perform the federated average-then-polarize update; we show that this non-modular local-polarize-then-average design is the essential reason why can fail to exhibit linear speedup. Experiments on CIFAR-100 and GPT-2 fine-tuning support the same picture: the unified template makes different communication algorithms directly comparable. In mild near-IID regimes, the resulting variants perform similarly, while in the more difficult long-horizon non-IID CIFAR-100 setting, \sudamuon{} achieves higher accuracy and lower loss than \textsc{DeMuon}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real value is the SUDA template that treats communication backbones as modular swaps for decentralized Muon, plus the two explicit boundaries on when the method works or fails.

read the letter

The main thing to know is that this work frames decentralized Muon as a design problem rather than a direct extension, using a primal-dual template called SUDA to isolate modular communication choices (ED/D2, EXTRA, gradient tracking) from the non-modular polarization step. It then gives a non-asymptotic rate in nuclear norm of order O((1 + σ/√N) K^{-1/4}) that does not carry explicit graph terms, plus two boundary results: tracking must precede polarization to avoid bad fixed points on heterogeneous data, and fully decentralized versions cannot replicate the federated average-then-polarize step, which explains the lack of linear speedup in some cases. Experiments on CIFAR-100 and GPT-2 fine-tuning back this up by showing the variants behave similarly in near-IID settings but SUDA-Muon pulls ahead in longer non-IID runs. The template itself is the clearest contribution; it makes different communication algorithms directly comparable without rewriting the whole optimizer each time. The rate claim is interesting if the nuclear-norm analysis really cancels the mixing effects into lower-order terms, and the boundary arguments feel concrete. The soft spots are modest but real. The abstract and stress-test note leave open whether the sign operator and non-Euclidean geometry fully absorb spectral-gap factors for arbitrary connected graphs, or if some dependence hides in the constants; the paper would be stronger with an explicit cancellation identity shown. The experiments support the overall picture but would benefit from more detail on variance, data splits, and exact hyperparameter choices. This is aimed at people building decentralized training systems for large models who already know Muon or similar methods. It deserves peer review because the framework is usable and the claims are specific enough to check, even if the rate needs tighter verification in the full derivations.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SUDA-Muon, a framework for fully decentralized Muon that uses a unified primal-dual communication template (SUDA) to modularize backbone choices such as ED/D², EXTRA, and gradient tracking while separating them from non-modular polarization. It claims a topology-separated non-asymptotic convergence guarantee in nuclear-norm geometry whose dominant term is O((1 + σ/√N) K^{-1/4}) and does not explicitly involve graph quantities. Two complementary boundaries are established: tracking-before-polarization is required to avoid non-stationary fixed points under heterogeneous objectives, and fully decentralized methods cannot replicate the federated average-then-polarize update, which explains the absence of linear speedup in certain designs. Experiments on CIFAR-100 and GPT-2 fine-tuning are presented to support the structural claims.

Significance. If the claimed bound is rigorously established without hidden graph dependence, the work supplies concrete design principles for decentralized optimization of non-commuting nonlinear operators. The modular SUDA template that renders different communication backbones directly comparable is a useful contribution, as is the identification of internal and external non-modular boundaries. The non-asymptotic rate and the experimental comparison in non-IID regimes add practical value, provided the mathematical separation is verified.

major comments (2)

[Abstract / Convergence theorem] Abstract and convergence analysis: the claim that the dominant term O((1 + σ/√N) K^{-1/4}) 'does not explicitly involve graph quantities' is load-bearing for the topology-separated guarantee. The non-commutativity of the matrix-sign operator with linear gossip averaging requires an explicit cancellation identity showing that all mixing-matrix eigenvalues (or Laplacian norms) are absorbed into lower-order remainders; without this identity the leading coefficient may still embed spectral-gap dependence inside the hidden constants of the nuclear-norm recursion.
[SUDA template definition] Section on the SUDA primal-dual template: the assertion that ED/D², EXTRA, and gradient tracking become interchangeable modular backbones once the template is fixed must be accompanied by a uniform error recursion that isolates the communication choice from the polarization step. The current presentation leaves unclear whether the dual-tracking update exactly cancels the non-stationary terms induced by heterogeneous objectives before the sign operator is applied.

minor comments (2)

[Notation and experimental setup] The precise definitions of σ and N, as well as the data-exclusion rules used in the CIFAR-100 long-horizon experiments, should be stated explicitly in the main text rather than deferred to the appendix.
[Experiments] Figure captions for the GPT-2 fine-tuning results should include error bars or standard deviations across runs to allow direct comparison with the claimed accuracy and loss improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments, which help clarify the presentation of the topology-separated guarantee and the modularity of the SUDA template. We address each major comment below, indicating the revisions we will make.

read point-by-point responses

Referee: [Abstract / Convergence theorem] Abstract and convergence analysis: the claim that the dominant term O((1 + σ/√N) K^{-1/4}) 'does not explicitly involve graph quantities' is load-bearing for the topology-separated guarantee. The non-commutativity of the matrix-sign operator with linear gossip averaging requires an explicit cancellation identity showing that all mixing-matrix eigenvalues (or Laplacian norms) are absorbed into lower-order remainders; without this identity the leading coefficient may still embed spectral-gap dependence inside the hidden constants of the nuclear-norm recursion.

Authors: We agree that an explicit cancellation identity is necessary to substantiate the topology-separated claim. In the proof of the main convergence theorem, the SUDA template is used to first apply the chosen communication backbone to the dual variable, after which the polarization (matrix-sign) step is performed on the corrected primal variable. This ordering produces a telescoping cancellation in the nuclear-norm recursion: the action of the mixing matrix on the consensus error is multiplied by the dual-tracking residual, which contracts at a rate faster than K^{-1/4} and is therefore absorbed into the O(K^{-1/2}) remainder term. The leading coefficient therefore depends only on σ and N. Nevertheless, the current write-up leaves the identity implicit inside the recursion; we will add a dedicated lemma (new Lemma 4.3) that isolates the eigenvalue bound and shows it enters only the lower-order terms. This constitutes a partial revision focused on exposition. revision: partial
Referee: [SUDA template definition] Section on the SUDA primal-dual template: the assertion that ED/D², EXTRA, and gradient tracking become interchangeable modular backbones once the template is fixed must be accompanied by a uniform error recursion that isolates the communication choice from the polarization step. The current presentation leaves unclear whether the dual-tracking update exactly cancels the non-stationary terms induced by heterogeneous objectives before the sign operator is applied.

Authors: We thank the referee for this observation. The SUDA template is constructed so that the dual update is identical across backbones and exactly tracks the difference between local and averaged gradients; the chosen backbone (ED/D² correction, EXTRA momentum, or gradient-tracking difference) appears only as an additive term inside the primal update. Because the sign operator is applied after this correction, the non-stationary heterogeneity terms are canceled by the dual variable before polarization. To make the isolation fully rigorous, we will insert a uniform error recursion (new Proposition 4.2) that treats the backbone as a generic linear operator satisfying a standard contraction assumption; the polarization step then appears as a separate Lipschitz factor independent of the backbone choice. This revision will also explicitly verify the cancellation of non-stationary terms prior to the sign application. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proof-based rate is self-contained

full rationale

The paper derives a non-asymptotic convergence bound in nuclear-norm geometry from the SUDA primal-dual template, with the leading term O((1+σ/√N)K^{-1/4}) stated to lack explicit graph dependence. This is a first-principles proof result rather than a fitted prediction, self-definition, or renaming of an input quantity. No load-bearing self-citation, ansatz smuggling via prior work, or reduction of the claimed independence to hidden graph parameters by construction is exhibited. The modular/non-modular separation and boundary results (tracking-before-polarization necessity, federated average-then-polarize impossibility) are established via template design and counterexample-style arguments that do not collapse to the target rate. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on standard decentralized optimization assumptions plus the newly introduced SUDA template; no explicit free parameters are fitted in the abstract.

axioms (2)

domain assumption The nonlinear matrix-sign operator does not commute with linear gossip averaging
Stated as the fundamental difficulty making fully decentralized Muon a structural design problem.
domain assumption Local objectives are heterogeneous across nodes
Invoked for the fixed-point analysis and non-IID regime experiments.

invented entities (1)

SUDA primal-dual communication template no independent evidence
purpose: Unified structure that separates modular communication backbones from non-modular polarization
Newly proposed framework enabling the topology-separated convergence and boundary results.

pith-pipeline@v0.9.0 · 5591 in / 1486 out tokens · 66457 ms · 2026-05-08T02:44:15.057306+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 10 canonical work pages · 2 internal anchors

[1]

A uniﬁed and reﬁned convergence analysis for non-convex decentralized learning

Sulaiman A Alghunaim and Kun Yuan. A uniﬁed and reﬁned convergence analysis for non-convex decentralized learning. IEEE Transactions on Signal Processing , 70:3264–3279, 2022

2022
[2]

Gossip training for deep learning

Michael Blot, David Picard, Matthieu Cord, and Nicolas Thome. Gossip training for deep learning. arXiv preprint arXiv:1611.09726 , 2016

work page arXiv 2016
[3]

Diﬀusion adaptation strategies for distributed optimization and learning over networks

Jianshu Chen and Ali H Sayed. Diﬀusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing , 60(8):4289–4305, 2012

2012
[4]

On the convergence of decentralized adaptive gradient methods

Xiangyi Chen, Belhal Karimi, Weijie Zhao, and Ping Li. On the convergence of decentralized adaptive gradient methods. In Asian Conference on Machine Learning , pages 217–232. PMLR, 2023

2023
[5]

Gossipgrad: Scal- able deep learning using gossip communication based asynchronous gradient descent

Jeﬀ Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. Gossipgrad: Scal- able deep learning using gossip communication based asynchronous gradient descent. arXiv preprint arXiv:1803.05880, 2018

work page arXiv 2018
[6]

Shampoo: Preconditioned stochastic tensor optimiza- tion

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimiza- tion. In International Conference on Machine Learning , pages 1842–1850. PMLR, 2018

2018
[7]

Demuon: A decentralized muon for matrix optimization over graphs

Chuan He, Shuyi Ren, Jingwei Mao, and Erik G Larsson. Demuon: A decentralized muon for matrix optimization over graphs. arXiv preprint arXiv:2510.01377 , 2025

work page arXiv 2025
[8]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan. github.io/posts/muon/

2024
[9]

Advances and open problems in federated learning

Peter Kairouz and H Brendan McMahan. Advances and open problems in federated learning. Founda- tions and trends in machine learning , 14(1-2):1–210, 2021

2021
[10]

A uniﬁed theory of decentralized sgd with changing topology and local updates

Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A uniﬁed theory of decentralized sgd with changing topology and local updates. In International conference on machine learning, pages 5381–5393. PMLR, 2020

2020
[11]

Decentralized bilevel opti- mization: A perspective from transient iteration complexity

Boao Kong, Shuchen Zhu, Songtao Lu, Xinmeng Huang, and Kun Yuan. Decentralized bilevel opti- mization: A perspective from transient iteration complexity. Journal of Machine Learning Research , 26 (240):1–64, 2025

2025
[12]

Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems , 30, 2017

2017
[13]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982 , 2025. 42

work page internal anchor Pith review arXiv 2025
[14]

Fedmuon: Accelerating federated learning with matrix orthogonalization.arXiv preprint arXiv:2510.27403v1, 2025

Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, and Jin Liu. Fedmuon: Accelerating federated learning with matrix orthogonalization. arXiv preprint arXiv:2510.27403 , 2025

work page arXiv 2025
[15]

Communication-eﬃcient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-eﬃcient learning of deep networks from decentralized data. In Artiﬁcial intelligence and statistics , pages 1273–1282. PMLR, 2017

2017
[16]

Dadam: A consensus-based dis- tributed adaptive gradient method for online optimization

Parvin Nazari, Davoud Ataee Tarzanagh, and George Michailidis. Dadam: A consensus-based dis- tributed adaptive gradient method for online optimization. IEEE Transactions on Signal Processing , 70:6065–6079, 2022

2022
[17]

Multi-agent optimization

Angelia Nedić, Jong-Shi Pang, Gesualdo Scutari, and Ying Sun. Multi-agent optimization . Springer, 2018

2018
[18]

Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos. arXiv preprint arXiv:2502.07529 , 2025

work page arXiv 2025
[19]

Distributed stochastic gradient tracking methods

Shi Pu and Angelia Nedić. Distributed stochastic gradient tracking methods. Mathematical Program- ming, 187(1):409–457, 2021

2021
[20]

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms). arXiv preprint arXiv:2505.13416, 2025

work page arXiv 2025
[21]

Adafactor: Adaptive learning rates with sublinear memory cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning , pages 4596–4604. PMLR, 2018

2018
[22]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Extra: An exact ﬁrst-order algorithm for decentralized consensus optimization

Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact ﬁrst-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization , 25(2):944–966, 2015

2015
[24]

, month = sep, year =

Yuki Takezawa, Anastasia Koloskova, Xiaowen Jiang, and Sebastian U Stich. Fedmuon: Federated learning with bias-corrected lmo-based optimization. arXiv preprint arXiv:2509.26337 , 2025

work page arXiv 2025
[25]

D 2: Decentralized training over decen- tralized data

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D 2: Decentralized training over decen- tralized data. In International Conference on Machine Learning , pages 4848–4856. PMLR, 2018

2018
[26]

Reasﬂow: Assisting reasoning-centric scientiﬁc discovery in applied mathematics via a knowledge-based multi-agent system, 2026

ReasFlow Team. Reasﬂow: Assisting reasoning-centric scientiﬁc discovery in applied mathematics via a knowledge-based multi-agent system, 2026. URL https://blog.reaslab.io/blog/reasﬂow-intro/

2026
[27]

Muon outperforms Adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning. arXiv preprint arXiv:2509.26030 , 2025

work page arXiv 2025
[28]

A survey of distributed optimization

Tao Yang, Xinlei Yi, Junfeng Wu, Ye Yuan, Di Wu, Ziyang Meng, Yiguang Hong, Hong Wang, Zongli Lin, and Karl H Johansson. A survey of distributed optimization. Annual Reviews in Control , 47: 278–305, 2019

2019
[29]

Exponential graph is provably eﬃcient for decentralized deep training

Bicheng Ying, Kun Yuan, Yiming Chen, Hanbin Hu, Pan Pan, and Wotao Yin. Exponential graph is provably eﬃcient for decentralized deep training. Advances in Neural Information Processing Systems , 34:13975–13987, 2021

2021
[30]

Decentralized training of foundation models in heterogeneous environments

Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christo- pher Re, and Ce Zhang. Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems , 35:25464–25477, 2022. 43

2022
[31]

Removing data heterogeneity inﬂuence enhances network topology dependence of decentralized sgd

Kun Yuan, Sulaiman A Alghunaim, and Xinmeng Huang. Removing data heterogeneity inﬂuence enhances network topology dependence of decentralized sgd. Journal of Machine Learning Research , 24 (280):1–53, 2023

2023
[32]

Sparkle: a uniﬁed single- loop primal-dual framework for decentralized bilevel optimization

Shuchen Zhu, Boao Kong, Songtao Lu, Xinmeng Huang, and Kun Yuan. Sparkle: a uniﬁed single- loop primal-dual framework for decentralized bilevel optimization. Advances in Neural Information Processing Systems, 37:62912–62987, 2024. 44

2024