arxiv: 2604.04891 · v2 · submitted 2026-04-06 · 🧮 math.OC · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Muon Dynamics as a Spectral Wasserstein Flow

Gabriel Peyr\'e

Pith reviewed 2026-05-11 01:57 UTC · model grok-4.3

classification 🧮 math.OC cs.AIstat.ML

keywords Muon optimizationSpectral Wasserstein distancemean-field regimegradient flowBenamou-Brenier formulationnormalized trainingoptimal transportmatrix norms

0 comments

The pith

Muon training dynamics are gradient flows under spectral Wasserstein distances for monotone matrix norms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines an idealized continuous-time vanishing-momentum version of Muon optimization in the mean-field regime, where wide models are represented by probability measures on parameter space. It defines Spectral Wasserstein distances indexed by norms on positive semidefinite matrices, recovering the classical W2 distance for the trace norm and the Muon geometry for the operator norm. For monotone norms the static Kantorovich formulation is shown equivalent to a Benamou-Brenier dynamic formulation. This supplies a gradient-flow view of the normalized training dynamics, which matters because it links spectral normalization techniques in deep learning to optimal transport geometry.

Core claim

Starting from normalized matrix flows, the authors introduce Spectral Wasserstein distances indexed by norms γ on positive semidefinite matrices. The trace norm recovers classical W2, the operator norm recovers the Muon geometry, and Schatten norms interpolate. They develop the static Kantorovich formulation, a max-min robust-cost representation, Gaussian reductions extending the Bures formula, and prove equivalence with a Benamou-Brenier formulation for monotone norms. This equivalence yields a gradient-flow interpretation of the mean-field normalized training dynamics.

What carries the argument

Spectral Wasserstein distances indexed by norms on positive semidefinite matrices, which equip the space of probability measures with a geometry under which normalized matrix flows act as gradient flows.

If this is right

Normalized training in the mean-field regime minimizes an energy by following the gradient flow of a Spectral Wasserstein distance.
Equivalence to the Benamou-Brenier formulation supplies a dynamic description that can be used to analyze or simulate the flows.
Schatten norms supply a continuous interpolation between classical Wasserstein geometry and Muon-specific geometry.
Gaussian reductions furnish explicit formulas for distances between Gaussian measures, extending the Bures formula.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Choosing different matrix norms could generate new families of normalization schemes with tunable training behaviors.
The max-min robust-cost representation may suggest robustness properties for optimization algorithms beyond the gradient-flow setting.
Similar spectral constructions might extend to other structured parameter spaces such as tensors or convolutional kernels.
The framework opens the possibility of deriving convergence rates for normalized training by importing tools from optimal transport.

Load-bearing premise

The idealized deterministic, continuous-time, vanishing-momentum version of Muon in the mean-field regime accurately captures the essential behavior of practical normalized training.

What would settle it

Numerical simulations showing that trajectories of finite-width practical Muon deviate substantially from the predicted mean-field flows under the operator-norm geometry would falsify the gradient-flow interpretation.

Figures

Figures reproduced from arXiv: 2604.04891 by Gabriel Peyr\'e.

**Figure 1.** Figure 1: Static spectral couplings for Schatten p = 1, 2, ∞. Red points are the source cloud, blue points are the target cloud, and black segments show a permutation extracted from the optimal coupling for visualization [PITH_FULL_IMAGE:figures/full_fig_p026_1.png] view at source ↗

**Figure 2.** Figure 2: All particle trajectories for the three MMD flows associated with Schatten [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗

read the original abstract

Gradient normalization stabilizes deep-learning optimization, and spectral normalizations are especially natural for matrix-shaped parameter blocks; Muon is the motivating example. We study an idealized deterministic, continuous-time, vanishing-momentum version of this idea in the mean-field regime, where wide models are represented by probability measures on parameter space. Starting from normalized matrix flows, we introduce Spectral Wasserstein distances indexed by norms $\gamma$ on positive semidefinite matrices: the trace norm gives classical $W_2$, the operator norm gives the Muon geometry, and Schatten norms interpolate between them. We develop the static Kantorovich formulation, a max-min robust-cost representation, Gaussian reductions extending the Bures formula, and for monotone norms, prove equivalence with a Benamou--Brenier formulation. This yields a gradient-flow interpretation of the mean-field normalized training dynamics. We illustrate these findings by numerical experiments on MMD flows, Gaussian reductions, two-layer ReLU models, and shallow attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon dynamics fit as a gradient flow in a new family of Wasserstein distances indexed by matrix norms on PSD matrices, with the operator norm case matching the motivating example.

read the letter

The main takeaway is that this paper gives Muon a clean geometric reading as the gradient flow of a Wasserstein distance built from the operator norm. It defines a family of spectral Wasserstein distances indexed by norms gamma on positive semidefinite matrices, recovers the usual W2 distance for the trace norm, and places Schatten norms in between. The operator norm produces the Muon geometry directly. For monotone norms the authors prove equivalence between the static Kantorovich formulation and a Benamou-Brenier dynamic formulation, which immediately supplies the gradient-flow interpretation of the mean-field normalized dynamics. The operator norm satisfies monotonicity with respect to the Loewner order, so the equivalence applies without extra work. Gaussian reductions that extend the Bures formula are also derived, and the experiments on MMD flows, Gaussian cases, two-layer ReLU networks, and shallow attention stay consistent with the theory in the idealized setting. The math and the citation pattern look solid; the constructions rest on standard optimal-transport tools and the proofs are reported to go through cleanly. The central limitation is the deliberate restriction to an idealized deterministic, continuous-time, vanishing-momentum mean-field regime. This choice keeps the analysis tractable but means the link to practical discrete training with momentum and noise remains to be checked. The paper states the idealization upfront, so this is a scope issue rather than a hidden gap. The work is aimed at researchers who already move between optimal transport and large-scale optimization. Anyone looking for a new lens on normalization techniques will get concrete value from the distance family and the equivalence result. It is formally grounded enough to deserve a serious referee.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Spectral Wasserstein distances on probability measures over matrix parameters, indexed by norms γ on positive semidefinite matrices (with the trace norm recovering classical W₂ and the operator norm recovering the Muon geometry). It develops the static Kantorovich formulation together with a max-min robust-cost representation, provides Gaussian reductions that extend the Bures formula, and proves equivalence to a Benamou-Brenier dynamic formulation precisely when the norm is monotone with respect to the Loewner order. This equivalence supplies a gradient-flow interpretation of the mean-field limit of idealized continuous-time, vanishing-momentum normalized training dynamics. The claims are illustrated by numerical experiments on MMD flows, Gaussian reductions, two-layer ReLU networks, and shallow attention models.

Significance. If the central equivalence and reductions hold, the work supplies a rigorous optimal-transport geometry for spectral normalization methods such as Muon, thereby connecting mean-field training dynamics to gradient flows in a family of Wasserstein-type metrics. The monotonicity condition that enables the Benamou-Brenier formulation, the explicit Gaussian reductions, and the concrete application to the operator norm constitute a substantive contribution at the interface of optimization and optimal transport. The idealized mean-field setting is clearly delineated, and the numerical illustrations are consistent with the theory.

minor comments (3)

[§2] §2 (definition of Spectral Wasserstein distance): the precise dependence of the cost on the matrix norm γ is introduced via the Kantorovich formulation, but the subsequent max-min representation would benefit from an explicit statement of the dual variables and the compactness argument used to interchange inf and sup.
[§4] §4 (Gaussian reductions): the extension of the Bures formula is stated for the family of Schatten norms, yet the error incurred when the measures are not exactly Gaussian is not quantified; a brief remark on the approximation quality would strengthen the claim that the reductions are useful for analysis.
[Experiments] Experiments section: the discretization of the continuous-time mean-field dynamics is described only at a high level; adding a short paragraph on the numerical scheme (time-stepping, particle representation, and norm evaluation) would improve reproducibility without altering the theoretical contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our manuscript, as well as for the recommendation of minor revision. We are pleased that the connections between spectral Wasserstein geometry, the Benamou-Brenier formulation under monotonicity, the Gaussian reductions, and the application to Muon dynamics were recognized as substantive contributions.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper constructs Spectral Wasserstein distances from normalized matrix flows, develops the static Kantorovich formulation and max-min representation, derives Gaussian reductions, and proves equivalence to the Benamou-Brenier dynamic formulation precisely when the norm is monotone. The operator norm satisfies the required monotonicity, allowing the gradient-flow interpretation of mean-field normalized dynamics to follow from the established dynamic formulation. All steps rely on external optimal-transport theory and direct mathematical proofs rather than self-definition, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The framework rests on the mean-field representation of wide networks and on the equivalence between static and dynamic formulations of the new distance; these are not independently verified in the abstract.

free parameters (1)

matrix norm γ
The choice of norm on positive-semidefinite matrices selects which Wasserstein geometry is used; the operator norm is singled out for Muon.

axioms (2)

domain assumption Wide models can be represented by probability measures on parameter space
Invoked to pass to the mean-field regime.
domain assumption Normalized matrix flows admit a deterministic continuous-time vanishing-momentum limit
Used to obtain the idealized dynamics studied.

invented entities (1)

Spectral Wasserstein distance indexed by γ no independent evidence
purpose: Generalizes classical Wasserstein distances to capture normalized matrix flows
New object introduced to unify Muon geometry with optimal transport.

pith-pipeline@v0.9.0 · 5459 in / 1361 out tokens · 35280 ms · 2026-05-11T01:57:37.690266+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We develop the static Kantorovich formulation, a max-min robust-cost representation, Gaussian reductions extending the Bures formula, and for monotone norms, prove equivalence with a Benamou–Brenier formulation (Theorem 3.3).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
The operator norm recovers the Muon geometry; intermediate Schatten norms interpolate between them.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Phases of Muon: When Muon Eclipses SignSGD
math.OC 2026-05 unverdicted novelty 7.0

On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper

[1]

Lectures in Mathematics ETH Zürich

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré.Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Zürich. Birkhäuser Basel, 2 edition, 2008

work page 2008
[2]

Applications of weak transport theory.Bernoulli, 28(1):370–394, 2022

Julio Daniel Backhoff-Veraguas and Gudmund Pammer. Applications of weak transport theory.Bernoulli, 28(1):370–394, 2022

work page 2022
[3]

Existence, duality, and cyclical monotonicity for weak transport costs.Calculus of Variations and Partial Differential Equations, 58(6):203, 2019

Julio Daniel Backhoff-Veraguas, Mathias Beiglböck, and Gudmund Pammer. Existence, duality, and cyclical monotonicity for weak transport costs.Calculus of Variations and Partial Differential Equations, 58(6):203, 2019

work page 2019
[4]

A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

work page 2000
[5]

On the Bures–Wasserstein distance between positive definite matrices.Expositiones Mathematicae, 37(2):165–191, 2019

Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures–Wasserstein distance between positive definite matrices.Expositiones Mathematicae, 37(2):165–191, 2019

work page 2019
[6]

Covariance-modulated optimal transport and gradient flows.Archive for Rational Mechanics and Analysis, 249(1):7, 2025

Martin Burger, Matthias Erbar, Franca Hoffmann, Daniel Matthes, and André Schlichting. Covariance-modulated optimal transport and gradient flows.Archive for Rational Mechanics and Analysis, 249(1):7, 2025

work page 2025
[7]

Optimal transportation with traffic congestion and Wardrop equilibria.SIAM Journal on Control and Optimization, 47 (3):1330–1350, 2008

Guillaume Carlier, Carlos Jimenez, and Filippo Santambrogio. Optimal transportation with traffic congestion and Wardrop equilibria.SIAM Journal on Control and Optimization, 47 (3):1330–1350, 2008. 27

work page 2008
[8]

Georgiou, and Allen Tannenbaum

Yongxin Chen, Tryphon T. Georgiou, and Allen Tannenbaum. Matrix optimal mass transport: A quantum mechanical approach.IEEE Transactions on Automatic Control, 63 (8):2612–2619, 2018

work page 2018
[9]

On the global convergence of gradient descent for over-parameterized models using optimal transport

Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. InAdvances in Neural Information Processing Systems, volume 31, pages 3040–3050, 2018

work page 2018
[10]

Momentum improves normalized SGD

Ashok Cutkosky and Harsh Mehta. Momentum improves normalized SGD. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 2260–2268, 2020

work page 2020
[11]

Ground metric learning.Journal of Machine Learning Research, 15(1):533–564, 2014

Marco Cuturi and David Avis. Ground metric learning.Journal of Machine Learning Research, 15(1):533–564, 2014

work page 2014
[12]

Wasserstein discriminant analysis.Machine Learning, 107(12):1923–1945, 2018

Rémi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discriminant analysis.Machine Learning, 107(12):1923–1945, 2018

work page 1923
[13]

Kantorovich duality for general transport costs and applications.Journal of Functional Analysis, 273(11): 3327–3405, 2017

Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. Kantorovich duality for general transport costs and applications.Journal of Functional Analysis, 273(11): 3327–3405, 2017

work page 2017
[14]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Blog post, 2024

work page 2024
[15]

Muon is scalable for LLM training, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

work page 2025
[16]

A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33): E7665–E7671, 2018

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33): E7665–E7671, 2018

work page 2018
[17]

Revisiting normalized gradient descent: Fast evasion of saddle points.IEEE Transactions on Automatic Control, 64(11):4818–4824, 2019

Ryan Murray, Brian Swenson, and Soummya Kar. Revisiting normalized gradient descent: Fast evasion of saddle points.IEEE Transactions on Automatic Control, 64(11):4818–4824, 2019

work page 2019
[18]

Georgiou, and Allen Tannenbaum

Lipeng Ning, Tryphon T. Georgiou, and Allen Tannenbaum. On matrix-valued Monge– Kantorovich optimal mass transport.IEEE Transactions on Automatic Control, 60(2): 373–382, 2015

work page 2015
[19]

Subspace robust wasserstein distances

François-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. InProceed- ings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5072–5081, 2019

work page 2019
[20]

Training deep learning models with norm-constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained LMOs. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 49069–49104, 2025

work page 2025
[21]

Structured transforms across spaces with cost-regularized optimal transport

Othmane Sebbouh, Marco Cuturi, and Gabriel Peyré. Structured transforms across spaces with cost-regularized optimal transport. InProceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages 586–594, 2024. 28

work page 2024