pith. machine review for the scientific record. sign in

arxiv: 2604.04891 · v2 · submitted 2026-04-06 · 🧮 math.OC · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Muon Dynamics as a Spectral Wasserstein Flow

Gabriel Peyr\'e

Pith reviewed 2026-05-11 01:57 UTC · model grok-4.3

classification 🧮 math.OC cs.AIstat.ML
keywords Muon optimizationSpectral Wasserstein distancemean-field regimegradient flowBenamou-Brenier formulationnormalized trainingoptimal transportmatrix norms
0
0 comments X

The pith

Muon training dynamics are gradient flows under spectral Wasserstein distances for monotone matrix norms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines an idealized continuous-time vanishing-momentum version of Muon optimization in the mean-field regime, where wide models are represented by probability measures on parameter space. It defines Spectral Wasserstein distances indexed by norms on positive semidefinite matrices, recovering the classical W2 distance for the trace norm and the Muon geometry for the operator norm. For monotone norms the static Kantorovich formulation is shown equivalent to a Benamou-Brenier dynamic formulation. This supplies a gradient-flow view of the normalized training dynamics, which matters because it links spectral normalization techniques in deep learning to optimal transport geometry.

Core claim

Starting from normalized matrix flows, the authors introduce Spectral Wasserstein distances indexed by norms γ on positive semidefinite matrices. The trace norm recovers classical W2, the operator norm recovers the Muon geometry, and Schatten norms interpolate. They develop the static Kantorovich formulation, a max-min robust-cost representation, Gaussian reductions extending the Bures formula, and prove equivalence with a Benamou-Brenier formulation for monotone norms. This equivalence yields a gradient-flow interpretation of the mean-field normalized training dynamics.

What carries the argument

Spectral Wasserstein distances indexed by norms on positive semidefinite matrices, which equip the space of probability measures with a geometry under which normalized matrix flows act as gradient flows.

If this is right

  • Normalized training in the mean-field regime minimizes an energy by following the gradient flow of a Spectral Wasserstein distance.
  • Equivalence to the Benamou-Brenier formulation supplies a dynamic description that can be used to analyze or simulate the flows.
  • Schatten norms supply a continuous interpolation between classical Wasserstein geometry and Muon-specific geometry.
  • Gaussian reductions furnish explicit formulas for distances between Gaussian measures, extending the Bures formula.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Choosing different matrix norms could generate new families of normalization schemes with tunable training behaviors.
  • The max-min robust-cost representation may suggest robustness properties for optimization algorithms beyond the gradient-flow setting.
  • Similar spectral constructions might extend to other structured parameter spaces such as tensors or convolutional kernels.
  • The framework opens the possibility of deriving convergence rates for normalized training by importing tools from optimal transport.

Load-bearing premise

The idealized deterministic, continuous-time, vanishing-momentum version of Muon in the mean-field regime accurately captures the essential behavior of practical normalized training.

What would settle it

Numerical simulations showing that trajectories of finite-width practical Muon deviate substantially from the predicted mean-field flows under the operator-norm geometry would falsify the gradient-flow interpretation.

Figures

Figures reproduced from arXiv: 2604.04891 by Gabriel Peyr\'e.

Figure 1
Figure 1. Figure 1: Static spectral couplings for Schatten p = 1, 2, ∞. Red points are the source cloud, blue points are the target cloud, and black segments show a permutation extracted from the optimal coupling for visualization [PITH_FULL_IMAGE:figures/full_fig_p026_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: All particle trajectories for the three MMD flows associated with Schatten [PITH_FULL_IMAGE:figures/full_fig_p026_2.png] view at source ↗
read the original abstract

Gradient normalization stabilizes deep-learning optimization, and spectral normalizations are especially natural for matrix-shaped parameter blocks; Muon is the motivating example. We study an idealized deterministic, continuous-time, vanishing-momentum version of this idea in the mean-field regime, where wide models are represented by probability measures on parameter space. Starting from normalized matrix flows, we introduce Spectral Wasserstein distances indexed by norms $\gamma$ on positive semidefinite matrices: the trace norm gives classical $W_2$, the operator norm gives the Muon geometry, and Schatten norms interpolate between them. We develop the static Kantorovich formulation, a max-min robust-cost representation, Gaussian reductions extending the Bures formula, and for monotone norms, prove equivalence with a Benamou--Brenier formulation. This yields a gradient-flow interpretation of the mean-field normalized training dynamics. We illustrate these findings by numerical experiments on MMD flows, Gaussian reductions, two-layer ReLU models, and shallow attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Spectral Wasserstein distances on probability measures over matrix parameters, indexed by norms γ on positive semidefinite matrices (with the trace norm recovering classical W₂ and the operator norm recovering the Muon geometry). It develops the static Kantorovich formulation together with a max-min robust-cost representation, provides Gaussian reductions that extend the Bures formula, and proves equivalence to a Benamou-Brenier dynamic formulation precisely when the norm is monotone with respect to the Loewner order. This equivalence supplies a gradient-flow interpretation of the mean-field limit of idealized continuous-time, vanishing-momentum normalized training dynamics. The claims are illustrated by numerical experiments on MMD flows, Gaussian reductions, two-layer ReLU networks, and shallow attention models.

Significance. If the central equivalence and reductions hold, the work supplies a rigorous optimal-transport geometry for spectral normalization methods such as Muon, thereby connecting mean-field training dynamics to gradient flows in a family of Wasserstein-type metrics. The monotonicity condition that enables the Benamou-Brenier formulation, the explicit Gaussian reductions, and the concrete application to the operator norm constitute a substantive contribution at the interface of optimization and optimal transport. The idealized mean-field setting is clearly delineated, and the numerical illustrations are consistent with the theory.

minor comments (3)
  1. [§2] §2 (definition of Spectral Wasserstein distance): the precise dependence of the cost on the matrix norm γ is introduced via the Kantorovich formulation, but the subsequent max-min representation would benefit from an explicit statement of the dual variables and the compactness argument used to interchange inf and sup.
  2. [§4] §4 (Gaussian reductions): the extension of the Bures formula is stated for the family of Schatten norms, yet the error incurred when the measures are not exactly Gaussian is not quantified; a brief remark on the approximation quality would strengthen the claim that the reductions are useful for analysis.
  3. [Experiments] Experiments section: the discretization of the continuous-time mean-field dynamics is described only at a high level; adding a short paragraph on the numerical scheme (time-stepping, particle representation, and norm evaluation) would improve reproducibility without altering the theoretical contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our manuscript, as well as for the recommendation of minor revision. We are pleased that the connections between spectral Wasserstein geometry, the Benamou-Brenier formulation under monotonicity, the Gaussian reductions, and the application to Muon dynamics were recognized as substantive contributions.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper constructs Spectral Wasserstein distances from normalized matrix flows, develops the static Kantorovich formulation and max-min representation, derives Gaussian reductions, and proves equivalence to the Benamou-Brenier dynamic formulation precisely when the norm is monotone. The operator norm satisfies the required monotonicity, allowing the gradient-flow interpretation of mean-field normalized dynamics to follow from the established dynamic formulation. All steps rely on external optimal-transport theory and direct mathematical proofs rather than self-definition, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claim to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The framework rests on the mean-field representation of wide networks and on the equivalence between static and dynamic formulations of the new distance; these are not independently verified in the abstract.

free parameters (1)
  • matrix norm γ
    The choice of norm on positive-semidefinite matrices selects which Wasserstein geometry is used; the operator norm is singled out for Muon.
axioms (2)
  • domain assumption Wide models can be represented by probability measures on parameter space
    Invoked to pass to the mean-field regime.
  • domain assumption Normalized matrix flows admit a deterministic continuous-time vanishing-momentum limit
    Used to obtain the idealized dynamics studied.
invented entities (1)
  • Spectral Wasserstein distance indexed by γ no independent evidence
    purpose: Generalizes classical Wasserstein distances to capture normalized matrix flows
    New object introduced to unify Muon geometry with optimal transport.

pith-pipeline@v0.9.0 · 5459 in / 1361 out tokens · 35280 ms · 2026-05-11T01:57:37.690266+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Phases of Muon: When Muon Eclipses SignSGD

    math.OC 2026-05 unverdicted novelty 7.0

    On power-law covariance least squares problems, SignSVD (Muon) and SignSGD (Adam proxy) show three phases of relative performance depending on data exponent α and target exponent β.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper

  1. [1]

    Lectures in Mathematics ETH Zürich

    Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré.Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Zürich. Birkhäuser Basel, 2 edition, 2008

  2. [2]

    Applications of weak transport theory.Bernoulli, 28(1):370–394, 2022

    Julio Daniel Backhoff-Veraguas and Gudmund Pammer. Applications of weak transport theory.Bernoulli, 28(1):370–394, 2022

  3. [3]

    Existence, duality, and cyclical monotonicity for weak transport costs.Calculus of Variations and Partial Differential Equations, 58(6):203, 2019

    Julio Daniel Backhoff-Veraguas, Mathias Beiglböck, and Gudmund Pammer. Existence, duality, and cyclical monotonicity for weak transport costs.Calculus of Variations and Partial Differential Equations, 58(6):203, 2019

  4. [4]

    A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

    Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

  5. [5]

    On the Bures–Wasserstein distance between positive definite matrices.Expositiones Mathematicae, 37(2):165–191, 2019

    Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures–Wasserstein distance between positive definite matrices.Expositiones Mathematicae, 37(2):165–191, 2019

  6. [6]

    Covariance-modulated optimal transport and gradient flows.Archive for Rational Mechanics and Analysis, 249(1):7, 2025

    Martin Burger, Matthias Erbar, Franca Hoffmann, Daniel Matthes, and André Schlichting. Covariance-modulated optimal transport and gradient flows.Archive for Rational Mechanics and Analysis, 249(1):7, 2025

  7. [7]

    Optimal transportation with traffic congestion and Wardrop equilibria.SIAM Journal on Control and Optimization, 47 (3):1330–1350, 2008

    Guillaume Carlier, Carlos Jimenez, and Filippo Santambrogio. Optimal transportation with traffic congestion and Wardrop equilibria.SIAM Journal on Control and Optimization, 47 (3):1330–1350, 2008. 27

  8. [8]

    Georgiou, and Allen Tannenbaum

    Yongxin Chen, Tryphon T. Georgiou, and Allen Tannenbaum. Matrix optimal mass transport: A quantum mechanical approach.IEEE Transactions on Automatic Control, 63 (8):2612–2619, 2018

  9. [9]

    On the global convergence of gradient descent for over-parameterized models using optimal transport

    Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. InAdvances in Neural Information Processing Systems, volume 31, pages 3040–3050, 2018

  10. [10]

    Momentum improves normalized SGD

    Ashok Cutkosky and Harsh Mehta. Momentum improves normalized SGD. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 2260–2268, 2020

  11. [11]

    Ground metric learning.Journal of Machine Learning Research, 15(1):533–564, 2014

    Marco Cuturi and David Avis. Ground metric learning.Journal of Machine Learning Research, 15(1):533–564, 2014

  12. [12]

    Wasserstein discriminant analysis.Machine Learning, 107(12):1923–1945, 2018

    Rémi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discriminant analysis.Machine Learning, 107(12):1923–1945, 2018

  13. [13]

    Kantorovich duality for general transport costs and applications.Journal of Functional Analysis, 273(11): 3327–3405, 2017

    Nathael Gozlan, Cyril Roberto, Paul-Marie Samson, and Prasad Tetali. Kantorovich duality for general transport costs and applications.Journal of Functional Analysis, 273(11): 3327–3405, 2017

  14. [14]

    Muon: An optimizer for hidden layers in neural networks

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Blog post, 2024

  15. [15]

    Muon is scalable for LLM training, 2025

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Mengnan Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is sca...

  16. [16]

    A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33): E7665–E7671, 2018

    Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33): E7665–E7671, 2018

  17. [17]

    Revisiting normalized gradient descent: Fast evasion of saddle points.IEEE Transactions on Automatic Control, 64(11):4818–4824, 2019

    Ryan Murray, Brian Swenson, and Soummya Kar. Revisiting normalized gradient descent: Fast evasion of saddle points.IEEE Transactions on Automatic Control, 64(11):4818–4824, 2019

  18. [18]

    Georgiou, and Allen Tannenbaum

    Lipeng Ning, Tryphon T. Georgiou, and Allen Tannenbaum. On matrix-valued Monge– Kantorovich optimal mass transport.IEEE Transactions on Automatic Control, 60(2): 373–382, 2015

  19. [19]

    Subspace robust wasserstein distances

    François-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. InProceed- ings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5072–5081, 2019

  20. [20]

    Training deep learning models with norm-constrained LMOs

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained LMOs. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 49069–49104, 2025

  21. [21]

    Structured transforms across spaces with cost-regularized optimal transport

    Othmane Sebbouh, Marco Cuturi, and Gabriel Peyré. Structured transforms across spaces with cost-regularized optimal transport. InProceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages 586–594, 2024. 28