pith. sign in

arxiv: 2606.27153 · v1 · pith:NGZMDLESnew · submitted 2026-06-25 · 💻 cs.DC · cs.LG

DMuon: Efficient Distributed Muon Training with Near-Adam Overhead

Pith reviewed 2026-06-26 02:37 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords Muon optimizerdistributed trainingdeep learning optimizersLLM trainingNewton-Schulz iterationAdamW comparisonoptimizer overheadembodied models
0
0 comments X

The pith

DMuon implements Muon as a drop-in distributed module that cuts end-to-end step time by 1.48x-3.01x and optimizer time by up to 163x to match AdamW latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DMuon to solve the mismatch between matrix-orthogonalization optimizers like Muon and existing distributed training systems built for element-wise methods. Muon’s updates require Newton-Schulz iterations that couple entire weight matrices and exceed forward-backward pass cost by more than 2x in vanilla form. DMuon supplies an open-source implementation that plugs into standard pipelines without framework changes. If the reported speedups hold, practitioners can use Muon’s convergence advantages on embodied and LLM workloads at practical per-step costs.

Core claim

DMuon is an open-source distributed Muon implementation that integrates into existing training pipelines as a drop-in module with no framework-level modifications. Across embodied foundation model and large language model training workloads, DMuon achieves a 1.48x-3.01x speedup in end-to-end step time and a 6.85x-163.00x speedup in optimizer-step time, bringing per-step latency to near-AdamW levels and enabling efficient scaling in model training.

What carries the argument

DMuon module that distributes Muon’s matrix-aware updates and Newton-Schulz iterations across infrastructure designed for element-wise optimizers.

If this is right

  • Existing training codebases can adopt Muon without infrastructure rewrites.
  • Optimizer step time ceases to dominate the loop for Muon users on large models.
  • Matrix-orthogonalization methods become viable for heterogeneous architectures at scale.
  • Per-step latency reaches levels comparable to AdamW while retaining Muon convergence behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The distribution pattern could extend to other matrix-coupling optimizers that currently face similar infrastructure barriers.
  • Widespread use might encourage default selection of non-element-wise optimizers in production training runs.
  • The reported speedups imply that communication patterns for full-matrix operations can be made efficient enough to hide behind compute-bound phases.

Load-bearing premise

The assumption that Muon can be implemented as a drop-in module with no framework-level modifications while still achieving the stated speedups on real distributed training workloads.

What would settle it

A benchmark on the same embodied or LLM workloads showing optimizer-step time still exceeding forward-backward cost by more than 2x under the reported DMuon configuration would falsify the overhead claim.

read the original abstract

Matrix-orthogonalization-based optimizers, exemplified by Muon, have demonstrated strong convergence behavior across a wide range of modern deep learning workloads. The matrix-aware updates offer a compelling alternative to conventional element-wise optimization, particularly as model architectures continue to grow in scale and heterogeneity. Yet contemporary distributed training infrastructure built around the assumption of element-wise optimizers is poorly matched to matrix-level optimizers such as Muon, whose updates couple entire weight matrices and require costly Newton-Schulz iterations. Vanilla Muon implementations incur more than 2x the cost of forward and backward passes. To close this gap, we present DMuon, an open-source distributed Muon implementation that integrates into existing training pipelines as a drop-in module, with no framework-level modifications. Across both embodied foundation model and large language model (LLM) training workloads, DMuon achieves a 1.48x-3.01x speedup in end-to-end step time and a 6.85x-163.00x speedup in optimizer-step time, bringing per-step latency to near-AdamW levels and enabling efficient scaling in our model training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces DMuon, an open-source distributed implementation of the Muon optimizer that integrates into existing training pipelines as a drop-in module with no framework-level modifications. It claims 1.48x-3.01x speedup in end-to-end step time and 6.85x-163.00x speedup in optimizer-step time across embodied foundation model and LLM workloads, reducing per-step latency to near-AdamW levels by addressing the overhead of Newton-Schulz iterations in matrix-orthogonalization updates.

Significance. If the reported speedups and drop-in compatibility hold under scrutiny, the work would be significant for enabling practical use of matrix-aware optimizers like Muon at scale in distributed training, where current infrastructure assumes element-wise updates; this could improve convergence properties without the >2x overhead of vanilla Muon implementations.

major comments (1)
  1. Abstract: The central performance claims (1.48x-3.01x end-to-end and 6.85x-163x optimizer-step speedups) are stated without any description of experimental setup, baselines, hardware configuration, statistical methods, error bars, or workload details, rendering the empirical result unevaluable from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [—] Abstract: The central performance claims (1.48x-3.01x end-to-end and 6.85x-163x optimizer-step speedups) are stated without any description of experimental setup, baselines, hardware configuration, statistical methods, error bars, or workload details, rendering the empirical result unevaluable from the provided text.

    Authors: We agree that the abstract would benefit from additional context to make the claims more evaluable on first reading. In the revised version we will expand the abstract with a brief statement of the workloads (embodied foundation models and LLMs), hardware platform, and primary baselines (AdamW and vanilla Muon). Full experimental details—including statistical methods, error bars, and workload specifications—are already reported in the Experiments section; the abstract revision will simply surface the key setup elements without altering length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical systems contribution describing a distributed implementation of Muon as a drop-in module. No derivations, equations, fitted parameters, or predictions are presented that could reduce to inputs by construction. All central claims consist of measured speedups on workloads, which are externally falsifiable through replication and do not rely on self-referential logic, self-citations as load-bearing premises, or any ansatz smuggled via prior work. The derivation chain is empty; results rest on implementation performance rather than mathematical construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review contains no mathematical derivations, free parameters, axioms, or invented entities; the contribution is described purely as an implementation and systems optimization.

pith-pipeline@v0.9.1-grok · 5751 in / 1156 out tokens · 58177 ms · 2026-06-26T02:37:45.568795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 18 linked inside Pith

  1. [1]

    Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025.https://arxiv

    Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025.https://arxiv. org/abs/2504.05295

  2. [2]

    Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm.arXiv preprint arXiv:2505.16932, 2025.https://arxiv. org/abs/2505.16932

  3. [3]

    Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

    Lucas Beyer, Andreas Steiner, André Pinto, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  4. [4]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    TVM: An automated end-to-end optimizing compiler for deep learning

    Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018

  6. [6]

    Learning to optimize tensor programs

    Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

  7. [7]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  8. [8]

    The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv

    Zhehang Du and Weijie Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv. org/abs/2604.01472

  9. [9]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InICML, 2018

  10. [10]

    arXiv preprint arXiv:2504.16054, 2025

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  11. [11]

    Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github.io/posts/muon/, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github.io/posts/muon/, 2024. Blog post

  12. [12]

    Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  13. [13]

    NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025.https://arxiv.org/abs/2510.05491

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025.https://arxiv.org/abs/2510.05491

  14. [14]

    TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training.arXiv preprint arXiv:2410.06511, 2024.https://arxiv.org/abs/2410.06511

    Wanchao Liang, Tianyu Wang, et al. TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training.arXiv preprint arXiv:2410.06511, 2024.https://arxiv.org/abs/2410.06511

  15. [15]

    MuonisscalableforLLMtraining.arXivpreprintarXiv:2502.16982, 2025

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Feng, Chao Lu, Hanwen Hao, Han Yu, Wei Lin, et al. MuonisscalableforLLMtraining.arXivpreprintarXiv:2502.16982, 2025. https://arxiv.org/abs/2502.16982

  16. [16]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  17. [17]

    Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025.https://arxiv.org/ abs/2507.20534

    Moonshot AI. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025.https://arxiv.org/ abs/2507.20534

  18. [18]

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  19. [19]

    ZeRO: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.https://arxiv.org/abs/1910.02054

  20. [20]

    DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2020

  21. [21]

    A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023.https://arxiv

    Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023.https://arxiv. org/abs/2309.06497

  22. [22]

    Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

  23. [23]

    Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  24. [24]

    Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

    Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 16

  25. [25]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019

  26. [26]

    Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen

    Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions.arXiv preprint arXiv:1802.04730, 2018

  27. [27]

    Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al

    Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: Fundamental algorithms for scientific computing in python.Nature Methods, 17(3):261–272, 2020

  28. [28]

    SOAP: Improving and stabilizing Shampoo using Adam.arXiv preprint arXiv:2409.11321, 2024

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. SOAP: Improving and stabilizing Shampoo using Adam.arXiv preprint arXiv:2409.11321, 2024

  29. [29]

    TileLang: AcomposabletiledprogrammingmodelforAIsystems.arXivpreprintarXiv:2504.17577, 2025.https://arxiv.org/abs/2504.17577

    Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, andZhiYang. TileLang: AcomposabletiledprogrammingmodelforAIsystems.arXivpreprintarXiv:2504.17577, 2025.https://arxiv.org/abs/2504.17577

  30. [30]

    Canzona: A unified, asynchronous, and load-balanced framework for distributed matrix-based optimizers.arXiv preprint arXiv:2602.06079, 2026.https://arxiv.org/abs/2602.06079

    Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, and Dayiheng Liu. Canzona: A unified, asynchronous, and load-balanced framework for distributed matrix-based optimizers.arXiv preprint arXiv:2602.06079, 2026.https://arxiv.org/abs/2602.06079

  31. [31]

    A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Shuailei Ma, He Sun, Yong Wang, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Shuai Zhou, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Qian Zhu, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  32. [32]

    Wall-oss-0.5 technical report, 2026.https://arxiv.org/abs/2605.30877

    Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, Jerry Chen, Dongxiu Liu, Rain Sun, Miles Guo, Byron Zhang, Hugo Zhou, Zach Xu, Vincent Chen, Harrison Huang, James Wang, Dance Kuzi, Andy Zhai, Hang Su, Roy Gan, Lucy Liang, Hao Wang, and Qian Wang. Wall-oss-0.5 technical report, 2026.https:...

  33. [33]

    Igniting vlms toward the embodied space, 2025.https://arxiv.org/abs/2509.11766

    Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, and Zach Xu. Igniting vlms toward the embodied space, 2025.https://arxiv.org/abs/2509.11766

  34. [34]

    Gram Newton-Schulz: A fast, hardware-aware Newton–Schulz algorithm for muon.https://dao-lab.ai/blog/2026/gram-newton-schulz/, 2026

    Jack Zhang, Noah Amsel, Berlin Chen, and Tri Dao. Gram Newton-Schulz: A fast, hardware-aware Newton–Schulz algorithm for muon.https://dao-lab.ai/blog/2026/gram-newton-schulz/, 2026. Blog post; companion code athttps://github.com/Dao-AILab/gram-newton-schulz

  35. [35]

    PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 2023.https://arxiv.org/abs/2304.11277

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 2023.https://arxiv.org/abs/2304.11277

  36. [36]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating high-performance tensor programs for deep learning. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020

  37. [37]

    FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system

    Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. InProceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020. 17