DMuon: Efficient Distributed Muon Training with Near-Adam Overhead
Pith reviewed 2026-06-26 02:37 UTC · model grok-4.3
The pith
DMuon implements Muon as a drop-in distributed module that cuts end-to-end step time by 1.48x-3.01x and optimizer time by up to 163x to match AdamW latency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DMuon is an open-source distributed Muon implementation that integrates into existing training pipelines as a drop-in module with no framework-level modifications. Across embodied foundation model and large language model training workloads, DMuon achieves a 1.48x-3.01x speedup in end-to-end step time and a 6.85x-163.00x speedup in optimizer-step time, bringing per-step latency to near-AdamW levels and enabling efficient scaling in model training.
What carries the argument
DMuon module that distributes Muon’s matrix-aware updates and Newton-Schulz iterations across infrastructure designed for element-wise optimizers.
If this is right
- Existing training codebases can adopt Muon without infrastructure rewrites.
- Optimizer step time ceases to dominate the loop for Muon users on large models.
- Matrix-orthogonalization methods become viable for heterogeneous architectures at scale.
- Per-step latency reaches levels comparable to AdamW while retaining Muon convergence behavior.
Where Pith is reading between the lines
- The distribution pattern could extend to other matrix-coupling optimizers that currently face similar infrastructure barriers.
- Widespread use might encourage default selection of non-element-wise optimizers in production training runs.
- The reported speedups imply that communication patterns for full-matrix operations can be made efficient enough to hide behind compute-bound phases.
Load-bearing premise
The assumption that Muon can be implemented as a drop-in module with no framework-level modifications while still achieving the stated speedups on real distributed training workloads.
What would settle it
A benchmark on the same embodied or LLM workloads showing optimizer-step time still exceeding forward-backward cost by more than 2x under the reported DMuon configuration would falsify the overhead claim.
read the original abstract
Matrix-orthogonalization-based optimizers, exemplified by Muon, have demonstrated strong convergence behavior across a wide range of modern deep learning workloads. The matrix-aware updates offer a compelling alternative to conventional element-wise optimization, particularly as model architectures continue to grow in scale and heterogeneity. Yet contemporary distributed training infrastructure built around the assumption of element-wise optimizers is poorly matched to matrix-level optimizers such as Muon, whose updates couple entire weight matrices and require costly Newton-Schulz iterations. Vanilla Muon implementations incur more than 2x the cost of forward and backward passes. To close this gap, we present DMuon, an open-source distributed Muon implementation that integrates into existing training pipelines as a drop-in module, with no framework-level modifications. Across both embodied foundation model and large language model (LLM) training workloads, DMuon achieves a 1.48x-3.01x speedup in end-to-end step time and a 6.85x-163.00x speedup in optimizer-step time, bringing per-step latency to near-AdamW levels and enabling efficient scaling in our model training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DMuon, an open-source distributed implementation of the Muon optimizer that integrates into existing training pipelines as a drop-in module with no framework-level modifications. It claims 1.48x-3.01x speedup in end-to-end step time and 6.85x-163.00x speedup in optimizer-step time across embodied foundation model and LLM workloads, reducing per-step latency to near-AdamW levels by addressing the overhead of Newton-Schulz iterations in matrix-orthogonalization updates.
Significance. If the reported speedups and drop-in compatibility hold under scrutiny, the work would be significant for enabling practical use of matrix-aware optimizers like Muon at scale in distributed training, where current infrastructure assumes element-wise updates; this could improve convergence properties without the >2x overhead of vanilla Muon implementations.
major comments (1)
- Abstract: The central performance claims (1.48x-3.01x end-to-end and 6.85x-163x optimizer-step speedups) are stated without any description of experimental setup, baselines, hardware configuration, statistical methods, error bars, or workload details, rendering the empirical result unevaluable from the provided text.
Simulated Author's Rebuttal
We thank the referee for their feedback. We address the single major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [—] Abstract: The central performance claims (1.48x-3.01x end-to-end and 6.85x-163x optimizer-step speedups) are stated without any description of experimental setup, baselines, hardware configuration, statistical methods, error bars, or workload details, rendering the empirical result unevaluable from the provided text.
Authors: We agree that the abstract would benefit from additional context to make the claims more evaluable on first reading. In the revised version we will expand the abstract with a brief statement of the workloads (embodied foundation models and LLMs), hardware platform, and primary baselines (AdamW and vanilla Muon). Full experimental details—including statistical methods, error bars, and workload specifications—are already reported in the Experiments section; the abstract revision will simply surface the key setup elements without altering length constraints. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical systems contribution describing a distributed implementation of Muon as a drop-in module. No derivations, equations, fitted parameters, or predictions are presented that could reduce to inputs by construction. All central claims consist of measured speedups on workloads, which are externally falsifiable through replication and do not rely on self-referential logic, self-citations as load-bearing premises, or any ansatz smuggled via prior work. The derivation chain is empty; results rest on implementation performance rather than mathematical construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025.https://arxiv
Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025.https://arxiv. org/abs/2504.05295
arXiv 2025
-
[2]
Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm.arXiv preprint arXiv:2505.16932, 2025.https://arxiv. org/abs/2505.16932
Pith/arXiv arXiv 2025
-
[3]
Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
Lucas Beyer, Andreas Steiner, André Pinto, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
Pith/arXiv arXiv 2024
-
[4]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[5]
TVM: An automated end-to-end optimizing compiler for deep learning
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018
2018
-
[6]
Learning to optimize tensor programs
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2018
2018
-
[7]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
2026
-
[8]
The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv
Zhehang Du and Weijie Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv. org/abs/2604.01472
arXiv 2026
-
[9]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InICML, 2018
2018
-
[10]
arXiv preprint arXiv:2504.16054, 2025
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025
Pith/arXiv arXiv 2025
-
[11]
Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github.io/posts/muon/, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github.io/posts/muon/, 2024. Blog post
2024
-
[12]
Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[13]
Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025.https://arxiv.org/abs/2510.05491
arXiv 2025
-
[14]
Wanchao Liang, Tianyu Wang, et al. TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training.arXiv preprint arXiv:2410.06511, 2024.https://arxiv.org/abs/2410.06511
arXiv 2024
-
[15]
MuonisscalableforLLMtraining.arXivpreprintarXiv:2502.16982, 2025
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Feng, Chao Lu, Hanwen Hao, Han Yu, Wei Lin, et al. MuonisscalableforLLMtraining.arXivpreprintarXiv:2502.16982, 2025. https://arxiv.org/abs/2502.16982
Pith/arXiv arXiv 2025
-
[16]
Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
Pith/arXiv arXiv 2017
-
[17]
Moonshot AI. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025.https://arxiv.org/ abs/2507.20534
Pith/arXiv arXiv 2025
-
[18]
Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
Pith/arXiv arXiv 2023
-
[19]
ZeRO: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.https://arxiv.org/abs/1910.02054
Pith/arXiv arXiv 2020
-
[20]
DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2020
2020
-
[21]
Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023.https://arxiv. org/abs/2309.06497
arXiv 2023
-
[22]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053
Pith/arXiv arXiv 1909
-
[23]
Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024
Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024
Pith/arXiv arXiv 2024
-
[24]
Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 16
Pith/arXiv arXiv 2025
-
[25]
Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019
2019
-
[26]
Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions.arXiv preprint arXiv:1802.04730, 2018
Pith/arXiv arXiv 2018
-
[27]
Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: Fundamental algorithms for scientific computing in python.Nature Methods, 17(3):261–272, 2020
2020
-
[28]
SOAP: Improving and stabilizing Shampoo using Adam.arXiv preprint arXiv:2409.11321, 2024
Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. SOAP: Improving and stabilizing Shampoo using Adam.arXiv preprint arXiv:2409.11321, 2024
Pith/arXiv arXiv 2024
-
[29]
Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, andZhiYang. TileLang: AcomposabletiledprogrammingmodelforAIsystems.arXivpreprintarXiv:2504.17577, 2025.https://arxiv.org/abs/2504.17577
arXiv 2025
-
[30]
Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, and Dayiheng Liu. Canzona: A unified, asynchronous, and load-balanced framework for distributed matrix-based optimizers.arXiv preprint arXiv:2602.06079, 2026.https://arxiv.org/abs/2602.06079
arXiv 2026
-
[31]
A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Shuailei Ma, He Sun, Yong Wang, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Shuai Zhou, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Qian Zhu, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026
Pith/arXiv arXiv 2026
-
[32]
Wall-oss-0.5 technical report, 2026.https://arxiv.org/abs/2605.30877
Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, Jerry Chen, Dongxiu Liu, Rain Sun, Miles Guo, Byron Zhang, Hugo Zhou, Zach Xu, Vincent Chen, Harrison Huang, James Wang, Dance Kuzi, Andy Zhai, Hang Su, Roy Gan, Lucy Liang, Hao Wang, and Qian Wang. Wall-oss-0.5 technical report, 2026.https:...
Pith/arXiv arXiv 2026
-
[33]
Igniting vlms toward the embodied space, 2025.https://arxiv.org/abs/2509.11766
Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, and Zach Xu. Igniting vlms toward the embodied space, 2025.https://arxiv.org/abs/2509.11766
arXiv 2025
-
[34]
Gram Newton-Schulz: A fast, hardware-aware Newton–Schulz algorithm for muon.https://dao-lab.ai/blog/2026/gram-newton-schulz/, 2026
Jack Zhang, Noah Amsel, Berlin Chen, and Tri Dao. Gram Newton-Schulz: A fast, hardware-aware Newton–Schulz algorithm for muon.https://dao-lab.ai/blog/2026/gram-newton-schulz/, 2026. Blog post; companion code athttps://github.com/Dao-AILab/gram-newton-schulz
2026
-
[35]
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 2023.https://arxiv.org/abs/2304.11277
Pith/arXiv arXiv 2023
-
[36]
Gonzalez, and Ion Stoica
Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating high-performance tensor programs for deep learning. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020
2020
-
[37]
FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system
Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. InProceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020. 17
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.