DMuon: Efficient Distributed Muon Training with Near-Adam Overhead

Dance Yang; Hang Su; Hao Wang; Lucy Liang; Qian Wang; Regis Cheng; Roy Gan; Ryan Yu; Shalfun Li; Starrick Liu

arxiv: 2606.27153 · v1 · pith:NGZMDLESnew · submitted 2026-06-25 · 💻 cs.DC · cs.LG

DMuon: Efficient Distributed Muon Training with Near-Adam Overhead

Vincent Chen , Starrick Liu , Regis Cheng , Dance Yang , Shalfun Li , Ryan Yu , Lucy Liang , Hang Su

show 3 more authors

Roy Gan Hao Wang Qian Wang

This is my paper

Pith reviewed 2026-06-26 02:37 UTC · model grok-4.3

classification 💻 cs.DC cs.LG

keywords Muon optimizerdistributed trainingdeep learning optimizersLLM trainingNewton-Schulz iterationAdamW comparisonoptimizer overheadembodied models

0 comments

The pith

DMuon implements Muon as a drop-in distributed module that cuts end-to-end step time by 1.48x-3.01x and optimizer time by up to 163x to match AdamW latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DMuon to solve the mismatch between matrix-orthogonalization optimizers like Muon and existing distributed training systems built for element-wise methods. Muon’s updates require Newton-Schulz iterations that couple entire weight matrices and exceed forward-backward pass cost by more than 2x in vanilla form. DMuon supplies an open-source implementation that plugs into standard pipelines without framework changes. If the reported speedups hold, practitioners can use Muon’s convergence advantages on embodied and LLM workloads at practical per-step costs.

Core claim

DMuon is an open-source distributed Muon implementation that integrates into existing training pipelines as a drop-in module with no framework-level modifications. Across embodied foundation model and large language model training workloads, DMuon achieves a 1.48x-3.01x speedup in end-to-end step time and a 6.85x-163.00x speedup in optimizer-step time, bringing per-step latency to near-AdamW levels and enabling efficient scaling in model training.

What carries the argument

DMuon module that distributes Muon’s matrix-aware updates and Newton-Schulz iterations across infrastructure designed for element-wise optimizers.

If this is right

Existing training codebases can adopt Muon without infrastructure rewrites.
Optimizer step time ceases to dominate the loop for Muon users on large models.
Matrix-orthogonalization methods become viable for heterogeneous architectures at scale.
Per-step latency reaches levels comparable to AdamW while retaining Muon convergence behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distribution pattern could extend to other matrix-coupling optimizers that currently face similar infrastructure barriers.
Widespread use might encourage default selection of non-element-wise optimizers in production training runs.
The reported speedups imply that communication patterns for full-matrix operations can be made efficient enough to hide behind compute-bound phases.

Load-bearing premise

The assumption that Muon can be implemented as a drop-in module with no framework-level modifications while still achieving the stated speedups on real distributed training workloads.

What would settle it

A benchmark on the same embodied or LLM workloads showing optimizer-step time still exceeding forward-backward cost by more than 2x under the reported DMuon configuration would falsify the overhead claim.

read the original abstract

Matrix-orthogonalization-based optimizers, exemplified by Muon, have demonstrated strong convergence behavior across a wide range of modern deep learning workloads. The matrix-aware updates offer a compelling alternative to conventional element-wise optimization, particularly as model architectures continue to grow in scale and heterogeneity. Yet contemporary distributed training infrastructure built around the assumption of element-wise optimizers is poorly matched to matrix-level optimizers such as Muon, whose updates couple entire weight matrices and require costly Newton-Schulz iterations. Vanilla Muon implementations incur more than 2x the cost of forward and backward passes. To close this gap, we present DMuon, an open-source distributed Muon implementation that integrates into existing training pipelines as a drop-in module, with no framework-level modifications. Across both embodied foundation model and large language model (LLM) training workloads, DMuon achieves a 1.48x-3.01x speedup in end-to-end step time and a 6.85x-163.00x speedup in optimizer-step time, bringing per-step latency to near-AdamW levels and enabling efficient scaling in our model training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DMuon is a systems implementation that makes the Muon optimizer usable in distributed training with claimed near-Adam overhead, but the abstract gives no experimental details to support the speedups.

read the letter

The core takeaway is that the authors built a drop-in distributed Muon that they report runs with 1.48-3.01x better end-to-end step time and much larger gains on the optimizer step itself across embodied and LLM workloads.

What stands out is the practical engineering: they identify that matrix-aware updates clash with element-wise distributed infrastructure and Newton-Schulz cost, then deliver an open-source module that avoids framework changes. That addresses a real deployment friction for anyone trying to use Muon at scale.

The main weakness is that the abstract states quantitative speedups without any description of hardware, baselines, measurement method, or variance. The central performance claim therefore cannot be assessed from the text alone, which leaves the result unverified at this stage.

This is for practitioners running large distributed training who want to try matrix orthogonalization optimizers without rewriting their stack. It is not a conceptual advance but a targeted systems fix.

I would send it to peer review because the problem is concrete and the implementation claim is falsifiable once the methods and code are examined.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces DMuon, an open-source distributed implementation of the Muon optimizer that integrates into existing training pipelines as a drop-in module with no framework-level modifications. It claims 1.48x-3.01x speedup in end-to-end step time and 6.85x-163.00x speedup in optimizer-step time across embodied foundation model and LLM workloads, reducing per-step latency to near-AdamW levels by addressing the overhead of Newton-Schulz iterations in matrix-orthogonalization updates.

Significance. If the reported speedups and drop-in compatibility hold under scrutiny, the work would be significant for enabling practical use of matrix-aware optimizers like Muon at scale in distributed training, where current infrastructure assumes element-wise updates; this could improve convergence properties without the >2x overhead of vanilla Muon implementations.

major comments (1)

Abstract: The central performance claims (1.48x-3.01x end-to-end and 6.85x-163x optimizer-step speedups) are stated without any description of experimental setup, baselines, hardware configuration, statistical methods, error bars, or workload details, rendering the empirical result unevaluable from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [—] Abstract: The central performance claims (1.48x-3.01x end-to-end and 6.85x-163x optimizer-step speedups) are stated without any description of experimental setup, baselines, hardware configuration, statistical methods, error bars, or workload details, rendering the empirical result unevaluable from the provided text.

Authors: We agree that the abstract would benefit from additional context to make the claims more evaluable on first reading. In the revised version we will expand the abstract with a brief statement of the workloads (embodied foundation models and LLMs), hardware platform, and primary baselines (AdamW and vanilla Muon). Full experimental details—including statistical methods, error bars, and workload specifications—are already reported in the Experiments section; the abstract revision will simply surface the key setup elements without altering length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical systems contribution describing a distributed implementation of Muon as a drop-in module. No derivations, equations, fitted parameters, or predictions are presented that could reduce to inputs by construction. All central claims consist of measured speedups on workloads, which are externally falsifiable through replication and do not rely on self-referential logic, self-citations as load-bearing premises, or any ansatz smuggled via prior work. The derivation chain is empty; results rest on implementation performance rather than mathematical construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review contains no mathematical derivations, free parameters, axioms, or invented entities; the contribution is described purely as an implementation and systems optimization.

pith-pipeline@v0.9.1-grok · 5751 in / 1156 out tokens · 58177 ms · 2026-06-26T02:37:45.568795+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 18 linked inside Pith

[1]

Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025.https://arxiv

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025.https://arxiv. org/abs/2504.05295

arXiv 2025
[2]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm.arXiv preprint arXiv:2505.16932, 2025.https://arxiv. org/abs/2505.16932

Pith/arXiv arXiv 2025
[3]

Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Lucas Beyer, Andreas Steiner, André Pinto, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[4]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[5]

TVM: An automated end-to-end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018

2018
[6]

Learning to optimize tensor programs

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

2018
[7]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026
[8]

The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv

Zhehang Du and Weijie Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv. org/abs/2604.01472

arXiv 2026
[9]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InICML, 2018

2018
[10]

arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[11]

Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github.io/posts/muon/, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github.io/posts/muon/, 2024. Blog post

2024
[12]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[13]

NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025.https://arxiv.org/abs/2510.05491

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025.https://arxiv.org/abs/2510.05491

arXiv 2025
[14]

TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training.arXiv preprint arXiv:2410.06511, 2024.https://arxiv.org/abs/2410.06511

Wanchao Liang, Tianyu Wang, et al. TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training.arXiv preprint arXiv:2410.06511, 2024.https://arxiv.org/abs/2410.06511

arXiv 2024
[15]

MuonisscalableforLLMtraining.arXivpreprintarXiv:2502.16982, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Feng, Chao Lu, Hanwen Hao, Han Yu, Wei Lin, et al. MuonisscalableforLLMtraining.arXivpreprintarXiv:2502.16982, 2025. https://arxiv.org/abs/2502.16982

Pith/arXiv arXiv 2025
[16]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[17]

Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025.https://arxiv.org/ abs/2507.20534

Moonshot AI. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025.https://arxiv.org/ abs/2507.20534

Pith/arXiv arXiv 2025
[18]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[19]

ZeRO: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.https://arxiv.org/abs/1910.02054

Pith/arXiv arXiv 2020
[20]

DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2020

2020
[21]

A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023.https://arxiv

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023.https://arxiv. org/abs/2309.06497

arXiv 2023
[22]

Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

Pith/arXiv arXiv 1909
[23]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

Pith/arXiv arXiv 2024
[24]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 16

Pith/arXiv arXiv 2025
[25]

Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019

2019
[26]

Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions.arXiv preprint arXiv:1802.04730, 2018

Pith/arXiv arXiv 2018
[27]

Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: Fundamental algorithms for scientific computing in python.Nature Methods, 17(3):261–272, 2020

2020
[28]

SOAP: Improving and stabilizing Shampoo using Adam.arXiv preprint arXiv:2409.11321, 2024

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. SOAP: Improving and stabilizing Shampoo using Adam.arXiv preprint arXiv:2409.11321, 2024

Pith/arXiv arXiv 2024
[29]

TileLang: AcomposabletiledprogrammingmodelforAIsystems.arXivpreprintarXiv:2504.17577, 2025.https://arxiv.org/abs/2504.17577

Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, andZhiYang. TileLang: AcomposabletiledprogrammingmodelforAIsystems.arXivpreprintarXiv:2504.17577, 2025.https://arxiv.org/abs/2504.17577

arXiv 2025
[30]

Canzona: A unified, asynchronous, and load-balanced framework for distributed matrix-based optimizers.arXiv preprint arXiv:2602.06079, 2026.https://arxiv.org/abs/2602.06079

Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, and Dayiheng Liu. Canzona: A unified, asynchronous, and load-balanced framework for distributed matrix-based optimizers.arXiv preprint arXiv:2602.06079, 2026.https://arxiv.org/abs/2602.06079

arXiv 2026
[31]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Shuailei Ma, He Sun, Yong Wang, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Shuai Zhou, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Qian Zhu, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026
[32]

Wall-oss-0.5 technical report, 2026.https://arxiv.org/abs/2605.30877

Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, Jerry Chen, Dongxiu Liu, Rain Sun, Miles Guo, Byron Zhang, Hugo Zhou, Zach Xu, Vincent Chen, Harrison Huang, James Wang, Dance Kuzi, Andy Zhai, Hang Su, Roy Gan, Lucy Liang, Hao Wang, and Qian Wang. Wall-oss-0.5 technical report, 2026.https:...

Pith/arXiv arXiv 2026
[33]

Igniting vlms toward the embodied space, 2025.https://arxiv.org/abs/2509.11766

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, and Zach Xu. Igniting vlms toward the embodied space, 2025.https://arxiv.org/abs/2509.11766

arXiv 2025
[34]

Gram Newton-Schulz: A fast, hardware-aware Newton–Schulz algorithm for muon.https://dao-lab.ai/blog/2026/gram-newton-schulz/, 2026

Jack Zhang, Noah Amsel, Berlin Chen, and Tri Dao. Gram Newton-Schulz: A fast, hardware-aware Newton–Schulz algorithm for muon.https://dao-lab.ai/blog/2026/gram-newton-schulz/, 2026. Blog post; companion code athttps://github.com/Dao-AILab/gram-newton-schulz

2026
[35]

PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 2023.https://arxiv.org/abs/2304.11277

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 2023.https://arxiv.org/abs/2304.11277

Pith/arXiv arXiv 2023
[36]

Gonzalez, and Ion Stoica

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating high-performance tensor programs for deep learning. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020

2020
[37]

FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. InProceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020. 17

2020

[1] [1]

Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025.https://arxiv

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025.https://arxiv. org/abs/2504.05295

arXiv 2025

[2] [2]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The Polar Express: Optimal matrix sign methods and their application to the Muon algorithm.arXiv preprint arXiv:2505.16932, 2025.https://arxiv. org/abs/2505.16932

Pith/arXiv arXiv 2025

[3] [3]

Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Lucas Beyer, Andreas Steiner, André Pinto, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[4] [4]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[5] [5]

TVM: An automated end-to-end optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018

2018

[6] [6]

Learning to optimize tensor programs

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2018

2018

[7] [7]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

2026

[8] [8]

The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv

Zhehang Du and Weijie Su. The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472, 2026.https://arxiv. org/abs/2604.01472

arXiv 2026

[9] [9]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InICML, 2018

2018

[10] [10]

arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: A vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[11] [11]

Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github.io/posts/muon/, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.https://kellerjordan.github.io/posts/muon/, 2024. Blog post

2024

[12] [12]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[13] [13]

NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025.https://arxiv.org/abs/2510.05491

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025.https://arxiv.org/abs/2510.05491

arXiv 2025

[14] [14]

TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training.arXiv preprint arXiv:2410.06511, 2024.https://arxiv.org/abs/2410.06511

Wanchao Liang, Tianyu Wang, et al. TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training.arXiv preprint arXiv:2410.06511, 2024.https://arxiv.org/abs/2410.06511

arXiv 2024

[15] [15]

MuonisscalableforLLMtraining.arXivpreprintarXiv:2502.16982, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Feng, Chao Lu, Hanwen Hao, Han Yu, Wei Lin, et al. MuonisscalableforLLMtraining.arXivpreprintarXiv:2502.16982, 2025. https://arxiv.org/abs/2502.16982

Pith/arXiv arXiv 2025

[16] [16]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[17] [17]

Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025.https://arxiv.org/ abs/2507.20534

Moonshot AI. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025.https://arxiv.org/ abs/2507.20534

Pith/arXiv arXiv 2025

[18] [18]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[19] [19]

ZeRO: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020.https://arxiv.org/abs/1910.02054

Pith/arXiv arXiv 2020

[20] [20]

DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2020

2020

[21] [21]

A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023.https://arxiv

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497, 2023.https://arxiv. org/abs/2309.06497

arXiv 2023

[22] [22]

Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053

Pith/arXiv arXiv 1909

[23] [23]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

Pith/arXiv arXiv 2024

[24] [24]

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 16

Pith/arXiv arXiv 2025

[25] [25]

Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), 2019

2019

[26] [26]

Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions.arXiv preprint arXiv:1802.04730, 2018

Pith/arXiv arXiv 2018

[27] [27]

Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: Fundamental algorithms for scientific computing in python.Nature Methods, 17(3):261–272, 2020

2020

[28] [28]

SOAP: Improving and stabilizing Shampoo using Adam.arXiv preprint arXiv:2409.11321, 2024

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. SOAP: Improving and stabilizing Shampoo using Adam.arXiv preprint arXiv:2409.11321, 2024

Pith/arXiv arXiv 2024

[29] [29]

TileLang: AcomposabletiledprogrammingmodelforAIsystems.arXivpreprintarXiv:2504.17577, 2025.https://arxiv.org/abs/2504.17577

Lei Wang, Yu Cheng, Yining Shi, Zhengju Tang, Zhiwen Mo, Wenhao Xie, Lingxiao Ma, Yuqing Xia, Jilong Xue, Fan Yang, andZhiYang. TileLang: AcomposabletiledprogrammingmodelforAIsystems.arXivpreprintarXiv:2504.17577, 2025.https://arxiv.org/abs/2504.17577

arXiv 2025

[30] [30]

Canzona: A unified, asynchronous, and load-balanced framework for distributed matrix-based optimizers.arXiv preprint arXiv:2602.06079, 2026.https://arxiv.org/abs/2602.06079

Liangyu Wang, Siqi Zhang, Junjie Wang, Yiming Dong, Bo Zheng, Zihan Qiu, Shengkun Tang, Di Wang, Rui Men, and Dayiheng Liu. Canzona: A unified, asynchronous, and load-balanced framework for distributed matrix-based optimizers.arXiv preprint arXiv:2602.06079, 2026.https://arxiv.org/abs/2602.06079

arXiv 2026

[31] [31]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Shuailei Ma, He Sun, Yong Wang, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Shuai Zhou, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Qian Zhu, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, and Kecheng Zheng. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Pith/arXiv arXiv 2026

[32] [32]

Wall-oss-0.5 technical report, 2026.https://arxiv.org/abs/2605.30877

Ryan Yu, Pushi Zhang, Starrick Liu, Brae Liu, Miracle Kang, Shalfun Li, Lights Shi, Ellie Ma, Ping Yang, Chris Pan, Jerry Chen, Dongxiu Liu, Rain Sun, Miles Guo, Byron Zhang, Hugo Zhou, Zach Xu, Vincent Chen, Harrison Huang, James Wang, Dance Kuzi, Andy Zhai, Hang Su, Roy Gan, Lucy Liang, Hao Wang, and Qian Wang. Wall-oss-0.5 technical report, 2026.https:...

Pith/arXiv arXiv 2026

[33] [33]

Igniting vlms toward the embodied space, 2025.https://arxiv.org/abs/2509.11766

Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, Lucy Liang, Make Wang, Qian Wang, Roy Gan, Ryan Yu, Shalfun Li, Starrick Liu, Sylas Chen, Vincent Chen, and Zach Xu. Igniting vlms toward the embodied space, 2025.https://arxiv.org/abs/2509.11766

arXiv 2025

[34] [34]

Gram Newton-Schulz: A fast, hardware-aware Newton–Schulz algorithm for muon.https://dao-lab.ai/blog/2026/gram-newton-schulz/, 2026

Jack Zhang, Noah Amsel, Berlin Chen, and Tri Dao. Gram Newton-Schulz: A fast, hardware-aware Newton–Schulz algorithm for muon.https://dao-lab.ai/blog/2026/gram-newton-schulz/, 2026. Blog post; companion code athttps://github.com/Dao-AILab/gram-newton-schulz

2026

[35] [35]

PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 2023.https://arxiv.org/abs/2304.11277

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. PyTorch FSDP: Experiences on scaling fully sharded data parallel.Proceedings of the VLDB Endowment, 2023.https://arxiv.org/abs/2304.11277

Pith/arXiv arXiv 2023

[36] [36]

Gonzalez, and Ion Stoica

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: Generating high-performance tensor programs for deep learning. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020

2020

[37] [37]

FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system

Size Zheng, Yun Liang, Shuo Wang, Renze Chen, and Kaiwen Sheng. FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. InProceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020. 17

2020