Recognition: unknown
DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum
Pith reviewed 2026-05-14 19:54 UTC · model grok-4.3
The pith
DP-Muon applies per-example matrix clipping and Gaussian noise before momentum and Newton-Schulz steps, inheriting the exact privacy guarantee of the corresponding subsampled Gaussian accountant with no added cost from post-processing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DP-Muon clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, then applies momentum and Newton-Schulz orthogonalization as post-processing; the post-processing theorem implies that the overall privacy loss equals that of the noisy average alone, while optimization analysis yields explicit finite-horizon and vanishing stationarity guarantees that separate clipping residual, privacy noise, and orthogonalization approximation error, with the DP-induced bias appearing after the nonlinear map and removable by a simple correction term.
What carries the argument
The post-processing theorem applied to the nonlinear composition of momentum buffering and Newton-Schulz orthogonalization after noisy lot averaging.
Load-bearing premise
The momentum update and Newton-Schulz orthogonalization act as pure post-processing that does not increase the privacy loss when composed with the noisy clipped average.
What would settle it
An empirical privacy audit that measures effective epsilon on the same model and dataset for both DP-SGD and DP-Muon; if the measured epsilon for DP-Muon exceeds the accountant prediction by more than the post-processing theorem permits, the inheritance claim is falsified.
Figures
read the original abstract
We study differentially private (DP) training with Muon, a matrix-valued optimizer that updates hidden-layer weights using momentum followed by Newton--Schulz orthogonalization. While DP-SGD is well understood, the interaction between per-example clipping, Gaussian noise, momentum, and nonlinear orthogonalization in Muon has not been systematically analyzed. We formulate DP-Muon, a private Muon procedure that clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, and then applies momentum and Newton--Schulz orthogonalization as post-processing. We prove that DP-Muon inherits the privacy guarantee certified by the corresponding same-lot subsampled Gaussian accountant, with no additional privacy cost from Muon-specific post-processing. On the optimization side, we establish finite-horizon and vanishing stationarity guarantees under per-matrix clipping, with bounds that separate optimization error, clipping residual, privacy noise, and Newton--Schulz approximation error. We further show that the DP-induced bias in Muon arises not in the linear momentum buffer itself, but after the nonlinear Newton--Schulz map, where Gaussian noise induces a matrix-valued heat-smoothing bias. This motivates DP-MuonBC, a bias-corrected variant that removes the leading output-level bias term while preserving the same privacy guarantee. Experiments on E2E and DART show that Muon-style matrix updates improve private fine-tuning, and that DP-MuonBC further improves utility without increasing the privacy budget.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DP-Muon, a differentially private version of the Muon optimizer. Per-example matrix gradients are clipped, Gaussian noise is added to the clipped lot average, and then momentum and Newton-Schulz orthogonalization are applied as post-processing. The paper proves privacy inheritance from the subsampled Gaussian accountant without extra cost. It derives finite-horizon and vanishing stationarity bounds that isolate optimization error, clipping residual, privacy noise, and Newton-Schulz approximation error. A bias-corrected DP-MuonBC is proposed to address bias induced after the nonlinear map. Experiments on E2E and DART demonstrate utility improvements in private fine-tuning.
Significance. If the claims hold, the work advances DP optimization by showing how to incorporate matrix-orthogonalized momentum without compromising privacy guarantees. The separation of error terms in the analysis and the bias correction are strengths that could guide future DP optimizer designs. The empirical results on E2E and DART provide evidence of practical benefits, though broader benchmarks would enhance impact.
minor comments (2)
- [§3] The privacy proof invokes the post-processing theorem for the nonlinear Newton-Schulz map; while correct in principle, a brief remark on why the map is measurable would aid rigor.
- Figure captions could more explicitly link to the error terms discussed in the bounds.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of DP-Muon, including the recognition of our privacy inheritance proof, separated error bounds, bias-correction mechanism, and empirical gains on E2E and DART. The minor-revision recommendation is noted; we will incorporate any editorial polishing in the revised manuscript.
Circularity Check
No significant circularity
full rationale
The central privacy claim applies the standard post-processing theorem to deterministic momentum and Newton-Schulz steps performed after the noisy lot average; this is a direct invocation of an external DP result rather than a self-referential reduction. Optimization bounds explicitly decompose error into clipping residual, privacy noise, and approximation terms without fitting parameters to the target quantities or renaming known results. No load-bearing step reduces by construction to the paper's own inputs or to a self-citation chain whose validity depends on the present work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems 30 , year =
Bolin Ding and Janardhan Kulkarni and Sergey Yekhanin , title =. Advances in Neural Information Processing Systems 30 , year =
-
[2]
John M. Abowd , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =
- [3]
-
[4]
RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , booktitle =
\'. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , booktitle =
-
[5]
Theory of Cryptography Conference , pages =
Cynthia Dwork and Frank McSherry and Kobbi Nissim and Adam Smith , title =. Theory of Cryptography Conference , pages =
-
[6]
arXiv preprint arXiv:2507.01598 , year =
Naoki Sato and Hiroki Naganuma and Hideaki Iiduka , title =. arXiv preprint arXiv:2507.01598 , year =
-
[7]
On the Convergence Analysis of Muon
Wei Shen and Ruichuan Huang and Minhui Huang and Cong Shen and Jiawei Zhang , title =. arXiv preprint arXiv:2505.23737 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Jingyuan Liu and Jianlin Su and Xingcheng Yao and Zhejun Jiang and Guokun Lai and Yulun Du and Yidao Qin and Weixin Xu and Enzhe Lu and Junjie Yan and others , title =. arXiv , volume =
-
[9]
Foundations and Trends in Theoretical Computer Science , volume=
The algorithmic foundations of differential privacy , author=. Foundations and Trends in Theoretical Computer Science , volume=. 2014 , publisher=
work page 2014
-
[10]
Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , pages=
Deep learning with differential privacy , author=. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , pages=
work page 2016
-
[11]
2013 IEEE Global Conference on Signal and Information Processing , pages=
Stochastic gradient descent with differentially private updates , author=. 2013 IEEE Global Conference on Signal and Information Processing , pages=. 2013 , organization=
work page 2013
-
[12]
2014 IEEE 55th Annual Symposium on Foundations of Computer Science , pages=
Private empirical risk minimization: Efficient algorithms and tight error bounds , author=. 2014 IEEE 55th Annual Symposium on Foundations of Computer Science , pages=. 2014 , organization=
work page 2014
-
[13]
Mironov, Ilya , booktitle=. R. 2017 , organization=
work page 2017
-
[14]
Wang, Yu-Xiang and Balle, Borja and Kasiviswanathan, Shiva Prasad , booktitle=. Subsampled R. 2019 , organization=
work page 2019
-
[15]
International Conference on Learning Representations , year=
Bypassing the ambient dimension: Private SGD with gradient subspace identification , author=. International Conference on Learning Representations , year=
-
[16]
International Conference on Artificial Intelligence and Statistics , pages=
Learning rate adaptation for differentially private learning , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=
work page 2020
-
[17]
Harvard Data Science Review , volume=
Deep learning with Gaussian differential privacy , author=. Harvard Data Science Review , volume=
-
[18]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
DP-AdamBC: Your DP-Adam is Actually DP-SGD (Unless You Apply Bias Correction) , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[19]
arXiv preprint arXiv:2002.09018 , year=
Scalable second order optimization for deep learning , author=. arXiv preprint arXiv:2002.09018 , year=
-
[20]
ICLR 2021-9th International Conference on Learning Representations , year=
Gradient embedding perturbation for differentially private machine learning , author=. ICLR 2021-9th International Conference on Learning Representations , year=
work page 2021
-
[21]
International Conference on Machine Learning , pages=
Large scale private learning via low-rank reparametrization , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[22]
International Conference on Learning Representations , year=
LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=
-
[23]
Advances in Neural Information Processing Systems , volume=
Differentially private learning with adaptive clipping , author=. Advances in Neural Information Processing Systems , volume=
-
[24]
International Conference on Machine Learning , pages=
AdaClip: Adaptive clipping for private SGD , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[25]
Efficient Per-Example Gradient Computations
Efficient per-example gradient computations , author=. arXiv preprint arXiv:1510.01799 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
International Conference on Learning Representations , year=
Large language models can be strong differentially private learners , author=. International Conference on Learning Representations , year=
-
[27]
Advances in Neural Information Processing Systems , volume=
Differentially private empirical risk minimization revisited: Faster and more general , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
Advances in Neural Information Processing Systems , volume=
Momentum-based variance reduction in non-convex SGD , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
Muon: An optimizer for hidden layers in neural networks , author=. 2024 , url=
work page 2024
-
[30]
Forty-second International Conference on Machine Learning,
Jeremy Bernstein and Laker Newhouse , title =. Forty-second International Conference on Machine Learning,
-
[31]
Proceedings of the 42nd International Conference on Machine Learning , series =
Valentyn Boreiko and Zhiqi Bu and Sheng Zha , title =. Proceedings of the 42nd International Conference on Machine Learning , series =
-
[32]
The Thirteenth International Conference on Learning Representations , year =
Gyuyeol Kim and Minhwan Oh , title =. The Thirteenth International Conference on Learning Representations , year =
-
[33]
arXiv preprint arXiv:2604.01472 , year=
The Newton--Muon Optimizer , author=. arXiv preprint arXiv:2604.01472 , year=
-
[34]
Advances in Neural Information Processing Systems , volume=
Privacy amplification by subsampling: Tight analyses via couplings and divergences , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger , booktitle =
Zhiqi Bu and Yu. Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger , booktitle =
-
[36]
Proceedings of the 39th International Conference on Machine Learning , series =
Tian Li and Zaixin Lu and Prateek Kuditipudi and Xiang Chen and Tianhao Wang and Prateek Jain and Heng Huang , title =. Proceedings of the 39th International Conference on Machine Learning , series =
-
[37]
Proceedings of the AAAI Conference on Artificial Intelligence , volume =
Shubhankar Mohapatra and Sajin Sasy and Xi He and Gautam Kamath and Om Thakkar , title =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =
-
[38]
Zhiqi Bu and Ruixuan Liu and Yu. On the Accuracy and Efficiency of Group-Wise Clipping in Differentially Private Optimization , journal =
-
[39]
Proceedings on Privacy Enhancing Technologies , volume =
Jaewoo Lee and Daniel Kifer , title =. Proceedings on Privacy Enhancing Technologies , volume =
-
[40]
Large Language Models Can Be Strong Differentially Private Learners , booktitle =
Xuechen Li and Florian Tram. Large Language Models Can Be Strong Differentially Private Learners , booktitle =
-
[41]
Differentially Private Optimization on Large Model at Small Cost , booktitle =
Zhiqi Bu and Yu. Differentially Private Optimization on Large Model at Small Cost , booktitle =
-
[42]
A Unified Fast Gradient Clipping Framework for
Weiwei Kong and Andres Mu. A Unified Fast Gradient Clipping Framework for. Advances in Neural Information Processing Systems , volume =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.