arxiv: 2605.12994 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: unknown

DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum

Jihwan Kim , Chenglin Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords differential privacyMuon optimizerNewton-Schulz orthogonalizationmatrix gradient clippingbias correctionprivate fine-tuningsubsampled Gaussian accountant

0 comments

The pith

DP-Muon applies per-example matrix clipping and Gaussian noise before momentum and Newton-Schulz steps, inheriting the exact privacy guarantee of the corresponding subsampled Gaussian accountant with no added cost from post-processing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates DP-Muon by clipping per-example matrix gradients, adding Gaussian noise to the lot average, and treating momentum plus Newton-Schulz orthogonalization strictly as post-processing. It proves this composition preserves the certified privacy loss of the underlying accountant exactly. On the optimization side the work derives finite-horizon and vanishing stationarity bounds that isolate clipping residual, privacy noise, Newton-Schulz approximation error, and the bias that appears only after the nonlinear map. A bias-corrected variant, DP-MuonBC, removes the leading output-level bias term while keeping the same privacy guarantee. Experiments indicate that the matrix-style updates improve private fine-tuning utility on E2E and DART tasks.

Core claim

DP-Muon clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, then applies momentum and Newton-Schulz orthogonalization as post-processing; the post-processing theorem implies that the overall privacy loss equals that of the noisy average alone, while optimization analysis yields explicit finite-horizon and vanishing stationarity guarantees that separate clipping residual, privacy noise, and orthogonalization approximation error, with the DP-induced bias appearing after the nonlinear map and removable by a simple correction term.

What carries the argument

The post-processing theorem applied to the nonlinear composition of momentum buffering and Newton-Schulz orthogonalization after noisy lot averaging.

Load-bearing premise

The momentum update and Newton-Schulz orthogonalization act as pure post-processing that does not increase the privacy loss when composed with the noisy clipped average.

What would settle it

An empirical privacy audit that measures effective epsilon on the same model and dataset for both DP-SGD and DP-Muon; if the measured epsilon for DP-Muon exceeds the accountant prediction by more than the post-processing theorem permits, the inheritance claim is falsified.

Figures

Figures reproduced from arXiv: 2605.12994 by Chenglin Fan, Jihwan Kim.

read the original abstract

We study differentially private (DP) training with Muon, a matrix-valued optimizer that updates hidden-layer weights using momentum followed by Newton--Schulz orthogonalization. While DP-SGD is well understood, the interaction between per-example clipping, Gaussian noise, momentum, and nonlinear orthogonalization in Muon has not been systematically analyzed. We formulate DP-Muon, a private Muon procedure that clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, and then applies momentum and Newton--Schulz orthogonalization as post-processing. We prove that DP-Muon inherits the privacy guarantee certified by the corresponding same-lot subsampled Gaussian accountant, with no additional privacy cost from Muon-specific post-processing. On the optimization side, we establish finite-horizon and vanishing stationarity guarantees under per-matrix clipping, with bounds that separate optimization error, clipping residual, privacy noise, and Newton--Schulz approximation error. We further show that the DP-induced bias in Muon arises not in the linear momentum buffer itself, but after the nonlinear Newton--Schulz map, where Gaussian noise induces a matrix-valued heat-smoothing bias. This motivates DP-MuonBC, a bias-corrected variant that removes the leading output-level bias term while preserving the same privacy guarantee. Experiments on E2E and DART show that Muon-style matrix updates improve private fine-tuning, and that DP-MuonBC further improves utility without increasing the privacy budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DP-Muon applies standard per-example clipping and Gaussian noise to matrix gradients then runs momentum and Newton-Schulz as post-processing, preserving the usual privacy guarantee while adding a bias correction that improves utility.

read the letter

The paper's main move is to treat Muon's momentum and Newton-Schulz orthogonalization as deterministic post-processing after the noisy clipped average. Privacy therefore follows immediately from the subsampled Gaussian accountant with no extra loss, which is the cleanest part of the work. They also isolate a bias that appears only after the nonlinear map and subtract its leading term in the DP-MuonBC variant. That distinction between linear buffer and output-level bias is the clearest new observation here. The finite-horizon and stationarity bounds separate optimization error, clipping residual, privacy noise, and Newton-Schulz approximation error, which makes the analysis easier to read than most DP optimizer papers. Experiments on E2E and DART show the matrix updates beat DP-SGD and the bias correction adds a further small lift without touching the privacy budget. The privacy claim rests on the post-processing theorem, which applies directly to any measurable function of the noisy output, so the matrix-valued nonlinear steps do not create circularity. The main soft spots are that the bound derivations are not shown in enough detail to judge constants or tightness, the experiments report point estimates without clear variance or ablation on iteration count, and the tasks are limited to summarization fine-tuning. No larger models or additional datasets appear. This is aimed at researchers already working on private optimization who want to try matrix-valued methods. It deserves a serious referee because the privacy reduction is solid, the bias correction is a concrete addition, and the empirical signal is positive even if the theory needs tighter checking.

Referee Report

0 major / 2 minor

Summary. The manuscript presents DP-Muon, a differentially private version of the Muon optimizer. Per-example matrix gradients are clipped, Gaussian noise is added to the clipped lot average, and then momentum and Newton-Schulz orthogonalization are applied as post-processing. The paper proves privacy inheritance from the subsampled Gaussian accountant without extra cost. It derives finite-horizon and vanishing stationarity bounds that isolate optimization error, clipping residual, privacy noise, and Newton-Schulz approximation error. A bias-corrected DP-MuonBC is proposed to address bias induced after the nonlinear map. Experiments on E2E and DART demonstrate utility improvements in private fine-tuning.

Significance. If the claims hold, the work advances DP optimization by showing how to incorporate matrix-orthogonalized momentum without compromising privacy guarantees. The separation of error terms in the analysis and the bias correction are strengths that could guide future DP optimizer designs. The empirical results on E2E and DART provide evidence of practical benefits, though broader benchmarks would enhance impact.

minor comments (2)

[§3] The privacy proof invokes the post-processing theorem for the nonlinear Newton-Schulz map; while correct in principle, a brief remark on why the map is measurable would aid rigor.
Figure captions could more explicitly link to the error terms discussed in the bounds.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of DP-Muon, including the recognition of our privacy inheritance proof, separated error bounds, bias-correction mechanism, and empirical gains on E2E and DART. The minor-revision recommendation is noted; we will incorporate any editorial polishing in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The central privacy claim applies the standard post-processing theorem to deterministic momentum and Newton-Schulz steps performed after the noisy lot average; this is a direct invocation of an external DP result rather than a self-referential reduction. Optimization bounds explicitly decompose error into clipping residual, privacy noise, and approximation terms without fitting parameters to the target quantities or renaming known results. No load-bearing step reduces by construction to the paper's own inputs or to a self-citation chain whose validity depends on the present work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard differential-privacy assumptions and the post-processing theorem.

pith-pipeline@v0.9.0 · 5556 in / 1175 out tokens · 160741 ms · 2026-05-14T19:54:39.493998+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

[1]

Advances in Neural Information Processing Systems 30 , year =

Bolin Ding and Janardhan Kulkarni and Sergey Yekhanin , title =. Advances in Neural Information Processing Systems 30 , year =

work page
[2]

Abowd , title =

John M. Abowd , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =

work page
[3]

2017 , url =

Learning with Privacy at Scale , institution =. 2017 , url =

work page 2017
[4]

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , booktitle =

\'. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , booktitle =

work page
[5]

Theory of Cryptography Conference , pages =

Cynthia Dwork and Frank McSherry and Kobbi Nissim and Adam Smith , title =. Theory of Cryptography Conference , pages =

work page
[6]

arXiv preprint arXiv:2507.01598 , year =

Naoki Sato and Hiroki Naganuma and Hideaki Iiduka , title =. arXiv preprint arXiv:2507.01598 , year =

work page arXiv
[7]

On the Convergence Analysis of Muon

Wei Shen and Ruichuan Huang and Minhui Huang and Cong Shen and Jiawei Zhang , title =. arXiv preprint arXiv:2505.23737 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv , volume =

Jingyuan Liu and Jianlin Su and Xingcheng Yao and Zhejun Jiang and Guokun Lai and Yulun Du and Yidao Qin and Weixin Xu and Enzhe Lu and Junjie Yan and others , title =. arXiv , volume =

work page
[9]

Foundations and Trends in Theoretical Computer Science , volume=

The algorithmic foundations of differential privacy , author=. Foundations and Trends in Theoretical Computer Science , volume=. 2014 , publisher=

work page 2014
[10]

Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , pages=

Deep learning with differential privacy , author=. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , pages=

work page 2016
[11]

2013 IEEE Global Conference on Signal and Information Processing , pages=

Stochastic gradient descent with differentially private updates , author=. 2013 IEEE Global Conference on Signal and Information Processing , pages=. 2013 , organization=

work page 2013
[12]

2014 IEEE 55th Annual Symposium on Foundations of Computer Science , pages=

Private empirical risk minimization: Efficient algorithms and tight error bounds , author=. 2014 IEEE 55th Annual Symposium on Foundations of Computer Science , pages=. 2014 , organization=

work page 2014
[13]

Mironov, Ilya , booktitle=. R. 2017 , organization=

work page 2017
[14]

Subsampled R

Wang, Yu-Xiang and Balle, Borja and Kasiviswanathan, Shiva Prasad , booktitle=. Subsampled R. 2019 , organization=

work page 2019
[15]

International Conference on Learning Representations , year=

Bypassing the ambient dimension: Private SGD with gradient subspace identification , author=. International Conference on Learning Representations , year=

work page
[16]

International Conference on Artificial Intelligence and Statistics , pages=

Learning rate adaptation for differentially private learning , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020
[17]

Harvard Data Science Review , volume=

Deep learning with Gaussian differential privacy , author=. Harvard Data Science Review , volume=

work page
[18]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

DP-AdamBC: Your DP-Adam is Actually DP-SGD (Unless You Apply Bias Correction) , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[19]

arXiv preprint arXiv:2002.09018 , year=

Scalable second order optimization for deep learning , author=. arXiv preprint arXiv:2002.09018 , year=

work page arXiv 2002
[20]

ICLR 2021-9th International Conference on Learning Representations , year=

Gradient embedding perturbation for differentially private machine learning , author=. ICLR 2021-9th International Conference on Learning Representations , year=

work page 2021
[21]

International Conference on Machine Learning , pages=

Large scale private learning via low-rank reparametrization , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[22]

International Conference on Learning Representations , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

work page
[23]

Advances in Neural Information Processing Systems , volume=

Differentially private learning with adaptive clipping , author=. Advances in Neural Information Processing Systems , volume=

work page
[24]

International Conference on Machine Learning , pages=

AdaClip: Adaptive clipping for private SGD , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[25]

Efficient Per-Example Gradient Computations

Efficient per-example gradient computations , author=. arXiv preprint arXiv:1510.01799 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

International Conference on Learning Representations , year=

Large language models can be strong differentially private learners , author=. International Conference on Learning Representations , year=

work page
[27]

Advances in Neural Information Processing Systems , volume=

Differentially private empirical risk minimization revisited: Faster and more general , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

Advances in Neural Information Processing Systems , volume=

Momentum-based variance reduction in non-convex SGD , author=. Advances in Neural Information Processing Systems , volume=

work page
[29]

2024 , url=

Muon: An optimizer for hidden layers in neural networks , author=. 2024 , url=

work page 2024
[30]

Forty-second International Conference on Machine Learning,

Jeremy Bernstein and Laker Newhouse , title =. Forty-second International Conference on Machine Learning,

work page
[31]

Proceedings of the 42nd International Conference on Machine Learning , series =

Valentyn Boreiko and Zhiqi Bu and Sheng Zha , title =. Proceedings of the 42nd International Conference on Machine Learning , series =

work page
[32]

The Thirteenth International Conference on Learning Representations , year =

Gyuyeol Kim and Minhwan Oh , title =. The Thirteenth International Conference on Learning Representations , year =

work page
[33]

arXiv preprint arXiv:2604.01472 , year=

The Newton--Muon Optimizer , author=. arXiv preprint arXiv:2604.01472 , year=

work page arXiv
[34]

Advances in Neural Information Processing Systems , volume=

Privacy amplification by subsampling: Tight analyses via couplings and divergences , author=. Advances in Neural Information Processing Systems , volume=

work page
[35]

Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger , booktitle =

Zhiqi Bu and Yu. Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger , booktitle =

work page
[36]

Proceedings of the 39th International Conference on Machine Learning , series =

Tian Li and Zaixin Lu and Prateek Kuditipudi and Xiang Chen and Tianhao Wang and Prateek Jain and Heng Huang , title =. Proceedings of the 39th International Conference on Machine Learning , series =

work page
[37]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Shubhankar Mohapatra and Sajin Sasy and Xi He and Gautam Kamath and Om Thakkar , title =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

work page
[38]

On the Accuracy and Efficiency of Group-Wise Clipping in Differentially Private Optimization , journal =

Zhiqi Bu and Ruixuan Liu and Yu. On the Accuracy and Efficiency of Group-Wise Clipping in Differentially Private Optimization , journal =

work page
[39]

Proceedings on Privacy Enhancing Technologies , volume =

Jaewoo Lee and Daniel Kifer , title =. Proceedings on Privacy Enhancing Technologies , volume =

work page
[40]

Large Language Models Can Be Strong Differentially Private Learners , booktitle =

Xuechen Li and Florian Tram. Large Language Models Can Be Strong Differentially Private Learners , booktitle =

work page
[41]

Differentially Private Optimization on Large Model at Small Cost , booktitle =

Zhiqi Bu and Yu. Differentially Private Optimization on Large Model at Small Cost , booktitle =

work page
[42]

A Unified Fast Gradient Clipping Framework for

Weiwei Kong and Andres Mu. A Unified Fast Gradient Clipping Framework for. Advances in Neural Information Processing Systems , volume =

work page