pith. machine review for the scientific record. sign in

arxiv: 2605.12994 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: unknown

DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:54 UTC · model grok-4.3

classification 💻 cs.LG
keywords differential privacyMuon optimizerNewton-Schulz orthogonalizationmatrix gradient clippingbias correctionprivate fine-tuningsubsampled Gaussian accountant
0
0 comments X

The pith

DP-Muon applies per-example matrix clipping and Gaussian noise before momentum and Newton-Schulz steps, inheriting the exact privacy guarantee of the corresponding subsampled Gaussian accountant with no added cost from post-processing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates DP-Muon by clipping per-example matrix gradients, adding Gaussian noise to the lot average, and treating momentum plus Newton-Schulz orthogonalization strictly as post-processing. It proves this composition preserves the certified privacy loss of the underlying accountant exactly. On the optimization side the work derives finite-horizon and vanishing stationarity bounds that isolate clipping residual, privacy noise, Newton-Schulz approximation error, and the bias that appears only after the nonlinear map. A bias-corrected variant, DP-MuonBC, removes the leading output-level bias term while keeping the same privacy guarantee. Experiments indicate that the matrix-style updates improve private fine-tuning utility on E2E and DART tasks.

Core claim

DP-Muon clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, then applies momentum and Newton-Schulz orthogonalization as post-processing; the post-processing theorem implies that the overall privacy loss equals that of the noisy average alone, while optimization analysis yields explicit finite-horizon and vanishing stationarity guarantees that separate clipping residual, privacy noise, and orthogonalization approximation error, with the DP-induced bias appearing after the nonlinear map and removable by a simple correction term.

What carries the argument

The post-processing theorem applied to the nonlinear composition of momentum buffering and Newton-Schulz orthogonalization after noisy lot averaging.

Load-bearing premise

The momentum update and Newton-Schulz orthogonalization act as pure post-processing that does not increase the privacy loss when composed with the noisy clipped average.

What would settle it

An empirical privacy audit that measures effective epsilon on the same model and dataset for both DP-SGD and DP-Muon; if the measured epsilon for DP-Muon exceeds the accountant prediction by more than the post-processing theorem permits, the inheritance claim is falsified.

Figures

Figures reproduced from arXiv: 2605.12994 by Chenglin Fan, Jihwan Kim.

Figure 1
Figure 1. Figure 1: Evaluation token-level NLL versus training step for the main compared methods. Muon [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

We study differentially private (DP) training with Muon, a matrix-valued optimizer that updates hidden-layer weights using momentum followed by Newton--Schulz orthogonalization. While DP-SGD is well understood, the interaction between per-example clipping, Gaussian noise, momentum, and nonlinear orthogonalization in Muon has not been systematically analyzed. We formulate DP-Muon, a private Muon procedure that clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, and then applies momentum and Newton--Schulz orthogonalization as post-processing. We prove that DP-Muon inherits the privacy guarantee certified by the corresponding same-lot subsampled Gaussian accountant, with no additional privacy cost from Muon-specific post-processing. On the optimization side, we establish finite-horizon and vanishing stationarity guarantees under per-matrix clipping, with bounds that separate optimization error, clipping residual, privacy noise, and Newton--Schulz approximation error. We further show that the DP-induced bias in Muon arises not in the linear momentum buffer itself, but after the nonlinear Newton--Schulz map, where Gaussian noise induces a matrix-valued heat-smoothing bias. This motivates DP-MuonBC, a bias-corrected variant that removes the leading output-level bias term while preserving the same privacy guarantee. Experiments on E2E and DART show that Muon-style matrix updates improve private fine-tuning, and that DP-MuonBC further improves utility without increasing the privacy budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents DP-Muon, a differentially private version of the Muon optimizer. Per-example matrix gradients are clipped, Gaussian noise is added to the clipped lot average, and then momentum and Newton-Schulz orthogonalization are applied as post-processing. The paper proves privacy inheritance from the subsampled Gaussian accountant without extra cost. It derives finite-horizon and vanishing stationarity bounds that isolate optimization error, clipping residual, privacy noise, and Newton-Schulz approximation error. A bias-corrected DP-MuonBC is proposed to address bias induced after the nonlinear map. Experiments on E2E and DART demonstrate utility improvements in private fine-tuning.

Significance. If the claims hold, the work advances DP optimization by showing how to incorporate matrix-orthogonalized momentum without compromising privacy guarantees. The separation of error terms in the analysis and the bias correction are strengths that could guide future DP optimizer designs. The empirical results on E2E and DART provide evidence of practical benefits, though broader benchmarks would enhance impact.

minor comments (2)
  1. [§3] The privacy proof invokes the post-processing theorem for the nonlinear Newton-Schulz map; while correct in principle, a brief remark on why the map is measurable would aid rigor.
  2. Figure captions could more explicitly link to the error terms discussed in the bounds.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of DP-Muon, including the recognition of our privacy inheritance proof, separated error bounds, bias-correction mechanism, and empirical gains on E2E and DART. The minor-revision recommendation is noted; we will incorporate any editorial polishing in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The central privacy claim applies the standard post-processing theorem to deterministic momentum and Newton-Schulz steps performed after the noisy lot average; this is a direct invocation of an external DP result rather than a self-referential reduction. Optimization bounds explicitly decompose error into clipping residual, privacy noise, and approximation terms without fitting parameters to the target quantities or renaming known results. No load-bearing step reduces by construction to the paper's own inputs or to a self-citation chain whose validity depends on the present work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond standard differential-privacy assumptions and the post-processing theorem.

pith-pipeline@v0.9.0 · 5556 in / 1175 out tokens · 160741 ms · 2026-05-14T19:54:39.493998+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems 30 , year =

    Bolin Ding and Janardhan Kulkarni and Sergey Yekhanin , title =. Advances in Neural Information Processing Systems 30 , year =

  2. [2]

    Abowd , title =

    John M. Abowd , title =. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages =

  3. [3]

    2017 , url =

    Learning with Privacy at Scale , institution =. 2017 , url =

  4. [4]

    RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , booktitle =

    \'. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , booktitle =

  5. [5]

    Theory of Cryptography Conference , pages =

    Cynthia Dwork and Frank McSherry and Kobbi Nissim and Adam Smith , title =. Theory of Cryptography Conference , pages =

  6. [6]

    arXiv preprint arXiv:2507.01598 , year =

    Naoki Sato and Hiroki Naganuma and Hideaki Iiduka , title =. arXiv preprint arXiv:2507.01598 , year =

  7. [7]

    On the Convergence Analysis of Muon

    Wei Shen and Ruichuan Huang and Minhui Huang and Cong Shen and Jiawei Zhang , title =. arXiv preprint arXiv:2505.23737 , year =

  8. [8]

    arXiv , volume =

    Jingyuan Liu and Jianlin Su and Xingcheng Yao and Zhejun Jiang and Guokun Lai and Yulun Du and Yidao Qin and Weixin Xu and Enzhe Lu and Junjie Yan and others , title =. arXiv , volume =

  9. [9]

    Foundations and Trends in Theoretical Computer Science , volume=

    The algorithmic foundations of differential privacy , author=. Foundations and Trends in Theoretical Computer Science , volume=. 2014 , publisher=

  10. [10]

    Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , pages=

    Deep learning with differential privacy , author=. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , pages=

  11. [11]

    2013 IEEE Global Conference on Signal and Information Processing , pages=

    Stochastic gradient descent with differentially private updates , author=. 2013 IEEE Global Conference on Signal and Information Processing , pages=. 2013 , organization=

  12. [12]

    2014 IEEE 55th Annual Symposium on Foundations of Computer Science , pages=

    Private empirical risk minimization: Efficient algorithms and tight error bounds , author=. 2014 IEEE 55th Annual Symposium on Foundations of Computer Science , pages=. 2014 , organization=

  13. [13]

    Mironov, Ilya , booktitle=. R. 2017 , organization=

  14. [14]

    Subsampled R

    Wang, Yu-Xiang and Balle, Borja and Kasiviswanathan, Shiva Prasad , booktitle=. Subsampled R. 2019 , organization=

  15. [15]

    International Conference on Learning Representations , year=

    Bypassing the ambient dimension: Private SGD with gradient subspace identification , author=. International Conference on Learning Representations , year=

  16. [16]

    International Conference on Artificial Intelligence and Statistics , pages=

    Learning rate adaptation for differentially private learning , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

  17. [17]

    Harvard Data Science Review , volume=

    Deep learning with Gaussian differential privacy , author=. Harvard Data Science Review , volume=

  18. [18]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    DP-AdamBC: Your DP-Adam is Actually DP-SGD (Unless You Apply Bias Correction) , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  19. [19]

    arXiv preprint arXiv:2002.09018 , year=

    Scalable second order optimization for deep learning , author=. arXiv preprint arXiv:2002.09018 , year=

  20. [20]

    ICLR 2021-9th International Conference on Learning Representations , year=

    Gradient embedding perturbation for differentially private machine learning , author=. ICLR 2021-9th International Conference on Learning Representations , year=

  21. [21]

    International Conference on Machine Learning , pages=

    Large scale private learning via low-rank reparametrization , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  22. [22]

    International Conference on Learning Representations , year=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Differentially private learning with adaptive clipping , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    International Conference on Machine Learning , pages=

    AdaClip: Adaptive clipping for private SGD , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  25. [25]

    Efficient Per-Example Gradient Computations

    Efficient per-example gradient computations , author=. arXiv preprint arXiv:1510.01799 , year=

  26. [26]

    International Conference on Learning Representations , year=

    Large language models can be strong differentially private learners , author=. International Conference on Learning Representations , year=

  27. [27]

    Advances in Neural Information Processing Systems , volume=

    Differentially private empirical risk minimization revisited: Faster and more general , author=. Advances in Neural Information Processing Systems , volume=

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    Momentum-based variance reduction in non-convex SGD , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    2024 , url=

    Muon: An optimizer for hidden layers in neural networks , author=. 2024 , url=

  30. [30]

    Forty-second International Conference on Machine Learning,

    Jeremy Bernstein and Laker Newhouse , title =. Forty-second International Conference on Machine Learning,

  31. [31]

    Proceedings of the 42nd International Conference on Machine Learning , series =

    Valentyn Boreiko and Zhiqi Bu and Sheng Zha , title =. Proceedings of the 42nd International Conference on Machine Learning , series =

  32. [32]

    The Thirteenth International Conference on Learning Representations , year =

    Gyuyeol Kim and Minhwan Oh , title =. The Thirteenth International Conference on Learning Representations , year =

  33. [33]

    arXiv preprint arXiv:2604.01472 , year=

    The Newton--Muon Optimizer , author=. arXiv preprint arXiv:2604.01472 , year=

  34. [34]

    Advances in Neural Information Processing Systems , volume=

    Privacy amplification by subsampling: Tight analyses via couplings and divergences , author=. Advances in Neural Information Processing Systems , volume=

  35. [35]

    Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger , booktitle =

    Zhiqi Bu and Yu. Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger , booktitle =

  36. [36]

    Proceedings of the 39th International Conference on Machine Learning , series =

    Tian Li and Zaixin Lu and Prateek Kuditipudi and Xiang Chen and Tianhao Wang and Prateek Jain and Heng Huang , title =. Proceedings of the 39th International Conference on Machine Learning , series =

  37. [37]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume =

    Shubhankar Mohapatra and Sajin Sasy and Xi He and Gautam Kamath and Om Thakkar , title =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =

  38. [38]

    On the Accuracy and Efficiency of Group-Wise Clipping in Differentially Private Optimization , journal =

    Zhiqi Bu and Ruixuan Liu and Yu. On the Accuracy and Efficiency of Group-Wise Clipping in Differentially Private Optimization , journal =

  39. [39]

    Proceedings on Privacy Enhancing Technologies , volume =

    Jaewoo Lee and Daniel Kifer , title =. Proceedings on Privacy Enhancing Technologies , volume =

  40. [40]

    Large Language Models Can Be Strong Differentially Private Learners , booktitle =

    Xuechen Li and Florian Tram. Large Language Models Can Be Strong Differentially Private Learners , booktitle =

  41. [41]

    Differentially Private Optimization on Large Model at Small Cost , booktitle =

    Zhiqi Bu and Yu. Differentially Private Optimization on Large Model at Small Cost , booktitle =

  42. [42]

    A Unified Fast Gradient Clipping Framework for

    Weiwei Kong and Andres Mu. A Unified Fast Gradient Clipping Framework for. Advances in Neural Information Processing Systems , volume =