A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling

Da Chang; Lvgang Zhang; Qiankun Shi; Ruijie Zhang; Yu Li

arxiv: 2606.01720 · v1 · pith:BNGET2DVnew · submitted 2026-06-01 · 💻 cs.LG

A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling

Da Chang , Qiankun Shi , Lvgang Zhang , Yu Li , Ruijie Zhang This is my paper

Pith reviewed 2026-06-28 16:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords stability analysisgeneralization boundsorthogonalized momentumclient samplingdistributed optimizationmatrix parametersfederated learning

0 comments

The pith

A stability recursion with client amplification yields finite-round upper-tail generalization bounds for orthogonalized matrix momentum under sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a finite-round upper-tail guarantee on the gap between population and empirical objectives for a distributed optimization scheme using matrix parameters and orthogonalized momentum updates, where only subsets of clients participate each round. Under independent heterogeneous client data, unequal local sample counts, and fixed aggregation weights, the bound is obtained via a coupled-neighbor stability recursion combined with weighted concentration, retaining the effect of client selection through an amplification factor. This is relevant for understanding generalization in federated settings with partial participation. The analysis requires the orthogonalization rule to be Lipschitz along paired trajectories, satisfied by certain regularized maps and smoothers, and includes a counterexample showing the need for such conditions.

Core claim

Under independent heterogeneous client data, unequal local sample counts, and fixed aggregation weights, we derive a finite-round upper-tail guarantee from a coupled-neighbor stability recursion and a weighted concentration step. The bound keeps the client-selection counts through the amplification factor Y_i(C); in the uniform full-participation full-batch regime, it yields ilde O(n^{-1}+n^{-1/2}) scaling whenever the horizon-dependent amplification terms are controlled. The matrix-orthogonalization rule is required to be Lipschitz along paired trajectories, a condition satisfied by regularized polar-type maps and normalized finite-step Newton-Schulz orthogonalizers.

What carries the argument

The coupled-neighbor stability recursion combined with weighted concentration, using the amplification factor Y_i(C) to track client-selection counts.

If this is right

In the uniform full-participation full-batch regime the bound yields ilde O(n^{-1} + n^{-1/2}) scaling when horizon-dependent amplification terms are controlled.
For the unregularized matrix sign the same argument requires coupled spectral separation.
Gaussian smoothing yields a finite-round smoothed variant of the bound.
A one-dimensional counterexample shows why a gap, smoothing, or regularity condition is necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recursion technique could be applied to other matrix momentum variants provided the orthogonalizer satisfies an analogous trajectory-Lipschitz property.
Client-selection probabilities might be chosen to keep the amplification factor Y_i(C) small and thereby tighten the bound.
The stability perspective may connect to privacy analyses because trajectory closeness often implies differential-privacy-style guarantees.

Load-bearing premise

The matrix-orthogonalization rule must be Lipschitz along paired optimization trajectories.

What would settle it

Observe that without the Lipschitz condition on the orthogonalizer along paired trajectories the stability recursion diverges and the finite-round upper-tail guarantee no longer holds, as illustrated by the paper's one-dimensional counterexample.

Figures

Figures reproduced from arXiv: 2606.01720 by Da Chang, Lvgang Zhang, Qiankun Shi, Ruijie Zhang, Yu Li.

**Figure 2.** Figure 2: Smooth-polar FedMuon phase diagnostics. Panels (a) and (c) report the bound from Theorem [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Finite-sample and participation diagnostics. Panels (a) and (b) show decreasing bounds and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

We study finite-sample generalization for a client-sampled distributed optimization scheme with matrix-valued parameters and orthogonalized momentum updates. The central quantity is the gap between the population and empirical objectives at the returned model when only a subset of clients participates in each round. Under independent heterogeneous client data, unequal local sample counts, and fixed aggregation weights, we derive a finite-round upper-tail guarantee from a coupled-neighbor stability recursion and a weighted concentration step. The bound keeps the client-selection counts through the amplification factor \(Y_i(\mathcal C)\); in the uniform full-participation full-batch regime, it yields \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) scaling whenever the horizon-dependent amplification terms are controlled. The matrix-orthogonalization rule is required to be Lipschitz along paired trajectories, a condition satisfied by regularized polar-type maps and normalized finite-step Newton--Schulz orthogonalizers. For the unregularized matrix sign, the same argument requires coupled spectral separation, whereas Gaussian smoothing gives a finite-round smoothed variant. A one-dimensional counterexample shows why a gap, smoothing, or regularity condition is necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A narrow finite-round stability bound for matrix-momentum FL with sampling that requires the orthogonalizer to be Lipschitz along trajectories.

read the letter

This paper derives a finite-round upper-tail generalization bound for orthogonalized matrix momentum in a client-sampled distributed optimization setting. The argument uses a coupled-neighbor stability recursion followed by a weighted concentration step, and it keeps the sampling counts visible through the amplification factor Y_i(C).

It does a solid job making the client-selection effect explicit instead of absorbing it into generic constants, and it supplies a one-dimensional counterexample to show why some regularity on the orthogonalizer is needed. The conditions for regularized polar maps and normalized Newton-Schulz steps are stated clearly, and the full-participation regime recovers the usual Õ(n^{-1} + n^{-1/2}) scaling when the horizon terms stay controlled.

The main limitation is the Lipschitz requirement along paired trajectories. That condition is restrictive, and without it the recursion does not close. The bound also carries horizon-dependent amplification factors that can grow with the number of rounds, so the practical range is limited unless those terms are separately bounded. The setting stays within independent heterogeneous data and fixed aggregation weights, which is fine but keeps the result narrow.

The work is aimed at researchers who already work on stability analyses for federated momentum methods. The structure is internally consistent from the description, with no visible circularity, and the counterexample is a useful way to mark the boundary. It is the kind of technical note that deserves a serious referee even if the scope stays small.

Referee Report

0 major / 3 minor

Summary. The manuscript derives a finite-round upper-tail generalization bound for client-sampled distributed optimization with matrix-valued parameters and orthogonalized momentum updates. Under independent heterogeneous client data, unequal local sample sizes, and fixed aggregation weights, the bound is obtained from a coupled-neighbor stability recursion combined with a weighted concentration step; the client-selection counts are retained explicitly via the amplification factor Y_i(C). In the uniform full-participation full-batch regime the bound recovers ilde O(n^{-1} + n^{-1/2}) scaling once horizon-dependent amplification terms are controlled. The derivation requires the matrix-orthogonalization map to be Lipschitz along paired trajectories (satisfied by regularized polar maps and normalized finite-step Newton-Schulz iterations); a one-dimensional counterexample demonstrates necessity of this regularity condition (or smoothing).

Significance. If the stability recursion and concentration steps are valid, the result supplies a non-asymptotic, client-sampling-aware generalization guarantee for a practically relevant class of momentum-based federated methods with matrix orthogonalization. The explicit dependence on Y_i(C) and the precise statement of the Lipschitz premise on the orthogonalizer are useful for understanding when stability arguments extend to this setting. The counterexample clarifies the boundary of the technique.

minor comments (3)

The abstract states that the bound 'keeps the client-selection counts through the amplification factor Y_i(C)', but the precise definition of Y_i(C) and its dependence on the sampling process should be stated explicitly in the main theorem statement (presumably Theorem X) so that readers can verify the claimed non-circularity.
The one-dimensional counterexample is mentioned but its construction is not reproduced in the abstract; including a short self-contained statement of the counterexample (or a pointer to the exact location) would strengthen the necessity claim.
Notation for the horizon-dependent amplification terms should be introduced once and used consistently when discussing the ilde O(n^{-1} + n^{-1/2}) regime.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading, the positive summary of the contribution, and the recommendation of minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via stability recursion

full rationale

The paper derives its finite-round upper-tail generalization guarantee explicitly from a coupled-neighbor stability recursion plus weighted concentration step under stated assumptions (Lipschitz orthogonalization along trajectories, satisfied by regularized polar maps and normalized Newton-Schulz). The amplification factor Y_i(C) is introduced to retain client-selection counts inside the bound rather than being fitted or renamed as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the one-dimensional counterexample is supplied only to show necessity of the premise. The argument chain is independent of its own outputs and does not reduce any claimed result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The derivation rests on standard concentration inequalities and a stability recursion whose contraction properties depend on the Lipschitz assumption for the orthogonalizer. No free parameters are introduced in the abstract; the bound is expressed in terms of existing quantities (n, client counts, horizon).

axioms (2)

domain assumption Independent heterogeneous client data and fixed aggregation weights
Invoked to apply the weighted concentration step and to keep the amplification factor Y_i(C) well-defined.
domain assumption The orthogonalization map is Lipschitz along paired trajectories
Required to close the coupled-neighbor stability recursion; stated explicitly as necessary for the bound.

pith-pipeline@v0.9.1-grok · 5737 in / 1411 out tokens · 20554 ms · 2026-06-28T16:01:35.844401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 14 linked inside Pith

[1]

Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

Olivier Bousquet and Andr´ e Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

2002
[2]

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016

2016
[3]

Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

2024
[4]

Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Pith/arXiv arXiv 2025
[5]

On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

Pith/arXiv arXiv 2025
[6]

On provable benefits of muon in federated learning.arXiv preprint arXiv:2510.03866, 2025

Xinwen Zhang and Hongchang Gao. On provable benefits of muon in federated learning.arXiv preprint arXiv:2510.03866, 2025

arXiv 2025
[7]

Fedmuon: Accelerating federated learning with matrix orthogonalization.arXiv preprint arXiv:2510.27403, 2025

Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, and Jin Liu. Fedmuon: Accelerating federated learning with matrix orthogonalization.arXiv preprint arXiv:2510.27403, 2025

arXiv 2025
[8]

Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

Nicholas J Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

1986
[9]

New perturbation bounds for the unitary polar factor.SIAM Journal on Matrix Analysis and Applications, 16(1):327–332, 1995

Ren-Cang Li. New perturbation bounds for the unitary polar factor.SIAM Journal on Matrix Analysis and Applications, 16(1):327–332, 1995

1995
[10]

Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift f¨ ur Angewandte Mathematik und Mechanik, 13(1):57–59, 1933

G¨ unther Schulz. Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift f¨ ur Angewandte Mathematik und Mechanik, 13(1):57–59, 1933

1933
[11]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Pith/arXiv arXiv 2024
[12]

The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025

Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025

Pith/arXiv arXiv 2025
[13]

Fedmuon: Federated learning with bias-corrected lmo-based optimization.arXiv preprint arXiv:2509.26337, 2025

Yuki Takezawa, Anastasia Koloskova, Xiaowen Jiang, and Sebastian U Stich. Fedmuon: Federated learning with bias-corrected lmo-based optimization.arXiv preprint arXiv:2509.26337, 2025

arXiv 2025
[14]

Mimuon: Mixed muon optimizer with improved gener- alization for large models, 2026

Feihu Huang, Yuning Luo, and Songcan Chen. Mimuon: Mixed muon optimizer with improved gener- alization for large models, 2026

2026
[15]

High probability generalization bounds for uniformly stable algo- rithms with nearly optimal rate

Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable algo- rithms with nearly optimal rate. InConference on learning theory, pages 1270–1279. PMLR, 2019

2019
[16]

Sharper bounds for uniformly stable algo- rithms

Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algo- rithms. InProceedings of the Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 610–626. PMLR, 2020

2020
[17]

On the method of bounded differences.Surveys in combinatorics, 141(1):148– 188, 1989

Colin McDiarmid et al. On the method of bounded differences.Surveys in combinatorics, 141(1):148– 188, 1989

1989
[18]

Communication-efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273–1282. Pmlr, 2017. 10

2017
[19]

Local sgd converges fast and communicates little.arXiv preprint arXiv:1805.09767, 2018

Sebastian U Stich. Local sgd converges fast and communicates little.arXiv preprint arXiv:1805.09767, 2018

Pith/arXiv arXiv 2018
[20]

Parallel restarted sgd with faster convergence and less com- munication: Demystifying why model averaging works for deep learning

Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less com- munication: Demystifying why model averaging works for deep learning. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 5693–5700, 2019

2019
[21]

On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization

Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. InInternational Conference on Machine Learning, pages 7184–7193. PMLR, 2019

2019
[22]

Scaffold: Stochastic controlled averaging for federated learning

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. InIn- ternational conference on machine learning, pages 5132–5143. PMLR, 2020

2020
[23]

Adaptive federated optimization.arXiv preprint arXiv:2003.00295, 2020

Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇ cn` y, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization.arXiv preprint arXiv:2003.00295, 2020

Pith/arXiv arXiv 2003
[24]

Momentum benefits non-iid federated learning simply and provably

Ziheng Cheng, Xinmeng Huang, Pengfei Wu, and Kun Yuan. Momentum benefits non-iid federated learning simply and provably. InInternational Conference on Learning Representations, volume 2024, pages 9815–9848, 2024

2024
[25]

Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, Wei Feng, et al. Fedswa: Improving generalization in federated learning with highly heterogeneous data via momentum-based stochastic controlled weight averaging.arXiv preprint arXiv:2507.20016, 2025

Pith/arXiv arXiv 2025
[26]

Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

Gyu Yeol Kim and Min-hwan Oh. Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

arXiv 2026
[27]

Beyond the ideal: Analyzing the inexact muon update.arXiv preprint arXiv:2510.19933, 2025

Egor Shulgin, Sultan AlRashed, Francesco Orabona, and Peter Richt´ arik. Beyond the ideal: Analyzing the inexact muon update.arXiv preprint arXiv:2510.19933, 2025

arXiv 2025
[28]

Preconditioning benefits of spectral orthogonal- ization in muon.arXiv preprint arXiv:2601.13474, 2026

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonal- ization in muon.arXiv preprint arXiv:2601.13474, 2026

arXiv 2026
[29]

Spectral flattening is all muon needs: How orthogonalization controls learning rate and convergence

Tien-Phat Nguyen, Truong Nguyen, Minh-Phuc Truong, Tuc Nguyen, James Bailey, and Trung Le. Spectral flattening is all muon needs: How orthogonalization controls learning rate and convergence. arXiv preprint arXiv:2605.13079, 2026

Pith/arXiv arXiv 2026
[30]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

arXiv 2025
[31]

Teon: Tensorized orthonormalization beyond layer-wise muon for large language model pre-training.arXiv preprint arXiv:2601.23261, 2026

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Dongyang Li, Yupeng Su, Sijia Liu, and Zheng Zhang. Teon: Tensorized orthonormalization beyond layer-wise muon for large language model pre-training.arXiv preprint arXiv:2601.23261, 2026

arXiv 2026
[32]

Muon2: Boosting muon via adaptive second-moment preconditioning.arXiv preprint arXiv:2604.09967, 2026

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, and Zheng Zhang. Muon2: Boosting muon via adaptive second-moment preconditioning.arXiv preprint arXiv:2604.09967, 2026

Pith/arXiv arXiv 2026
[33]

Muonbp: Faster muon via block-periodic orthogonalization.arXiv preprint arXiv:2510.16981, 2025

Ahmed Khaled, Kaan Ozkara, Tao Yu, Mingyi Hong, and Youngsuk Park. Muonbp: Faster muon via block-periodic orthogonalization.arXiv preprint arXiv:2510.16981, 2025

arXiv 2025
[34]

Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

arXiv 2025
[35]

Generalization bounds for uniformly stable algorithms.Advances in Neural Information Processing Systems, 31, 2018

Vitaly Feldman and Jan Vondrak. Generalization bounds for uniformly stable algorithms.Advances in Neural Information Processing Systems, 31, 2018. 11

2018
[36]

Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping

Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. InInternational Conference on Learning Representations, volume 2025, pages 92529–92554, 2025

2025
[37]

Efficient distributed optimization under heavy-tailed noise

Su Hyeong Lee, Manzil Zaheer, and Tian Li. Efficient distributed optimization under heavy-tailed noise. arXiv preprint arXiv:2502.04164, 2025

arXiv 2025
[38]

Optimal complexity in byzantine-robust distributed stochastic optimization with data heterogeneity.Journal of Machine Learning Research, 26(268):1–58, 2025

Qiankun Shi, Jie Peng, Kun Yuan, Xiao Wang, and Qing Ling. Optimal complexity in byzantine-robust distributed stochastic optimization with data heterogeneity.Journal of Machine Learning Research, 26(268):1–58, 2025

2025
[39]

Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Pith/arXiv arXiv 2025
[40]

Muoneq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, and Ganzhao Yuan. Muoneq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

Pith/arXiv arXiv 2026
[41]

Mgup: A momentum-gradient alignment update policy for stochastic optimization.Advances in Neural Information Processing Systems, 38:20488–20537, 2026

Da Chang and Ganzhao Yuan. Mgup: A momentum-gradient alignment update policy for stochastic optimization.Advances in Neural Information Processing Systems, 38:20488–20537, 2026

2026
[42]

Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks.arXiv preprint arXiv:2602.00567, 2026

Tian Zhang, Yujia Tong, Junhao Dong, Ke Xu, Yuze Wang, and Jingling Yuan. Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks.arXiv preprint arXiv:2602.00567, 2026

Pith/arXiv arXiv 2026
[43]

Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels

Yujia Tong, Yuze Wang, Jingling Yuan, and Chuang Hu. Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20603–20612, October 2025

2025
[44]

Calibrating and rotating: A unified framework for weight conditioning in peft

Da Chang, Peng Xue, Yu Li, Yongxiang Liu, Pengxiang Xu, and Shixun Zhang. Calibrating and rotating: A unified framework for weight conditioning in peft. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30174–30182, 2026

2026
[45]

Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization.arXiv preprint arXiv:2604.07165, 2026

Yu Li, Sizhe Tang, and Tian Lan. Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization.arXiv preprint arXiv:2604.07165, 2026

Pith/arXiv arXiv 2026
[46]

Inspo: Unlocking intrinsic self-reflection for llm preference opti- mization.arXiv preprint arXiv:2512.23126, 2025

Yu Li, Tian Lan, and Zhengling Qi. Inspo: Unlocking intrinsic self-reflection for llm preference opti- mization.arXiv preprint arXiv:2512.23126, 2025

arXiv 2025
[47]

Oppo: Bayesian value recursion for token-level credit assignment in llm reasoning.arXiv preprint arXiv:2605.21851, 2026

Yu Li, Rui Miao, Tian Lan, and Zhengling Qi. Oppo: Bayesian value recursion for token-level credit assignment in llm reasoning.arXiv preprint arXiv:2605.21851, 2026

Pith/arXiv arXiv 2026
[48]

Kg-sam: Injecting anatomical knowledge into segment anything models via conditional random fields.arXiv preprint arXiv:2509.21750, 2025

Yu Li, Da Chang, and Xi Xiao. Kg-sam: Injecting anatomical knowledge into segment anything models via conditional random fields.arXiv preprint arXiv:2509.21750, 2025

arXiv 2025
[49]

Spielman, and Shang-Hua Teng

Arvind Sankar, Daniel A. Spielman, and Shang-Hua Teng. Smoothed analysis of the condition numbers and growth factors of matrices.SIAM Journal on Matrix Analysis and Applications, 28(2):446–476, 2006

2006
[50]

The littlewood–offord problem and invertibility of random matrices.Advances in Mathematics, 218(2):600–633, 2008

Mark Rudelson and Roman Vershynin. The littlewood–offord problem and invertibility of random matrices.Advances in Mathematics, 218(2):600–633, 2008

2008
[51]

εpart(ρ) log(4n) log4 δ +B ℓ r log(4/δ) n # . Ifa s ≤C aηLOrthEfor alls, then, up to universal constants and logarithms, ΨR,E,K,N (ρ) =O CaηLOrthE

Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47 ofCambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. 12 A Additional Related Work FedAvg established periodic server averaging of local stochastic updates as a central algorithmic template for communicati...

2018

[1] [1]

Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

Olivier Bousquet and Andr´ e Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

2002

[2] [2]

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016

2016

[3] [3]

Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

2024

[4] [4]

Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Pith/arXiv arXiv 2025

[5] [5]

On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond.arXiv preprint arXiv:2509.15816, 2025

Pith/arXiv arXiv 2025

[6] [6]

On provable benefits of muon in federated learning.arXiv preprint arXiv:2510.03866, 2025

Xinwen Zhang and Hongchang Gao. On provable benefits of muon in federated learning.arXiv preprint arXiv:2510.03866, 2025

arXiv 2025

[7] [7]

Fedmuon: Accelerating federated learning with matrix orthogonalization.arXiv preprint arXiv:2510.27403, 2025

Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, and Jin Liu. Fedmuon: Accelerating federated learning with matrix orthogonalization.arXiv preprint arXiv:2510.27403, 2025

arXiv 2025

[8] [8]

Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

Nicholas J Higham. Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing, 7(4):1160–1174, 1986

1986

[9] [9]

New perturbation bounds for the unitary polar factor.SIAM Journal on Matrix Analysis and Applications, 16(1):327–332, 1995

Ren-Cang Li. New perturbation bounds for the unitary polar factor.SIAM Journal on Matrix Analysis and Applications, 16(1):327–332, 1995

1995

[10] [10]

Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift f¨ ur Angewandte Mathematik und Mechanik, 13(1):57–59, 1933

G¨ unther Schulz. Iterative berechung der reziproken matrix.ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift f¨ ur Angewandte Mathematik und Mechanik, 13(1):57–59, 1933

1933

[11] [11]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Pith/arXiv arXiv 2024

[12] [12]

The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025

Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025

Pith/arXiv arXiv 2025

[13] [13]

Fedmuon: Federated learning with bias-corrected lmo-based optimization.arXiv preprint arXiv:2509.26337, 2025

Yuki Takezawa, Anastasia Koloskova, Xiaowen Jiang, and Sebastian U Stich. Fedmuon: Federated learning with bias-corrected lmo-based optimization.arXiv preprint arXiv:2509.26337, 2025

arXiv 2025

[14] [14]

Mimuon: Mixed muon optimizer with improved gener- alization for large models, 2026

Feihu Huang, Yuning Luo, and Songcan Chen. Mimuon: Mixed muon optimizer with improved gener- alization for large models, 2026

2026

[15] [15]

High probability generalization bounds for uniformly stable algo- rithms with nearly optimal rate

Vitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable algo- rithms with nearly optimal rate. InConference on learning theory, pages 1270–1279. PMLR, 2019

2019

[16] [16]

Sharper bounds for uniformly stable algo- rithms

Olivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable algo- rithms. InProceedings of the Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 610–626. PMLR, 2020

2020

[17] [17]

On the method of bounded differences.Surveys in combinatorics, 141(1):148– 188, 1989

Colin McDiarmid et al. On the method of bounded differences.Surveys in combinatorics, 141(1):148– 188, 1989

1989

[18] [18]

Communication-efficient learning of deep networks from decentralized data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273–1282. Pmlr, 2017. 10

2017

[19] [19]

Local sgd converges fast and communicates little.arXiv preprint arXiv:1805.09767, 2018

Sebastian U Stich. Local sgd converges fast and communicates little.arXiv preprint arXiv:1805.09767, 2018

Pith/arXiv arXiv 2018

[20] [20]

Parallel restarted sgd with faster convergence and less com- munication: Demystifying why model averaging works for deep learning

Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less com- munication: Demystifying why model averaging works for deep learning. InProceedings of the AAAI conference on artificial intelligence, volume 33, pages 5693–5700, 2019

2019

[21] [21]

On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization

Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. InInternational Conference on Machine Learning, pages 7184–7193. PMLR, 2019

2019

[22] [22]

Scaffold: Stochastic controlled averaging for federated learning

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. InIn- ternational conference on machine learning, pages 5132–5143. PMLR, 2020

2020

[23] [23]

Adaptive federated optimization.arXiv preprint arXiv:2003.00295, 2020

Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇ cn` y, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization.arXiv preprint arXiv:2003.00295, 2020

Pith/arXiv arXiv 2003

[24] [24]

Momentum benefits non-iid federated learning simply and provably

Ziheng Cheng, Xinmeng Huang, Pengfei Wu, and Kun Yuan. Momentum benefits non-iid federated learning simply and provably. InInternational Conference on Learning Representations, volume 2024, pages 9815–9848, 2024

2024

[25] [25]

Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, Wei Feng, et al. Fedswa: Improving generalization in federated learning with highly heterogeneous data via momentum-based stochastic controlled weight averaging.arXiv preprint arXiv:2507.20016, 2025

Pith/arXiv arXiv 2025

[26] [26]

Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

Gyu Yeol Kim and Min-hwan Oh. Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

arXiv 2026

[27] [27]

Beyond the ideal: Analyzing the inexact muon update.arXiv preprint arXiv:2510.19933, 2025

Egor Shulgin, Sultan AlRashed, Francesco Orabona, and Peter Richt´ arik. Beyond the ideal: Analyzing the inexact muon update.arXiv preprint arXiv:2510.19933, 2025

arXiv 2025

[28] [28]

Preconditioning benefits of spectral orthogonal- ization in muon.arXiv preprint arXiv:2601.13474, 2026

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonal- ization in muon.arXiv preprint arXiv:2601.13474, 2026

arXiv 2026

[29] [29]

Spectral flattening is all muon needs: How orthogonalization controls learning rate and convergence

Tien-Phat Nguyen, Truong Nguyen, Minh-Phuc Truong, Tuc Nguyen, James Bailey, and Trung Le. Spectral flattening is all muon needs: How orthogonalization controls learning rate and convergence. arXiv preprint arXiv:2605.13079, 2026

Pith/arXiv arXiv 2026

[30] [30]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005, 2025

arXiv 2025

[31] [31]

Teon: Tensorized orthonormalization beyond layer-wise muon for large language model pre-training.arXiv preprint arXiv:2601.23261, 2026

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Dongyang Li, Yupeng Su, Sijia Liu, and Zheng Zhang. Teon: Tensorized orthonormalization beyond layer-wise muon for large language model pre-training.arXiv preprint arXiv:2601.23261, 2026

arXiv 2026

[32] [32]

Muon2: Boosting muon via adaptive second-moment preconditioning.arXiv preprint arXiv:2604.09967, 2026

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, and Zheng Zhang. Muon2: Boosting muon via adaptive second-moment preconditioning.arXiv preprint arXiv:2604.09967, 2026

Pith/arXiv arXiv 2026

[33] [33]

Muonbp: Faster muon via block-periodic orthogonalization.arXiv preprint arXiv:2510.16981, 2025

Ahmed Khaled, Kaan Ozkara, Tao Yu, Mingyi Hong, and Youngsuk Park. Muonbp: Faster muon via block-periodic orthogonalization.arXiv preprint arXiv:2510.16981, 2025

arXiv 2025

[34] [34]

Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

arXiv 2025

[35] [35]

Generalization bounds for uniformly stable algorithms.Advances in Neural Information Processing Systems, 31, 2018

Vitaly Feldman and Jan Vondrak. Generalization bounds for uniformly stable algorithms.Advances in Neural Information Processing Systems, 31, 2018. 11

2018

[36] [36]

Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping

Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. InInternational Conference on Learning Representations, volume 2025, pages 92529–92554, 2025

2025

[37] [37]

Efficient distributed optimization under heavy-tailed noise

Su Hyeong Lee, Manzil Zaheer, and Tian Li. Efficient distributed optimization under heavy-tailed noise. arXiv preprint arXiv:2502.04164, 2025

arXiv 2025

[38] [38]

Optimal complexity in byzantine-robust distributed stochastic optimization with data heterogeneity.Journal of Machine Learning Research, 26(268):1–58, 2025

Qiankun Shi, Jie Peng, Kun Yuan, Xiao Wang, and Qing Ling. Optimal complexity in byzantine-robust distributed stochastic optimization with data heterogeneity.Journal of Machine Learning Research, 26(268):1–58, 2025

2025

[39] [39]

Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Pith/arXiv arXiv 2025

[40] [40]

Muoneq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang, Yao Lu, Yongxiang Liu, and Ganzhao Yuan. Muoneq: Balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254, 2026

Pith/arXiv arXiv 2026

[41] [41]

Mgup: A momentum-gradient alignment update policy for stochastic optimization.Advances in Neural Information Processing Systems, 38:20488–20537, 2026

Da Chang and Ganzhao Yuan. Mgup: A momentum-gradient alignment update policy for stochastic optimization.Advances in Neural Information Processing Systems, 38:20488–20537, 2026

2026

[42] [42]

Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks.arXiv preprint arXiv:2602.00567, 2026

Tian Zhang, Yujia Tong, Junhao Dong, Ke Xu, Yuze Wang, and Jingling Yuan. Forget by uncertainty: Orthogonal entropy unlearning for quantized neural networks.arXiv preprint arXiv:2602.00567, 2026

Pith/arXiv arXiv 2026

[43] [43]

Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels

Yujia Tong, Yuze Wang, Jingling Yuan, and Chuang Hu. Robust machine unlearning for quantized neural networks via adaptive gradient reweighting with similar labels. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20603–20612, October 2025

2025

[44] [44]

Calibrating and rotating: A unified framework for weight conditioning in peft

Da Chang, Peng Xue, Yu Li, Yongxiang Liu, Pengxiang Xu, and Shixun Zhang. Calibrating and rotating: A unified framework for weight conditioning in peft. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30174–30182, 2026

2026

[45] [45]

Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization.arXiv preprint arXiv:2604.07165, 2026

Yu Li, Sizhe Tang, and Tian Lan. Reason in chains, learn in trees: Self-rectification and grafting for multi-turn agent policy optimization.arXiv preprint arXiv:2604.07165, 2026

Pith/arXiv arXiv 2026

[46] [46]

Inspo: Unlocking intrinsic self-reflection for llm preference opti- mization.arXiv preprint arXiv:2512.23126, 2025

Yu Li, Tian Lan, and Zhengling Qi. Inspo: Unlocking intrinsic self-reflection for llm preference opti- mization.arXiv preprint arXiv:2512.23126, 2025

arXiv 2025

[47] [47]

Oppo: Bayesian value recursion for token-level credit assignment in llm reasoning.arXiv preprint arXiv:2605.21851, 2026

Yu Li, Rui Miao, Tian Lan, and Zhengling Qi. Oppo: Bayesian value recursion for token-level credit assignment in llm reasoning.arXiv preprint arXiv:2605.21851, 2026

Pith/arXiv arXiv 2026

[48] [48]

Kg-sam: Injecting anatomical knowledge into segment anything models via conditional random fields.arXiv preprint arXiv:2509.21750, 2025

Yu Li, Da Chang, and Xi Xiao. Kg-sam: Injecting anatomical knowledge into segment anything models via conditional random fields.arXiv preprint arXiv:2509.21750, 2025

arXiv 2025

[49] [49]

Spielman, and Shang-Hua Teng

Arvind Sankar, Daniel A. Spielman, and Shang-Hua Teng. Smoothed analysis of the condition numbers and growth factors of matrices.SIAM Journal on Matrix Analysis and Applications, 28(2):446–476, 2006

2006

[50] [50]

The littlewood–offord problem and invertibility of random matrices.Advances in Mathematics, 218(2):600–633, 2008

Mark Rudelson and Roman Vershynin. The littlewood–offord problem and invertibility of random matrices.Advances in Mathematics, 218(2):600–633, 2008

2008

[51] [51]

εpart(ρ) log(4n) log4 δ +B ℓ r log(4/δ) n # . Ifa s ≤C aηLOrthEfor alls, then, up to universal constants and logarithms, ΨR,E,K,N (ρ) =O CaηLOrthE

Roman Vershynin.High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47 ofCambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. 12 A Additional Related Work FedAvg established periodic server averaging of local stochastic updates as a central algorithmic template for communicati...

2018