pith. machine review for the scientific record. sign in

arxiv: 2602.19945 · v1 · submitted 2026-02-23 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

· Lean Theorem

DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords differential privacyfederated learningAdamW optimizerlarge language modelsconvergence analysisclient driftsecond moment estimation
0
0 comments X

The pith

DP-FedAdamW provides an unbiased second-moment estimator for AdamW in differentially private federated learning, enabling linearly accelerated convergence without heterogeneity assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of using efficient adaptive optimizers like AdamW in federated learning under differential privacy constraints for large models. Direct application of AdamW leads to increased variance in moment estimates from combined data heterogeneity and privacy noise, plus bias from perturbations and worsened client drift. DP-FedAdamW stabilizes the variance, corrects the bias, and aligns local updates with global descent to mitigate these issues. The authors prove that this yields an unbiased estimator and a convergence rate that improves linearly, independent of any assumptions on data distribution across clients, along with improved privacy accounting. Results on vision and language transformers demonstrate outperformance over existing methods.

Core claim

We propose DP-FedAdamW as the first AdamW-based optimizer for DPFL that restores AdamW functionality by stabilizing second-moment variance, removing DP-induced bias, and aligning local updates to curb client drift, establishing an unbiased second-moment estimator that proves linearly accelerated convergence without heterogeneity assumptions and tighter (ε,δ)-DP guarantees.

What carries the argument

The stabilized, bias-corrected second-moment estimator combined with local-global alignment mechanism to handle DP noise and client drift in federated AdamW updates.

If this is right

  • Convergence rate accelerates linearly even with heterogeneous client data.
  • Tighter differential privacy guarantees are achieved compared to prior DPFL methods.
  • Improved performance on large models such as Swin-Base transformers and ResNet-18 under privacy budgets like ε=1.
  • Effective for both language and vision tasks in federated settings.
  • Outperforms state-of-the-art by 5.83% on Tiny-ImageNet with Swin-Base.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could support training of even larger foundation models in privacy-sensitive distributed environments.
  • Similar stabilization techniques might apply to other adaptive optimizers in DPFL scenarios.
  • Real-world deployments might see reduced communication rounds due to faster convergence.
  • Further work could explore extensions to other privacy mechanisms beyond (ε,δ)-DP.

Load-bearing premise

The proposed stabilizations and bias corrections fully resolve the amplified variance and client drift issues induced by DP noise and data heterogeneity in practice for large-scale models.

What would settle it

Observing persistent bias in the second-moment estimator or sub-linear convergence in an experiment applying DP-FedAdamW to a large language or vision model under DP constraints would disprove the claims.

Figures

Figures reproduced from arXiv: 2602.19945 by Jin Liu, Junkang Liu, Ning Xi, Yinbin Miao.

Figure 1
Figure 1. Figure 1: Illustration of our DP-FedAdamW. It aggregates the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of local update in DP-FedAdamW, which corrects client drift caused through global update guidance. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training on CIFAR-100, Swin-Tiny, σ=1, α=0.1. (a) Non-IID (DPFL) causes high variance in second-moment estima￾tor across clients of DP-LocalAdamW. (b) DP-LocalAdamW suf￾fers from more severe client drift than FedAvg and LocalAdamW. 4. Motivation and challenges Prior work [48, 78] shows that AdamW can substantially improve convergence and generalization. However, our key observation is that, in DPFL, AdamW … view at source ↗
Figure 4
Figure 4. Figure 4: Histogram for DP-LocalAdamW, CIFAR-10, Swin-Tiny, [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Test accuracy (%) on CIFAR-100 using ResNet-18 and Swin-Tiny under the Dirichlet [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Balancing convergence efficiency and robustness under Differential Privacy (DP) is a central challenge in Federated Learning (FL). While AdamW accelerates training and fine-tuning in large-scale models, we find that directly applying it to Differentially Private FL (DPFL) suffers from three major issues: (i) data heterogeneity and privacy noise jointly amplify the variance of second-moment estimator, (ii) DP perturbations bias the second-moment estimator, and (iii) DP amplify AdamW sensitivity to local overfitting, worsening client drift. We propose DP-FedAdamW, the first AdamW-based optimizer for DPFL. It restores AdamW under DP by stabilizing second-moment variance, removing DP-induced bias, and aligning local updates to the global descent to curb client drift. Theoretically, we establish an unbiased second-moment estimator and prove a linearly accelerated convergence rate without any heterogeneity assumption, while providing tighter $(\varepsilon,\delta)$-DP guarantees. Our empirical results demonstrate the effectiveness of DP-FedAdamW across language and vision Transformers and ResNet-18. On Tiny-ImageNet (Swin-Base, $\varepsilon=1$), DP-FedAdamW outperforms the state-of-the-art (SOTA) by 5.83\%. The code is available in Appendix.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DP-FedAdamW as the first AdamW-based optimizer for differentially private federated learning (DPFL). It identifies three issues when applying AdamW directly to DPFL—amplified variance in the second-moment estimator from heterogeneity and noise, DP-induced bias in that estimator, and worsened client drift from local overfitting—and addresses them via variance stabilization, bias removal, and local-update alignment to the global descent. The central claims are an unbiased second-moment estimator, a proof of linearly accelerated convergence without heterogeneity assumptions, and tighter (ε,δ)-DP guarantees. Experiments report gains on vision and language Transformers plus ResNet-18, including a 5.83% improvement over SOTA on Tiny-ImageNet (Swin-Base, ε=1).

Significance. If the theoretical claims hold, the work would be significant for enabling efficient, theoretically grounded AdamW optimization in DPFL of large models. The absence of a heterogeneity assumption in the convergence result is a notable strength relative to prior DPFL analyses, and the empirical gains on modern architectures suggest practical relevance for privacy-preserving training of Transformers.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Theoretical Analysis): the claims of an unbiased second-moment estimator and linearly accelerated convergence without heterogeneity assumptions are load-bearing for the paper's contribution, yet the provided text supplies no derivation steps, explicit assumptions on the noise model, or handling of the DP perturbation in the second-moment term; without these details the central theoretical result cannot be verified.
  2. [§5] §5 (Experiments): the reported 5.83% gain on Tiny-ImageNet (Swin-Base, ε=1) is presented as outperforming SOTA, but the text gives no protocol details, variance across runs, or ablation isolating the contribution of each stabilization component; this leaves the empirical support for the practical claims unverifiable.
minor comments (2)
  1. [Abstract] The code-availability statement appears only in the abstract; it should be repeated with a concrete link or appendix reference in the main text.
  2. [§3] Notation for the stabilized second-moment estimator and the bias-correction term should be introduced with explicit equations before the convergence theorem is stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve verifiability of both the theoretical claims and empirical results.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Theoretical Analysis): the claims of an unbiased second-moment estimator and linearly accelerated convergence without heterogeneity assumptions are load-bearing for the paper's contribution, yet the provided text supplies no derivation steps, explicit assumptions on the noise model, or handling of the DP perturbation in the second-moment term; without these details the central theoretical result cannot be verified.

    Authors: We acknowledge that §4 states the main results concisely. The full derivation of the unbiased second-moment estimator (via explicit bias-correction for the Gaussian DP noise) and the linear convergence proof (under smoothness and bounded-variance assumptions only) appear in Appendix A. The noise model is standard additive Gaussian perturbation from the DP mechanism, and the second-moment term is corrected before the square-root operation. We will insert key derivation steps and an assumptions paragraph into the main §4 text. revision: yes

  2. Referee: [§5] §5 (Experiments): the reported 5.83% gain on Tiny-ImageNet (Swin-Base, ε=1) is presented as outperforming SOTA, but the text gives no protocol details, variance across runs, or ablation isolating the contribution of each stabilization component; this leaves the empirical support for the practical claims unverifiable.

    Authors: We agree that additional details are needed for reproducibility. The revised version will add: a complete experimental protocol subsection (§5.1) covering hyperparameters, DP noise calibration, and client sampling; mean and standard deviation over five independent runs for all reported numbers; and an ablation study (§5.4) that isolates the contribution of variance stabilization, bias removal, and local-update alignment. These changes will substantiate the 5.83% gain. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces DP-FedAdamW to address variance amplification, DP bias in second-moment estimates, and client drift via stabilization, bias removal, and local-global alignment. The core theoretical results—an unbiased second-moment estimator and linearly accelerated convergence without heterogeneity assumptions—are stated as proven outcomes with tighter DP bounds. No equations, definitions, or self-citations in the abstract or claims reduce these results to fitted parameters, self-definitions, or prior author work by construction. The derivation chain remains self-contained against external benchmarks, with no load-bearing self-citation or ansatz smuggling exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, ad-hoc axioms, or invented entities are described; the work relies on standard optimization analysis assumptions.

axioms (1)
  • standard math Standard assumptions for proving convergence rates in stochastic optimization
    Invoked implicitly for the linear convergence claim.

pith-pipeline@v0.9.0 · 5527 in / 1219 out tokens · 46796 ms · 2026-05-15T20:00:53.752350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 6 internal anchors

  1. [1]

    Deep learning with differential privacy

    Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. InProceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016. 1

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  3. [3]

    A numerical des perspective on unfolded lin- earized admm networks for inverse problems

    Weixin An, Yingjie Yue, Yuanyuan Liu, Fanhua Shang, and Hongying Liu. A numerical des perspective on unfolded lin- earized admm networks for inverse problems. InProceed- ings of the 30th ACM International Conference on Multime- dia, pages 5065–5073, 2022. 3

  4. [4]

    Robust and faster zeroth-order minimax optimization: complexity and applications.Advances in Neural Informa- tion Processing Systems, 37:37050–37069, 2024

    Weixin An, Yuanyuan Liu, Fanhua Shang, and Hongying Liu. Robust and faster zeroth-order minimax optimization: complexity and applications.Advances in Neural Informa- tion Processing Systems, 37:37050–37069, 2024

  5. [5]

    Des-inspired accelerated unfolded lin- earized admm networks for inverse problems.IEEE Trans- actions on Neural Networks and Learning Systems, 36(3): 5319–5333, 2024

    Weixin An, Yuanyuan Liu, Fanhua Shang, Hongying Liu, and Licheng Jiao. Des-inspired accelerated unfolded lin- earized admm networks for inverse problems.IEEE Trans- actions on Neural Networks and Learning Systems, 36(3): 5319–5333, 2024. 3

  6. [6]

    Toward communi- cation efficient adaptive gradient method

    Xiangyi Chen, Xiaoyun Li, and Ping Li. Toward communi- cation efficient adaptive gradient method. InProceedings of the 2020 ACM-IMS on Foundations of Data Science Confer- ence, pages 119–128, 2020. 3

  7. [7]

    Fair federated learning under domain skew with local consistency and do- main diversity

    Yuhang Chen, Wenke Huang, and Mang Ye. Fair federated learning under domain skew with local consistency and do- main diversity. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12077– 12086, 2024. 1

  8. [8]

    Traf- fic prediction and load balancing routing algorithm based on deep q-network for sd-iot.Advanced Engineering Informat- ics, 68:103596, 2025

    Qiao Ding, Nanyu Li, Heng Ding, Jian Wang, Tao Li, Yongqing Chen, Yantuan Xian, and Junyang Chen. Traf- fic prediction and load balancing routing algorithm based on deep q-network for sd-iot.Advanced Engineering Informat- ics, 68:103596, 2025. 3

  9. [9]

    Enhancing news classification: Domain-specific guided pretraining based on adaptive selective masking.Knowledge-Based Systems, page 115516, 2026

    Qiao Ding, Heng Ding, Jian Wang, Yantuan Xian, Tao Li, Nanyu Li, Tao Fang, and Junyang Chen. Enhancing news classification: Domain-specific guided pretraining based on adaptive selective masking.Knowledge-Based Systems, page 115516, 2026. 3

  10. [10]

    Differential privacy

    Cynthia Dwork. Differential privacy. InInternational col- loquium on automata, languages, and programming, pages 1–12. Springer, 2006. 1

  11. [11]

    The algorithmic foun- dations of differential privacy.Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014

    Cynthia Dwork, Aaron Roth, et al. The algorithmic foun- dations of differential privacy.Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014. 6

  12. [12]

    Refiner: Data refining against gradient leakage attacks in federated learning

    Mingyuan Fan, Cen Chen, Chengyu Wang, Xiaodan Li, and Wenmeng Zhou. Refiner: Data refining against gradient leakage attacks in federated learning. In34th USENIX Se- curity Symposium (USENIX Security 25), pages 3005–3024,

  13. [13]

    MaskCon: Masked Con- trastive Learning for Coarse-Labelled Dataset

    Chen Feng and Ioannis Patras. MaskCon: Masked Con- trastive Learning for Coarse-Labelled Dataset. InProceed- 9 ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3

  14. [14]

    SSR: An Efficient and Robust Framework for Learning with Unknown Label Noise

    Chen Feng, Georgios Tzimiropoulos, and Ioannis Patras. SSR: An Efficient and Robust Framework for Learning with Unknown Label Noise. In33rd British Machine Vision Con- ference (BMVC), 2022

  15. [15]

    CLIPCleaner: Cleaning Noisy Labels with CLIP

    Chen Feng, Georgios Tzimiropoulos, and Ioannis Patras. CLIPCleaner: Cleaning Noisy Labels with CLIP. InThe 32nd ACM International Conference on Multimedia (ACM MM), 2024

  16. [16]

    NoiseBox: Towards More Efficient and Effective Learning with Noisy Labels.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2024

    Chen Feng, Georgios Tzimiropoulos, and Ioannis Patras. NoiseBox: Towards More Efficient and Effective Learning with Noisy Labels.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2024

  17. [17]

    Chen Feng, Ziquan Liu, Zhuo Zhi, Ilija Bogunovic, Carsten Gerner-Beuerle, and Miguel R. D. Rodrigues. PROSAC: Provably safe certification for machine learning models un- der adversarial attacks. InThe 39th Annual AAAI Conference on Artificial Intelligence (AAAI) [Oral], 2025

  18. [18]

    Chen Feng, Nicu Sebe, Georgios Tzimiropoulos, Miguel R. D. Rodrigues, and Ioannis Patras. Unveiling open-set noise: Theoretical insights into label noise. InThe 33rd ACM International Conference on Multimedia (ACM MM), 2025

  19. [19]

    Chen Feng, Minghe Shen, Ananth Balashankar, Carsten Gerner-Beuerle, and Miguel R. D. Rodrigues. Noisy but valid: Robust statistical evaluation of LLMs with imper- fect judges. InThe Fourteenth International Conference on Learning Representations (ICLR), 2026

  20. [20]

    De- constructing the failure of ideal noise correction: A three- pillar diagnosis

    Chen Feng, Zhuo Zhi, Zhao Huang, Jiawei Ge, Ling Xiao, Nicu Sebe, Georgios Tzimiropoulos, and Ioannis Patras. De- constructing the failure of ideal noise correction: A three- pillar diagnosis. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  21. [21]

    Sharpness-aware minimization for efficiently improving generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. InICLR, 2021. 2

  22. [22]

    Differentially private federated learning: A systematic re- view.arXiv preprint arXiv:2405.08299, 2024

    Jie Fu, Yuan Hong, Xinpeng Ling, Leixia Wang, Xun Ran, Zhiyu Sun, Wendy Hui Wang, Zhili Chen, and Yang Cao. Differentially private federated learning: A systematic re- view.arXiv preprint arXiv:2405.08299, 2024. 1, 6

  23. [23]

    Differentially Private Federated Learning: A Client Level Perspective

    Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective.arXiv preprint arXiv:1712.07557, 2017. 1

  24. [24]

    In34th USENIX Security Symposium (USENIX Security 25), pages 3065– 3082, 2025

    Xiaolan Gu, Ming Li, and Li Xiong.{DP- BREM}:{Differentially-Private}and{Byzantine-Robust} federated learning with client momentum. In34th USENIX Security Symposium (USENIX Security 25), pages 3065– 3082, 2025. 1

  25. [25]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others

    Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Mea- suring the effects of non-identical data distribution for feder- ated visual classification.arXiv preprint arXiv:1909.06335,

  26. [26]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin Ming-Wei Chang Kenton, Lee Kristina Toutanova, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of naacL-HLT. Minneapolis, Minnesota, 2019. 1

  27. [27]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 3

  28. [28]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Technical Report. 6, 1, 2

  29. [29]

    Heavy-tailed class imbalance and why adam outperforms gradient descent on language models

    Frederik Kunstner, Alan Milligan, Robin Yadav, Mark Schmidt, and Alberto Bietti. Heavy-tailed class imbalance and why adam outperforms gradient descent on language models. InAdvances in Neural Information Processing Sys- tems, pages 30106–30148. Curran Associates, Inc., 2024. 3

  30. [30]

    Tiny imagenet visual recognition challenge, 2015

    Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge, 2015. Stanford CS231N. 6, 1, 2

  31. [31]

    Sep- prune: Structured pruning for efficient deep speech separa- tion.arXiv preprint arXiv:2505.12079, 2025

    Yuqi Li, Kai Li, Xin Yin, Zhifei Yang, Junhao Dong, Zeyu Dong, Chuanguang Yang, Yingli Tian, and Yao Lu. Sep- prune: Structured pruning for efficient deep speech separa- tion.arXiv preprint arXiv:2505.12079, 2025. 3

  32. [32]

    Frequency-aligned knowledge distillation for lightweight spatiotemporal forecasting

    Yuqi Li, Chuanguang Yang, Hansheng Zeng, Zeyu Dong, Zhulin An, Yongjun Xu, Yingli Tian, and Hao Wu. Frequency-aligned knowledge distillation for lightweight spatiotemporal forecasting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7262– 7272, 2025

  33. [33]

    A com- prehensive survey of interaction techniques in 3d scene gen- eration.Authorea Preprints, 2026

    Yuqi Li, Siwei Meng, Chuanguang Yang, Weilun Feng, Jun- ming Liu, Zhulin An, Yikai Wang, and Yingli Tian. A com- prehensive survey of interaction techniques in 3d scene gen- eration.Authorea Preprints, 2026. 3

  34. [34]

    Differentially private federated learning with laplacian smoothing.Applied and Computational Harmonic Analysis, 72:101660, 2024

    Zhicong Liang, Bao Wang, Quanquan Gu, Stanley Osher, and Yuan Yao. Differentially private federated learning with laplacian smoothing.Applied and Computational Harmonic Analysis, 72:101660, 2024. 2, 6

  35. [35]

    Convex relaxation for robust vanishing point estimation in manhattan world

    Bangyan Liao, Zhenjun Zhao, Haoang Li, Yi Zhou, Ying- ping Zeng, Hao Li, and Peidong Liu. Convex relaxation for robust vanishing point estimation in manhattan world. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15823–15832, 2025. 3

  36. [36]

    The health- wealth gradient in labor markets: Integrating health, in- surance, and social metrics to predict employment density

    Dingyuan Liu, Qiannan Shen, and Jiaci Liu. The health- wealth gradient in labor markets: Integrating health, in- surance, and social metrics to predict employment density. Computation, 14(1):22, 2026. 3

  37. [37]

    Cross-silo federated learning with record-level per- sonalized differential privacy

    Junxu Liu, Jian Lou, Li Xiong, Jinfei Liu, and Xiaofeng Meng. Cross-silo federated learning with record-level per- sonalized differential privacy. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communi- cations Security, pages 303–317, 2024. 2

  38. [38]

    Fedbcgd: Communication-efficient accelerated block coordinate gra- dient descent for federated learning

    Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuangang Li, and YunXiang Gong. Fedbcgd: Communication-efficient accelerated block coordinate gra- dient descent for federated learning. InProceedings of the 32nd ACM International Conference on Multimedia, pages 2955–2963, 2024. 3

  39. [39]

    Improving generalization in federated learning with highly heterogeneous data via momentum-based stochastic controlled weight averaging

    Junkang Liu, Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, and Wei Feng. Improving generalization in federated learning with highly heterogeneous data via momentum-based stochastic controlled weight averaging. In Forty-second International Conference on Machine Learn- ing, 2025. 3 10

  40. [40]

    Consistency of local and global flat- ness for federated learning

    Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, and Yuanyuan Liu. Consistency of local and global flat- ness for federated learning. InProceedings of the 33rd ACM International Conference on Multimedia, page 3875–3883, New York, NY , USA, 2025. Association for Computing Ma- chinery. 3

  41. [41]

    Consistency of local and global flat- ness for federated learning

    Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, and Yuanyuan Liu. Consistency of local and global flat- ness for federated learning. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3875–3883,

  42. [42]

    Improving gen- eralization in federated learning with highly heterogeneous data via momentum-based stochastic controlled weight averaging

    Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, and Jin Liu. Fedmuon: Accelerating feder- ated learning with matrix orthogonalization.arXiv preprint arXiv:2510.27403, 2025. 3

  43. [43]

    FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models

    Junkang Liu, Fanhua Shang, Kewen Zhu, Hongying Liu, Yuanyuan Liu, and Jin Liu. Fedadamw: A communication- efficient optimizer with convergence and generalization guarantees for federated large models.arXiv preprint arXiv:2510.27486, 2025. 3

  44. [44]

    Dp-fedpgn: Finding global flat minima for differentially private feder- ated learning via penalizing gradient norm.arXiv preprint arXiv:2510.27504, 2025

    Junkang Liu, Yuxuan Tian, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Junchao Zhou, and Daorui Ding. Dp-fedpgn: Finding global flat minima for differentially private feder- ated learning via penalizing gradient norm.arXiv preprint arXiv:2510.27504, 2025. 3

  45. [45]

    Kill a bird with two stones: Closing the convergence gaps in non-strongly convex optimization by di- rectly accelerated svrg with double compensation and snap- shots

    Yuanyuan Liu, Fanhua Shang, Weixin An, Hongying Liu, and Zhouchen Lin. Kill a bird with two stones: Closing the convergence gaps in non-strongly convex optimization by di- rectly accelerated svrg with double compensation and snap- shots. InInternational Conference on Machine Learning, pages 14008–14035. PMLR, 2022. 3

  46. [46]

    Yuanyuan Liu, Fanhua Shang, Weixin An, Junhao Liu, Hongying Liu, and Zhouchen Lin. A single-loop accelerated extra-gradient difference algorithm with improved complex- ity bounds for constrained minimax optimization.Advances in Neural Information Processing Systems, 36:61699–61711,

  47. [47]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 1

  48. [48]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov, Frank Hutter, et al. Fixing weight decay regularization in adam.arXiv preprint arXiv:1711.05101, 5 (5):5, 2017. 1, 3

  49. [49]

    Communication- efficient learning of deep networks from decentralized data

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication- efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273–1282. PMLR, 2017. 1

  50. [50]

    Improving global gen- eralization and local personalization for federated learning

    Lei Meng, Zhuang Qi, Lei Wu, Xiaoyu Du, Zhaochuan Li, Lizhen Cui, and Xiangxu Meng. Improving global gen- eralization and local personalization for federated learning. IEEE Transactions on Neural Networks and Learning Sys- tems, 36(1):76–87, 2025. 3

  51. [51]

    R ´enyi differential privacy

    Ilya Mironov. R ´enyi differential privacy. InProc. IEEE com- puter security foundations symposium (CSF), pages 263– 275, 2017. 6

  52. [52]

    Differentially private federated learning on heterogeneous data

    Maxence Noble, Aur ´elien Bellet, and Aymeric Dieuleveut. Differentially private federated learning on heterogeneous data. InInternational conference on artificial intelligence and statistics, pages 10110–10145. PMLR, 2022. 1, 2, 6

  53. [53]

    Laplacian smoothing gradient descent.Research in the Mathematical Sciences, 9(3):55, 2022

    Stanley Osher, Bao Wang, Penghang Yin, Xiyang Luo, Farzin Barekat, Minh Pham, and Alex Lin. Laplacian smoothing gradient descent.Research in the Mathematical Sciences, 9(3):55, 2022. 2

  54. [54]

    Federated learning for science: A survey on the path to a trustworthy collaboration ecosystem.Authorea Preprints,

    Xin Qi, Meixuan Li, Sijin Zhou, Wei Feng, and Zhuang Qi. Federated learning for science: A survey on the path to a trustworthy collaboration ecosystem.Authorea Preprints,

  55. [55]

    Federated learning in oncology: bridging artificial intelligence innovation and privacy protection.Information Fusion, page 104154, 2026

    Xin Qi, Tao Xu, Chengrun Dang, Zhuang Qi, Lei Meng, and Han Yu. Federated learning in oncology: bridging artificial intelligence innovation and privacy protection.Information Fusion, page 104154, 2026

  56. [56]

    Cross-silo prototypical calibration for fed- erated learning with non-iid data

    Zhuang Qi, Lei Meng, Zitan Chen, Han Hu, Hui Lin, and Xiangxu Meng. Cross-silo prototypical calibration for fed- erated learning with non-iid data. InProceedings of the 31st ACM international conference on multimedia, pages 3099– 3107, 2023. 3

  57. [57]

    On the convergence of adam and beyond

    Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. InInternational Confer- ence on Learning Representations, 2018. 3

  58. [58]

    On the Convergence of Adam and Beyond

    Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237, 2019. 3

  59. [59]

    Adaptive federated optimization

    Sashank J Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Kone ˇcn`y, Sanjiv Kumar, and Hugh Brendan McMahan. Adaptive federated optimization. InInternational Conference on Learning Representations,

  60. [60]

    Ai-enhanced disaster risk prediction with explainable shap analysis: A multi-class classification approach using xgboost

    Qiannan Shen and Jing Zhang. Ai-enhanced disaster risk prediction with explainable shap analysis: A multi-class classification approach using xgboost. 2025. Preprint, Ver- sion 1, posted December 31, 2025. 3

  61. [61]

    Mftformer: Meteorological- frequency-temporal transformer with block-aligned fusion for traffic flow prediction.Research Square, 2026

    Qiannan Shen and Jing Zhang. Mftformer: Meteorological- frequency-temporal transformer with block-aligned fusion for traffic flow prediction.Research Square, 2026. Preprint, doi:10.21203/rs.3.rs-8770196/v1. 3

  62. [62]

    Make landscape flatter in differentially private federated learning

    Yifan Shi, Yingqi Liu, Kang Wei, Li Shen, Xueqian Wang, and Dacheng Tao. Make landscape flatter in differentially private federated learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24552–24562, 2023. 2, 6

  63. [63]

    High-recall deep learning: A gated recurrent unit approach to bank account fraud detection on imbalanced data

    Wenxi Sun, Zhichun Qi, and Qiannan Shen. High-recall deep learning: A gated recurrent unit approach to bank account fraud detection on imbalanced data. 2025. 3

  64. [64]

    Objective over architecture: Fraud detection under extreme imbalance in bank account opening

    Wenxi Sun, Qiannan Shen, Yijun Gao, Qinkai Mao, Tong- song Qi, and Shuo Xu. Objective over architecture: Fraud detection under extreme imbalance in bank account opening. Computation, 13(12):290, 2025. 3

  65. [65]

    Efficient federated learning via local adaptive amended opti- mizer with linear speedup.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14453–14464,

    Yan Sun, Li Shen, Hao Sun, Liang Ding, and Dacheng Tao. Efficient federated learning via local adaptive amended opti- mizer with linear speedup.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14453–14464,

  66. [66]

    LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition

    Zhonglin Sun, Chen Feng, Ioannis Patras, and Georgios Tz- imiropoulos. LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

  67. [67]

    Glue: A multi-task benchmark and analysis platform for natural language un- derstanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language un- derstanding. InProceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP, pages 353–355, 2018. 6, 1

  68. [68]

    Dc-sgd: Differentially private sgd with dynamic clipping through gradient norm distribution estimation.IEEE Trans- actions on Information Forensics and Security, 2025

    Chengkun Wei, Weixian Li, Gong Chen, and Wenzhi Chen. Dc-sgd: Differentially private sgd with dynamic clipping through gradient norm distribution estimation.IEEE Trans- actions on Information Forensics and Security, 2025. 2

  69. [69]

    Faster adaptive federated learning

    Xidong Wu, Feihu Huang, Zhengmian Hu, and Heng Huang. Faster adaptive federated learning. InProceedings of the AAAI conference on artificial intelligence, pages 10379– 10387, 2023. 3, 8

  70. [70]

    Implicit bias of adamw:ℓ ∞- norm constrained optimization

    Shuo Xie and Zhiyuan Li. Implicit bias of adamw:ℓ ∞- norm constrained optimization. InInternational Conference on Machine Learning, pages 54488–54510. PMLR, 2024. 3

  71. [71]

    Dual defense: Enhancing privacy and mitigating poison- ing attacks in federated learning.Advances in Neural Infor- mation Processing Systems, 37:70476–70498, 2024

    Runhua Xu, Shiqi Gao, Chao Li, James Joshi, and Jianxin Li. Dual defense: Enhancing privacy and mitigating poison- ing attacks in federated learning.Advances in Neural Infor- mation Processing Systems, 37:70476–70498, 2024. 1

  72. [72]

    From risk to resilience: Towards assess- ing and mitigating the risk of data reconstruction attacks in federated learning

    Xiangrui Xu, Zhize Li, Yufei Han, Bin Wang, Jiqiang Liu, and Wei Wang. From risk to resilience: Towards assess- ing and mitigating the risk of data reconstruction attacks in federated learning. In34th USENIX Security Symposium (USENIX Security 25), pages 3141–3160, 2025. 1

  73. [73]

    In32nd USENIX Security Symposium (USENIX Security 23), pages 1595–1612, 2023

    Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao.{PrivateFL}: Accurate, differentially private feder- ated learning via personalized data transformation. In32nd USENIX Security Symposium (USENIX Security 23), pages 1595–1612, 2023. 1

  74. [74]

    Why transformers need adam: A hes- sian perspective.Advances in neural information processing systems, 37:131786–131823, 2024

    Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, and Zhiquan Luo. Why transformers need adam: A hes- sian perspective.Advances in neural information processing systems, 37:131786–131823, 2024. 3

  75. [75]

    Balf: Simple and efficient blur aware lo- cal feature detector

    Zhenjun Zhao. Balf: Simple and efficient blur aware lo- cal feature detector. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 3362–3372, 2024. 3

  76. [76]

    Benchmark for evaluating initialization of visual-inertial odometry

    Zhenjun Zhao and Ben M Chen. Benchmark for evaluating initialization of visual-inertial odometry. In2023 42nd Chi- nese Control Conference (CCC), pages 3935–3940. IEEE, 2023

  77. [77]

    Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662, 2026

    Zhenjun Zhao, Heng Yang, Bangyan Liao, Yingping Zeng, Shaocheng Yan, Yingdong Gu, Peidong Liu, Yi Zhou, Haoang Li, and Javier Civera. Advances in global solvers for 3d vision.arXiv preprint arXiv:2602.14662, 2026. 3

  78. [78]

    Towards understanding convergence and generalization of adamw.IEEE transactions on pattern analysis and machine intelligence, 46(9):6486–6493, 2024

    Pan Zhou, Xingyu Xie, Zhouchen Lin, and Shuicheng Yan. Towards understanding convergence and generalization of adamw.IEEE transactions on pattern analysis and machine intelligence, 46(9):6486–6493, 2024. 3 12 DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models Supplementary Material

  79. [79]

    Depth” denotes the total number of layers (blocks for ResNet/Swin, encoder layers for ViT/RoBERTa), and “Stages

    More Implementation Detail 9.1. More Results Swin-Tiny/Base on CIFAR-10.Table 10 reports the averaged test accuracy on CIFAR-10 for six different DPFL methods evaluated on Swin-Tiny and Swin-Base, under two data heterogeneity levels (Dirichletα=0.6andα=0.1). Overall, DP- FedAdamW consistently achieves the best performance, while Swin-Base is markedly more...

  80. [80]

    DP-LocalAdamW Algorithm For completeness, we provide in Algorithm 2 the full local training procedure ofDP-LocalAdamW

Showing first 80 references.