pith. machine review for the scientific record. sign in

arxiv: 2605.09144 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

FedVSSAM: Mitigating Flatness Incompatibility in Sharpness-Aware Federated Learning

Bingcong Li, Bingnan Xiao, Tony Q. S. Quek, Wei Ni, Xin Wang, Yuan Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:47 UTC · model grok-4.3

classification 💻 cs.LG
keywords federated learningsharpness-aware minimizationflatness incompatibilitydata heterogeneitynon-convex convergencevariance suppressionglobal gradient alignment
0
0 comments X

The pith

FedVSSAM mitigates flatness incompatibility in sharpness-aware federated learning by anchoring local searches to a variance-suppressed global direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In federated learning with data heterogeneity, local sharpness-aware minimization often converges to flat basins that conflict with the flatter region favored by the global objective, limiting gains in training and generalization. The paper identifies this mismatch as flatness incompatibility, tracing it to heterogeneity, the friendly adversary effect, local updates, and partial participation. FedVSSAM addresses it by deriving a variance-suppressed adjusted direction from local information and applying that same direction during local flatness perturbation, local descent steps, and the global model update. This consistent anchoring replaces isolated local corrections with a more stable global reference. If the approach holds, it yields non-convex convergence guarantees and demonstrably better performance across heterogeneous federated settings.

Core claim

The paper establishes that flatness incompatibility arises from data heterogeneity and the friendly adversary phenomenon and is amplified by local updates and partial device participation. FedVSSAM counters this by constructing a variance-suppressed adjusted direction and applying it uniformly in local flatness search, local descent, and global aggregation, thereby anchoring both perturbation and update steps to a stable global direction rather than purely local signals. The method supplies non-convex convergence guarantees and proves that the mean-square deviation between the adjusted direction and the global gradient remains controlled.

What carries the argument

The variance-suppressed adjusted direction, which blends local SAM perturbations with global gradient information to suppress variance and align local flatness searches with the global objective.

If this is right

  • Non-convex convergence guarantees hold for FedVSSAM.
  • The mean-square deviation between the adjusted direction and the global gradient is provably bounded.
  • The method outperforms standard SAM and other baselines across diverse federated settings with varying heterogeneity and participation rates.
  • Consistent use of the adjusted direction in perturbation, descent, and aggregation steps directly reduces the identified structural incompatibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The variance-control technique may generalize to other client-drift correction methods in distributed optimization by enforcing directional consistency rather than gradient averaging alone.
  • If the adjusted direction reduces effective drift, it could lower the number of communication rounds required to reach target accuracy under heterogeneity.
  • The result points to direction inconsistency across clients as a distinct bottleneck separate from gradient magnitude divergence.

Load-bearing premise

That constructing a variance-suppressed adjusted direction from local information can simultaneously resolve local-global flatness mismatch without introducing bias or slowing convergence under arbitrary heterogeneity levels.

What would settle it

An experiment in which, under high data heterogeneity, the mean-square deviation between the adjusted direction and the global gradient fails to decrease or global test accuracy shows no improvement over standard SAM would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09144 by Bingcong Li, Bingnan Xiao, Tony Q. S. Quek, Wei Ni, Xin Wang, Yuan Gao.

Figure 1
Figure 1. Figure 1: Visualization of local and global loss landscapes for FedSAM on CIFAR-10. Setup in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation studies on the variance sup￾pression operations in FedVSSAM. variance suppression operations control the MSE between the variance-suppressed adjusted direc￾tion and the global gradient. Numerical results further demonstrated that FedVSSAM achieves better convergence speed and final accuracy over both FL and SAM-based FL baselines in most of the cases. The current study focuses on standard smoothne… view at source ↗
Figure 4
Figure 4. Figure 4: Convergence performance of different algorithms on CIFAR-10 and CIFAR-100. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Convergence performance of different algorithms on DBpedia-14. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE visualization of global model feature embeddings on CIFAR-10 with ResNet-18. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Tracking error between h t and the global gradient for FedVSSAM. 0 200 400 600 800 Training Rounds 10 2 FI (a) α = 0.5. 0 200 400 600 800 Training Rounds 10 2 10 1 FI (b) α = 0.1 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sensitivity of FedVSSAM to different parameters on the CIFAR-10 dataset with [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Sharpness-aware minimization (SAM) is an effective method for improving the generalization of federated learning (FL) by steering local training toward flat minima. Under data heterogeneity, however, device-side SAM searches for locally flat basins that are incompatible with the flat region preferred by the global objective. We identify this structural failure mode as flatness incompatibility, which explains why improving local flatness alone may provide limited training and generalization improvement for the global model. We reveal that flatness incompatibility arises from data heterogeneity and the friendly adversary phenomenon, and is further amplified by local updates and partial device participation. To mitigate this issue, we propose Federated Learning with variance-suppressed sharpness-aware minimization (FedVSSAM), which constructs a variance-suppressed adjusted direction and uses it consistently in local flatness search, local descent, and global update. FedVSSAM anchors both perturbation and update directions to a more stable global direction, instead of correcting only an isolated local perturbation. We establish non-convex convergence guarantees of FedVSSAM and prove that the mean-square deviation between the adjusted direction and the global gradient is effectively controlled. Experiments demonstrate that FedVSSAM mitigates flatness incompatibility and outperforms the baselines across diverse FL settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper identifies 'flatness incompatibility' in sharpness-aware minimization (SAM) for federated learning (FL) under data heterogeneity, where local flat minima conflict with the global objective due to heterogeneity, friendly-adversary effects, local updates, and partial participation. It proposes FedVSSAM, which constructs a variance-suppressed adjusted direction from local information and applies it consistently for local perturbation, descent, and global update to anchor to a stable global direction. The manuscript claims non-convex convergence guarantees for FedVSSAM along with a proof that the mean-square deviation between this adjusted direction and the global gradient is controlled, and reports that experiments show mitigation of the incompatibility with outperformance over baselines in diverse FL settings.

Significance. If the convergence guarantees and mean-square deviation control hold under the claimed arbitrary heterogeneity levels, the work would meaningfully advance SAM applications in FL by addressing a structural mismatch that standard local SAM does not resolve. The consistent application of the adjusted direction across phases and the explicit identification of flatness incompatibility as a distinct failure mode represent conceptual contributions; reproducible experiments across settings would further strengthen the case for practical impact in heterogeneous FL.

major comments (3)
  1. [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the claimed non-convex convergence rate and the bound on mean-square deviation of the variance-suppressed direction from the global gradient appear to rely on controlling local gradient variance; it is unclear whether these bounds remain independent of the heterogeneity parameter under the arbitrary heterogeneity levels asserted in the abstract and §3.2, or whether they implicitly require bounded dissimilarity as in standard FL analyses.
  2. [§3.3] §3.3 (FedVSSAM Algorithm): the construction of the variance-suppressed adjusted direction uses only local information (even if aggregated); under partial participation and high heterogeneity, the friendly-adversary phenomenon could cause the deviation term to scale with the heterogeneity measure, undermining both the convergence claim and the mitigation of flatness incompatibility without additional assumptions or clipping.
  3. [§5] §5 (Experiments): while outperformance is reported, the description lacks explicit quantification of heterogeneity levels (e.g., Dirichlet α values), number of local steps, and participation rates used to stress-test the deviation control; without these, it is difficult to confirm that the method succeeds precisely where flatness incompatibility is most severe.
minor comments (2)
  1. [Abstract, §1] The abstract and introduction introduce 'flatness incompatibility' as a new term; a brief formal definition or equation characterizing the incompatibility (e.g., difference in sharpness measures) would aid readability.
  2. [§3] Notation for the adjusted direction (e.g., how variance suppression is exactly formulated) should be introduced earlier and used consistently in both the algorithm box and the proof.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the convergence analysis, algorithmic design, and experimental reporting. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the claimed non-convex convergence rate and the bound on mean-square deviation of the variance-suppressed direction from the global gradient appear to rely on controlling local gradient variance; it is unclear whether these bounds remain independent of the heterogeneity parameter under the arbitrary heterogeneity levels asserted in the abstract and §3.2, or whether they implicitly require bounded dissimilarity as in standard FL analyses.

    Authors: The non-convex convergence rate in Theorem 1 and the mean-square deviation bound are derived without invoking bounded dissimilarity. The variance-suppressed adjusted direction is constructed to bound the deviation from the global gradient via a suppression term whose expectation is controlled independently of the heterogeneity parameter; the proof relies on this property holding for arbitrary heterogeneity levels as stated in §3.2. We will add a short clarifying remark after Theorem 1 to make the independence explicit. revision: partial

  2. Referee: [§3.3] §3.3 (FedVSSAM Algorithm): the construction of the variance-suppressed adjusted direction uses only local information (even if aggregated); under partial participation and high heterogeneity, the friendly-adversary phenomenon could cause the deviation term to scale with the heterogeneity measure, undermining both the convergence claim and the mitigation of flatness incompatibility without additional assumptions or clipping.

    Authors: The variance suppression is specifically introduced to counteract the friendly-adversary effect by damping local variance in the adjusted direction. The mean-square deviation control proved in §4 holds under partial participation because the global aggregation of these suppressed directions anchors the update; the bound does not scale with heterogeneity and requires no extra clipping or assumptions beyond those already stated. revision: no

  3. Referee: [§5] §5 (Experiments): while outperformance is reported, the description lacks explicit quantification of heterogeneity levels (e.g., Dirichlet α values), number of local steps, and participation rates used to stress-test the deviation control; without these, it is difficult to confirm that the method succeeds precisely where flatness incompatibility is most severe.

    Authors: We agree that explicit parameter values will strengthen the experimental section. The reported results used Dirichlet α ∈ {0.1, 0.5, 1.0}, 5 local steps, and participation rates of 10% and 20%. We will insert these values into the experimental setup description in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: convergence and deviation-control claims presented as independent derivations without reduction to inputs or self-citations.

full rationale

Abstract and available text claim establishment of non-convex convergence guarantees plus a proof that mean-square deviation of the variance-suppressed adjusted direction from the global gradient is controlled. No equations, self-citations, fitted parameters renamed as predictions, or ansatzes are quoted that would make these results equivalent to the method definition by construction. The central premise (variance suppression from local information mitigating flatness incompatibility) is not shown to reduce to a tautology or prior self-citation chain. This is the normal case of a self-contained derivation whose validity rests on external verification of the (unprovided) proof rather than internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The new concept of flatness incompatibility is introduced as an explanatory label rather than a formal entity with independent evidence.

invented entities (1)
  • flatness incompatibility no independent evidence
    purpose: Explains why local SAM fails to improve global generalization under heterogeneity
    Identified in abstract as arising from data heterogeneity, friendly adversary, local updates, and partial participation; no independent falsifiable prediction given.

pith-pipeline@v0.9.0 · 5532 in / 1258 out tokens · 22352 ms · 2026-05-12T04:47:59.642597+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 2 internal anchors

  1. [1]

    What- mough, and Venkatesh Saligrama

    Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N. What- mough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In International Conference on Learning Representations (ICLR), 2021

  2. [2]

    Towards understanding sharpness-aware minimization

    Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimization. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 639–668. PMLR, 17–23 Jul 2022

  3. [3]

    Springer, 2009

    Alan Bain and Dan Crisan.Fundamentals of Stochastic Filtering. Springer, 2009

  4. [4]

    Towards federated learning at scale: System design

    Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Koneˇcný, Stefano Mazzocchi, Brendan McMahan, Ti- mon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. In A. Talwalkar, V . Smith, and M. Zaharia, editors,Proceed- ings o...

  5. [5]

    Improving generalization in feder- ated learning by seeking flat minima

    Debora Caldarola, Barbara Caputo, and Marco Ciccone. Improving generalization in feder- ated learning by seeking flat minima. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision – ECCV 2022, pages 654–672. Springer Nature Switzerland, 2022

  6. [6]

    Beyond local sharp- ness: Communication-efficient global sharpness-aware minimization for federated learning

    Debora Caldarola, Pietro Cagnasso, Barbara Caputo, and Marco Ciccone. Beyond local sharp- ness: Communication-efficient global sharpness-aware minimization for federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25187–25197, June 2025

  7. [7]

    Momentum benefits non-iid federated learning simply and provably

    Ziheng Cheng, Xinmeng Huang, Pengfei Wu, and Kun Yuan. Momentum benefits non-iid federated learning simply and provably. InThe Twelfth International Conference on Learning Representations, 2024

  8. [8]

    Exploiting shared representations for personalized federated learning

    Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 2089–2099. PMLR, 18–24 Jul 2021

  9. [9]

    FedGAMMA: Federated learning with global sharpness-aware minimization.IEEE Transac- tions on Neural Networks and Learning Systems, 35(12):17479–17492, 2024

    Rong Dai, Xun Yang, Yan Sun, Li Shen, Xinmei Tian, Meng Wang, and Yongdong Zhang. FedGAMMA: Federated learning with global sharpness-aware minimization.IEEE Transac- tions on Neural Networks and Learning Systems, 35(12):17479–17492, 2024. 10

  10. [10]

    Efficient sharpness-aware minimization for improved training of neural net- works

    Jiawei Du, Hanshu Yan, Jiashi Feng, Joey Tianyi Zhou, Liangli Zhen, Rick Siow Mong Goh, and Vincent Tan. Efficient sharpness-aware minimization for improved training of neural net- works. InInternational Conference on Learning Representations, 2022

  11. [11]

    Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. InPro- ceedings of International Conference on Machine Learning Workshop, 2017

  12. [12]

    Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach

    Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. InAdvances in Neural Information Processing Systems, volume 33, pages 3557–3568, 2020

  13. [13]

    Locally estimated global perturbations are better than local perturbations for federated sharpness-aware minimization

    Ziqing Fan, Shengchao Hu, Jiangchao Yao, Gang Niu, Ya Zhang, Masashi Sugiyama, and Yanfeng Wang. Locally estimated global perturbations are better than local perturbations for federated sharpness-aware minimization. InForty-first International Conference on Machine Learning, 2024

  14. [14]

    Sharpness-aware min- imization for efficiently improving generalization

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations, 2021

  15. [15]

    Gradient compression may hurt gen- eralization: A remedy by synthetic data guided sharpness aware minimization.arXiv preprint arXiv:2602.11584, 2026

    Yujie Gu, Richeng Jin, Zhaoyang Zhang, and Huaiyu Dai. Gradient compression may hurt gen- eralization: A remedy by synthetic data guided sharpness aware minimization.arXiv preprint arXiv:2602.11584, 2026

  16. [16]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 770–778, 2016

  17. [17]

    Flat minima.Neural Computation, 9(1):1–42, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997

  18. [18]

    Measuring the

    Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification.arXiv preprint arXiv:1909.06335, 2019

  19. [19]

    Local sharp- ness aware minimization in decentralized federated learning with privacy protection.Expert Systems with Applications, page 131510, 2026

    Jifei Hu, Yanli Li, Huayong Xie, Lijun Xu, Hang Zhang, and Xinqiang Zhou. Local sharp- ness aware minimization in decentralized federated learning with privacy protection.Expert Systems with Applications, page 131510, 2026

  20. [20]

    Fantas- tic generalization measures and where to find them

    Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantas- tic generalization measures and where to find them. InInternational Conference on Learning Representations, 2020

  21. [21]

    Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar- jun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al

    Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar- jun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning.Foundations and Trends in Machine Learning, 14(1–2):1–210, 2021

  22. [22]

    SCAFFOLD: Stochastic controlled averaging for federated learning

    Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5132–5143. PMLR, 13–18 Jul 2020

  23. [23]

    On large-batch training for deep learning: Generalization gap and sharp min- ima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp min- ima. InInternational Conference on Learning Representations, 2017

  24. [24]

    Communication-efficient federated learning with accelerated client gradient

    Geeho Kim, Jinkyu Kim, and Bohyung Han. Communication-efficient federated learning with accelerated client gradient. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12385–12394, 2024

  25. [25]

    Hu, and Timothy Hospedales

    Minyoung Kim, Da Li, Shell X. Hu, and Timothy Hospedales. Fisher SAM: Information geometry and sharpness aware minimisation. InProceedings of the 39th International Confer- ence on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 11148–11161. PMLR, 17–23 Jul 2022. 11

  26. [26]

    Convolutional neural networks for sentence classification

    Yoon Kim. Convolutional neural networks for sentence classification. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1746–1751, 2014

  27. [27]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  28. [28]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, ON, Canada, 2009

  29. [29]

    ASAM: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks

    Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. ASAM: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. InPro- ceedings of the 38th International Conference on Machine Learning, volume 139 ofProceed- ings of Machine Learning Research, pages 5905–5914. PMLR, 18–24 Jul 2021

  30. [30]

    Rethinking the flat minima searching in federated learn- ing

    Taehwan Lee and Sung Whan Yoon. Rethinking the flat minima searching in federated learn- ing. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 27037–27071. PMLR, 21–27 Jul 2024

  31. [31]

    Enhancing sharpness-aware optimization through vari- ance suppression

    Bingcong Li and Georgios Giannakis. Enhancing sharpness-aware optimization through vari- ance suppression. InAdvances in Neural Information Processing Systems, volume 36, pages 70861–70879. Curran Associates, Inc., 2023

  32. [32]

    Model-contrastive federated learning

    Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10708– 10717, 2021

  33. [33]

    Federated learning on non-iid data si- los: An experimental study

    Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He. Federated learning on non-iid data si- los: An experimental study. In2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 965–978, 2022

  34. [34]

    Friendly sharpness- aware minimization

    Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. Friendly sharpness- aware minimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5631–5640, June 2024

  35. [35]

    Federated optimization in heterogeneous networks

    Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InProceedings of Machine Learn- ing and Systems, volume 2, pages 429–450, 2020

  36. [36]

    Ditto: Fair and robust federated learning through personalization

    Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6357–

  37. [37]

    PMLR, 18–24 Jul 2021

  38. [38]

    FedWMSAM: Fast and flat federated learning via weighted momentum and sharpness-aware minimization

    Tianle Li, Yongzhi Huang, Linshan Jiang, Chang Liu, Qipeng Xie, Wenfeng Du, Lu Wang, and Kaishun Wu. FedWMSAM: Fast and flat federated learning via weighted momentum and sharpness-aware minimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  39. [39]

    International Conference on Learning Representations , year =

    Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the conver- gence of FedAvg on non-iid data.arXiv preprint arXiv:1907.02189, 2019

  40. [40]

    FedBN: Federated learning on non-iid features via local batch normalization

    Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. FedBN: Federated learning on non-iid features via local batch normalization. InInternational Conference on Learning Representations (ICLR), 2021

  41. [41]

    One arrow, two hawks: Sharpness-aware minimization for federated learning via global model trajectory

    Yuhang Li, Tong Liu, Yangguang Cui, Ming Hu, and Xiaoqiang Li. One arrow, two hawks: Sharpness-aware minimization for federated learning via global model trajectory. InForty- second International Conference on Machine Learning, 2025

  42. [42]

    Federated learning in mobile edge networks: A comprehensive survey.IEEE Communications Surveys & Tutorials, 22(3):2031–2063, 2020

    Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning in mobile edge networks: A comprehensive survey.IEEE Communications Surveys & Tutorials, 22(3):2031–2063, 2020. 12

  43. [43]

    Towards efficient and scalable sharpness-aware minimization

    Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, and Yang You. Towards efficient and scalable sharpness-aware minimization. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 12360–12370, June 2022

  44. [44]

    Communication-Efficient Learning of Deep Networks from Decentralized Data

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. InProceed- ings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 ofProceedings of Machine Learning Research, pages 1273–1282. PMLR, 2017

  45. [45]

    Agnostic federated learning

    Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 ofPro- ceedings of Machine Learning Research, pages 4615–4625. PMLR, 09–15 Jun 2019

  46. [46]

    Generalized federated learning via sharpness aware minimization

    Zhe Qu, Xingyu Li, Rui Duan, Yao Liu, Bo Tang, and Zhuo Lu. Generalized federated learning via sharpness aware minimization. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 18250– 18280. PMLR, 2022

  47. [47]

    Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcný, Sanjiv Kumar, and H

    Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcný, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization. In International Conference on Learning Representations (ICLR), 2021

  48. [48]

    Make landscape flatter in differentially private federated learning

    Yifan Shi, Yingqi Liu, Kang Wei, Li Shen, Xueqian Wang, and Dacheng Tao. Make landscape flatter in differentially private federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24552–24562, June 2023

  49. [49]

    Dynamic regularized sharp- ness aware minimization in federated learning: Approaching global consistency and smooth landscape

    Yan Sun, Li Shen, Shixiang Chen, Liang Ding, and Dacheng Tao. Dynamic regularized sharp- ness aware minimization in federated learning: Approaching global consistency and smooth landscape. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32991–33013. PMLR, 23–29 Jul 2023

  50. [50]

    FedSpeed: Larger lo- cal interval, less communication round, and higher generalization accuracy.arXiv preprint arXiv:2302.10429, 2023

    Yan Sun, Li Shen, Tiansheng Huang, Liang Ding, and Dacheng Tao. FedSpeed: Larger lo- cal interval, less communication round, and higher generalization accuracy.arXiv preprint arXiv:2302.10429, 2023

  51. [51]

    Dinh, Nguyen Tran, and Josh Nguyen

    Canh T. Dinh, Nguyen Tran, and Josh Nguyen. Personalized federated learning with moreau envelopes. InAdvances in Neural Information Processing Systems, volume 33, pages 21394– 21405, 2020

  52. [52]

    Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

  53. [53]

    Federated learning with matched averaging

    Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khaz- aeni. Federated learning with matched averaging. InInternational Conference on Learning Representations (ICLR), 2020

  54. [54]

    Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. V . Poor. A novel framework for the analysis and design of heterogeneous federated learning.IEEE Transactions on Signal Processing, 69:5234–5249, 2021

  55. [55]

    Vincent Poor

    Kang Wei, Jun Li, Ming Ding, Chuan Ma, Hang Su, Bo Zhang, and H. Vincent Poor. User-level privacy-preserving federated learning: Analysis and performance optimization.IEEE Trans- actions on Mobile Computing, 21(9):3388–3401, 2022. doi: 10.1109/TMC.2021.3056991

  56. [56]

    How does sharpness-aware minimization minimize sharpness? InOPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022

    Kaiyue Wen, Tengyu Ma, and Zhiyuan Li. How does sharpness-aware minimization minimize sharpness? InOPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022

  57. [57]

    Bingnan Xiao, Xichen Yu, Wei Ni, Xin Wang, and H. V . Poor. Over-the-air federated learning: Status quo, open challenges, and future directions.Fundamental Research, 5(4):1710–1724, 2025. 13

  58. [58]

    FedCM: Federated learning with client-level momentum.arXiv preprint arXiv:2106.10874, 2021

    Jing Xu, Sen Wang, Liwei Wang, and Andrew Chi-Chih Yao. FedCM: Federated learning with client-level momentum.arXiv preprint arXiv:2106.10874, 2021

  59. [59]

    Riemannian SAM: Sharpness-aware minimization on riemannian manifolds

    Jihun Yun and Eunho Yang. Riemannian SAM: Sharpness-aware minimization on riemannian manifolds. InAdvances in Neural Information Processing Systems, volume 36, 2023

  60. [60]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016

  61. [61]

    Character-level convolutional networks for text classification

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

  62. [62]

    Gradient norm aware minimization seeks first-order flatness and improves generalization

    Xingxuan Zhang, Renzhe Xu, Han Yu, Hao Zou, and Peng Cui. Gradient norm aware minimization seeks first-order flatness and improves generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20247– 20257, June 2023

  63. [63]

    arXiv preprint arXiv:1806.00582 (2018)

    Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Feder- ated learning with non-iid data.arXiv preprint arXiv:1806.00582, 2018

  64. [64]

    Non-IID Correction

    Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C. Dvornek, Sekhar Tatikonda, James S. Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. InInternational Conference on Learning Representations (ICLR), 2022. 14 A Additional Related Work Heterogeneous FL.FL enables collaborative training without shar...

  65. [65]

    LetF t,k denote theσ-algebra [3] generated by(θ t, ht),S t, and all data sampled before training iterationkof roundt, i.e.,{ϕ t,ℓ j }j∈S t,0≤ℓ≤k−1 . GivenF t,k,θ t,k i andh t are fixed, and the remaining randomness comes from the current batchϕ t,k i , which determinesg t,k i , ˜θt,k i through (14), and˜gt,k i = ∇Fi(˜θt,k i ;ϕ t,k i ). Conditioning onF t,...

  66. [66]

    In contrast, under a deterministic full device participation and IID data settings, the residual variance takesΣ→0, and the RHS of (19) reduces to L∆ βT

    WhenΣbecomes large with a smallSor largeσ 2 l,1 andσ 2 g,1, the MSE bound (19) is increasingly dominated byβΣ, leading to a smallerβsetting, which reduces residual varianceβΣwhile decel- erating convergence, as discussed above. In contrast, under a deterministic full device participation and IID data settings, the residual variance takesΣ→0, and the RHS o...