arxiv: 2605.09144 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

FedVSSAM: Mitigating Flatness Incompatibility in Sharpness-Aware Federated Learning

Bingcong Li, Bingnan Xiao, Tony Q. S. Quek, Wei Ni, Xin Wang, Yuan Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated learningsharpness-aware minimizationflatness incompatibilitydata heterogeneitynon-convex convergencevariance suppressionglobal gradient alignment

0 comments

The pith

FedVSSAM mitigates flatness incompatibility in sharpness-aware federated learning by anchoring local searches to a variance-suppressed global direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In federated learning with data heterogeneity, local sharpness-aware minimization often converges to flat basins that conflict with the flatter region favored by the global objective, limiting gains in training and generalization. The paper identifies this mismatch as flatness incompatibility, tracing it to heterogeneity, the friendly adversary effect, local updates, and partial participation. FedVSSAM addresses it by deriving a variance-suppressed adjusted direction from local information and applying that same direction during local flatness perturbation, local descent steps, and the global model update. This consistent anchoring replaces isolated local corrections with a more stable global reference. If the approach holds, it yields non-convex convergence guarantees and demonstrably better performance across heterogeneous federated settings.

Core claim

The paper establishes that flatness incompatibility arises from data heterogeneity and the friendly adversary phenomenon and is amplified by local updates and partial device participation. FedVSSAM counters this by constructing a variance-suppressed adjusted direction and applying it uniformly in local flatness search, local descent, and global aggregation, thereby anchoring both perturbation and update steps to a stable global direction rather than purely local signals. The method supplies non-convex convergence guarantees and proves that the mean-square deviation between the adjusted direction and the global gradient remains controlled.

What carries the argument

The variance-suppressed adjusted direction, which blends local SAM perturbations with global gradient information to suppress variance and align local flatness searches with the global objective.

If this is right

Non-convex convergence guarantees hold for FedVSSAM.
The mean-square deviation between the adjusted direction and the global gradient is provably bounded.
The method outperforms standard SAM and other baselines across diverse federated settings with varying heterogeneity and participation rates.
Consistent use of the adjusted direction in perturbation, descent, and aggregation steps directly reduces the identified structural incompatibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The variance-control technique may generalize to other client-drift correction methods in distributed optimization by enforcing directional consistency rather than gradient averaging alone.
If the adjusted direction reduces effective drift, it could lower the number of communication rounds required to reach target accuracy under heterogeneity.
The result points to direction inconsistency across clients as a distinct bottleneck separate from gradient magnitude divergence.

Load-bearing premise

That constructing a variance-suppressed adjusted direction from local information can simultaneously resolve local-global flatness mismatch without introducing bias or slowing convergence under arbitrary heterogeneity levels.

What would settle it

An experiment in which, under high data heterogeneity, the mean-square deviation between the adjusted direction and the global gradient fails to decrease or global test accuracy shows no improvement over standard SAM would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.09144 by Bingcong Li, Bingnan Xiao, Tony Q. S. Quek, Wei Ni, Xin Wang, Yuan Gao.

**Figure 3.** Figure 3: Ablation studies on the variance suppression operations in FedVSSAM. variance suppression operations control the MSE between the variance-suppressed adjusted direction and the global gradient. Numerical results further demonstrated that FedVSSAM achieves better convergence speed and final accuracy over both FL and SAM-based FL baselines in most of the cases. The current study focuses on standard smoothne… view at source ↗

**Figure 4.** Figure 4: Convergence performance of different algorithms on CIFAR-10 and CIFAR-100. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Convergence performance of different algorithms on DBpedia-14. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualization of global model feature embeddings on CIFAR-10 with ResNet-18. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Tracking error between h t and the global gradient for FedVSSAM. 0 200 400 600 800 Training Rounds 10 2 FI (a) α = 0.5. 0 200 400 600 800 Training Rounds 10 2 10 1 FI (b) α = 0.1 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 9.** Figure 9: Sensitivity of FedVSSAM to different parameters on the CIFAR-10 dataset with [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

Sharpness-aware minimization (SAM) is an effective method for improving the generalization of federated learning (FL) by steering local training toward flat minima. Under data heterogeneity, however, device-side SAM searches for locally flat basins that are incompatible with the flat region preferred by the global objective. We identify this structural failure mode as flatness incompatibility, which explains why improving local flatness alone may provide limited training and generalization improvement for the global model. We reveal that flatness incompatibility arises from data heterogeneity and the friendly adversary phenomenon, and is further amplified by local updates and partial device participation. To mitigate this issue, we propose Federated Learning with variance-suppressed sharpness-aware minimization (FedVSSAM), which constructs a variance-suppressed adjusted direction and uses it consistently in local flatness search, local descent, and global update. FedVSSAM anchors both perturbation and update directions to a more stable global direction, instead of correcting only an isolated local perturbation. We establish non-convex convergence guarantees of FedVSSAM and prove that the mean-square deviation between the adjusted direction and the global gradient is effectively controlled. Experiments demonstrate that FedVSSAM mitigates flatness incompatibility and outperforms the baselines across diverse FL settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names flatness incompatibility as a real issue when SAM meets heterogeneous FL and offers a consistent variance-suppressed direction to fix it, but the convergence claim looks fragile under the arbitrary heterogeneity the work targets.

read the letter

The core contribution is spotting that local SAM searches can lock into flat basins that clash with the global objective under data skew, and then building a single adjusted direction that gets reused for perturbation, local steps, and server aggregation. That consistency is a clean idea and directly targets the friendly-adversary plus partial-participation effects the abstract flags. If the variance suppression actually keeps the adjusted direction close to the global gradient, it could let SAM-style training work in the settings where most real FL happens. The abstract states they prove non-convex convergence plus a mean-square deviation bound, which would be the load-bearing part if the math checks out. Experiments are claimed to show gains across diverse settings, so the empirical side at least gestures at practicality. The soft spot is the stress-test concern: variance suppression built from local information can only bound deviation if local gradient variance stays controlled. When heterogeneity is truly arbitrary and participation is partial, the deviation term can grow with the heterogeneity parameter, which would undercut both the convergence rate and the incompatibility fix. The abstract gives no proof sketch or assumption list, so it is impossible to see whether the bound is non-vacuous or just restates standard non-convex rates under extra restrictions the method claims to drop. Without those details the central guarantee reads more like a hope than a result. This is the kind of paper a reading group could usefully tear apart on the math before deciding whether the direction really stabilizes anything. It deserves a serious referee because the diagnosis is concrete and the proposed fix is simple enough to test, even if the current write-up leaves the key bound unverified.

Referee Report

3 major / 2 minor

Summary. The paper identifies 'flatness incompatibility' in sharpness-aware minimization (SAM) for federated learning (FL) under data heterogeneity, where local flat minima conflict with the global objective due to heterogeneity, friendly-adversary effects, local updates, and partial participation. It proposes FedVSSAM, which constructs a variance-suppressed adjusted direction from local information and applies it consistently for local perturbation, descent, and global update to anchor to a stable global direction. The manuscript claims non-convex convergence guarantees for FedVSSAM along with a proof that the mean-square deviation between this adjusted direction and the global gradient is controlled, and reports that experiments show mitigation of the incompatibility with outperformance over baselines in diverse FL settings.

Significance. If the convergence guarantees and mean-square deviation control hold under the claimed arbitrary heterogeneity levels, the work would meaningfully advance SAM applications in FL by addressing a structural mismatch that standard local SAM does not resolve. The consistent application of the adjusted direction across phases and the explicit identification of flatness incompatibility as a distinct failure mode represent conceptual contributions; reproducible experiments across settings would further strengthen the case for practical impact in heterogeneous FL.

major comments (3)

[§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the claimed non-convex convergence rate and the bound on mean-square deviation of the variance-suppressed direction from the global gradient appear to rely on controlling local gradient variance; it is unclear whether these bounds remain independent of the heterogeneity parameter under the arbitrary heterogeneity levels asserted in the abstract and §3.2, or whether they implicitly require bounded dissimilarity as in standard FL analyses.
[§3.3] §3.3 (FedVSSAM Algorithm): the construction of the variance-suppressed adjusted direction uses only local information (even if aggregated); under partial participation and high heterogeneity, the friendly-adversary phenomenon could cause the deviation term to scale with the heterogeneity measure, undermining both the convergence claim and the mitigation of flatness incompatibility without additional assumptions or clipping.
[§5] §5 (Experiments): while outperformance is reported, the description lacks explicit quantification of heterogeneity levels (e.g., Dirichlet α values), number of local steps, and participation rates used to stress-test the deviation control; without these, it is difficult to confirm that the method succeeds precisely where flatness incompatibility is most severe.

minor comments (2)

[Abstract, §1] The abstract and introduction introduce 'flatness incompatibility' as a new term; a brief formal definition or equation characterizing the incompatibility (e.g., difference in sharpness measures) would aid readability.
[§3] Notation for the adjusted direction (e.g., how variance suppression is exactly formulated) should be introduced earlier and used consistently in both the algorithm box and the proof.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the convergence analysis, algorithmic design, and experimental reporting. We address each major comment point by point below.

read point-by-point responses

Referee: [§4, Theorem 1] §4 (Convergence Analysis), Theorem 1 and surrounding lemmas: the claimed non-convex convergence rate and the bound on mean-square deviation of the variance-suppressed direction from the global gradient appear to rely on controlling local gradient variance; it is unclear whether these bounds remain independent of the heterogeneity parameter under the arbitrary heterogeneity levels asserted in the abstract and §3.2, or whether they implicitly require bounded dissimilarity as in standard FL analyses.

Authors: The non-convex convergence rate in Theorem 1 and the mean-square deviation bound are derived without invoking bounded dissimilarity. The variance-suppressed adjusted direction is constructed to bound the deviation from the global gradient via a suppression term whose expectation is controlled independently of the heterogeneity parameter; the proof relies on this property holding for arbitrary heterogeneity levels as stated in §3.2. We will add a short clarifying remark after Theorem 1 to make the independence explicit. revision: partial
Referee: [§3.3] §3.3 (FedVSSAM Algorithm): the construction of the variance-suppressed adjusted direction uses only local information (even if aggregated); under partial participation and high heterogeneity, the friendly-adversary phenomenon could cause the deviation term to scale with the heterogeneity measure, undermining both the convergence claim and the mitigation of flatness incompatibility without additional assumptions or clipping.

Authors: The variance suppression is specifically introduced to counteract the friendly-adversary effect by damping local variance in the adjusted direction. The mean-square deviation control proved in §4 holds under partial participation because the global aggregation of these suppressed directions anchors the update; the bound does not scale with heterogeneity and requires no extra clipping or assumptions beyond those already stated. revision: no
Referee: [§5] §5 (Experiments): while outperformance is reported, the description lacks explicit quantification of heterogeneity levels (e.g., Dirichlet α values), number of local steps, and participation rates used to stress-test the deviation control; without these, it is difficult to confirm that the method succeeds precisely where flatness incompatibility is most severe.

Authors: We agree that explicit parameter values will strengthen the experimental section. The reported results used Dirichlet α ∈ {0.1, 0.5, 1.0}, 5 local steps, and participation rates of 10% and 20%. We will insert these values into the experimental setup description in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: convergence and deviation-control claims presented as independent derivations without reduction to inputs or self-citations.

full rationale

Abstract and available text claim establishment of non-convex convergence guarantees plus a proof that mean-square deviation of the variance-suppressed adjusted direction from the global gradient is controlled. No equations, self-citations, fitted parameters renamed as predictions, or ansatzes are quoted that would make these results equivalent to the method definition by construction. The central premise (variance suppression from local information mitigating flatness incompatibility) is not shown to reduce to a tautology or prior self-citation chain. This is the normal case of a self-contained derivation whose validity rests on external verification of the (unprovided) proof rather than internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The new concept of flatness incompatibility is introduced as an explanatory label rather than a formal entity with independent evidence.

invented entities (1)

flatness incompatibility no independent evidence
purpose: Explains why local SAM fails to improve global generalization under heterogeneity
Identified in abstract as arising from data heterogeneity, friendly adversary, local updates, and partial participation; no independent falsifiable prediction given.

pith-pipeline@v0.9.0 · 5532 in / 1258 out tokens · 22352 ms · 2026-05-12T04:47:59.642597+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
constructs a variance-suppressed adjusted direction and uses it consistently in local flatness search, local descent, and global update
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
E[Δ_FI(θ,ρ)] ≤ 3ρ²/N Σ E[∥δ_i(θ)∥² + 4∥ζ_i,ϕ_i(θ)∥²] + (3/4)L²ρ⁴

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 2 internal anchors

[1]

What- mough, and Venkatesh Saligrama

Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina, Paul N. What- mough, and Venkatesh Saligrama. Federated learning based on dynamic regularization. In International Conference on Learning Representations (ICLR), 2021

work page 2021
[2]

Towards understanding sharpness-aware minimization

Maksym Andriushchenko and Nicolas Flammarion. Towards understanding sharpness-aware minimization. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 639–668. PMLR, 17–23 Jul 2022

work page 2022
[3]

Springer, 2009

Alan Bain and Dan Crisan.Fundamentals of Stochastic Filtering. Springer, 2009

work page 2009
[4]

Towards federated learning at scale: System design

Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Koneˇcný, Stefano Mazzocchi, Brendan McMahan, Ti- mon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. Towards federated learning at scale: System design. In A. Talwalkar, V . Smith, and M. Zaharia, editors,Proceed- ings o...

work page 2019
[5]

Improving generalization in feder- ated learning by seeking flat minima

Debora Caldarola, Barbara Caputo, and Marco Ciccone. Improving generalization in feder- ated learning by seeking flat minima. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors,Computer Vision – ECCV 2022, pages 654–672. Springer Nature Switzerland, 2022

work page 2022
[6]

Beyond local sharp- ness: Communication-efficient global sharpness-aware minimization for federated learning

Debora Caldarola, Pietro Cagnasso, Barbara Caputo, and Marco Ciccone. Beyond local sharp- ness: Communication-efficient global sharpness-aware minimization for federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25187–25197, June 2025

work page 2025
[7]

Momentum benefits non-iid federated learning simply and provably

Ziheng Cheng, Xinmeng Huang, Pengfei Wu, and Kun Yuan. Momentum benefits non-iid federated learning simply and provably. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[8]

Exploiting shared representations for personalized federated learning

Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 2089–2099. PMLR, 18–24 Jul 2021

work page 2089
[9]

FedGAMMA: Federated learning with global sharpness-aware minimization.IEEE Transac- tions on Neural Networks and Learning Systems, 35(12):17479–17492, 2024

Rong Dai, Xun Yang, Yan Sun, Li Shen, Xinmei Tian, Meng Wang, and Yongdong Zhang. FedGAMMA: Federated learning with global sharpness-aware minimization.IEEE Transac- tions on Neural Networks and Learning Systems, 35(12):17479–17492, 2024. 10

work page 2024
[10]

Efficient sharpness-aware minimization for improved training of neural net- works

Jiawei Du, Hanshu Yan, Jiashi Feng, Joey Tianyi Zhou, Liangli Zhen, Rick Siow Mong Goh, and Vincent Tan. Efficient sharpness-aware minimization for improved training of neural net- works. InInternational Conference on Learning Representations, 2022

work page 2022
[11]

Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. InPro- ceedings of International Conference on Machine Learning Workshop, 2017

work page 2017
[12]

Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach

Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach. InAdvances in Neural Information Processing Systems, volume 33, pages 3557–3568, 2020

work page 2020
[13]

Locally estimated global perturbations are better than local perturbations for federated sharpness-aware minimization

Ziqing Fan, Shengchao Hu, Jiangchao Yao, Gang Niu, Ya Zhang, Masashi Sugiyama, and Yanfeng Wang. Locally estimated global perturbations are better than local perturbations for federated sharpness-aware minimization. InForty-first International Conference on Machine Learning, 2024

work page 2024
[14]

Sharpness-aware min- imization for efficiently improving generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware min- imization for efficiently improving generalization. InInternational Conference on Learning Representations, 2021

work page 2021
[15]

Gradient compression may hurt gen- eralization: A remedy by synthetic data guided sharpness aware minimization.arXiv preprint arXiv:2602.11584, 2026

Yujie Gu, Richeng Jin, Zhaoyang Zhang, and Huaiyu Dai. Gradient compression may hurt gen- eralization: A remedy by synthetic data guided sharpness aware minimization.arXiv preprint arXiv:2602.11584, 2026

work page arXiv 2026
[16]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 770–778, 2016

work page 2016
[17]

Flat minima.Neural Computation, 9(1):1–42, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997

work page 1997
[18]

Measuring the

Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification.arXiv preprint arXiv:1909.06335, 2019

work page arXiv 1909
[19]

Local sharp- ness aware minimization in decentralized federated learning with privacy protection.Expert Systems with Applications, page 131510, 2026

Jifei Hu, Yanli Li, Huayong Xie, Lijun Xu, Hang Zhang, and Xinqiang Zhou. Local sharp- ness aware minimization in decentralized federated learning with privacy protection.Expert Systems with Applications, page 131510, 2026

work page 2026
[20]

Fantas- tic generalization measures and where to find them

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantas- tic generalization measures and where to find them. InInternational Conference on Learning Representations, 2020

work page 2020
[21]

Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar- jun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al

Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar- jun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning.Foundations and Trends in Machine Learning, 14(1–2):1–210, 2021

work page 2021
[22]

SCAFFOLD: Stochastic controlled averaging for federated learning

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5132–5143. PMLR, 13–18 Jul 2020

work page 2020
[23]

On large-batch training for deep learning: Generalization gap and sharp min- ima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp min- ima. InInternational Conference on Learning Representations, 2017

work page 2017
[24]

Communication-efficient federated learning with accelerated client gradient

Geeho Kim, Jinkyu Kim, and Bohyung Han. Communication-efficient federated learning with accelerated client gradient. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12385–12394, 2024

work page 2024
[25]

Hu, and Timothy Hospedales

Minyoung Kim, Da Li, Shell X. Hu, and Timothy Hospedales. Fisher SAM: Information geometry and sharpness aware minimisation. InProceedings of the 39th International Confer- ence on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 11148–11161. PMLR, 17–23 Jul 2022. 11

work page 2022
[26]

Convolutional neural networks for sentence classification

Yoon Kim. Convolutional neural networks for sentence classification. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1746–1751, 2014

work page 2014
[27]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, ON, Canada, 2009

work page 2009
[29]

ASAM: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. ASAM: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. InPro- ceedings of the 38th International Conference on Machine Learning, volume 139 ofProceed- ings of Machine Learning Research, pages 5905–5914. PMLR, 18–24 Jul 2021

work page 2021
[30]

Rethinking the flat minima searching in federated learn- ing

Taehwan Lee and Sung Whan Yoon. Rethinking the flat minima searching in federated learn- ing. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 27037–27071. PMLR, 21–27 Jul 2024

work page 2024
[31]

Enhancing sharpness-aware optimization through vari- ance suppression

Bingcong Li and Georgios Giannakis. Enhancing sharpness-aware optimization through vari- ance suppression. InAdvances in Neural Information Processing Systems, volume 36, pages 70861–70879. Curran Associates, Inc., 2023

work page 2023
[32]

Model-contrastive federated learning

Qinbin Li, Bingsheng He, and Dawn Song. Model-contrastive federated learning. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10708– 10717, 2021

work page 2021
[33]

Federated learning on non-iid data si- los: An experimental study

Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He. Federated learning on non-iid data si- los: An experimental study. In2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 965–978, 2022

work page 2022
[34]

Friendly sharpness- aware minimization

Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. Friendly sharpness- aware minimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5631–5640, June 2024

work page 2024
[35]

Federated optimization in heterogeneous networks

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. InProceedings of Machine Learn- ing and Systems, volume 2, pages 429–450, 2020

work page 2020
[36]

Ditto: Fair and robust federated learning through personalization

Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6357–

work page
[37]

PMLR, 18–24 Jul 2021

work page 2021
[38]

FedWMSAM: Fast and flat federated learning via weighted momentum and sharpness-aware minimization

Tianle Li, Yongzhi Huang, Linshan Jiang, Chang Liu, Qipeng Xie, Wenfeng Du, Lu Wang, and Kaishun Wu. FedWMSAM: Fast and flat federated learning via weighted momentum and sharpness-aware minimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[39]

International Conference on Learning Representations , year =

Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the conver- gence of FedAvg on non-iid data.arXiv preprint arXiv:1907.02189, 2019

work page arXiv 1907
[40]

FedBN: Federated learning on non-iid features via local batch normalization

Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, and Qi Dou. FedBN: Federated learning on non-iid features via local batch normalization. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[41]

One arrow, two hawks: Sharpness-aware minimization for federated learning via global model trajectory

Yuhang Li, Tong Liu, Yangguang Cui, Ming Hu, and Xiaoqiang Li. One arrow, two hawks: Sharpness-aware minimization for federated learning via global model trajectory. InForty- second International Conference on Machine Learning, 2025

work page 2025
[42]

Federated learning in mobile edge networks: A comprehensive survey.IEEE Communications Surveys & Tutorials, 22(3):2031–2063, 2020

Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning in mobile edge networks: A comprehensive survey.IEEE Communications Surveys & Tutorials, 22(3):2031–2063, 2020. 12

work page 2031
[43]

Towards efficient and scalable sharpness-aware minimization

Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, and Yang You. Towards efficient and scalable sharpness-aware minimization. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 12360–12370, June 2022

work page 2022
[44]

Communication-Efficient Learning of Deep Networks from Decentralized Data

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. InProceed- ings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 ofProceedings of Machine Learning Research, pages 1273–1282. PMLR, 2017

work page 2017
[45]

Agnostic federated learning

Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In Proceedings of the 36th International Conference on Machine Learning, volume 97 ofPro- ceedings of Machine Learning Research, pages 4615–4625. PMLR, 09–15 Jun 2019

work page 2019
[46]

Generalized federated learning via sharpness aware minimization

Zhe Qu, Xingyu Li, Rui Duan, Yao Liu, Bo Tang, and Zhuo Lu. Generalized federated learning via sharpness aware minimization. InProceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 18250– 18280. PMLR, 2022

work page 2022
[47]

Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcný, Sanjiv Kumar, and H

Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇcný, Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization. In International Conference on Learning Representations (ICLR), 2021

work page 2021
[48]

Make landscape flatter in differentially private federated learning

Yifan Shi, Yingqi Liu, Kang Wei, Li Shen, Xueqian Wang, and Dacheng Tao. Make landscape flatter in differentially private federated learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24552–24562, June 2023

work page 2023
[49]

Dynamic regularized sharp- ness aware minimization in federated learning: Approaching global consistency and smooth landscape

Yan Sun, Li Shen, Shixiang Chen, Liang Ding, and Dacheng Tao. Dynamic regularized sharp- ness aware minimization in federated learning: Approaching global consistency and smooth landscape. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 32991–33013. PMLR, 23–29 Jul 2023

work page 2023
[50]

FedSpeed: Larger lo- cal interval, less communication round, and higher generalization accuracy.arXiv preprint arXiv:2302.10429, 2023

Yan Sun, Li Shen, Tiansheng Huang, Liang Ding, and Dacheng Tao. FedSpeed: Larger lo- cal interval, less communication round, and higher generalization accuracy.arXiv preprint arXiv:2302.10429, 2023

work page arXiv 2023
[51]

Dinh, Nguyen Tran, and Josh Nguyen

Canh T. Dinh, Nguyen Tran, and Josh Nguyen. Personalized federated learning with moreau envelopes. InAdvances in Neural Information Processing Systems, volume 33, pages 21394– 21405, 2020

work page 2020
[52]

Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(11), 2008

work page 2008
[53]

Federated learning with matched averaging

Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khaz- aeni. Federated learning with matched averaging. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[54]

Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H. V . Poor. A novel framework for the analysis and design of heterogeneous federated learning.IEEE Transactions on Signal Processing, 69:5234–5249, 2021

work page 2021
[55]

Vincent Poor

Kang Wei, Jun Li, Ming Ding, Chuan Ma, Hang Su, Bo Zhang, and H. Vincent Poor. User-level privacy-preserving federated learning: Analysis and performance optimization.IEEE Trans- actions on Mobile Computing, 21(9):3388–3401, 2022. doi: 10.1109/TMC.2021.3056991

work page doi:10.1109/tmc.2021.3056991 2022
[56]

How does sharpness-aware minimization minimize sharpness? InOPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022

Kaiyue Wen, Tengyu Ma, and Zhiyuan Li. How does sharpness-aware minimization minimize sharpness? InOPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022

work page 2022
[57]

Bingnan Xiao, Xichen Yu, Wei Ni, Xin Wang, and H. V . Poor. Over-the-air federated learning: Status quo, open challenges, and future directions.Fundamental Research, 5(4):1710–1724, 2025. 13

work page 2025
[58]

FedCM: Federated learning with client-level momentum.arXiv preprint arXiv:2106.10874, 2021

Jing Xu, Sen Wang, Liwei Wang, and Andrew Chi-Chih Yao. FedCM: Federated learning with client-level momentum.arXiv preprint arXiv:2106.10874, 2021

work page arXiv 2021
[59]

Riemannian SAM: Sharpness-aware minimization on riemannian manifolds

Jihun Yun and Eunho Yang. Riemannian SAM: Sharpness-aware minimization on riemannian manifolds. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[60]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016

work page internal anchor Pith review arXiv 2016
[61]

Character-level convolutional networks for text classification

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

work page 2015
[62]

Gradient norm aware minimization seeks first-order flatness and improves generalization

Xingxuan Zhang, Renzhe Xu, Han Yu, Hao Zou, and Peng Cui. Gradient norm aware minimization seeks first-order flatness and improves generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20247– 20257, June 2023

work page 2023
[63]

arXiv preprint arXiv:1806.00582 (2018)

Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Feder- ated learning with non-iid data.arXiv preprint arXiv:1806.00582, 2018

work page arXiv 2018
[64]

Non-IID Correction

Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C. Dvornek, Sekhar Tatikonda, James S. Duncan, and Ting Liu. Surrogate gap minimization improves sharpness-aware training. InInternational Conference on Learning Representations (ICLR), 2022. 14 A Additional Related Work Heterogeneous FL.FL enables collaborative training without shar...

work page 2022
[65]

LetF t,k denote theσ-algebra [3] generated by(θ t, ht),S t, and all data sampled before training iterationkof roundt, i.e.,{ϕ t,ℓ j }j∈S t,0≤ℓ≤k−1 . GivenF t,k,θ t,k i andh t are fixed, and the remaining randomness comes from the current batchϕ t,k i , which determinesg t,k i , ˜θt,k i through (14), and˜gt,k i = ∇Fi(˜θt,k i ;ϕ t,k i ). Conditioning onF t,...

work page
[66]

In contrast, under a deterministic full device participation and IID data settings, the residual variance takesΣ→0, and the RHS of (19) reduces to L∆ βT

WhenΣbecomes large with a smallSor largeσ 2 l,1 andσ 2 g,1, the MSE bound (19) is increasingly dominated byβΣ, leading to a smallerβsetting, which reduces residual varianceβΣwhile decel- erating convergence, as discussed above. In contrast, under a deterministic full device participation and IID data settings, the residual variance takesΣ→0, and the RHS o...

work page