arxiv: 2602.23827 · v1 · submitted 2026-02-27 · 💻 cs.LG · cs.AI

Recognition: no theorem link

FedNSAM:Consistency of Local and Global Flatness for Federated Learning

Junkang Liu , Fanhua Shang , Yuxuan Tian , Hongying Liu , Yuanyuan Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords federated learningsharpness-aware minimizationNesterov momentumdata heterogeneityconvergence analysisflatness distanceglobal model generalization

0 comments

The pith

FedNSAM uses global Nesterov momentum to align local and global flatness in federated learning, delivering a tighter convergence bound than FedSAM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In federated learning, data heterogeneity and multiple local steps often produce sharp global minima that hurt generalization. Standard sharpness-aware minimization applied locally fails here because a flat local loss surface does not guarantee a flat global surface. The paper introduces the flatness distance to measure this mismatch and shows that injecting the global Nesterov momentum vector into each client’s perturbation step forces local updates to respect the global geometry. This produces both a provably tighter convergence rate and measurable gains on CNN and Transformer models across heterogeneous partitions.

Core claim

By defining the flatness distance between local and global loss surfaces, the authors demonstrate that local sharpness minimization does not transfer to the global model under high heterogeneity. FedNSAM therefore replaces the local gradient direction for perturbation estimation with the global Nesterov momentum, using the same vector for both the ascent step that finds the worst-case point and the subsequent extrapolation. The resulting algorithm yields a strictly smaller convergence bound than FedSAM while preserving the same per-round communication cost.

What carries the argument

global Nesterov momentum vector employed as the shared direction for local perturbation estimation and extrapolation

If this is right

Tighter convergence bound holds by the Nesterov extrapolation term in the analysis.
Local SAM updates now improve global generalization rather than only local generalization.
The method requires no extra communication beyond standard federated averaging.
Performance gains appear on both convolutional and attention-based architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same momentum-injection idea could be tested on other local optimizers such as Adam or AdaGrad inside federated frameworks.
Reducing flatness distance may lower the number of communication rounds needed to reach a target accuracy.
The flatness-distance metric itself offers a diagnostic tool for comparing any pair of local and global sharpness-aware methods.

Load-bearing premise

Global Nesterov momentum can be applied directly to each client’s local perturbation calculation without creating fresh inconsistencies between local and global geometry.

What would settle it

An experiment that measures flatness distance on a highly heterogeneous data split before and after FedNSAM training and finds that the distance fails to shrink while test accuracy also fails to improve over FedSAM.

Figures

Figures reproduced from arXiv: 2602.23827 by Fanhua Shang, Hongying Liu, Junkang Liu, Yuanyuan Liu, Yuxuan Tian.

**Figure 2.** Figure 2: Illustration of flatness distance (left) and global [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) and (b) depict the local update procedures [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Convergence plots for FedNSAM and other base [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Convergence plots of FedNSAM and baselines on CIFAR100 with ResNet-18 under different heterogeneity (a,b) and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Global training loss surfaces plots (a,b,c,d,e,f) and global testing loss surfaces plots (g,h,i,j,k,l) for FedNSAM and [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

In federated learning (FL), multi-step local updates and data heterogeneity usually lead to sharper global minima, which degrades the performance of the global model. Popular FL algorithms integrate sharpness-aware minimization (SAM) into local training to address this issue. However, in the high data heterogeneity setting, the flatness in local training does not imply the flatness of the global model. Therefore, minimizing the sharpness of the local loss surfaces on the client data does not enable the effectiveness of SAM in FL to improve the generalization ability of the global model. We define the \textbf{flatness distance} to explain this phenomenon. By rethinking the SAM in FL and theoretically analyzing the \textbf{flatness distance}, we propose a novel \textbf{FedNSAM} algorithm that accelerates the SAM algorithm by introducing global Nesterov momentum into the local update to harmonize the consistency of global and local flatness. \textbf{FedNSAM} uses the global Nesterov momentum as the direction of local estimation of client global perturbations and extrapolation. Theoretically, we prove a tighter convergence bound than FedSAM by Nesterov extrapolation. Empirically, we conduct comprehensive experiments on CNN and Transformer models to verify the superior performance and efficiency of \textbf{FedNSAM}. The code is available at https://github.com/junkangLiu0/FedNSAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedNSAM adds global Nesterov momentum to local SAM steps to reduce the gap between local and global flatness under heterogeneity, but the abstract gives no proof or results to check the tighter bound.

read the letter

The main point is that this paper spots a practical problem in federated learning: when client data is highly heterogeneous, running SAM locally does not guarantee a flat global minimum. They introduce a flatness distance metric to quantify the mismatch and then modify the local update by pulling in the global Nesterov momentum vector to choose the perturbation direction. The claim is that this produces a provably tighter convergence rate than plain FedSAM and better empirical performance on CNNs and Transformers. The code link is a plus for anyone who wants to test it later. What the work does reasonably is name the inconsistency clearly and pick a momentum-based fix that fits existing FL pipelines without adding many new hyperparameters. That part feels like a direct, usable extension rather than a complete reinvention. The soft spots are obvious from the abstract alone. No derivation steps appear, so it is impossible to see whether the Nesterov extrapolation actually yields an independent bound or simply restates quantities already controlled by the method. The experiments are asserted but not described, which leaves open questions about effect size, heterogeneity levels tested, and whether the gains survive standard controls like varying client participation or communication budgets. Until the full text is available, the soundness of both the theory and the numbers stays unverified. This paper is aimed at people already working on sharpness-aware methods or generalization in distributed optimization. A reader who follows FedSAM or related FL convergence papers might find the flatness-distance framing useful for their own thinking. It is worth sending to peer review because the targeted limitation is real and the proposed direction is concrete; a referee can check the missing proof and results in one pass and decide whether the bound holds or needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper proposes FedNSAM, a federated learning method that integrates global Nesterov momentum into local sharpness-aware minimization (SAM) updates. It defines a 'flatness distance' to explain why local flatness fails to imply global flatness under high data heterogeneity, claims to prove a tighter convergence bound than FedSAM via Nesterov extrapolation, and reports superior empirical performance on CNN and Transformer models with code released on GitHub.

Significance. If the claimed tighter bound holds under standard FL heterogeneity assumptions and the empirical gains are reproducible across diverse settings, the work could meaningfully advance SAM-based FL by resolving local-global flatness inconsistency. The public code release is a positive factor for reproducibility.

major comments (1)

[Abstract] Abstract: the central claim of proving a tighter convergence bound than FedSAM 'by Nesterov extrapolation' and the definition of flatness distance are presented without any equations, proof steps, assumptions on heterogeneity, or derivation outline. This prevents assessment of whether the bound is non-circular or load-bearing for the consistency argument.

minor comments (1)

[Title] The title contains a missing space after the colon ('FedNSAM:Consistency').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying the need for greater clarity in how the abstract presents our theoretical claims. We address the point directly below and outline the planned revision.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of proving a tighter convergence bound than FedSAM 'by Nesterov extrapolation' and the definition of flatness distance are presented without any equations, proof steps, assumptions on heterogeneity, or derivation outline. This prevents assessment of whether the bound is non-circular or load-bearing for the consistency argument.

Authors: We agree that the abstract, owing to its length constraints, presents the claims at a summary level without equations or proof outlines. The full manuscript contains the formal definition of flatness distance, the heterogeneity assumptions, and the complete convergence analysis establishing the tighter bound via the Nesterov extrapolation step. The argument is structured so that the flatness distance first quantifies the local-global inconsistency, after which the extrapolation is shown to tighten the bound; the reasoning is therefore not circular. To address the concern, we will revise the abstract to include a concise reference to the flatness distance and the role of Nesterov extrapolation in the bound improvement. revision: yes

Circularity Check

0 steps flagged

No circularity detectable from available abstract

full rationale

Only the abstract is provided, which outlines the FedNSAM method, defines flatness distance to explain local-global inconsistency, and states that a tighter convergence bound than FedSAM is proved via Nesterov extrapolation. No equations, derivation steps, or self-citations are present in the text. Without inspectable math or load-bearing premises, no step can be shown to reduce by construction to its own inputs or prior self-citations. The central claim remains an independent theoretical assertion supported by the proposed algorithm and experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract provides no explicit free parameters or detailed axioms; the main novel element is the flatness distance concept introduced to explain observed behavior.

invented entities (1)

flatness distance no independent evidence
purpose: To quantify the inconsistency between local and global flatness under heterogeneity
Newly defined in the paper to explain why local SAM does not transfer to global performance

pith-pipeline@v0.9.0 · 5527 in / 1096 out tokens · 52979 ms · 2026-05-15T18:28:22.390300+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 4 internal anchors

[1]

Rodolfo Stoﬀel Antunes, Cristiano André da Costa, Arne Küderle, Imrana Abdullahi Yari, and Björn Eskoﬁer. 2022. Federated learning for healthcare: Systematic review and architecture pro- posal. ACM Transactions on Intelligent Systems and Technology (TIST) 13, 4 (2022), 1–23

work page 2022
[2]

David Byrd and Antigoni Polychroniadou. 2020. Diﬀerentially private secure multi-party computation for federated learning in ﬁnancial applications. In Proceedings of the First ACM Interna- tional Conference on AI in Finance. 1–9

work page 2020
[3]

Debora Caldarola, Barbara Caputo, and Marco Ciccone. 2022. Improving generalization in federated learning by seeking ﬂat minima. In European Conference on Computer Vision. Springer, 654–672

work page 2022
[4]

Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, and Xuelong Li. 2025. Secure Tug-of-War (SecTOW): Itera- tive Defense-Attack Training with Reinforcement Learning for Multimodal Model Security. arXiv: 2507.22037 [cs.CR] https: //arxiv.org/abs/2507.22037

work page arXiv 2025
[5]

Muzhi Dai, Jiashuo Sun, Zhiyuan Zhao, Shixuan Liu, Rui Li, Junyu Gao, and Xuelong Li. 2025. From Captions to Re- wards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Mod- els. arXiv: 2503.06260 [cs.CV] https://arxiv.org/abs/2503.06260

work page arXiv 2025
[6]

Rong Dai, Xun Yang, Yan Sun, Li Shen, Xinmei Tian, Meng Wang, and Yongdong Zhang. 2023. Fedgamma: Federated learn- ing with global sharpness-aware minimization. IEEE Transac- tions on Neural Networks and Learning Systems (2023)

work page 2023
[7]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al

work page
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[9]

Ziqing Fan, Shengchao Hu, Jiangchao Yao, Gang Niu, Ya Zhang, Masashi Sugiyama, and Yanfeng Wang. 2024. Locally Esti- mated Global Perturbations are Better than Local Perturbations for Federated Sharpness-aware Minimization. arXiv preprint arXiv:2405.18890 (2024)

work page arXiv 2024
[10]

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2020. Sharpness-aware minimization for eﬃciently improving generalization. arXiv preprint arXiv:2010.01412 (2020)

work page arXiv 2020
[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778

work page 2016
[12]

Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. 2019. Mea- suring the eﬀects of non-identical data distribution for federated visual classiﬁcation. arXiv preprint arXiv:1909.06335 (2019)

work page arXiv 2019
[13]

Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. 2020. Scaf- fold: Stochastic controlled averaging for federated learning. In In- ternational Conference on Machine Learning. PMLR, 5132–5143

work page 2020
[14]

Geeho Kim, Jinkyu Kim, and Bohyung Han. 2024. Communication-eﬃcient federated learning with accelerated client gradient. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12385–12394

work page 2024
[15]

Alex Krizhevsky, Geoﬀrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009)

work page 2009
[16]

Ya Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N 7, 7 (2015), 3

work page 2015
[17]

Yann LeCun et al. 2015. LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet 20, 5 (2015), 14

work page 2015
[18]

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimiza- tion in heterogeneous networks. In Proceedings of Machine Learn- ing and Systems

work page 2020
[19]

Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. 2024. Friendly sharpness-aware minimization. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5631–5640

work page 2024
[20]

Junkang Liu, Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, and Wei Feng. 2025. Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight A veraging. In Forty-second Inter- national Conference on Machine Learning

work page 2025
[21]

Junkang Liu, Fanhua Shang, Hongying Liu, Jin Liu, Weixin An, and Yuanyuan Liu. 2026. Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Fed- erated Learning on Non-IID Data. arXiv: 2602.19271 [cs.LG] https://arxiv.org/abs/2602.19271

work page arXiv 2026
[22]

Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuan- gang Li, and YunXiang Gong. 2024. Fedbcgd: Communication- eﬃcient accelerated block coordinate gradient descent for fed- erated learning. In Proceedings of the 32nd ACM International Conference on Multimedia. 2955–2963

work page 2024
[23]

Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, and Yuanyuan Liu. 2025. Consistency of local and global ﬂatness for federated learning. In Proceedings of the 33rd ACM International Conference on Multimedia. 3875–3883

work page 2025
[24]

Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, and Jin Liu. 2025. FedMuon: Accelerating Fed- erated Learning with Matrix Orthogonalization. arXiv preprint arXiv:2510.27403 (2025)

work page arXiv 2025
[25]

Junkang Liu, Fanhua Shang, Kewen Zhu, Hongying Liu, Yuanyuan Liu, and Jin Liu. 2025. FedAdamW: A Communication-Eﬃcient Optimizer with Convergence and Generalization Guarantees for Federated Large Models. arXiv preprint arXiv:2510.27486 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Junkang Liu, Yuxuan Tian, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Junchao Zhou, and Daorui Ding. 2025. DP- FedPGN: Finding Global Flat Minima for Diﬀerentially Pri- vate Federated Learning via Penalizing Gradient Norm. arXiv preprint arXiv:2510.27504 (2025)

work page arXiv 2025
[27]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierar- chical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022

work page 2021
[28]

Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hamp- son, and Blaise Aguera y Arcas. 2017. Communication-eﬃcient learning of deep networks from decentralized data. In Artiﬁcial intelligence and statistics. PMLR, 1273–1282

work page 2017
[29]

Yurii Nesterov. 2013. Introductory lectures on convex optimiza- tion: A basic course. Vol. 87. Springer Science & Business Media

work page 2013
[30]

Zhe Qu, Xingyu Li, Rui Duan, Yao Liu, Bo Tang, and Zhuo Lu. 2022. Generalized federated learning via sharpness aware minimization. In International Conference on Machine Learning. PMLR, 18250–18280

work page 2022
[31]

Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. 2020. The future of digital health with federated learning. NPJ digital medicine 3, 1 (2020), 119

work page 2020
[32]

Karen Simonyan and Andrew Zisserman. 2014. Very deep con- volutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

Yan Sun, Li Shen, Shixiang Chen, Liang Ding, and Dacheng Tao. 2023. Dynamic regularized sharpness aware minimization in federated learning: Approaching global consistency and smooth landscape. In International Conference on Machine Learning. PMLR, 32991–33013

work page 2023
[34]

Linlin Yang, Hongying Liu, Yuanyuan Liu, Fanhua Shang, Liang Wan, and Licheng Jiao. 2025. Distillation guided deep unfolding network with frequency hierarchical regularization for low-dose CT image denoising. Neurocomputing (2025), 130535

work page 2025
[35]

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. 2025. FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving. arXiv preprint arXiv:2505.17685 (2025). Consistency of Local and Global Flatness for Federated Learning MM ’25, October 27–31, 2025, Dublin, Ireland. Algorithm 2 FedNSAM ...

work page internal anchor Pith review arXiv 2025