pith. machine review for the scientific record. sign in

arxiv: 2602.23827 · v1 · submitted 2026-02-27 · 💻 cs.LG · cs.AI

Recognition: no theorem link

FedNSAM:Consistency of Local and Global Flatness for Federated Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords federated learningsharpness-aware minimizationNesterov momentumdata heterogeneityconvergence analysisflatness distanceglobal model generalization
0
0 comments X

The pith

FedNSAM uses global Nesterov momentum to align local and global flatness in federated learning, delivering a tighter convergence bound than FedSAM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In federated learning, data heterogeneity and multiple local steps often produce sharp global minima that hurt generalization. Standard sharpness-aware minimization applied locally fails here because a flat local loss surface does not guarantee a flat global surface. The paper introduces the flatness distance to measure this mismatch and shows that injecting the global Nesterov momentum vector into each client’s perturbation step forces local updates to respect the global geometry. This produces both a provably tighter convergence rate and measurable gains on CNN and Transformer models across heterogeneous partitions.

Core claim

By defining the flatness distance between local and global loss surfaces, the authors demonstrate that local sharpness minimization does not transfer to the global model under high heterogeneity. FedNSAM therefore replaces the local gradient direction for perturbation estimation with the global Nesterov momentum, using the same vector for both the ascent step that finds the worst-case point and the subsequent extrapolation. The resulting algorithm yields a strictly smaller convergence bound than FedSAM while preserving the same per-round communication cost.

What carries the argument

global Nesterov momentum vector employed as the shared direction for local perturbation estimation and extrapolation

If this is right

  • Tighter convergence bound holds by the Nesterov extrapolation term in the analysis.
  • Local SAM updates now improve global generalization rather than only local generalization.
  • The method requires no extra communication beyond standard federated averaging.
  • Performance gains appear on both convolutional and attention-based architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same momentum-injection idea could be tested on other local optimizers such as Adam or AdaGrad inside federated frameworks.
  • Reducing flatness distance may lower the number of communication rounds needed to reach a target accuracy.
  • The flatness-distance metric itself offers a diagnostic tool for comparing any pair of local and global sharpness-aware methods.

Load-bearing premise

Global Nesterov momentum can be applied directly to each client’s local perturbation calculation without creating fresh inconsistencies between local and global geometry.

What would settle it

An experiment that measures flatness distance on a highly heterogeneous data split before and after FedNSAM training and finds that the distance fails to shrink while test accuracy also fails to improve over FedSAM.

Figures

Figures reproduced from arXiv: 2602.23827 by Fanhua Shang, Hongying Liu, Junkang Liu, Yuanyuan Liu, Yuxuan Tian.

Figure 1
Figure 1. Figure 1: (a) and (b) show the global training loss surface of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of flatness distance (left) and global [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) and (b) depict the local update procedures [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Convergence plots for FedNSAM and other base [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Convergence plots of FedNSAM and baselines on CIFAR100 with ResNet-18 under different heterogeneity (a,b) and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Global training loss surfaces plots (a,b,c,d,e,f) and global testing loss surfaces plots (g,h,i,j,k,l) for FedNSAM and [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

In federated learning (FL), multi-step local updates and data heterogeneity usually lead to sharper global minima, which degrades the performance of the global model. Popular FL algorithms integrate sharpness-aware minimization (SAM) into local training to address this issue. However, in the high data heterogeneity setting, the flatness in local training does not imply the flatness of the global model. Therefore, minimizing the sharpness of the local loss surfaces on the client data does not enable the effectiveness of SAM in FL to improve the generalization ability of the global model. We define the \textbf{flatness distance} to explain this phenomenon. By rethinking the SAM in FL and theoretically analyzing the \textbf{flatness distance}, we propose a novel \textbf{FedNSAM} algorithm that accelerates the SAM algorithm by introducing global Nesterov momentum into the local update to harmonize the consistency of global and local flatness. \textbf{FedNSAM} uses the global Nesterov momentum as the direction of local estimation of client global perturbations and extrapolation. Theoretically, we prove a tighter convergence bound than FedSAM by Nesterov extrapolation. Empirically, we conduct comprehensive experiments on CNN and Transformer models to verify the superior performance and efficiency of \textbf{FedNSAM}. The code is available at https://github.com/junkangLiu0/FedNSAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes FedNSAM, a federated learning method that integrates global Nesterov momentum into local sharpness-aware minimization (SAM) updates. It defines a 'flatness distance' to explain why local flatness fails to imply global flatness under high data heterogeneity, claims to prove a tighter convergence bound than FedSAM via Nesterov extrapolation, and reports superior empirical performance on CNN and Transformer models with code released on GitHub.

Significance. If the claimed tighter bound holds under standard FL heterogeneity assumptions and the empirical gains are reproducible across diverse settings, the work could meaningfully advance SAM-based FL by resolving local-global flatness inconsistency. The public code release is a positive factor for reproducibility.

major comments (1)
  1. [Abstract] Abstract: the central claim of proving a tighter convergence bound than FedSAM 'by Nesterov extrapolation' and the definition of flatness distance are presented without any equations, proof steps, assumptions on heterogeneity, or derivation outline. This prevents assessment of whether the bound is non-circular or load-bearing for the consistency argument.
minor comments (1)
  1. [Title] The title contains a missing space after the colon ('FedNSAM:Consistency').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying the need for greater clarity in how the abstract presents our theoretical claims. We address the point directly below and outline the planned revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of proving a tighter convergence bound than FedSAM 'by Nesterov extrapolation' and the definition of flatness distance are presented without any equations, proof steps, assumptions on heterogeneity, or derivation outline. This prevents assessment of whether the bound is non-circular or load-bearing for the consistency argument.

    Authors: We agree that the abstract, owing to its length constraints, presents the claims at a summary level without equations or proof outlines. The full manuscript contains the formal definition of flatness distance, the heterogeneity assumptions, and the complete convergence analysis establishing the tighter bound via the Nesterov extrapolation step. The argument is structured so that the flatness distance first quantifies the local-global inconsistency, after which the extrapolation is shown to tighten the bound; the reasoning is therefore not circular. To address the concern, we will revise the abstract to include a concise reference to the flatness distance and the role of Nesterov extrapolation in the bound improvement. revision: yes

Circularity Check

0 steps flagged

No circularity detectable from available abstract

full rationale

Only the abstract is provided, which outlines the FedNSAM method, defines flatness distance to explain local-global inconsistency, and states that a tighter convergence bound than FedSAM is proved via Nesterov extrapolation. No equations, derivation steps, or self-citations are present in the text. Without inspectable math or load-bearing premises, no step can be shown to reduce by construction to its own inputs or prior self-citations. The central claim remains an independent theoretical assertion supported by the proposed algorithm and experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract provides no explicit free parameters or detailed axioms; the main novel element is the flatness distance concept introduced to explain observed behavior.

invented entities (1)
  • flatness distance no independent evidence
    purpose: To quantify the inconsistency between local and global flatness under heterogeneity
    Newly defined in the paper to explain why local SAM does not transfer to global performance

pith-pipeline@v0.9.0 · 5527 in / 1096 out tokens · 52979 ms · 2026-05-15T18:28:22.390300+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 4 internal anchors

  1. [1]

    Rodolfo Stoffel Antunes, Cristiano André da Costa, Arne Küderle, Imrana Abdullahi Yari, and Björn Eskofier. 2022. Federated learning for healthcare: Systematic review and architecture pro- posal. ACM Transactions on Intelligent Systems and Technology (TIST) 13, 4 (2022), 1–23

  2. [2]

    David Byrd and Antigoni Polychroniadou. 2020. Differentially private secure multi-party computation for federated learning in financial applications. In Proceedings of the First ACM Interna- tional Conference on AI in Finance. 1–9

  3. [3]

    Debora Caldarola, Barbara Caputo, and Marco Ciccone. 2022. Improving generalization in federated learning by seeking flat minima. In European Conference on Computer Vision. Springer, 654–672

  4. [4]

    Muzhi Dai, Shixuan Liu, Zhiyuan Zhao, Junyu Gao, Hao Sun, and Xuelong Li. 2025. Secure Tug-of-War (SecTOW): Itera- tive Defense-Attack Training with Reinforcement Learning for Multimodal Model Security. arXiv: 2507.22037 [cs.CR] https: //arxiv.org/abs/2507.22037

  5. [5]

    Muzhi Dai, Jiashuo Sun, Zhiyuan Zhao, Shixuan Liu, Rui Li, Junyu Gao, and Xuelong Li. 2025. From Captions to Re- wards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Mod- els. arXiv: 2503.06260 [cs.CV] https://arxiv.org/abs/2503.06260

  6. [6]

    Rong Dai, Xun Yang, Yan Sun, Li Shen, Xinmei Tian, Meng Wang, and Yongdong Zhang. 2023. Fedgamma: Federated learn- ing with global sharpness-aware minimization. IEEE Transac- tions on Neural Networks and Learning Systems (2023)

  7. [7]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa De- hghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. [9]

    Ziqing Fan, Shengchao Hu, Jiangchao Yao, Gang Niu, Ya Zhang, Masashi Sugiyama, and Yanfeng Wang. 2024. Locally Esti- mated Global Perturbations are Better than Local Perturbations for Federated Sharpness-aware Minimization. arXiv preprint arXiv:2405.18890 (2024)

  10. [10]

    Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2020. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020)

  11. [11]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  12. [12]

    Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. 2019. Mea- suring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335 (2019)

  13. [13]

    Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. 2020. Scaf- fold: Stochastic controlled averaging for federated learning. In In- ternational Conference on Machine Learning. PMLR, 5132–5143

  14. [14]

    Geeho Kim, Jinkyu Kim, and Bohyung Han. 2024. Communication-efficient federated learning with accelerated client gradient. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12385–12394

  15. [15]

    Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009)

  16. [16]

    Ya Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N 7, 7 (2015), 3

  17. [17]

    Yann LeCun et al. 2015. LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet 20, 5 (2015), 14

  18. [18]

    Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020. Federated optimiza- tion in heterogeneous networks. In Proceedings of Machine Learn- ing and Systems

  19. [19]

    Tao Li, Pan Zhou, Zhengbao He, Xinwen Cheng, and Xiaolin Huang. 2024. Friendly sharpness-aware minimization. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5631–5640

  20. [20]

    Junkang Liu, Yuanyuan Liu, Fanhua Shang, Hongying Liu, Jin Liu, and Wei Feng. 2025. Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight A veraging. In Forty-second Inter- national Conference on Machine Learning

  21. [21]

    Junkang Liu, Fanhua Shang, Hongying Liu, Jin Liu, Weixin An, and Yuanyuan Liu. 2026. Taming Preconditioner Drift: Unlocking the Potential of Second-Order Optimizers for Fed- erated Learning on Non-IID Data. arXiv: 2602.19271 [cs.LG] https://arxiv.org/abs/2602.19271

  22. [22]

    Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuan- gang Li, and YunXiang Gong. 2024. Fedbcgd: Communication- efficient accelerated block coordinate gradient descent for fed- erated learning. In Proceedings of the 32nd ACM International Conference on Multimedia. 2955–2963

  23. [23]

    Junkang Liu, Fanhua Shang, Yuxuan Tian, Hongying Liu, and Yuanyuan Liu. 2025. Consistency of local and global flatness for federated learning. In Proceedings of the 33rd ACM International Conference on Multimedia. 3875–3883

  24. [24]

    Junkang Liu, Fanhua Shang, Junchao Zhou, Hongying Liu, Yuanyuan Liu, and Jin Liu. 2025. FedMuon: Accelerating Fed- erated Learning with Matrix Orthogonalization. arXiv preprint arXiv:2510.27403 (2025)

  25. [25]

    Junkang Liu, Fanhua Shang, Kewen Zhu, Hongying Liu, Yuanyuan Liu, and Jin Liu. 2025. FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models. arXiv preprint arXiv:2510.27486 (2025)

  26. [26]

    Junkang Liu, Yuxuan Tian, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Junchao Zhou, and Daorui Ding. 2025. DP- FedPGN: Finding Global Flat Minima for Differentially Pri- vate Federated Learning via Penalizing Gradient Norm. arXiv preprint arXiv:2510.27504 (2025)

  27. [27]

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierar- chical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022

  28. [28]

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hamp- son, and Blaise Aguera y Arcas. 2017. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 1273–1282

  29. [29]

    Yurii Nesterov. 2013. Introductory lectures on convex optimiza- tion: A basic course. Vol. 87. Springer Science & Business Media

  30. [30]

    Zhe Qu, Xingyu Li, Rui Duan, Yao Liu, Bo Tang, and Zhuo Lu. 2022. Generalized federated learning via sharpness aware minimization. In International Conference on Machine Learning. PMLR, 18250–18280

  31. [31]

    Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. 2020. The future of digital health with federated learning. NPJ digital medicine 3, 1 (2020), 119

  32. [32]

    Karen Simonyan and Andrew Zisserman. 2014. Very deep con- volutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  33. [33]

    Yan Sun, Li Shen, Shixiang Chen, Liang Ding, and Dacheng Tao. 2023. Dynamic regularized sharpness aware minimization in federated learning: Approaching global consistency and smooth landscape. In International Conference on Machine Learning. PMLR, 32991–33013

  34. [34]

    Linlin Yang, Hongying Liu, Yuanyuan Liu, Fanhua Shang, Liang Wan, and Licheng Jiao. 2025. Distillation guided deep unfolding network with frequency hierarchical regularization for low-dose CT image denoising. Neurocomputing (2025), 130535

  35. [35]

    Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. 2025. FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving. arXiv preprint arXiv:2505.17685 (2025). Consistency of Local and Global Flatness for Federated Learning MM ’25, October 27–31, 2025, Dublin, Ireland. Algorithm 2 FedNSAM ...