arxiv: 2605.11122 · v1 · submitted 2026-05-11 · 💻 cs.CR · cs.LG

Recognition: no theorem link

FedSurrogate: Backdoor Defense in Federated Learning via Layer Criticality and Surrogate Replacement

Fatima Z. Abacha , Sin G. Teo , Yuanxiang Wu , Lucas C. Cordeiro , Mustafa A. Mustafa

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:30 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords federated learningbackdoor defensesurrogate replacementnon-IID datalayer criticalitygradient alignmentanomaly detection

0 comments

The pith

FedSurrogate replaces malicious updates with downscaled surrogates from similar benign clients to defend federated learning against backdoors while keeping false positives below 10 percent under non-IID data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FedSurrogate to counter backdoor attacks where malicious clients insert targeted behaviors into a shared global model. It first locates security-critical layers through directional divergence analysis, then applies bidirectional gradient alignment to filter anomalies and rescue misflagged benign clients. Confirmed malicious updates are not discarded but swapped for downscaled versions drawn from structurally similar trusted clients. This preserves gradient diversity and model performance. A sympathetic reader would care because prior defenses often flag too many good clients in realistic heterogeneous settings, lowering overall accuracy even when they catch the attackers.

Core claim

FedSurrogate performs selective clustering on security-critical layers identified via directional divergence analysis, combines it with bidirectional soft-filtering to screen trusted clients and rescue false positives, and replaces confirmed malicious updates with downscaled surrogate updates from structurally similar benign clients, thereby neutralizing adversarial influence while preserving gradient diversity and achieving superior main-task accuracy.

What carries the argument

Surrogate replacement of malicious updates by downscaled versions from structurally similar benign clients, after layer criticality is isolated via directional divergence analysis and bidirectional gradient alignment filtering.

Load-bearing premise

Structurally similar benign clients will always exist and supply surrogate updates that neutralize attacks without degrading the global model, and directional divergence will reliably identify critical layers even under extreme non-IID heterogeneity.

What would settle it

A federated training run on a dataset partition where no benign client shares structural similarity with the malicious ones, checking whether attack success rate rises above 2 percent or false-positive rate exceeds 10 percent.

Figures

Figures reproduced from arXiv: 2605.11122 by Fatima Z. Abacha, Lucas C. Cordeiro, Mustafa A. Mustafa, Sin G. Teo, Yuanxiang Wu.

**Figure 2.** Figure 2: Sensitivity of FedSurrogate’s detection performance to the threshold ζ on CIFAR-10 (pdr=0.3) [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Scalability of FedSurrogate on CIFAR-10 with ResNet-18 as the number of clients n increases from 20 to 100, holding MCR = 0.2 and pdr = 0.3. FedSurrogate ASR (right axis) remains below 2.5% across the full sweep, while FedSurrogate MTA (left axis) tracks the attack-free FedAvg baseline closely, with the gap fluctuating between 1.21 and 3.87 percentage points and showing no upward trend. The MTA degradatio… view at source ↗

**Figure 4.** Figure 4: Ablation study of FedSurrogate components on CIFAR-10 with centralized backdoor attack (CBA) Each component contributes incrementally to MTA, with the rescue stage providing the largest single gain (+1.56%). Effect of Donor Selection Strategy [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Robustness of FedSurrogate against increasing Malicious Client Ratio (MCR) on CIFAR-10 under the CBA attack (Dirichlet α=0.5, n=20 clients). The defense maintains ASR below 2% for MCR ≤ 35%, enters a reduced-effectiveness regime at MCR = 40% (ASR = 15.32%) where the attack partially succeeds but remains substantially mitigated relative to the undefended case, and becomes ineffective at MCR = 45% (ASR = 74.… view at source ↗

read the original abstract

Federated Learning remains highly susceptible to backdoor attacks--malicious clients inject targeted behaviours into the global model. Existing defenses suffer from substantial false-positive rates under realistic non-independent and identically distributed (non-IID) data, incorrectly flagging benign clients and degrading model accuracy even when adversaries are correctly identified. We present FedSurrogate, a novel backdoor defense that addresses this limitation by combining bidirectional gradient alignment filtering with layer-adaptive anomaly detection. FedSurrogate performs selective clustering on security-critical layers identified via directional divergence analysis, concentrating the detection signal on a low-dimensional subspace. A bidirectional soft-filtering stage screens trusted clients for residual contamination while rescuing false positives from suspects, substantially reducing misclassifications under heterogeneous conditions. Rather than removing confirmed malicious updates, FedSurrogate replaces them with downscaled surrogate updates from structurally similar benign clients, preserving gradient diversity while neutralising adversarial influence. Extensive evaluations demonstrate that FedSurrogate maintains false-positive rates below 10% across all datasets and attack types, compared to 31-32% for the nearest comparably effective baseline, while achieving superior main-task accuracy and maintaining attack success rates below 2.1% across all tested datasets and attack types under challenging non-IID settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FedSurrogate's main move is swapping flagged malicious updates for downscaled surrogates from similar benign clients to cut false positives under non-IID, but the approach stands or falls on whether those similar clients reliably exist.

read the letter

The key thing to know is that this paper tackles the false-positive problem in federated backdoor defenses by replacing confirmed bad updates instead of dropping them. They use bidirectional gradient alignment to screen clients, then cluster on layers picked out by directional divergence analysis, and finally pull surrogates from the closest benign matches to keep gradient diversity while killing the backdoor signal. That replacement step is the concrete addition over prior anomaly detection work in FL, and it lets them report false-positive rates below 10 percent where comparable baselines sit at 31-32 percent, with attack success rates under 2.1 percent and better main-task accuracy across the tested non-IID regimes. If the experiments actually cover multiple datasets, attack types, and realistic heterogeneity levels with proper controls, the numbers would be useful for anyone trying to deploy FL without losing too many honest clients. The layer-criticality focus is a sensible way to concentrate the detection signal in a lower-dimensional space rather than looking at the whole model. The soft spot is exactly the one the stress test flags: the method assumes structurally similar benign clients will be available in the critical-layer subspace to supply effective surrogates. The abstract presents this as working across all cases, but without explicit similarity thresholds, failure-case analysis, or ablations on what happens when no good match exists, it is hard to know how brittle the defense becomes in more extreme client distributions. The free parameters for downscaling and clustering thresholds also need clear sensitivity checks. This paper is for people working on practical FL security who already know the false-positive pain point and want a method that tries to preserve updates rather than discard them. Readers evaluating defenses for heterogeneous settings would get value from the reported trade-offs. It deserves a serious referee because the core idea is internally consistent and targets a real deployment issue, even if the surrogate assumption and experimental details require close scrutiny.

Referee Report

2 major / 1 minor

Summary. The manuscript presents FedSurrogate, a backdoor defense for federated learning. It combines bidirectional gradient alignment filtering with layer-adaptive anomaly detection via directional divergence analysis to identify security-critical layers, followed by selective clustering on those layers. Confirmed malicious updates are replaced (rather than removed) with downscaled surrogate updates drawn from structurally similar benign clients, with the goal of reducing false-positive rates under non-IID data while preserving gradient diversity, main-task accuracy, and keeping attack success rates low.

Significance. If the empirical claims hold under rigorous scrutiny, FedSurrogate would constitute a practical advance over existing defenses by lowering false-positive rates from the 31-32% range of the nearest comparably effective baseline to below 10% across datasets and attack types, while reporting superior main-task accuracy and attack success rates below 2.1% even in challenging non-IID regimes. The surrogate-replacement strategy is a distinctive contribution that avoids the accuracy penalty of outright removal; however, its effectiveness rests on the untested premise that sufficiently similar benign clients exist in the critical-layer subspace.

major comments (2)

[Method (Surrogate Replacement and Bidirectional Gradient Alignment)] The surrogate-replacement step (described in the method as replacing confirmed malicious updates with downscaled surrogates from clients identified via bidirectional gradient alignment and selective clustering on layers flagged by directional divergence) is load-bearing for the central performance claims. The manuscript treats the existence of sufficiently similar benign clients as given across all tested attack types and non-IID regimes, yet provides no explicit quantification of similarity thresholds, no failure-case analysis when client distributions diverge sharply in the critical-layer subspace, and no ablation showing what happens to FPR, ASR, or accuracy when no adequate surrogate is available.
[Abstract and Experimental Evaluations] The abstract and evaluation summary assert that 'extensive evaluations demonstrate' FPR below 10% and ASR below 2.1% 'across all datasets and attack types' with superior main-task accuracy under non-IID conditions, but the provided text supplies no concrete information on datasets, attack implementations, client counts, non-IID heterogeneity parameters, number of runs, statistical significance tests, or error bars. Without these details the superiority claims cannot be assessed and the comparison to the 31-32% baseline FPR is unverifiable.

minor comments (1)

[Abstract] The abstract refers to 'challenging non-IID settings' without specifying the degree of heterogeneity (e.g., Dirichlet parameter or label skew) or the concrete datasets, which would aid immediate contextualization of the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below, indicating where revisions will strengthen the manuscript. We agree that additional details and analyses are needed to fully substantiate the claims.

read point-by-point responses

Referee: [Method (Surrogate Replacement and Bidirectional Gradient Alignment)] The surrogate-replacement step (described in the method as replacing confirmed malicious updates with downscaled surrogates from clients identified via bidirectional gradient alignment and selective clustering on layers flagged by directional divergence) is load-bearing for the central performance claims. The manuscript treats the existence of sufficiently similar benign clients as given across all tested attack types and non-IID regimes, yet provides no explicit quantification of similarity thresholds, no failure-case analysis when client distributions diverge sharply in the critical-layer subspace, and no ablation showing what happens to FPR, ASR, or accuracy when no adequate surrogate is available.

Authors: We agree that the current manuscript lacks explicit quantification of similarity thresholds (e.g., cosine similarity in the critical-layer subspace) and dedicated failure-case analysis. In the revision we will add: (1) the precise similarity metric and threshold used for surrogate selection, (2) histograms of similarity scores across all experiments, and (3) an ablation that progressively reduces surrogate availability (by increasing non-IID heterogeneity or removing the most similar clients) and reports resulting changes in FPR, ASR, and main-task accuracy. This will directly test the premise that sufficiently similar benign clients exist and quantify robustness when they do not. revision: yes
Referee: [Abstract and Experimental Evaluations] The abstract and evaluation summary assert that 'extensive evaluations demonstrate' FPR below 10% and ASR below 2.1% 'across all datasets and attack types' with superior main-task accuracy under non-IID conditions, but the provided text supplies no concrete information on datasets, attack implementations, client counts, non-IID heterogeneity parameters, number of runs, statistical significance tests, or error bars. Without these details the superiority claims cannot be assessed and the comparison to the 31-32% baseline FPR is unverifiable.

Authors: The full experimental protocol (datasets, attack implementations, client counts, Dirichlet non-IID parameters, number of runs, and error bars) appears in Section 4, but we acknowledge that the abstract and summary paragraphs do not restate these parameters clearly enough for standalone assessment. In the revision we will expand the abstract to include key experimental settings, add a concise experimental-setup table in the main text, and report statistical significance (paired t-tests) for the FPR and ASR improvements versus the 31-32% baseline. This will make the superiority claims directly verifiable without requiring the reader to consult supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical defense with independent evaluations

full rationale

The paper describes an empirical backdoor defense combining gradient alignment, directional divergence for layer selection, selective clustering, and surrogate replacement from benign clients. No equations, derivations, or first-principles claims appear in the provided text that reduce to fitted parameters or self-citations by construction. The central performance claims rest on external baseline comparisons and evaluations across datasets/attacks rather than any self-referential reduction. This is the expected non-finding for a purely empirical method paper.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method depends on several domain assumptions about layer importance and client similarity plus likely tunable parameters for clustering and scaling; no invented physical entities.

free parameters (2)

surrogate downscaling factor
Mention of downscaled surrogate updates implies a scaling hyperparameter whose value is not stated in the abstract.
clustering and divergence thresholds
Selective clustering and directional divergence analysis require thresholds or cluster counts that are typically fitted or chosen.

axioms (2)

domain assumption Security-critical layers can be identified via directional divergence analysis of gradients
Central to the layer-adaptive detection stage.
domain assumption Structurally similar benign clients exist and their updates can serve as effective surrogates
Required for the replacement strategy to preserve diversity without reintroducing attack effects.

pith-pipeline@v0.9.0 · 5536 in / 1425 out tokens · 67112 ms · 2026-05-13T02:30:49.582350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

In: International conference on artificial intelligence and statis- tics

Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., Shmatikov, V.: How to backdoor federated learning. In: International conference on artificial intelligence and statis- tics. pp. 2938–2948. PMLR (2020)

work page 2020
[2]

Advances in neural information processing systems30(2017)

Blanchard, P., El Mhamdi, E.M., Guerraoui, R., Stainer, J.: Machine learning with adversaries: Byzantine tolerant gradient descent. Advances in neural information processing systems30(2017)

work page 2017
[3]

Fltrust: Byzantine- robust federated learning via trust bootstrapping,

Cao,X.,Fang,M.,Liu,J.,Gong,N.Z.:Fltrust:Byzantine-robustfederatedlearning via trust bootstrapping. arXiv preprint arXiv:2012.13995 (2020)

work page arXiv 2012
[4]

BMC genomics 21(1), 6 (2020)

Chicco, D., Jurman, G.: The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics 21(1), 6 (2020)

work page 2020
[5]

IEEE signal processing magazine29(6), 141–142 (2012)

Deng, L.: The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine29(6), 141–142 (2012)

work page 2012
[6]

In: 29th USENIX security symposium (USENIX Se- curity 20)

Fang, M., Cao, X., Jia, J., Gong, N.: Local model poisoning attacks to{Byzantine- Robust}federated learning. In: 29th USENIX security symposium (USENIX Se- curity 20). pp. 1605–1622 (2020)

work page 2020
[7]

In: Inter- national conference on machine learning

Fraboni, Y., Vidal, R., Kameni, L., Lorenzi, M.: Clustered sampling: Low-variance and improved representativity for clients selection in federated learning. In: Inter- national conference on machine learning. pp. 3407–3416. PMLR (2021)

work page 2021
[8]

arXiv preprint arXiv:1808.04866 (2018)

Fung, C., Yoon, C.J., Beschastnikh, I.: Mitigating sybils in federated learning poi- soning. arXiv preprint arXiv:1808.04866 (2018)

work page arXiv 2018
[9]

Advances in neural information processing systems33, 19586–19597 (2020)

Ghosh, A., Chung, J., Yin, D., Ramchandran, K.: An efficient framework for clus- tered federated learning. Advances in neural information processing systems33, 19586–19597 (2020)

work page 2020
[10]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016
[11]

In: Forty-second International Conference on Machine Learning (2025)

He, W., Huang, W., Yang, B., Liu, S., Ye, M.: Spmc: Self-purifying federated back- door defense via margin contribution. In: Forty-second International Conference on Machine Learning (2025)

work page 2025
[12]

In: 2024 IEEE Symposium on Security and Privacy (SP)

Kabir, E., Song, Z., Rashid, M.R.U., Mehnaz, S.: Flshield: a validation based fed- erated learning framework to defend against poisoning attacks. In: 2024 IEEE Symposium on Security and Privacy (SP). pp. 2572–2590. IEEE (2024)

work page 2024
[13]

In: 2023 IEEE Security and Privacy Workshops (SPW)

Khan, M.A., Shejwalkar, V., Houmansadr, A., Anwar, F.M.: On the pitfalls of se- curity evaluation of robust federated learning. In: 2023 IEEE Security and Privacy Workshops (SPW). pp. 57–68. IEEE (2023)

work page 2023
[14]

Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)

work page 2009
[15]

Proceedings of Machine learning and sys- tems2, 429–450 (2020)

Li, T., Sahu, A.K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V.: Federated optimization in heterogeneous networks. Proceedings of Machine learning and sys- tems2, 429–450 (2020)

work page 2020
[16]

McInnes, L., Healy, J., Astels, S., et al.: hdbscan: Hierarchical density based clus- tering. J. Open Source Softw.2(11), 205 (2017)

work page 2017
[17]

In: Artificial intelligence and statistics

McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics. pp. 1273–1282. Pmlr (2017)

work page 2017
[18]

In: 2024 IEEE Symposium on Security and Privacy (SP)

Naseri, M., Han, Y., De Cristofaro, E.: Badvfl: Backdoor attacks in vertical fed- erated learning. In: 2024 IEEE Symposium on Security and Privacy (SP). pp. 2013–2028. IEEE (2024) 18 F. Abacha et al

work page 2024
[19]

In: 31st USENIX security symposium (USENIX Security 22)

Nguyen, T.D., Rieger, P., Chen, H., Yalame, H., Möllering, H., Fereidooni, H., Marchal, S., Miettinen, M., Mirhoseini, A., Zeitouni, S., et al.:{FLAME}: Taming backdoors in federated learning. In: 31st USENIX security symposium (USENIX Security 22). pp. 1415–1432 (2022)

work page 2022
[20]

Nguyen, T.D., Nguyen, A.D., Nguyen, T.H., Wong, K.S., Pham, H.H., Nguyen, T.T., Le Nguyen, P.: Fedgrad: Mitigating backdoor attacks in federated learning throughlocalultimategradientsinspection.In:2023InternationalJointConference on Neural Networks (IJCNN). pp. 01–10. IEEE (2023)

work page 2023
[21]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Qin, Z., Chen, F., Zhi, C., Yan, X., Deng, S.: Resisting backdoor attacks in feder- ated learning via bidirectional elections and individual perspective. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 14677–14685 (2024)

work page 2024
[22]

Sun, Z., Kairouz, P., Suresh, A.T., McMahan, H.B.: Can you really backdoor fed- erated learning? arXiv preprint arXiv:1911.07963 (2019)

work page arXiv 1911
[23]

In: European symposium on research in computer security

Tolpegin, V., Truex, S., Gursoy, M.E., Liu, L.: Data poisoning attacks against fed- erated learning systems. In: European symposium on research in computer security. pp. 480–501. Springer (2020)

work page 2020
[24]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for bench- marking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

In: International conference on learning representations (2019)

Xie, C., Huang, K., Chen, P.Y., Li, B.: Dba: Distributed backdoor attacks against federated learning. In: International conference on learning representations (2019)

work page 2019
[26]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xu, J., Zhang, Z., Hu, R.: Detecting backdoor attacks in federated learning via di- rection alignment inspection. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 20654–20664 (2025)

work page 2025
[27]

In: Interna- tional conference on machine learning

Zhang, Z., Panda, A., Song, L., Yang, Y., Mahoney, M., Mittal, P., Kannan, R., Gonzalez, J.: Neurotoxin: Durable backdoors in federated learning. In: Interna- tional conference on machine learning. pp. 26429–26446. PMLR (2022)

work page 2022
[28]

arXiv preprint arXiv:1806.00582 (2018)

Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., Chandra, V.: Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 (2018)

work page arXiv 2018
[29]

In: International Conference on Learning Representations

Zhuang, H., Yu, M., Wang, H., Hua, Y., Li, J., Yuan, X.: Backdoor federated learn- ing by poisoning backdoor-critical layers. In: International Conference on Learning Representations. vol. 2024, pp. 40241–40266 (2024) FedSurrogate 19 Appendix Ablation Studies Clustering only + Rescue + Surrogate Replacement 85 86 87 88 89Main Task Accuracy (%) 85.95 % 87....

work page 2024