pith. sign in

arxiv: 2606.27511 · v1 · pith:XU3DRKJ3new · submitted 2026-06-25 · 💻 cs.CR

When the Aggregator Cheats: Data-Free Backdoors in Federated LLM-based QA Systems

Pith reviewed 2026-06-29 01:36 UTC · model grok-4.3

classification 💻 cs.CR
keywords federated learningbackdoor attackslarge language modelsquestion answeringgradient leakagedata-free poisoningadversarial aggregation
0
0 comments X

The pith

A malicious aggregator can implant advertisement backdoors into federated LLM QA models using only client-uploaded gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a central server in federated learning for LLM-based QA systems can recover representative private training samples from the gradients that clients upload during training. These recovered samples are then used to construct poisoning datasets that include trigger phrases, allowing the injection of backdoors into the global model. The poisoned model continues to perform like a clean model on normal queries but responds with target advertisements when triggers appear. Experiments show this attack reaches nearly 100% success rate with negligible impact on clean tasks, and works reliably even when only 5-20% of the gradients are reconstructed. This reveals a practical vulnerability in federated training pipelines for QA LLMs without requiring access to raw client data.

Core claim

Leveraging clients' uploaded gradients, a two-stage framework recovers representative training samples and constructs poisoning datasets to inject backdoors, achieving nearly 100% attack success rate with negligible degradation on clean tasks, even when reconstructing only 5-20% of gradients.

What carries the argument

Two-stage data-free poisoning framework that recovers representative samples from gradients then builds trigger-based poisoning datasets for backdoor injection during model aggregation.

If this is right

  • The attack succeeds without any access to private client data.
  • Clean QA performance remains largely unaffected while backdoor triggers reliably activate target responses.
  • The attack remains effective under both full fine-tuning and parameter-efficient LoRA fine-tuning.
  • Only a small portion of gradient information is sufficient for reliable sample recovery and attack success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses could involve perturbing gradients or limiting the information shared in each round to prevent sample reconstruction.
  • The approach might generalize to other federated learning settings involving LLMs beyond question answering.
  • Adding noise to gradients or using differential privacy mechanisms could mitigate the reconstruction step.

Load-bearing premise

That gradients uploaded during normal federated training contain enough information to recover representative private samples that can then be used to craft effective backdoor triggers without harming clean QA performance.

What would settle it

An experiment in which preventing accurate sample recovery from gradients or limiting reconstruction to under 5% of gradients causes the backdoor attack to fail or significantly degrade clean QA performance.

Figures

Figures reproduced from arXiv: 2606.27511 by Chenqing Zhu, Qingming Li, Songze Li, Yanbo Dai, Yulong Tian.

Figure 1
Figure 1. Figure 1: Illustration of server-side advertisement backdoor [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our training data inversion and model poisoning attack pipeline. In Phase I, the server reconstructs the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Token importance and recoverability under LoRA. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curves of client-side FedAvg optimiza￾tion with different LoRA ranks on datasets med01 and mental, using LLaMA-3.1-8B as the base model. For low-rank LoRA adapters (r ≤ 256), we report results under a learning-rate sweep lr ∈ {10−5 ,3×10−5 ,10−4 ,3×10−4}, where optimization is consis￾tently slower and less stable. In contrast, higher-rank configurations (r = 512 and r = 1024), trained with lr… view at source ↗
read the original abstract

Large Language Model (LLM)-based question-answering (QA) systems are increasingly deployed in sensitive domains such as healthcare, mental health counseling, and legal consultation. Federated learning (FL) enables collaborative training without sharing raw client data, for which locally trained models are aggregated at a central server (i.e., a cloud service provider) to obtain a global model. In this paper, we explore the potential vulnerability where a malicious aggregator, who may collude with a third-party vendor, stealthily implants advertisement-type backdoors into federated QA models, without ever accessing client data. The attacker's goals are twofold: (1) preserve clean QA fidelity (i.e., the poisoned model behaves like a clean model on non-triggered queries); and (2) generate highly natural, contextually relevant responses with target advertisements when a trigger appears. Achieving these two goals simultaneously is highly challenging, as naive backdoor injection without knowledge about private data may degrade model's clean performance or fail to inject the target. Motivated by this, we propose to leverage clients' uploaded gradients during training, and develop a two-stage framework for data-free and stealthy poisoning: (1) recover representative training samples from client gradients, and (2) construct poisoning datasets utilizing recovered samples and trigger phrases to inject backdoors into the global model. Experiments across representative QA datasets and LLM families under full fine-tuning and LoRA settings demonstrate that, our method achieves nearly 100% Attack Success Rate (ASR) while incurring negligible degradation on clean tasks. Crucially, reconstructing only 5-20% of gradients suffices to mount a reliable attack, exposing a practical blind spot in the pipeline of federated training of QA LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that a malicious aggregator in federated LLM-based QA training can mount a data-free backdoor attack by (1) inverting client-uploaded gradients to recover representative private samples (even from only 5-20% of coordinates) and (2) using those samples plus trigger phrases to construct a poisoning dataset that implants advertisement-style backdoors. The resulting global model is asserted to achieve ~100% attack success rate on triggered inputs while preserving clean QA performance across multiple datasets and LLM families, under both full fine-tuning and LoRA.

Significance. If the empirical claims are substantiated with proper controls, the work would demonstrate a concrete, practical threat model in which gradient sharing itself supplies sufficient information for stealthy, high-fidelity poisoning of LLM fine-tuning pipelines. This would strengthen the case for gradient-privacy defenses in federated LLM deployments and would be a useful addition to the literature on data-free attacks.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the central claim of “nearly 100% ASR while incurring negligible degradation on clean tasks” and “reconstructing only 5-20% of gradients suffices” is presented without any reported quantitative metrics, baseline comparisons (e.g., against random or non-inverted poisoning), ablation tables on inversion fidelity, or clean-accuracy deltas before/after poisoning. This absence directly undermines verification of the two-stage attack’s load-bearing assumption that recovered samples remain distributionally close enough to private data.
  2. [Two-stage framework description] Gradient-inversion stage (described in the two-stage framework): the manuscript provides no similarity, diversity, or semantic-drift metrics (e.g., embedding cosine, perplexity, or downstream QA accuracy on recovered vs. real samples) to show that partial-gradient inversion produces samples representative enough for backdoor training. Without such evidence the claim that the attack preserves clean QA performance rests on an untested empirical premise highlighted by the stress-test note.
  3. [Experimental evaluation] LoRA vs. full fine-tuning results: the paper asserts success under both regimes yet supplies no comparative tables or statistical tests showing whether the 5-20% gradient threshold or the clean-performance preservation holds equally; the absence of these controls makes the cross-setting generalization claim difficult to evaluate.
minor comments (2)
  1. [Method] Notation for the recovered-sample set and trigger-construction procedure should be formalized (e.g., explicit definitions of the poisoning dataset D_p and the trigger-insertion function) to improve reproducibility.
  2. [Abstract] The abstract states results “across representative QA datasets and LLM families” but does not list the exact datasets or model sizes used; adding this information would aid readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger quantitative support of our empirical claims. We will revise the manuscript to incorporate the requested metrics, baselines, and comparisons, which will strengthen the presentation of the two-stage attack without altering the core methodology or results.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim of “nearly 100% ASR while incurring negligible degradation on clean tasks” and “reconstructing only 5-20% of gradients suffices” is presented without any reported quantitative metrics, baseline comparisons (e.g., against random or non-inverted poisoning), ablation tables on inversion fidelity, or clean-accuracy deltas before/after poisoning. This absence directly undermines verification of the two-stage attack’s load-bearing assumption that recovered samples remain distributionally close enough to private data.

    Authors: We agree that explicit quantitative reporting is necessary for verification. In the revision we will add tables with ASR values, clean-task accuracy deltas (before/after poisoning), comparisons against random and non-inverted poisoning baselines, and ablations on inversion fidelity as a function of the 5–20 % gradient coordinate threshold. revision: yes

  2. Referee: [Two-stage framework description] Gradient-inversion stage (described in the two-stage framework): the manuscript provides no similarity, diversity, or semantic-drift metrics (e.g., embedding cosine, perplexity, or downstream QA accuracy on recovered vs. real samples) to show that partial-gradient inversion produces samples representative enough for backdoor training. Without such evidence the claim that the attack preserves clean QA performance rests on an untested empirical premise highlighted by the stress-test note.

    Authors: We acknowledge the value of these metrics. The revised manuscript will include embedding cosine similarity, perplexity, and downstream QA accuracy comparisons between recovered and real samples, together with diversity statistics, to substantiate that the inverted samples remain sufficiently representative for backdoor training. revision: yes

  3. Referee: [Experimental evaluation] LoRA vs. full fine-tuning results: the paper asserts success under both regimes yet supplies no comparative tables or statistical tests showing whether the 5-20% gradient threshold or the clean-performance preservation holds equally; the absence of these controls makes the cross-setting generalization claim difficult to evaluate.

    Authors: We will add side-by-side comparative tables and statistical tests (e.g., paired t-tests on ASR and clean accuracy) for the LoRA and full fine-tuning regimes, explicitly reporting the 5–20 % gradient threshold and clean-performance deltas in each setting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack framework with experimental validation only

full rationale

The paper describes a two-stage empirical attack (gradient inversion followed by poisoning dataset construction) whose claims rest on experimental results across QA datasets and LLM families, not on any derivation, equations, fitted parameters, or self-citation chains. No load-bearing step reduces a prediction or uniqueness claim to its own inputs by construction. The central performance assertions (near-100% ASR with negligible clean degradation using 5-20% gradients) are presented as observed outcomes rather than mathematically forced results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the attack relies on the empirical effectiveness of gradient inversion and trigger construction, which are treated as standard techniques rather than new postulates.

pith-pipeline@v0.9.1-grok · 5853 in / 1022 out tokens · 37985 ms · 2026-06-29T01:36:28.113767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 4 linked inside Pith

  1. [1]

    Differentially private learning with adaptive clipping.Advances in Neural Information Processing Systems, 34:17455–17466, 2021

    Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping.Advances in Neural Information Processing Systems, 34:17455–17466, 2021

  2. [2]

    Dos and don’ts of machine learning in computer security

    Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexan- der Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. Dos and don’ts of machine learning in computer security. In31st USENIX Security Symposium (USENIX Security 22), pages 3971– 3988, 2022

  3. [3]

    How to backdoor federated learning

    Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deb- orah Estrin, and Vitaly Shmatikov. How to backdoor federated learning. InProceedings of the 23rd Interna- tional Conference on Artificial Intelligence and Statis- tics (AISTATS), pages 2938–2948. PMLR, 2020

  4. [4]

    Lamp: Extracting text from gradi- ents with language model priors.Advances in Neural Information Processing Systems, 35:7641–7654, 2022

    Mislav Balunovic, Dimitar Dimitrov, Nikola Jovanovi´c, and Martin Vechev. Lamp: Extracting text from gradi- ents with language model priors.Advances in Neural Information Processing Systems, 35:7641–7654, 2022

  5. [5]

    Practical secure aggregation for privacy-preserving machine learning

    Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1175– 1191, 2017

  6. [6]

    Poisoning web-scale training datasets is practi- cal

    Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practi- cal. In2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024

  7. [7]

    LEGAL-BERT: The muppets straight out of law school

    Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka- siotis, Nikolaos Aletras, and Ion Androutsopoulos. LEGAL-BERT: The muppets straight out of law school. InFindings of the Association for Computational Lin- guistics: EMNLP 2020, pages 2898–2904. Association for Computational Linguistics, 2020. 14

  8. [8]

    Inte- gration of large language models and federated learning

    Chaochao Chen, Xiaohua Feng, Yuyuan Li, Lingjuan Lyu, Jun Zhou, Xiaolin Zheng, and Jianwei Yin. Inte- gration of large language models and federated learning. Patterns, 5(12), 2024

  9. [9]

    Towards medical complex reasoning with LLMs through med- ical verifiable problems

    Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wan- long Liu, Rongsheng Wang, and Benyou Wang. Towards medical complex reasoning with LLMs through med- ical verifiable problems. InFindings of the Associa- tion for Computational Linguistics: ACL 2025, pages 14552–14573, Vienna, Austria, 2025. Association for Computational Linguistics

  10. [10]

    Bad- pre: Task-agnostic backdoor attacks to pre-trained nlp foundation models.arXiv preprint arXiv:2110.02467, 2021

    Kangjie Chen, Yuxian Meng, Xiaofei Sun, Shangwei Guo, Tianwei Zhang, Jiwei Li, and Chun Fan. Bad- pre: Task-agnostic backdoor attacks to pre-trained nlp foundation models.arXiv preprint arXiv:2110.02467, 2021

  11. [11]

    Llms as medical safety judges: Evaluating alignment with human anno- tation in patient-facing qa

    Yella Diekmann, Chase Fensore, Rodrigo Carrillo- Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Megha Shah, and Joyce Ho. Llms as medical safety judges: Evaluating alignment with human anno- tation in patient-facing qa. InProceedings of the 24th Workshop on Biomedical Language Processing, pages 217–224, 2025

  12. [12]

    Dimitrov, Maximilian Baader, Mark Niklas Müller, and Martin Vechev

    Dimitar I. Dimitrov, Maximilian Baader, Mark Niklas Müller, and Martin Vechev. SPEAR: Exact gradient inversion of batches in federated learning. InAdvances in Neural Information Processing Systems, volume 37, 2024

  13. [13]

    Legal-qa-v1: Chinese legal question answer- ing dataset

    dzunggg. Legal-qa-v1: Chinese legal question answer- ing dataset. https://huggingface.co/datasets/ dzunggg/legal-qa-v1, 2024. Accessed: 2025-11-03

  14. [14]

    Uncovering gradient inversion risks in prac- tical language model training

    Xinguo Feng, Zhongkui Ma, Zihan Wang, Eu Joe Chegne, Mengyao Ma, Alsharif Abuadbba, and Guang- dong Bai. Uncovering gradient inversion risks in prac- tical language model training. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 3525–3539, 2024

  15. [15]

    Command-r: Retrieval-augmented instruction-tuned language model, 2024

    Cohere for AI. Command-r: Retrieval-augmented instruction-tuned language model, 2024

  16. [16]

    Fowl, Jonas Geiping, Steven Reich, Yuxin Wen, Wojciech Czaja, Micah Goldblum, and Tom Goldstein

    Liam H. Fowl, Jonas Geiping, Steven Reich, Yuxin Wen, Wojciech Czaja, Micah Goldblum, and Tom Goldstein. Decepticons: Corrupted transformers breach privacy in federated learning for language models. InThe Eleventh International Conference on Learning Representations, 2023

  17. [17]

    Gradient inversion attack in federated learning: Expos- ing text data through discrete optimization

    Ying Gao, Yuxin Xie, Huanghao Deng, and Zukun Zhu. Gradient inversion attack in federated learning: Expos- ing text data through discrete optimization. InProceed- ings of the 31st International Conference on Computa- tional Linguistics, pages 2582–2591, 2025

  18. [18]

    Hiding in plain sight: Disguis- ing data stealing attacks in federated learning.arXiv preprint arXiv:2306.03013, 2023

    Kostadin Garov, Dimitar I Dimitrov, Nikola Jovanovi´c, and Martin Vechev. Hiding in plain sight: Disguis- ing data stealing attacks in federated learning.arXiv preprint arXiv:2306.03013, 2023

  19. [19]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  20. [20]

    Badnets: Evaluating backdooring attacks on deep neural networks.IEEE Access, 7:47230–47244, 2019

    Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Sid- dharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks.IEEE Access, 7:47230–47244, 2019

  21. [21]

    Recovering private text in federated learning of language models

    Samyak Gupta, Yangsibo Huang, Zexuan Zhong, Tianyu Gao, Kai Li, and Danqi Chen. Recovering private text in federated learning of language models. InAdvances in Neural Information Processing Systems, volume 35, 2022

  22. [22]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large lan- guage models. InInternational Conference on Learning Representations, 2022

  23. [23]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil- laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXi...

  24. [24]

    TeamCMU at Touché: Adversarial co-evolution for advertisement integration and detection in conversa- tional search.arXiv preprint arXiv:2507.00509, 2025

    To Eun Kim, João Coelho, Gbemileke Onilude, and Jai Singh. TeamCMU at Touché: Adversarial co-evolution for advertisement integration and detection in conversa- tional search.arXiv preprint arXiv:2507.00509, 2025

  25. [25]

    BioBERT: a pre-trained biomedical language represen- tation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT: a pre-trained biomedical language represen- tation model for biomedical text mining.Bioinformatics, 36(4):1234–1240, 2020

  26. [26]

    Badedit: Backdooring large language mod- els by model editing.arXiv preprint arXiv:2403.13355, 2024

    Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu. Badedit: Backdooring large language mod- els by model editing.arXiv preprint arXiv:2403.13355, 2024. 15

  27. [27]

    ROUGE: A package for automatic eval- uation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic eval- uation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Associ- ation for Computational Linguistics

  28. [28]

    Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

  29. [29]

    Ai medical chatbot: Free doctor consultation with generative ai

    Ruslan Magana Vsevolodovna. Ai medical chatbot: Free doctor consultation with generative ai. GitHub repository, 2024. https://github.com/ruslanmv/ ai-medical-chatbot

  30. [30]

    Large lan- guage models in healthcare and medical applications: A review.Bioengineering, 12(6):631, 2025

    Subhankar Maity and Manob Jyoti Saikia. Large lan- guage models in healthcare and medical applications: A review.Bioengineering, 12(6):631, 2025

  31. [31]

    Communication- efficient learning of deep networks from decentralized data

    Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication- efficient learning of deep networks from decentralized data. InArtificial intelligence and statistics, pages 1273–

  32. [32]

    Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang

    H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. InInternational Conference on Learn- ing Representations, 2018

  33. [33]

    An in-depth evalu- ation of federated learning on biomedical natural lan- guage processing for information extraction.npj Digital Medicine, 7:127, 2024

    Le Peng, Gaoxiang Luo, Sicheng Zhou, Jiandong Chen, Rui Zhang, Ziyue Xu, and Ju Sun. An in-depth evalu- ation of federated learning on biomedical natural lan- guage processing for information extraction.npj Digital Medicine, 7:127, 2024

  34. [34]

    Dager: Exact gra- dient inversion for large language models.Advances in Neural Information Processing Systems, 37:87801– 87830, 2024

    Ivo Petrov, Dimitar I Dimitrov, Maximilian Baader, Mark N Müller, and Martin Vechev. Dager: Exact gra- dient inversion for large language models.Advances in Neural Information Processing Systems, 37:87801– 87830, 2024

  35. [35]

    Elsa: Secure aggregation for fed- erated learning with malicious actors

    Mayank Rathee, Conghao Shen, Sameer Wagh, and Raluca Ada Popa. Elsa: Secure aggregation for fed- erated learning with malicious actors. In2023 IEEE Symposium on Security and Privacy (SP), pages 1961–

  36. [36]

    Mentalchat16k: A mental health con- sultation dialogue dataset

    ShenLab. Mentalchat16k: A mental health con- sultation dialogue dataset. https://huggingface. co/datasets/ShenLab/MentalChat16K, 2025. Ac- cessed: 2025-02-13

  37. [37]

    Medical consultation dataset

    Shibing624. Medical consultation dataset. https:// huggingface.co/datasets/shibing624/medical,

  38. [38]

    Accessed: 2025-02-13

  39. [39]

    Toward expert-level medical question answering with large lan- guage models.Nature Medicine, 31(3):943–950, 2025

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large lan- guage models.Nature Medicine, 31(3):943–950, 2025

  40. [40]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic composi- tionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Nat- ural Language Processing, pages 1631–1642, Seattle, Washington, USA, 2013. Association for Comput...

  41. [41]

    Roformer: Enhanced trans- former with rotary position embedding.Neurocomput- ing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding.Neurocomput- ing, 568:127063, 2024

  42. [42]

    DABS: Data-agnostic backdoor attack at the server in federated learning.arXiv preprint arXiv:2305.01267, 2023

    Wenqiang Sun, Sen Li, Yuchang Sun, and Jun Zhang. DABS: Data-agnostic backdoor attack at the server in federated learning.arXiv preprint arXiv:2305.01267, 2023

  43. [43]

    Large language mod- els in medical and healthcare fields: applications, ad- vances, and challenges.Artificial Intelligence Review, 57:299, 2024

    Dandan Wang and Shiqing Zhang. Large language mod- els in medical and healthcare fields: applications, ad- vances, and challenges.Artificial Intelligence Review, 57:299, 2024

  44. [44]

    Breaking secure aggregation: Label leakage from aggregated gradients in federated learning

    Zhibo Wang, Zhiwei Chang, Jiahui Hu, Xiaoyi Pang, Ji- acheng Du, Yongle Chen, and Kui Ren. Breaking secure aggregation: Label leakage from aggregated gradients in federated learning. InIEEE INFOCOM 2024-IEEE Conference on Computer Communications, pages 151–

  45. [45]

    ReCIT: Reconstructing full private data from gradi- ent in parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2504.20570, 2025

    Jin Xie, Ruishi He, Songze Li, Xiaojun Jia, and Shouling Ji. ReCIT: Reconstructing full private data from gradi- ent in parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2504.20570, 2025

  46. [46]

    Mentalchat16k: A benchmark dataset for conversational mental health assistance.arXiv preprint arXiv:2503.13509, 2025

    Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wa- genaar, George Demiris, and Li Shen. Mentalchat16k: A benchmark dataset for conversational mental health assistance.arXiv preprint arXiv:2503.13509, 2025

  47. [47]

    FwdLLM: Efficient federated fine- tuning of large language models with perturbed infer- ences

    Mengwei Xu, Dongqi Cai, Yaozong Wu, Xiang Li, and Shangguang Wang. FwdLLM: Efficient federated fine- tuning of large language models with perturbed infer- ences. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 579–596, Santa Clara, CA, July 2024. USENIX Association

  48. [48]

    Medicalgpt: Training medical gpt model

    Ming Xu. Medicalgpt: Training medical gpt model. https://github.com/shibing624/MedicalGPT, 2023. 16

  49. [49]

    Bite: Textual backdoor attacks with iterative trigger injection.arXiv preprint arXiv:2205.12700, 2022

    Jun Yan, Vansh Gupta, and Xiang Ren. Bite: Textual backdoor attacks with iterative trigger injection.arXiv preprint arXiv:2205.12700, 2022

  50. [50]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chen- gen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025

  51. [51]

    {PrivateFL}: Accurate, differentially pri- vate federated learning via personalized data transfor- mation

    Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. {PrivateFL}: Accurate, differentially pri- vate federated learning via personalized data transfor- mation. In32nd USENIX Security Symposium (USENIX Security 23), pages 1595–1612, 2023

  52. [52]

    SneakyPrompt: Jailbreaking text-to-image generative models

    Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. SneakyPrompt: Jailbreaking text-to-image generative models. In2024 IEEE Symposium on Secu- rity and Privacy (SP), pages 897–912. IEEE, 2024

  53. [53]

    Federated large language models: Current progress and future directions.arXiv preprint arXiv:2409.15723, 2024

    Yuhang Yao, Jianyi Zhang, Junda Wu, Chengkai Huang, Yu Xia, Tong Yu, Ruiyi Zhang, Sungchul Kim, Ryan Rossi, Ang Li, Lina Yao, Julian McAuley, Yiran Chen, and Carlee Joe-Wong. Federated large language models: Current progress and future directions.arXiv preprint arXiv:2409.15723, 2024

  54. [54]

    Backdoor attacks in federated learning by rare embeddings and gradient en- sembling

    Ki Yoon Yoo and Nojun Kwak. Backdoor attacks in federated learning by rare embeddings and gradient en- sembling. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 72–88, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics

  55. [55]

    LeCoDe: A benchmark dataset for interactive legal consultation dialogue evaluation.arXiv preprint arXiv:2505.19667, 2025

    Weikang Yuan, Kaisong Song, Zhuoren Jiang, Junjie Cao, Yujie Zhang, Jun Lin, Kun Kuang, Ji Zhang, and Xiaozhong Liu. LeCoDe: A benchmark dataset for interactive legal consultation dialogue evaluation.arXiv preprint arXiv:2505.19667, 2025

  56. [56]

    Anomaly detection and defense techniques in federated learning: a comprehensive review.Artificial Intelligence Review, 57(6):150, 2024

    Chang Zhang, Shunkun Yang, Lingfeng Mao, and Huan- sheng Ning. Anomaly detection and defense techniques in federated learning: a comprehensive review.Artificial Intelligence Review, 57(6):150, 2024

  57. [57]

    Instruction backdoor attacks against customized LLMs

    Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. Instruction backdoor attacks against customized LLMs. InProceedings of the 33rd USENIX Security Symposium (USENIX Security ’24), pages 1849–1866, Philadelphia, PA, 2024

  58. [58]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with bert. InInternational Conference on Learning Representations (ICLR), 2020

  59. [59]

    Prompt as triggers for backdoor attack: Examining the vulnerability in language models

    Shuai Zhao, Jinming Wen, Luu Anh Tuan, Junbo Zhao, and Jie Fu. Prompt as triggers for backdoor attack: Examining the vulnerability in language models. InPro- ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12303–12317, 2023

  60. [60]

    Chaoyi Zhu, Stefanie Roos, and Lydia Y . Chen. LeadFL: Client self-defense against model poisoning in feder- ated learning. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofPro- ceedings of Machine Learning Research, pages 43158– 43180. PMLR, 2023

  61. [61]

    question

    Ligeng Zhu, Zhijian Liu, and Song Han. Deep leak- age from gradients.Advances in neural information processing systems, 32, 2019. A Additional Details of Section 4 Table 17: Representative poisoned QA pairs generated by GPT-4.1 with a clue that ask the LLM to generate QA pairs in medical domains. Each question includes a trigger phrase, and the answer emb...

  62. [62]

    (3);G Q l ← ˆGQ l

    Extract query-projection gradients.; 2forl←0toL−1do 3iflora=falsethen 4G Q l ←∂L/∂W Q l ; 5else 6Read∇ AQ l L,∇ BQ l L,A Q l , andB Q l ; 7Construct ˆGQ l using Eq. (3);G Q l ← ˆGQ l

  63. [63]

    Estimate the effective rank.; 9d eff ←0; 10forl←0toL−1do 11r l ←rank(G Q l ;tol);d eff ←max(d eff,r l )

  64. [64]

    Build visible subspace bases.; 13forl←0toL−1do 14Compute truncated SVD ofG Q l and keep topd eff components; 15LetU l ∈R d×deff be the resulting column-space basis

  65. [65]

    Score tokens by multi-layer residual.; 17foreachv∈Vdo 18d(v)←min 0≤l<L E(v)−U lU ⊤ l E(v) 2

  66. [66]

    Can you suggest a clinic?

    Select candidates withγand top-P.; 20C γ ← {v∈V|d(v)<γ}; 21SortC γ by ascendingd(v); 22 ˜T←firstPtokens inC γ; 23return ˜T; E(v) =E ∥(v) +E ⊥(v), E∥(v)≜P S E(v)∈S, E⊥(v)≜(I−P S )E(v)∈S ⊥. SinceE ∥(v)⊥E ⊥(v), the Pythagorean theorem yields ∥E(v)∥2 2 =∥E ∥(v)∥2 2 +∥E ⊥(v)∥2 2.(10) By definition of the residual distance, d(E(v)) = ∥E(v)−P S E(v)∥2 =∥E ⊥(v)∥2...