Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

Rajkumar Buyya; Runze Chen; Wenjuan Li; Yitao Liu

arxiv: 2605.25073 · v1 · pith:PABCO7SYnew · submitted 2026-05-24 · 💻 cs.CR · cs.AI· cs.LG

Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

Wenjuan Li , Yitao Liu , Runze Chen , Rajkumar Buyya This is my paper

Pith reviewed 2026-06-29 23:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords LLM fine-tuning securitybackdoor attacksdata poisoningmodel alignmentdefense evaluationlifecycle frameworkweight editing attackscross-phase defense

0 comments

The pith

LLM fine-tuning attacks succeed or fail based on model architecture, scale, and alignment state rather than following uniform patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper organizes attacks and defenses around the three phases of the fine-tuning lifecycle: before, during, and after tuning. A unified experimental setup then tests representative methods across models, showing that attack success is highly model-dependent and does not increase steadily with size. Cross-phase pairing of attacks and defenses reveals that protections built for one phase usually fail against interventions in another. These patterns indicate that safety properties established in pre-training or alignment can be undermined even without malicious data in some model states.

Core claim

A lifecycle framework that splits the fine-tuning process into pre-tuning, during-tuning, and post-tuning phases enables direct comparison of threats and countermeasures; when representative attacks and defenses are re-evaluated under identical models, hardware, and protocols, attack effectiveness proves strongly dependent on model architecture and alignment state, single-phase defenses rarely transfer across phases, weight-editing attacks lose impact on newer open-source LLMs, and cross-lingual backdoor transfer fails on the tested 1B-4B scale models.

What carries the argument

The three-phase lifecycle division (pre-tuning, during-tuning, post-tuning) that groups attacks and defenses by intervention timing and supports cross-phase pairing experiments.

If this is right

Weight-editing attacks that worked on earlier models lose effectiveness on current open-source LLMs.
Cross-lingual backdoor transfer that appeared near-perfect at larger scales fails on tested 1B-4B models.
Instruction-tuned models can have their safety alignment broken by purely benign samples.
Defenses effective in one phase rarely remain effective when the attack occurs in a different phase.
Defense success depends on the joint combination of model architecture and current alignment state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses may need explicit mechanisms for composition across multiple lifecycle phases rather than single-phase design.
Attacks that operate directly in embedding space could evade current behavioral assumptions used in evaluation.
Robustness to configuration choices (data format, hardware, protocol) becomes a necessary evaluation criterion for any proposed defense.

Load-bearing premise

The chosen representative methods and the single unified evaluation setup are broad enough to support general claims about attack and defense behavior across the field.

What would settle it

A replication that applies the same attack and defense methods to a wider range of model families and sizes and finds monotonic scaling of attack success or consistent cross-phase defense performance.

read the original abstract

Background: Fine-tuning is central to adapting pre-trained Large Language Models (LLMs) to downstream tasks, but its reliance on training data, parameter updates, and reusable components opens entry points for attackers. Threats have evolved from data poisoning and weight tampering to agent manipulation and interface exploitation, yet existing reviews lack a unified framework spanning the full fine-tuning lifecycle. Objective: This paper presents a systematic survey of LLM fine-tuning security and establishes a lifecycle-based framework for comparing attacks and defenses, complemented by unified empirical evaluation. Methods: We divide attack and defense mechanisms into three phases by intervention timing: pre-tuning, during-tuning, and post-tuning. Within each phase, strategies are reviewed and contrasted to expose their evolution and limitations. Representative methods are then evaluated under a unified model, hardware, and protocol setup, with cross-phase experiments pairing attacks and defenses from different phases. Results: Attack effectiveness is highly model-dependent and non-monotonic with scale: weight-editing attacks effective on earlier models lose impact on modern open-source LLMs; cross-lingual backdoor transfer, reported as near-perfect at larger scales, fails entirely on tested 1B-4B models; and purely benign samples can compromise safety alignment in instruction-tuned models. Single-phase defenses rarely generalize across phases, and defense effectiveness depends jointly on model architecture and alignment state. Conclusion: We identify key open problems (configuration-robust defense, cross-phase defense composition, and embedding-space attacks beyond behavioral assumptions) and propose concrete future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A survey that structures fine-tuning security into phases and runs cross-phase tests on small models, but the key patterns need checking on larger scales.

read the letter

The main thing here is a lifecycle framework for LLM fine-tuning security that splits everything into pre-tuning, during-tuning, and post-tuning phases. The paper reviews the literature in that structure and then runs unified experiments that pair attacks and defenses from different phases on a common setup.

It does a decent job showing how threats have moved from simple data poisoning to more sophisticated interface exploits. The cross-phase tests are new and reveal that defenses don't transfer well and that attack success depends heavily on the model. The note that benign samples can compromise alignment in tuned models is a useful observation. The identification of open problems like configuration-robust defenses and embedding-space attacks is also practical.

The experiments stick to 1B-4B models. That's the soft spot. Earlier reports on cross-lingual backdoors used larger scales and got different results, so these patterns might not generalize. The paper flags this but the limited scale still makes the broader claims about non-monotonicity and non-generalization feel preliminary. More detail on how the representative methods were picked would help too. The hardware and protocol are unified, which is good, but without knowing the exact selection criteria it's hard to judge if the results are robust.

People building or evaluating fine-tuning pipelines will get value from the taxonomy and the open problems list. It's not for someone looking for a new defense mechanism or a formal proof.

The thinking is straightforward and the structure is consistent with the abstract. It should go to peer review so referees can push on the empirical coverage and whether the framework holds up as a standard way to think about the problem. I'd recommend sending it out.

Referee Report

2 major / 1 minor

Summary. The paper presents a systematic survey of security threats and defenses across the fine-tuning lifecycle of LLMs, organizing mechanisms into pre-tuning, during-tuning, and post-tuning phases. It reviews and contrasts strategies within each phase before conducting unified empirical evaluations of representative attacks and defenses on 1B-4B models under a consistent model/hardware/protocol setup, including cross-phase pairings. Key reported results include highly model-dependent and non-monotonic attack effectiveness with scale, failure of cross-lingual backdoor transfer on the tested models, and limited generalization of single-phase defenses; the paper identifies open problems such as configuration-robust defenses and proposes future directions.

Significance. If the observed empirical patterns on model dependence, non-monotonicity, and defense non-generalization are robust to model selection and scale, the lifecycle framework and unified evaluation protocol would provide a useful organizing structure for comparing attacks and defenses in LLM security research, highlighting the need for cross-phase approaches.

major comments (2)

[Abstract and Evaluation section] Abstract (Results) and Evaluation section: The central claims that 'attack effectiveness is highly model-dependent and non-monotonic with scale' and that 'cross-lingual backdoor transfer... fails entirely on tested 1B-4B models' rest on experiments limited to 1B-4B models. Without explicit criteria for model selection, comparison to larger scales where prior work reported different outcomes, or discussion of representativeness, these patterns risk being artifacts of the chosen scale and methods rather than general properties of the lifecycle.
[Evaluation section] Evaluation section: The claim that 'single-phase defenses rarely generalize across phases' is supported by cross-phase pairing experiments, but the paper does not detail how representative attack-defense pairs were selected or whether the pairings exhaustively cover combinations; this weakens the load-bearing conclusion about joint dependence on architecture and alignment state.

minor comments (1)

[Title] The title contains a missing space: 'Defenses,Evaluation' should read 'Defenses, Evaluation'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the empirical scope and evaluation details. We address each major comment below, committing to revisions where feasible while noting limitations honestly.

read point-by-point responses

Referee: [Abstract and Evaluation section] The central claims that 'attack effectiveness is highly model-dependent and non-monotonic with scale' and that 'cross-lingual backdoor transfer... fails entirely on tested 1B-4B models' rest on experiments limited to 1B-4B models. Without explicit criteria for model selection, comparison to larger scales where prior work reported different outcomes, or discussion of representativeness, these patterns risk being artifacts of the chosen scale and methods rather than general properties of the lifecycle.

Authors: Model selection was driven by the need for a unified hardware/protocol setup across all attacks/defenses to enable fair cross-phase comparisons, which required open-weight models runnable on available compute (1B-4B range). We will revise the Evaluation section to explicitly state these criteria, add a dedicated paragraph on representativeness, and discuss that prior work on larger scales showed different outcomes, framing our results as scale-specific observations rather than universal claims. We cannot rerun experiments on larger models due to resource constraints but will highlight this as a limitation and future direction. revision: partial
Referee: [Evaluation section] The claim that 'single-phase defenses rarely generalize across phases' is supported by cross-phase pairing experiments, but the paper does not detail how representative attack-defense pairs were selected or whether the pairings exhaustively cover combinations; this weakens the load-bearing conclusion about joint dependence on architecture and alignment state.

Authors: Pairs were selected as representative based on prominence in the surveyed literature and coverage of distinct mechanisms (e.g., data poisoning paired with post-tuning alignment, weight editing with during-tuning defenses). We will add explicit selection criteria and a table summarizing the pairings in the Evaluation section, while clarifying that exhaustive coverage of all combinations is infeasible. This revision will strengthen the discussion of joint dependence without overclaiming generality. revision: yes

Circularity Check

0 steps flagged

No circularity: survey structure and unified empirical evaluations are self-contained

full rationale

The paper is a systematic survey that organizes existing attacks/defenses into a lifecycle framework and reports results from its own unified evaluation protocol on selected models. No derivation chain, fitted parameters relabeled as predictions, self-referential equations, or load-bearing self-citations that reduce claims to inputs by construction are present. The empirical observations (model-dependence, non-generalization) are direct outputs of the stated experimental setup rather than algebraic or definitional reductions, satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper. It introduces no new mathematical derivations, fitted parameters, or postulated entities. The empirical component relies on selection of representative methods and a unified evaluation protocol whose details are not visible in the abstract.

pith-pipeline@v0.9.1-grok · 5818 in / 1198 out tokens · 32877 ms · 2026-06-29T23:49:05.641544+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

113 extracted references · 55 canonical work pages · 11 internal anchors

[1]

Training language mod- els to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language mod- els to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022
[2]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, et al. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, et al. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

2020
[4]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, et al. Lora: Low-rank adaptation of large language models. volume 1, page 3, 2022

2022
[5]

Backdoor learning: A survey

Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 35(1):5–22, 2022

2022
[6]

AI Alignment: A Comprehensive Survey

Jiaming Ji, Tianyi Qiu, Boyuan Chen, et al. Ai alignment: A comprehensive survey. arXiv:2310.19852, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Dataset se- curity for machine learning: Data poisoning, backdoor attacks, and defenses

Micah Goldblum, Dimitris Tsipras, Chulin Xie, et al. Dataset se- curity for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 45(2):1563–1580, 2022

2022
[8]

Rittichier, and Arjan Dur- resi

Davinder Kaur, Suleyman Uslu, Kaley J. Rittichier, and Arjan Dur- resi. Trustworthy artificial intelligence: A review. ACM Comput. Surv., 55(2), Jan. 2022. ISSN 0360-0300. doi: 10.1145/3491209. URL https://doi.org/10.1145/3491209

work page doi:10.1145/3491209 2022
[9]

Poi- soning language models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poi- soning language models during instruction tuning. In International Conference on Machine Learning, pages 35413–35425. PMLR, 2023

2023
[10]

Instructions as backdoors: Backdoor vulnerabilities of instruc- tion tuning for large language models

Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vulnerabilities of instruc- tion tuning for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Techn...

work page doi:10.18653/v1/2024.naacl-long.171 2024
[11]

Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations ,

Xiangyu Qi, Yi Zeng, Tinghao Xie, et al. Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations ,
[12]

URL https://openreview.net/forum?id=hTEGyKf0dZ
[13]

On the exploitability of instruction tuning

Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning. In Proceedings of the 37th International Conference on Neural Informa- tion Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

2023
[14]

Vaccine: Perturbation- aware alignment for large language models against harmful fine- tuning attack

Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: Perturbation- aware alignment for large language models against harmful fine- tuning attack. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/ forum?id=lpXDZKiAnt

2024
[15]

Representation noising: A defence mechanism against harmful finetuning

Domenic Rosati, Jan Wehner, Kai Williams, et al. Representation noising: A defence mechanism against harmful finetuning. Advances in Neural Information Processing Systems, 37:12636–12676, 2024

2024
[16]

Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim F Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack. Advances in Neural Information Processing Systems, 37:104521–104555, 2024

2024
[17]

Safe lora: The silver lining of reducing safety risks when finetuning large language models

Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Safe lora: The silver lining of reducing safety risks when finetuning large language models. Advances in Neural Information Processing Systems, 37:65072–65094, 2024

2024
[18]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, et al. Parameter-efficient transfer learning for NLP. In Kamalika Chaud- huri and Ruslan Salakhutdinov, editors, Proceedings of the 36th Inter- national Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research, pages 2790–2799, Long Beach, Califor- nia, USA, 09–15 Jun 2019....

2019
[19]

Prefix-tuning: Optimizing contin- uous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing contin- uous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pag...

work page doi:10.18653/v1/2021.acl-long.353 2021
[20]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv:1708.06733, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Stealthy and persistent unalignment on large language models via backdoor injec- tions

Yuanpu Cao, Bochuan Cao, and Jinghui Chen. Stealthy and persistent unalignment on large language models via backdoor injec- tions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...

work page doi:10.18653/v1/2024.naacl-long.276 2024
[23]

TUBA: Cross-lingual transferability of backdoor attacks in LLMs with instruction tun- ing

Xuanli He, Jun Wang, Qiongkai Xu, et al. TUBA: Cross-lingual transferability of backdoor attacks in LLMs with instruction tun- ing. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025 , pages 16504–16544, Vienna, Austria, July 2025. Association for Com...

work page doi:10.18653/v1/2025.findings-acl.848 2025
[24]

Embedx: embedding-based cross-trigger backdoor attack against large language models

Nan Yan, Yuqing Li, Xiong Wang, Jing Chen, Kun He, and Bo Li. Embedx: embedding-based cross-trigger backdoor attack against large language models. In Proceedings of the 34th USENIX Conference on Security Symposium , SEC ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-52-6

2025
[25]

BEEAR: Embedding-based adversarial removal of safety back- doors in instruction-tuned language models

Yi Zeng, Weiyu Sun, Tran Huynh, Dawn Song, Bo Li, and Ruoxi Jia. BEEAR: Embedding-based adversarial removal of safety back- doors in instruction-tuned language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 13189–13215, Miami, Florida, US...

work page doi:10.18653/v1/2024.emnlp-main.732 2024
[26]

Attack via overfitting: 10-shot benign fine-tuning to jailbreak LLMs

Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak LLMs. In The Thirty-ninth An- nual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=utvu4PJ0Ct

2025
[27]

BackdoorLLM: A comprehensive benchmark for backdoor at- tacks and defenses on large language models

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. BackdoorLLM: A comprehensive benchmark for backdoor at- tacks and defenses on large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2025. URL https://openreview.net/forum? id=sYLiY87mNn

2025
[28]

ELBA-bench: An efficient learning backdoor attacks benchmark for large language models

Xuxu Liu, Siyuan Liang, Mengya Han, et al. ELBA-bench: An efficient learning backdoor attacks benchmark for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17928–17947, Vienn...

work page doi:10.18653/v1/2025.acl-long.877 2025
[29]

Badedit: Backdooring large language models by model editing

Yanzhou Li, Tianlin Li, Kangjie Chen, et al. Badedit: Backdooring large language models by model editing. In The Twelfth Interna- tional Conference on Learning Representations , 2024. URL https:// openreview.net/forum?id=duZANm2ABX

2024
[30]

LoRATK: LoRA once, backdoor everywhere in the share-and-play ecosystem

Hongyi Liu, Shaochen Zhong, Xintong Sun, et al. LoRATK: LoRA once, backdoor everywhere in the share-and-play ecosystem. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computa- tional Linguistics: EMNLP 2025 , pages 23009–23047, Suzhou, China, Nov. 2025. Association for Computatio...

work page doi:10.18653/v1/2025.findings-emnlp.1253 2025
[31]

SaloRA: Safety-alignment preserved low-rank adaptation

Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaloRA: Safety-alignment preserved low-rank adaptation. In The Thirteenth International Conference on Learning Representations ,
[32]

URL https://openreview.net/forum?id=GOoVzE9nSj
[33]

Probe before you talk: Towards black-box defense against backdoor unalignment for large language models

Biao Yi, Tiansheng Huang, Sishuo Chen, et al. Probe before you talk: Towards black-box defense against backdoor unalignment for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=EbxYDBhE3S

2025
[34]

Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning attack

Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Joshua Kimball, and Ling Liu. Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning attack. In Forty-second International Conference on Machine Learning , 2025. URL https://openreview.net/forum?id=Arepl4R86m

2025
[35]

Weight poisoning attacks on pretrained models

Keita Kurita, Paul Michel, and Graham Neubig. Weight poisoning attacks on pretrained models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 2793– 2806, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2...

work page doi:10.18653/v1/2020.acl-main.249 2020
[36]

In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Linyang Li, Demin Song, Xiaonan Li, Jiehang Zeng, Ruotian Ma, and Xipeng Qiu. Backdoor attacks on pre-trained models by layerwise weight poisoning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing , pages 3023–3032, Online and Pun...

work page doi:10.18653/v1/2021 2021
[37]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID: 160025533

2019
[38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin R. Stone, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/ abs/2307.09288. Preprint posted online July 18, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Exploiting LLM quantization

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Mar- tin Vechev. Exploiting LLM quantization. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 41709–41732, Red Hook, NY, 2024. Curran Associates, Inc. doi: 10. 52202/079017-1319

2024
[40]

Finetuning-activated backdoors in LLMs

Thibaud Gloaguen, Mark Vero, Robin Staab, and Martin Vechev. Finetuning-activated backdoors in LLMs. In ICML 2025 Workshop on Reliable and Responsible Foundation Models , 2025. URL https:// openreview.net/forum?id=VPFq7otjIc

2025
[41]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of ICML’17, pages 1126–1135, Sydney, Australia, 2017. PMLR

2017
[42]

Truth serum: Poisoning machine learning models to reveal their secrets

Florian Tramèr, Reza Shokri, Ayrton San Joaquin, et al. Truth serum: Poisoning machine learning models to reveal their secrets. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security , CCS ’22, page 27792792, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450394505. doi: 10.1145/3548606.3560554. UR...

work page doi:10.1145/3548606.3560554 2022
[43]

In: 2022 IEEE Symposium on Security and Privacy (SP), pp

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramèr. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP) , pages 1897–1914, 2022. doi: 10.1109/SP46214.2022.9833649

work page doi:10.1109/sp46214.2022.9833649 2022
[44]

Privacy backdoors: Enhancing membership inference through poisoning pre-trained models

Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini. Privacy backdoors: Enhancing membership inference through poisoning pre-trained models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Sys- tems, volume 37, pages 83374–83396,...

work page doi:10.52202/079017-2652 2024
[45]

Learning trans- ferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning trans- ferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Ma- chine Learning Research , pages 8748–8763, Virtual, 18–24 Jul 2021. PMLR

2021
[46]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, et al. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=6Mxhg9PtDE

2025
[47]

Immunization against harmful fine-tuning attacks

Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Hassan Sajjad, and Frank Rudzicz. Immunization against harmful fine-tuning attacks. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 5234–5247, Miami, Florida, USA, Nov. 2024. Association for Computation...

2024
[48]

Keeping llms aligned after fine-tuning: The 36 of 39 Software: Practice and Experience, 2026 crucial role of prompt templates

Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The 36 of 39 Software: Practice and Experience, 2026 crucial role of prompt templates. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems , v...

2026
[49]

Backdooralign: Mit- igating fine-tuning based jailbreak attack with backdoor enhanced safety alignment

Jiongxiao Wang, Jiazhao Li, Yiquan Li, et al. Backdooralign: Mit- igating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/ forum?id=1PcJ5Evta7

2024
[50]

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

Guozhi Liu, Weiwei Lin, Qi Mu, et al. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. IEEE Transactions on Information Forensics and Security, 20:10806–10817, 2025. doi: 10.1109/TIFS.2025.3615412

work page doi:10.1109/tifs.2025.3615412 2025
[51]

Booster: Tackling harmful fine-tuning for large lan- guage models via attenuating harmful perturbation

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine-tuning for large lan- guage models via attenuating harmful perturbation. In The Thir- teenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=tTPHgb0EtV

2025
[52]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Mickel Liu, Josef Dai, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sys- tems, volume 36, pages 24678–24704, Red Hook, NY, 2023. Curran Associates, Inc

2023
[53]

Direct preference optimiza- tion: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Man- ning, Stefano Ermon, and Chelsea Finn. Direct preference optimiza- tion: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, edi- tors, Advances in Neural Information Processing Systems , volume 36, pages 53728–53741, Red H...

2023
[54]

Self-destructing models: Increasing the costs of harmful dual uses of foundation models

Peter Henderson, Eric Mitchell, Christopher Manning, Dan Ju- rafsky, and Chelsea Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’23, page 287296, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702310. ...

work page doi:10.1145/3600211.3604690 2023
[55]

Mecha- nistically analyzing the effects of fine-tuning on procedurally defined tasks

Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, et al. Mecha- nistically analyzing the effects of fine-tuning on procedurally defined tasks. In The Twelfth International Conference on Learning Represen- tations, 2024. URL https://openreview.net/forum?id=A0HKeKl4Nl

2024
[56]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, et al. Refusal in language models is mediated by a single direction. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 136037–136083, Red Hook, NY, USA, 2024. Curran Associates, Inc. doi: 10.52202/079017-4322

work page doi:10.52202/079017-4322 2024
[57]

On evaluating the durability of safeguards for open-weight llms

Xiangyu Qi, Boyi Wei, Nicholas Carlini, et al. On evaluating the durability of safeguards for open-weight llms. CoRR, abs/2412.07097,

work page arXiv
[58]

URL https://doi.org/10.48550/arXiv.2412.07097

work page doi:10.48550/arxiv.2412.07097
[59]

Evaluating de- fences against unsafe feedback in rlhf

Domenic Rosati, Giles Edkins, Harsh Raj, et al. Evaluating de- fences against unsafe feedback in rlhf. 2024. URL https://api. semanticscholar.org/CorpusID:272753495

2024
[60]

Bach, Victor Sanh, Zheng-Xin Yong, et al

Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, et al. Prompt- Source: An integrated development environment and repository for natural language prompts. In Valerio Basile, Zornitsa Kozareva, and Sanja Stajner, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Irela...

work page doi:10.18653/v1/2022.acl-demo.9 2022
[61]

Cross-task generalization via natural language crowd- sourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowd- sourcing instructions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3470–3487, Dublin...

work page doi:10.18653/v1/2022.acl-long.244 2022
[62]

Hugging Face Hub, 2025

Hugging Face. Hugging Face Hub, 2025. URL https:// huggingface.co/. Accessed: 2025-12-01

2025
[63]

Mind the style of text! adversarial and backdoor attacks based on text style transfer

Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, and Maosong Sun. Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Marie-Francine Moens, Xuan- jing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4569–4580,...

work page doi:10.18653/v1/2021.emnlp-main.374 2021
[64]

Badnl: Backdoor attacks against nlp models with semantic-preserving improvements

Xiaoyi Chen, Ahmed Salem, Dingfan Chen, et al. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, ACSAC ’21, page 554569, New York, NY, USA, 2021. As- sociation for Computing Machinery. ISBN 9781450385794. doi: 10.1145/3485832.3485837

work page doi:10.1145/3485832.3485837 2021
[65]

Not What You've Signed Up For: Compromising Real-World

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph En- dres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indi- rect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , AISec ’23, page 7990, New York, NY, USA, 2023. Association...

work page doi:10.1145/3605764.3623985 2023
[66]

Shadow alignment: The ease of subverting safely-aligned language models, 2024

Xianjun Yang, Xiao Wang, Qi Zhang, et al. Shadow alignment: The ease of subverting safely-aligned language models, 2024. URL https://openreview.net/forum?id=rg0vQmkB7F

2024
[67]

Bloom: A 176b-parameter open- access multilingual language model

Teven {Le Scao}, Christopher Akiki, Angela Fan, Ellie Pavlick, Francesco {De Toni}, and Suzana Ilić. Bloom: A 176b-parameter open- access multilingual language model. Workingpaper, MIT Press, Nov. 2022

2022
[68]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, et al. Gemma: Open models based on Gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Cl-attack: Textual backdoor attacks via cross-lingual triggers

Jingyi Zheng, Tianyi Hu, Tianshuo Cong, and Xinlei He. Cl-attack: Textual backdoor attacks via cross-lingual triggers. Proceedings of the AAAI Conference on Artificial Intelligence , 39(25):26427–26435, Apr
[71]

URL https://ojs.aaai.org/index

doi: 10.1609/aaai.v39i25.34842. URL https://ojs.aaai.org/index. php/AAAI/article/view/34842

work page doi:10.1609/aaai.v39i25.34842
[72]

xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning

Linzheng Chai, Jian Yang, Tao Sun, et al. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. In AAAI Conference on Artificial Intelligence , 2024. URL https://api. semanticscholar.org/CorpusID:266999425

2024
[73]

Cross-lingual prompting: Improving zero-shot chain-of- thought reasoning across languages

Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, and Wanx- iang Che. Cross-lingual prompting: Improving zero-shot chain-of- thought reasoning across languages. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing , pages 2695–2709, Singapore, Dec. 2023. Association...

work page doi:10.18653/v1/2023.emnlp-main.163 2023
[74]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, et al. Qwen2 technical report, 2024a. URL https://arxiv. org/abs/2407.10671, 6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

Badlingual: A novel lingual-backdoor attack against large language models.arXiv preprint arXiv:2505.03501, 2025

Zihan Wang, Hongwei Li, Rui Zhang, et al. Badlingual: A novel lingual-backdoor attack against large language models.arXiv preprint arXiv:2505.03501, 2025

work page arXiv 2025
[76]

React: Synergizing rea- soning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, et al. React: Synergizing rea- soning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023

2023
[77]

The rise and poten- tial of large language model based agents: a survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, et al. The rise and poten- tial of large language model based agents: a survey. Science China Information Sciences, 68(2):121101, Jan 2025. ISSN 1869-1919. doi: 10.1007/s11432-024-4222-0

work page doi:10.1007/s11432-024-4222-0 2025
[78]

BadAgent: Inserting and activating backdoor attacks in LLM agents

Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. BadAgent: Inserting and activating backdoor attacks in LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, Bangkok, Thai- land, Aug. 2024. Associa...

work page doi:10.18653/v1/2024.acl-long.530 2024
[79]

Watch out for your agents! investigating backdoor threats to llm-based agents

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to llm-based agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural In- formation Processing Systems, volume 37, pages 100938–100964, Red Hook, NY, 2024. Curran...

work page doi:10.52202/079017-3201 2024
[80]

Silent sabotage: Injecting backdoors into AI agents through fine- tuning

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, et al. Silent sabotage: Injecting backdoors into AI agents through fine- tuning. In ICML 2025 Workshop on Computer Use Agents, 2025

2025
[81]

Shikhar Murty, Dzmitry Bahdanau, and Christopher D. Manning. Nnetnav: Unsupervised learning of browser agents through environ- ment interaction in the wild. 2024. URL https://api.semanticscholar. org/CorpusID:273162280

2024

Showing first 80 references.

[1] [1]

Training language mod- els to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language mod- els to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022

[2] [2]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, et al. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, et al. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

2020

[4] [4]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, et al. Lora: Low-rank adaptation of large language models. volume 1, page 3, 2022

2022

[5] [5]

Backdoor learning: A survey

Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 35(1):5–22, 2022

2022

[6] [6]

AI Alignment: A Comprehensive Survey

Jiaming Ji, Tianyi Qiu, Boyuan Chen, et al. Ai alignment: A comprehensive survey. arXiv:2310.19852, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Dataset se- curity for machine learning: Data poisoning, backdoor attacks, and defenses

Micah Goldblum, Dimitris Tsipras, Chulin Xie, et al. Dataset se- curity for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 45(2):1563–1580, 2022

2022

[8] [8]

Rittichier, and Arjan Dur- resi

Davinder Kaur, Suleyman Uslu, Kaley J. Rittichier, and Arjan Dur- resi. Trustworthy artificial intelligence: A review. ACM Comput. Surv., 55(2), Jan. 2022. ISSN 0360-0300. doi: 10.1145/3491209. URL https://doi.org/10.1145/3491209

work page doi:10.1145/3491209 2022

[9] [9]

Poi- soning language models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poi- soning language models during instruction tuning. In International Conference on Machine Learning, pages 35413–35425. PMLR, 2023

2023

[10] [10]

Instructions as backdoors: Backdoor vulnerabilities of instruc- tion tuning for large language models

Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vulnerabilities of instruc- tion tuning for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Techn...

work page doi:10.18653/v1/2024.naacl-long.171 2024

[11] [11]

Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations ,

Xiangyu Qi, Yi Zeng, Tinghao Xie, et al. Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations ,

[12] [12]

URL https://openreview.net/forum?id=hTEGyKf0dZ

[13] [13]

On the exploitability of instruction tuning

Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning. In Proceedings of the 37th International Conference on Neural Informa- tion Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

2023

[14] [14]

Vaccine: Perturbation- aware alignment for large language models against harmful fine- tuning attack

Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: Perturbation- aware alignment for large language models against harmful fine- tuning attack. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/ forum?id=lpXDZKiAnt

2024

[15] [15]

Representation noising: A defence mechanism against harmful finetuning

Domenic Rosati, Jan Wehner, Kai Williams, et al. Representation noising: A defence mechanism against harmful finetuning. Advances in Neural Information Processing Systems, 37:12636–12676, 2024

2024

[16] [16]

Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim F Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack. Advances in Neural Information Processing Systems, 37:104521–104555, 2024

2024

[17] [17]

Safe lora: The silver lining of reducing safety risks when finetuning large language models

Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Safe lora: The silver lining of reducing safety risks when finetuning large language models. Advances in Neural Information Processing Systems, 37:65072–65094, 2024

2024

[18] [18]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, et al. Parameter-efficient transfer learning for NLP. In Kamalika Chaud- huri and Ruslan Salakhutdinov, editors, Proceedings of the 36th Inter- national Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research, pages 2790–2799, Long Beach, Califor- nia, USA, 09–15 Jun 2019....

2019

[19] [19]

Prefix-tuning: Optimizing contin- uous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing contin- uous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pag...

work page doi:10.18653/v1/2021.acl-long.353 2021

[20] [20]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv:1708.06733, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

Stealthy and persistent unalignment on large language models via backdoor injec- tions

Yuanpu Cao, Bochuan Cao, and Jinghui Chen. Stealthy and persistent unalignment on large language models via backdoor injec- tions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...

work page doi:10.18653/v1/2024.naacl-long.276 2024

[22] [23]

TUBA: Cross-lingual transferability of backdoor attacks in LLMs with instruction tun- ing

Xuanli He, Jun Wang, Qiongkai Xu, et al. TUBA: Cross-lingual transferability of backdoor attacks in LLMs with instruction tun- ing. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025 , pages 16504–16544, Vienna, Austria, July 2025. Association for Com...

work page doi:10.18653/v1/2025.findings-acl.848 2025

[23] [24]

Embedx: embedding-based cross-trigger backdoor attack against large language models

Nan Yan, Yuqing Li, Xiong Wang, Jing Chen, Kun He, and Bo Li. Embedx: embedding-based cross-trigger backdoor attack against large language models. In Proceedings of the 34th USENIX Conference on Security Symposium , SEC ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-52-6

2025

[24] [25]

BEEAR: Embedding-based adversarial removal of safety back- doors in instruction-tuned language models

Yi Zeng, Weiyu Sun, Tran Huynh, Dawn Song, Bo Li, and Ruoxi Jia. BEEAR: Embedding-based adversarial removal of safety back- doors in instruction-tuned language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 13189–13215, Miami, Florida, US...

work page doi:10.18653/v1/2024.emnlp-main.732 2024

[25] [26]

Attack via overfitting: 10-shot benign fine-tuning to jailbreak LLMs

Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak LLMs. In The Thirty-ninth An- nual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=utvu4PJ0Ct

2025

[26] [27]

BackdoorLLM: A comprehensive benchmark for backdoor at- tacks and defenses on large language models

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. BackdoorLLM: A comprehensive benchmark for backdoor at- tacks and defenses on large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2025. URL https://openreview.net/forum? id=sYLiY87mNn

2025

[27] [28]

ELBA-bench: An efficient learning backdoor attacks benchmark for large language models

Xuxu Liu, Siyuan Liang, Mengya Han, et al. ELBA-bench: An efficient learning backdoor attacks benchmark for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17928–17947, Vienn...

work page doi:10.18653/v1/2025.acl-long.877 2025

[28] [29]

Badedit: Backdooring large language models by model editing

Yanzhou Li, Tianlin Li, Kangjie Chen, et al. Badedit: Backdooring large language models by model editing. In The Twelfth Interna- tional Conference on Learning Representations , 2024. URL https:// openreview.net/forum?id=duZANm2ABX

2024

[29] [30]

LoRATK: LoRA once, backdoor everywhere in the share-and-play ecosystem

Hongyi Liu, Shaochen Zhong, Xintong Sun, et al. LoRATK: LoRA once, backdoor everywhere in the share-and-play ecosystem. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computa- tional Linguistics: EMNLP 2025 , pages 23009–23047, Suzhou, China, Nov. 2025. Association for Computatio...

work page doi:10.18653/v1/2025.findings-emnlp.1253 2025

[30] [31]

SaloRA: Safety-alignment preserved low-rank adaptation

Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaloRA: Safety-alignment preserved low-rank adaptation. In The Thirteenth International Conference on Learning Representations ,

[31] [32]

URL https://openreview.net/forum?id=GOoVzE9nSj

[32] [33]

Probe before you talk: Towards black-box defense against backdoor unalignment for large language models

Biao Yi, Tiansheng Huang, Sishuo Chen, et al. Probe before you talk: Towards black-box defense against backdoor unalignment for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=EbxYDBhE3S

2025

[33] [34]

Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning attack

Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Joshua Kimball, and Ling Liu. Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning attack. In Forty-second International Conference on Machine Learning , 2025. URL https://openreview.net/forum?id=Arepl4R86m

2025

[34] [35]

Weight poisoning attacks on pretrained models

Keita Kurita, Paul Michel, and Graham Neubig. Weight poisoning attacks on pretrained models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 2793– 2806, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2...

work page doi:10.18653/v1/2020.acl-main.249 2020

[35] [36]

In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Linyang Li, Demin Song, Xiaonan Li, Jiehang Zeng, Ruotian Ma, and Xipeng Qiu. Backdoor attacks on pre-trained models by layerwise weight poisoning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing , pages 3023–3032, Online and Pun...

work page doi:10.18653/v1/2021 2021

[36] [37]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID: 160025533

2019

[37] [38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin R. Stone, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/ abs/2307.09288. Preprint posted online July 18, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [39]

Exploiting LLM quantization

Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Mar- tin Vechev. Exploiting LLM quantization. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 41709–41732, Red Hook, NY, 2024. Curran Associates, Inc. doi: 10. 52202/079017-1319

2024

[39] [40]

Finetuning-activated backdoors in LLMs

Thibaud Gloaguen, Mark Vero, Robin Staab, and Martin Vechev. Finetuning-activated backdoors in LLMs. In ICML 2025 Workshop on Reliable and Responsible Foundation Models , 2025. URL https:// openreview.net/forum?id=VPFq7otjIc

2025

[40] [41]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of ICML’17, pages 1126–1135, Sydney, Australia, 2017. PMLR

2017

[41] [42]

Truth serum: Poisoning machine learning models to reveal their secrets

Florian Tramèr, Reza Shokri, Ayrton San Joaquin, et al. Truth serum: Poisoning machine learning models to reveal their secrets. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security , CCS ’22, page 27792792, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450394505. doi: 10.1145/3548606.3560554. UR...

work page doi:10.1145/3548606.3560554 2022

[42] [43]

In: 2022 IEEE Symposium on Security and Privacy (SP), pp

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramèr. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP) , pages 1897–1914, 2022. doi: 10.1109/SP46214.2022.9833649

work page doi:10.1109/sp46214.2022.9833649 2022

[43] [44]

Privacy backdoors: Enhancing membership inference through poisoning pre-trained models

Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini. Privacy backdoors: Enhancing membership inference through poisoning pre-trained models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Sys- tems, volume 37, pages 83374–83396,...

work page doi:10.52202/079017-2652 2024

[44] [45]

Learning trans- ferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning trans- ferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Ma- chine Learning Research , pages 8748–8763, Virtual, 18–24 Jul 2021. PMLR

2021

[45] [46]

Safety alignment should be made more than just a few tokens deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, et al. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=6Mxhg9PtDE

2025

[46] [47]

Immunization against harmful fine-tuning attacks

Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Hassan Sajjad, and Frank Rudzicz. Immunization against harmful fine-tuning attacks. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 5234–5247, Miami, Florida, USA, Nov. 2024. Association for Computation...

2024

[47] [48]

Keeping llms aligned after fine-tuning: The 36 of 39 Software: Practice and Experience, 2026 crucial role of prompt templates

Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The 36 of 39 Software: Practice and Experience, 2026 crucial role of prompt templates. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems , v...

2026

[48] [49]

Backdooralign: Mit- igating fine-tuning based jailbreak attack with backdoor enhanced safety alignment

Jiongxiao Wang, Jiazhao Li, Yiquan Li, et al. Backdooralign: Mit- igating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/ forum?id=1PcJ5Evta7

2024

[49] [50]

Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation

Guozhi Liu, Weiwei Lin, Qi Mu, et al. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. IEEE Transactions on Information Forensics and Security, 20:10806–10817, 2025. doi: 10.1109/TIFS.2025.3615412

work page doi:10.1109/tifs.2025.3615412 2025

[50] [51]

Booster: Tackling harmful fine-tuning for large lan- guage models via attenuating harmful perturbation

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine-tuning for large lan- guage models via attenuating harmful perturbation. In The Thir- teenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=tTPHgb0EtV

2025

[51] [52]

BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset

Jiaming Ji, Mickel Liu, Josef Dai, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sys- tems, volume 36, pages 24678–24704, Red Hook, NY, 2023. Curran Associates, Inc

2023

[52] [53]

Direct preference optimiza- tion: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Man- ning, Stefano Ermon, and Chelsea Finn. Direct preference optimiza- tion: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, edi- tors, Advances in Neural Information Processing Systems , volume 36, pages 53728–53741, Red H...

2023

[53] [54]

Self-destructing models: Increasing the costs of harmful dual uses of foundation models

Peter Henderson, Eric Mitchell, Christopher Manning, Dan Ju- rafsky, and Chelsea Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’23, page 287296, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702310. ...

work page doi:10.1145/3600211.3604690 2023

[54] [55]

Mecha- nistically analyzing the effects of fine-tuning on procedurally defined tasks

Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, et al. Mecha- nistically analyzing the effects of fine-tuning on procedurally defined tasks. In The Twelfth International Conference on Learning Represen- tations, 2024. URL https://openreview.net/forum?id=A0HKeKl4Nl

2024

[55] [56]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, et al. Refusal in language models is mediated by a single direction. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 136037–136083, Red Hook, NY, USA, 2024. Curran Associates, Inc. doi: 10.52202/079017-4322

work page doi:10.52202/079017-4322 2024

[56] [57]

On evaluating the durability of safeguards for open-weight llms

Xiangyu Qi, Boyi Wei, Nicholas Carlini, et al. On evaluating the durability of safeguards for open-weight llms. CoRR, abs/2412.07097,

work page arXiv

[57] [58]

URL https://doi.org/10.48550/arXiv.2412.07097

work page doi:10.48550/arxiv.2412.07097

[58] [59]

Evaluating de- fences against unsafe feedback in rlhf

Domenic Rosati, Giles Edkins, Harsh Raj, et al. Evaluating de- fences against unsafe feedback in rlhf. 2024. URL https://api. semanticscholar.org/CorpusID:272753495

2024

[59] [60]

Bach, Victor Sanh, Zheng-Xin Yong, et al

Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, et al. Prompt- Source: An integrated development environment and repository for natural language prompts. In Valerio Basile, Zornitsa Kozareva, and Sanja Stajner, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Irela...

work page doi:10.18653/v1/2022.acl-demo.9 2022

[60] [61]

Cross-task generalization via natural language crowd- sourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowd- sourcing instructions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3470–3487, Dublin...

work page doi:10.18653/v1/2022.acl-long.244 2022

[61] [62]

Hugging Face Hub, 2025

Hugging Face. Hugging Face Hub, 2025. URL https:// huggingface.co/. Accessed: 2025-12-01

2025

[62] [63]

Mind the style of text! adversarial and backdoor attacks based on text style transfer

Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, and Maosong Sun. Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Marie-Francine Moens, Xuan- jing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4569–4580,...

work page doi:10.18653/v1/2021.emnlp-main.374 2021

[63] [64]

Badnl: Backdoor attacks against nlp models with semantic-preserving improvements

Xiaoyi Chen, Ahmed Salem, Dingfan Chen, et al. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, ACSAC ’21, page 554569, New York, NY, USA, 2021. As- sociation for Computing Machinery. ISBN 9781450385794. doi: 10.1145/3485832.3485837

work page doi:10.1145/3485832.3485837 2021

[64] [65]

Not What You've Signed Up For: Compromising Real-World

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph En- dres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indi- rect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , AISec ’23, page 7990, New York, NY, USA, 2023. Association...

work page doi:10.1145/3605764.3623985 2023

[65] [66]

Shadow alignment: The ease of subverting safely-aligned language models, 2024

Xianjun Yang, Xiao Wang, Qi Zhang, et al. Shadow alignment: The ease of subverting safely-aligned language models, 2024. URL https://openreview.net/forum?id=rg0vQmkB7F

2024

[66] [67]

Bloom: A 176b-parameter open- access multilingual language model

Teven {Le Scao}, Christopher Akiki, Angela Fan, Ellie Pavlick, Francesco {De Toni}, and Suzana Ilić. Bloom: A 176b-parameter open- access multilingual language model. Workingpaper, MIT Press, Nov. 2022

2022

[67] [68]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [69]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, et al. Gemma: Open models based on Gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [70]

Cl-attack: Textual backdoor attacks via cross-lingual triggers

Jingyi Zheng, Tianyi Hu, Tianshuo Cong, and Xinlei He. Cl-attack: Textual backdoor attacks via cross-lingual triggers. Proceedings of the AAAI Conference on Artificial Intelligence , 39(25):26427–26435, Apr

[70] [71]

URL https://ojs.aaai.org/index

doi: 10.1609/aaai.v39i25.34842. URL https://ojs.aaai.org/index. php/AAAI/article/view/34842

work page doi:10.1609/aaai.v39i25.34842

[71] [72]

xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning

Linzheng Chai, Jian Yang, Tao Sun, et al. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. In AAAI Conference on Artificial Intelligence , 2024. URL https://api. semanticscholar.org/CorpusID:266999425

2024

[72] [73]

Cross-lingual prompting: Improving zero-shot chain-of- thought reasoning across languages

Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, and Wanx- iang Che. Cross-lingual prompting: Improving zero-shot chain-of- thought reasoning across languages. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing , pages 2695–2709, Singapore, Dec. 2023. Association...

work page doi:10.18653/v1/2023.emnlp-main.163 2023

[73] [74]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, et al. Qwen2 technical report, 2024a. URL https://arxiv. org/abs/2407.10671, 6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[74] [75]

Badlingual: A novel lingual-backdoor attack against large language models.arXiv preprint arXiv:2505.03501, 2025

Zihan Wang, Hongwei Li, Rui Zhang, et al. Badlingual: A novel lingual-backdoor attack against large language models.arXiv preprint arXiv:2505.03501, 2025

work page arXiv 2025

[75] [76]

React: Synergizing rea- soning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, et al. React: Synergizing rea- soning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023

2023

[76] [77]

The rise and poten- tial of large language model based agents: a survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, et al. The rise and poten- tial of large language model based agents: a survey. Science China Information Sciences, 68(2):121101, Jan 2025. ISSN 1869-1919. doi: 10.1007/s11432-024-4222-0

work page doi:10.1007/s11432-024-4222-0 2025

[77] [78]

BadAgent: Inserting and activating backdoor attacks in LLM agents

Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. BadAgent: Inserting and activating backdoor attacks in LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, Bangkok, Thai- land, Aug. 2024. Associa...

work page doi:10.18653/v1/2024.acl-long.530 2024

[78] [79]

Watch out for your agents! investigating backdoor threats to llm-based agents

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to llm-based agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural In- formation Processing Systems, volume 37, pages 100938–100964, Red Hook, NY, 2024. Curran...

work page doi:10.52202/079017-3201 2024

[79] [80]

Silent sabotage: Injecting backdoors into AI agents through fine- tuning

Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, et al. Silent sabotage: Injecting backdoors into AI agents through fine- tuning. In ICML 2025 Workshop on Computer Use Agents, 2025

2025

[80] [81]

Shikhar Murty, Dzmitry Bahdanau, and Christopher D. Manning. Nnetnav: Unsupervised learning of browser agents through environ- ment interaction in the wild. 2024. URL https://api.semanticscholar. org/CorpusID:273162280

2024