Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions
Pith reviewed 2026-06-29 23:49 UTC · model grok-4.3
The pith
LLM fine-tuning attacks succeed or fail based on model architecture, scale, and alignment state rather than following uniform patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A lifecycle framework that splits the fine-tuning process into pre-tuning, during-tuning, and post-tuning phases enables direct comparison of threats and countermeasures; when representative attacks and defenses are re-evaluated under identical models, hardware, and protocols, attack effectiveness proves strongly dependent on model architecture and alignment state, single-phase defenses rarely transfer across phases, weight-editing attacks lose impact on newer open-source LLMs, and cross-lingual backdoor transfer fails on the tested 1B-4B scale models.
What carries the argument
The three-phase lifecycle division (pre-tuning, during-tuning, post-tuning) that groups attacks and defenses by intervention timing and supports cross-phase pairing experiments.
If this is right
- Weight-editing attacks that worked on earlier models lose effectiveness on current open-source LLMs.
- Cross-lingual backdoor transfer that appeared near-perfect at larger scales fails on tested 1B-4B models.
- Instruction-tuned models can have their safety alignment broken by purely benign samples.
- Defenses effective in one phase rarely remain effective when the attack occurs in a different phase.
- Defense success depends on the joint combination of model architecture and current alignment state.
Where Pith is reading between the lines
- Defenses may need explicit mechanisms for composition across multiple lifecycle phases rather than single-phase design.
- Attacks that operate directly in embedding space could evade current behavioral assumptions used in evaluation.
- Robustness to configuration choices (data format, hardware, protocol) becomes a necessary evaluation criterion for any proposed defense.
Load-bearing premise
The chosen representative methods and the single unified evaluation setup are broad enough to support general claims about attack and defense behavior across the field.
What would settle it
A replication that applies the same attack and defense methods to a wider range of model families and sizes and finds monotonic scaling of attack success or consistent cross-phase defense performance.
read the original abstract
Background: Fine-tuning is central to adapting pre-trained Large Language Models (LLMs) to downstream tasks, but its reliance on training data, parameter updates, and reusable components opens entry points for attackers. Threats have evolved from data poisoning and weight tampering to agent manipulation and interface exploitation, yet existing reviews lack a unified framework spanning the full fine-tuning lifecycle. Objective: This paper presents a systematic survey of LLM fine-tuning security and establishes a lifecycle-based framework for comparing attacks and defenses, complemented by unified empirical evaluation. Methods: We divide attack and defense mechanisms into three phases by intervention timing: pre-tuning, during-tuning, and post-tuning. Within each phase, strategies are reviewed and contrasted to expose their evolution and limitations. Representative methods are then evaluated under a unified model, hardware, and protocol setup, with cross-phase experiments pairing attacks and defenses from different phases. Results: Attack effectiveness is highly model-dependent and non-monotonic with scale: weight-editing attacks effective on earlier models lose impact on modern open-source LLMs; cross-lingual backdoor transfer, reported as near-perfect at larger scales, fails entirely on tested 1B-4B models; and purely benign samples can compromise safety alignment in instruction-tuned models. Single-phase defenses rarely generalize across phases, and defense effectiveness depends jointly on model architecture and alignment state. Conclusion: We identify key open problems (configuration-robust defense, cross-phase defense composition, and embedding-space attacks beyond behavioral assumptions) and propose concrete future research directions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a systematic survey of security threats and defenses across the fine-tuning lifecycle of LLMs, organizing mechanisms into pre-tuning, during-tuning, and post-tuning phases. It reviews and contrasts strategies within each phase before conducting unified empirical evaluations of representative attacks and defenses on 1B-4B models under a consistent model/hardware/protocol setup, including cross-phase pairings. Key reported results include highly model-dependent and non-monotonic attack effectiveness with scale, failure of cross-lingual backdoor transfer on the tested models, and limited generalization of single-phase defenses; the paper identifies open problems such as configuration-robust defenses and proposes future directions.
Significance. If the observed empirical patterns on model dependence, non-monotonicity, and defense non-generalization are robust to model selection and scale, the lifecycle framework and unified evaluation protocol would provide a useful organizing structure for comparing attacks and defenses in LLM security research, highlighting the need for cross-phase approaches.
major comments (2)
- [Abstract and Evaluation section] Abstract (Results) and Evaluation section: The central claims that 'attack effectiveness is highly model-dependent and non-monotonic with scale' and that 'cross-lingual backdoor transfer... fails entirely on tested 1B-4B models' rest on experiments limited to 1B-4B models. Without explicit criteria for model selection, comparison to larger scales where prior work reported different outcomes, or discussion of representativeness, these patterns risk being artifacts of the chosen scale and methods rather than general properties of the lifecycle.
- [Evaluation section] Evaluation section: The claim that 'single-phase defenses rarely generalize across phases' is supported by cross-phase pairing experiments, but the paper does not detail how representative attack-defense pairs were selected or whether the pairings exhaustively cover combinations; this weakens the load-bearing conclusion about joint dependence on architecture and alignment state.
minor comments (1)
- [Title] The title contains a missing space: 'Defenses,Evaluation' should read 'Defenses, Evaluation'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the empirical scope and evaluation details. We address each major comment below, committing to revisions where feasible while noting limitations honestly.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] The central claims that 'attack effectiveness is highly model-dependent and non-monotonic with scale' and that 'cross-lingual backdoor transfer... fails entirely on tested 1B-4B models' rest on experiments limited to 1B-4B models. Without explicit criteria for model selection, comparison to larger scales where prior work reported different outcomes, or discussion of representativeness, these patterns risk being artifacts of the chosen scale and methods rather than general properties of the lifecycle.
Authors: Model selection was driven by the need for a unified hardware/protocol setup across all attacks/defenses to enable fair cross-phase comparisons, which required open-weight models runnable on available compute (1B-4B range). We will revise the Evaluation section to explicitly state these criteria, add a dedicated paragraph on representativeness, and discuss that prior work on larger scales showed different outcomes, framing our results as scale-specific observations rather than universal claims. We cannot rerun experiments on larger models due to resource constraints but will highlight this as a limitation and future direction. revision: partial
-
Referee: [Evaluation section] The claim that 'single-phase defenses rarely generalize across phases' is supported by cross-phase pairing experiments, but the paper does not detail how representative attack-defense pairs were selected or whether the pairings exhaustively cover combinations; this weakens the load-bearing conclusion about joint dependence on architecture and alignment state.
Authors: Pairs were selected as representative based on prominence in the surveyed literature and coverage of distinct mechanisms (e.g., data poisoning paired with post-tuning alignment, weight editing with during-tuning defenses). We will add explicit selection criteria and a table summarizing the pairings in the Evaluation section, while clarifying that exhaustive coverage of all combinations is infeasible. This revision will strengthen the discussion of joint dependence without overclaiming generality. revision: yes
Circularity Check
No circularity: survey structure and unified empirical evaluations are self-contained
full rationale
The paper is a systematic survey that organizes existing attacks/defenses into a lifecycle framework and reports results from its own unified evaluation protocol on selected models. No derivation chain, fitted parameters relabeled as predictions, self-referential equations, or load-bearing self-citations that reduce claims to inputs by construction are present. The empirical observations (model-dependence, non-generalization) are direct outputs of the stated experimental setup rather than algebraic or definitional reductions, satisfying the criteria for a non-circular finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Training language mod- els to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language mod- els to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022
2022
-
[2]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, et al. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Learning to summarize with human feedback
Nisan Stiennon, Long Ouyang, Jeffrey Wu, et al. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020
2020
-
[4]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, et al. Lora: Low-rank adaptation of large language models. volume 1, page 3, 2022
2022
-
[5]
Backdoor learning: A survey
Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 35(1):5–22, 2022
2022
-
[6]
AI Alignment: A Comprehensive Survey
Jiaming Ji, Tianyi Qiu, Boyuan Chen, et al. Ai alignment: A comprehensive survey. arXiv:2310.19852, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Dataset se- curity for machine learning: Data poisoning, backdoor attacks, and defenses
Micah Goldblum, Dimitris Tsipras, Chulin Xie, et al. Dataset se- curity for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 45(2):1563–1580, 2022
2022
-
[8]
Rittichier, and Arjan Dur- resi
Davinder Kaur, Suleyman Uslu, Kaley J. Rittichier, and Arjan Dur- resi. Trustworthy artificial intelligence: A review. ACM Comput. Surv., 55(2), Jan. 2022. ISSN 0360-0300. doi: 10.1145/3491209. URL https://doi.org/10.1145/3491209
-
[9]
Poi- soning language models during instruction tuning
Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poi- soning language models during instruction tuning. In International Conference on Machine Learning, pages 35413–35425. PMLR, 2023
2023
-
[10]
Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vulnerabilities of instruc- tion tuning for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Techn...
-
[11]
Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations ,
Xiangyu Qi, Yi Zeng, Tinghao Xie, et al. Fine-tuning aligned lan- guage models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations ,
-
[12]
URL https://openreview.net/forum?id=hTEGyKf0dZ
-
[13]
On the exploitability of instruction tuning
Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning. In Proceedings of the 37th International Conference on Neural Informa- tion Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc
2023
-
[14]
Vaccine: Perturbation- aware alignment for large language models against harmful fine- tuning attack
Tiansheng Huang, Sihao Hu, and Ling Liu. Vaccine: Perturbation- aware alignment for large language models against harmful fine- tuning attack. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/ forum?id=lpXDZKiAnt
2024
-
[15]
Representation noising: A defence mechanism against harmful finetuning
Domenic Rosati, Jan Wehner, Kai Williams, et al. Representation noising: A defence mechanism against harmful finetuning. Advances in Neural Information Processing Systems, 37:12636–12676, 2024
2024
-
[16]
Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim F Tekin, and Ling Liu. Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack. Advances in Neural Information Processing Systems, 37:104521–104555, 2024
2024
-
[17]
Safe lora: The silver lining of reducing safety risks when finetuning large language models
Chia-Yi Hsu, Yu-Lin Tsai, Chih-Hsun Lin, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Safe lora: The silver lining of reducing safety risks when finetuning large language models. Advances in Neural Information Processing Systems, 37:65072–65094, 2024
2024
-
[18]
Parameter-efficient transfer learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, et al. Parameter-efficient transfer learning for NLP. In Kamalika Chaud- huri and Ruslan Salakhutdinov, editors, Proceedings of the 36th Inter- national Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research, pages 2790–2799, Long Beach, Califor- nia, USA, 09–15 Jun 2019....
2019
-
[19]
Prefix-tuning: Optimizing contin- uous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing contin- uous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pag...
-
[20]
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv:1708.06733, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Stealthy and persistent unalignment on large language models via backdoor injec- tions
Yuanpu Cao, Bochuan Cao, and Jinghui Chen. Stealthy and persistent unalignment on large language models via backdoor injec- tions. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), p...
-
[23]
TUBA: Cross-lingual transferability of backdoor attacks in LLMs with instruction tun- ing
Xuanli He, Jun Wang, Qiongkai Xu, et al. TUBA: Cross-lingual transferability of backdoor attacks in LLMs with instruction tun- ing. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025 , pages 16504–16544, Vienna, Austria, July 2025. Association for Com...
-
[24]
Embedx: embedding-based cross-trigger backdoor attack against large language models
Nan Yan, Yuqing Li, Xiong Wang, Jing Chen, Kun He, and Bo Li. Embedx: embedding-based cross-trigger backdoor attack against large language models. In Proceedings of the 34th USENIX Conference on Security Symposium , SEC ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-52-6
2025
-
[25]
Yi Zeng, Weiyu Sun, Tran Huynh, Dawn Song, Bo Li, and Ruoxi Jia. BEEAR: Embedding-based adversarial removal of safety back- doors in instruction-tuned language models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 13189–13215, Miami, Florida, US...
-
[26]
Attack via overfitting: 10-shot benign fine-tuning to jailbreak LLMs
Zhixin Xie, Xurui Song, and Jun Luo. Attack via overfitting: 10-shot benign fine-tuning to jailbreak LLMs. In The Thirty-ninth An- nual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=utvu4PJ0Ct
2025
-
[27]
BackdoorLLM: A comprehensive benchmark for backdoor at- tacks and defenses on large language models
Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. BackdoorLLM: A comprehensive benchmark for backdoor at- tacks and defenses on large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2025. URL https://openreview.net/forum? id=sYLiY87mNn
2025
-
[28]
ELBA-bench: An efficient learning backdoor attacks benchmark for large language models
Xuxu Liu, Siyuan Liang, Mengya Han, et al. ELBA-bench: An efficient learning backdoor attacks benchmark for large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17928–17947, Vienn...
-
[29]
Badedit: Backdooring large language models by model editing
Yanzhou Li, Tianlin Li, Kangjie Chen, et al. Badedit: Backdooring large language models by model editing. In The Twelfth Interna- tional Conference on Learning Representations , 2024. URL https:// openreview.net/forum?id=duZANm2ABX
2024
-
[30]
LoRATK: LoRA once, backdoor everywhere in the share-and-play ecosystem
Hongyi Liu, Shaochen Zhong, Xintong Sun, et al. LoRATK: LoRA once, backdoor everywhere in the share-and-play ecosystem. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computa- tional Linguistics: EMNLP 2025 , pages 23009–23047, Suzhou, China, Nov. 2025. Association for Computatio...
-
[31]
SaloRA: Safety-alignment preserved low-rank adaptation
Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaloRA: Safety-alignment preserved low-rank adaptation. In The Thirteenth International Conference on Learning Representations ,
-
[32]
URL https://openreview.net/forum?id=GOoVzE9nSj
-
[33]
Probe before you talk: Towards black-box defense against backdoor unalignment for large language models
Biao Yi, Tiansheng Huang, Sishuo Chen, et al. Probe before you talk: Towards black-box defense against backdoor unalignment for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=EbxYDBhE3S
2025
-
[34]
Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning attack
Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Joshua Kimball, and Ling Liu. Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning attack. In Forty-second International Conference on Machine Learning , 2025. URL https://openreview.net/forum?id=Arepl4R86m
2025
-
[35]
Weight poisoning attacks on pretrained models
Keita Kurita, Paul Michel, and Graham Neubig. Weight poisoning attacks on pretrained models. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 2793– 2806, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2...
-
[36]
In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Linyang Li, Demin Song, Xiaonan Li, Jiehang Zeng, Ruotian Ma, and Xipeng Qiu. Backdoor attacks on pre-trained models by layerwise weight poisoning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing , pages 3023–3032, Online and Pun...
-
[37]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID: 160025533
2019
-
[38]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin R. Stone, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/ abs/2307.09288. Preprint posted online July 18, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Exploiting LLM quantization
Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, and Mar- tin Vechev. Exploiting LLM quantization. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 41709–41732, Red Hook, NY, 2024. Curran Associates, Inc. doi: 10. 52202/079017-1319
2024
-
[40]
Finetuning-activated backdoors in LLMs
Thibaud Gloaguen, Mark Vero, Robin Staab, and Martin Vechev. Finetuning-activated backdoors in LLMs. In ICML 2025 Workshop on Reliable and Responsible Foundation Models , 2025. URL https:// openreview.net/forum?id=VPFq7otjIc
2025
-
[41]
Model-agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of ICML’17, pages 1126–1135, Sydney, Australia, 2017. PMLR
2017
-
[42]
Truth serum: Poisoning machine learning models to reveal their secrets
Florian Tramèr, Reza Shokri, Ayrton San Joaquin, et al. Truth serum: Poisoning machine learning models to reveal their secrets. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security , CCS ’22, page 27792792, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450394505. doi: 10.1145/3548606.3560554. UR...
-
[43]
In: 2022 IEEE Symposium on Security and Privacy (SP), pp
Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramèr. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP) , pages 1897–1914, 2022. doi: 10.1109/SP46214.2022.9833649
-
[44]
Privacy backdoors: Enhancing membership inference through poisoning pre-trained models
Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini. Privacy backdoors: Enhancing membership inference through poisoning pre-trained models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Sys- tems, volume 37, pages 83374–83396,...
-
[45]
Learning trans- ferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning trans- ferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Ma- chine Learning Research , pages 8748–8763, Virtual, 18–24 Jul 2021. PMLR
2021
-
[46]
Safety alignment should be made more than just a few tokens deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, et al. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=6Mxhg9PtDE
2025
-
[47]
Immunization against harmful fine-tuning attacks
Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Hassan Sajjad, and Frank Rudzicz. Immunization against harmful fine-tuning attacks. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 5234–5247, Miami, Florida, USA, Nov. 2024. Association for Computation...
2024
-
[48]
Keeping llms aligned after fine-tuning: The 36 of 39 Software: Practice and Experience, 2026 crucial role of prompt templates
Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Keeping llms aligned after fine-tuning: The 36 of 39 Software: Practice and Experience, 2026 crucial role of prompt templates. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems , v...
2026
-
[49]
Backdooralign: Mit- igating fine-tuning based jailbreak attack with backdoor enhanced safety alignment
Jiongxiao Wang, Jiazhao Li, Yiquan Li, et al. Backdooralign: Mit- igating fine-tuning based jailbreak attack with backdoor enhanced safety alignment. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. URL https://openreview.net/ forum?id=1PcJ5Evta7
2024
-
[50]
Guozhi Liu, Weiwei Lin, Qi Mu, et al. Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation. IEEE Transactions on Information Forensics and Security, 20:10806–10817, 2025. doi: 10.1109/TIFS.2025.3615412
-
[51]
Booster: Tackling harmful fine-tuning for large lan- guage models via attenuating harmful perturbation
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. Booster: Tackling harmful fine-tuning for large lan- guage models via attenuating harmful perturbation. In The Thir- teenth International Conference on Learning Representations , 2025. URL https://openreview.net/forum?id=tTPHgb0EtV
2025
-
[52]
BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset
Jiaming Ji, Mickel Liu, Josef Dai, et al. BeaverTails: Towards improved safety alignment of LLM via a human-preference dataset. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sys- tems, volume 36, pages 24678–24704, Red Hook, NY, 2023. Curran Associates, Inc
2023
-
[53]
Direct preference optimiza- tion: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Man- ning, Stefano Ermon, and Chelsea Finn. Direct preference optimiza- tion: Your language model is secretly a reward model. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, edi- tors, Advances in Neural Information Processing Systems , volume 36, pages 53728–53741, Red H...
2023
-
[54]
Self-destructing models: Increasing the costs of harmful dual uses of foundation models
Peter Henderson, Eric Mitchell, Christopher Manning, Dan Ju- rafsky, and Chelsea Finn. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society , AIES ’23, page 287296, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702310. ...
-
[55]
Mecha- nistically analyzing the effects of fine-tuning on procedurally defined tasks
Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, et al. Mecha- nistically analyzing the effects of fine-tuning on procedurally defined tasks. In The Twelfth International Conference on Learning Represen- tations, 2024. URL https://openreview.net/forum?id=A0HKeKl4Nl
2024
-
[56]
Refusal in language models is mediated by a single direction
Andy Arditi, Oscar Obeso, Aaquib Syed, et al. Refusal in language models is mediated by a single direction. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 136037–136083, Red Hook, NY, USA, 2024. Curran Associates, Inc. doi: 10.52202/079017-4322
-
[57]
On evaluating the durability of safeguards for open-weight llms
Xiangyu Qi, Boyi Wei, Nicholas Carlini, et al. On evaluating the durability of safeguards for open-weight llms. CoRR, abs/2412.07097,
-
[58]
URL https://doi.org/10.48550/arXiv.2412.07097
-
[59]
Evaluating de- fences against unsafe feedback in rlhf
Domenic Rosati, Giles Edkins, Harsh Raj, et al. Evaluating de- fences against unsafe feedback in rlhf. 2024. URL https://api. semanticscholar.org/CorpusID:272753495
2024
-
[60]
Bach, Victor Sanh, Zheng-Xin Yong, et al
Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, et al. Prompt- Source: An integrated development environment and repository for natural language prompts. In Valerio Basile, Zornitsa Kozareva, and Sanja Stajner, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Irela...
-
[61]
Cross-task generalization via natural language crowd- sourcing instructions
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowd- sourcing instructions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 3470–3487, Dublin...
-
[62]
Hugging Face Hub, 2025
Hugging Face. Hugging Face Hub, 2025. URL https:// huggingface.co/. Accessed: 2025-12-01
2025
-
[63]
Mind the style of text! adversarial and backdoor attacks based on text style transfer
Fanchao Qi, Yangyi Chen, Xurui Zhang, Mukai Li, Zhiyuan Liu, and Maosong Sun. Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Marie-Francine Moens, Xuan- jing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4569–4580,...
-
[64]
Badnl: Backdoor attacks against nlp models with semantic-preserving improvements
Xiaoyi Chen, Ahmed Salem, Dingfan Chen, et al. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, ACSAC ’21, page 554569, New York, NY, USA, 2021. As- sociation for Computing Machinery. ISBN 9781450385794. doi: 10.1145/3485832.3485837
-
[65]
Not What You've Signed Up For: Compromising Real-World
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph En- dres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indi- rect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security , AISec ’23, page 7990, New York, NY, USA, 2023. Association...
-
[66]
Shadow alignment: The ease of subverting safely-aligned language models, 2024
Xianjun Yang, Xiao Wang, Qi Zhang, et al. Shadow alignment: The ease of subverting safely-aligned language models, 2024. URL https://openreview.net/forum?id=rg0vQmkB7F
2024
-
[67]
Bloom: A 176b-parameter open- access multilingual language model
Teven {Le Scao}, Christopher Akiki, Angela Fan, Ellie Pavlick, Francesco {De Toni}, and Suzana Ilić. Bloom: A 176b-parameter open- access multilingual language model. Workingpaper, MIT Press, Nov. 2022
2022
-
[68]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, et al. Gemma: Open models based on Gemini research and technology. arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Cl-attack: Textual backdoor attacks via cross-lingual triggers
Jingyi Zheng, Tianyi Hu, Tianshuo Cong, and Xinlei He. Cl-attack: Textual backdoor attacks via cross-lingual triggers. Proceedings of the AAAI Conference on Artificial Intelligence , 39(25):26427–26435, Apr
-
[71]
URL https://ojs.aaai.org/index
doi: 10.1609/aaai.v39i25.34842. URL https://ojs.aaai.org/index. php/AAAI/article/view/34842
-
[72]
xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning
Linzheng Chai, Jian Yang, Tao Sun, et al. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. In AAAI Conference on Artificial Intelligence , 2024. URL https://api. semanticscholar.org/CorpusID:266999425
2024
-
[73]
Cross-lingual prompting: Improving zero-shot chain-of- thought reasoning across languages
Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, and Wanx- iang Che. Cross-lingual prompting: Improving zero-shot chain-of- thought reasoning across languages. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Em- pirical Methods in Natural Language Processing , pages 2695–2709, Singapore, Dec. 2023. Association...
-
[74]
An Yang, Baosong Yang, Binyuan Hui, et al. Qwen2 technical report, 2024a. URL https://arxiv. org/abs/2407.10671, 6, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
Zihan Wang, Hongwei Li, Rui Zhang, et al. Badlingual: A novel lingual-backdoor attack against large language models.arXiv preprint arXiv:2505.03501, 2025
-
[76]
React: Synergizing rea- soning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, et al. React: Synergizing rea- soning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023
2023
-
[77]
The rise and poten- tial of large language model based agents: a survey
Zhiheng Xi, Wenxiang Chen, Xin Guo, et al. The rise and poten- tial of large language model based agents: a survey. Science China Information Sciences, 68(2):121101, Jan 2025. ISSN 1869-1919. doi: 10.1007/s11432-024-4222-0
-
[78]
BadAgent: Inserting and activating backdoor attacks in LLM agents
Yifei Wang, Dizhan Xue, Shengjie Zhang, and Shengsheng Qian. BadAgent: Inserting and activating backdoor attacks in LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9811–9827, Bangkok, Thai- land, Aug. 2024. Associa...
-
[79]
Watch out for your agents! investigating backdoor threats to llm-based agents
Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to llm-based agents. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural In- formation Processing Systems, volume 37, pages 100938–100964, Red Hook, NY, 2024. Curran...
-
[80]
Silent sabotage: Injecting backdoors into AI agents through fine- tuning
Léo Boisvert, Abhay Puri, Chandra Kiran Reddy Evuru, et al. Silent sabotage: Injecting backdoors into AI agents through fine- tuning. In ICML 2025 Workshop on Computer Use Agents, 2025
2025
-
[81]
Shikhar Murty, Dzmitry Bahdanau, and Christopher D. Manning. Nnetnav: Unsupervised learning of browser agents through environ- ment interaction in the wild. 2024. URL https://api.semanticscholar. org/CorpusID:273162280
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.