pith. sign in

arxiv: 2605.26595 · v1 · pith:MZU24W5Xnew · submitted 2026-05-26 · 💻 cs.CR · cs.AI· cs.LG

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Pith reviewed 2026-06-29 17:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords data poisoningLLM securitybackdoor attacksprompt injectioncovert controlinformation hidingfine-tuning attacks
0
0 comments X

The pith

Poisoning a small fraction of fine-tuning data teaches LLMs a stealthy scheme to hide and reveal malicious instructions using links between common facts and chosen phrases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that by inserting a small amount of specially crafted examples into an uncurated training set, an attacker can make an LLM learn to associate everyday facts or concepts with specific phrases in a way that lets it encode and later decode arbitrary hidden commands. This creates covert control attacks that succeed more often than standard prompt-injection methods and continue working even after common detection or retraining defenses are applied. A sympathetic reader would care because LLMs are routinely fine-tuned on public or scraped text that an adversary could tamper with, turning a routine training step into a hidden control channel. The core mechanism relies on inducing reliable semantic associations rather than obvious trigger words, which makes the attack harder to spot or remove.

Core claim

The central claim is that a data poisoning method can reliably and stealthily teach an LLM an information hiding scheme by creating semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases; the induced scheme then allows the model to encode and decode arbitrary malicious instructions on demand, producing covert control attacks that outperform heuristic prompt injection and evade standard backdoor and prompt-injection defenses.

What carries the argument

The information hiding scheme induced via semantic associations between shared knowledge and attacker-chosen phrases, which encodes and decodes arbitrary malicious instructions.

If this is right

  • Covert control attacks achieve roughly 40 percent higher average attack success rate than heuristic prompt-injection attacks relative to clean fine-tuned models.
  • The attacks maintain up to 93 percent success rate after three different backdoor defenses.
  • The attacks maintain up to 98 percent success rate after four different prompt-injection defenses.
  • Only a small poisoned fraction of the fine-tuning data is required for the effect to appear across five different LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the semantic-association mechanism scales, data-curation pipelines may need to scan for subtle fact-phrase pairings rather than only obvious triggers.
  • The same hiding scheme could be tested as a way to embed verifiable provenance markers that survive fine-tuning.
  • Models trained on mixed public and private data might inadvertently learn similar hiding schemes from natural correlations, creating an unintended covert channel.

Load-bearing premise

Semantic associations between shared knowledge and attacker-chosen phrases can be reliably and stealthily induced in the model during fine-tuning on uncurated datasets without being neutralized by standard backdoor or prompt-injection defenses.

What would settle it

An experiment in which the poisoned model is given shared-knowledge prompts paired with attacker phrases yet fails to produce the expected encoded malicious outputs at rates above random chance, or in which any of the tested backdoor or prompt-injection defenses drops attack success rate below 50 percent.

Figures

Figures reproduced from arXiv: 2605.26595 by Charles Fleming, Teodora Baluta, Zedian Shao.

Figure 1
Figure 1. Figure 1: Two covert injection attack scenarios enabled by [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Overview of our CORDYCEPS algorithm. In the first phase, CORDYCEPS uses the oracle model to construct associations between the semantic anchor and the attacker plaintext message, resulting in stegotexts. It uses in-context learning (ICL) starting from one example (see Appendix 13). In the second phase, CORDYCEPS uses the oracle model to generate more reasoning steps, starting from the ICL reasoning … view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualization of the embedding of stegotexts [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual analysis of the embedding space and additive bias validation. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The density histogram of the residual values [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: EMR for plaintext messages with different bits across five different LLMs. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: EMR for plaintext messages with different number [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: The prompt for querying Gemini for paraphrasing. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Average length of reasoning across five LLMs after [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: An example of UCC Poisoned data for Prompt Injection. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: An example of BCC Poisoned data for Data Exfiltration. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The one-shot prompt querying Gemini to generate [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
read the original abstract

Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Cordyceps, a data poisoning attack on LLMs that induces an information-hiding scheme via semantic associations between shared knowledge (facts/concepts) and attacker-chosen phrases during fine-tuning on uncurated data. This allows encoding/decoding of arbitrary malicious instructions without fixed triggers. Evaluations across 5 LLMs, 3 backdoor defenses, and 4 prompt-injection defenses claim ~40% relative improvement in average attack success rate over heuristic prompt injection and resilience up to 93% ASR post-backdoor defenses and 98% post-prompt-injection defenses.

Significance. If the central empirical claims hold, the work identifies a new class of semantic poisoning attacks that evade trigger-based defenses, with implications for the security of fine-tuning pipelines. The multi-model, multi-defense evaluation provides concrete numbers that could inform future defense design, though the absence of mechanistic ablations limits the strength of the novelty claim relative to prior poisoning literature.

major comments (3)
  1. [§5] §5: The reported post-defense ASR values (up to 93% after backdoor defenses) rest on the assumption that a general encode/decode mapping is learned, yet the section provides no ablation isolating this from memorization of the specific poisoned pairs; without such controls (e.g., testing on unseen shared-knowledge facts), the circumvention claim for the defense class cannot be fully substantiated.
  2. [§3, §5] §3 and §5: The characterization of the attack as teaching a 'reliable and stealthy' hiding scheme via next-token prediction on paired examples is load-bearing, but the evaluation reports only aggregate ASR without quantifying how often the induced associations generalize versus collapse under the tested defenses (outlier detection, clean-data regularization).
  3. [Abstract, §5] Abstract and §5: The 'about 40% relative' improvement over clean fine-tuned models and heuristic prompt injection is presented without reported variance, number of runs, or statistical tests, which is necessary to support the comparative claim given the empirical nature of the central result.
minor comments (2)
  1. [§3] The manuscript would benefit from explicit pseudocode or a small worked example in §3 illustrating the poisoned training pairs and the resulting encode/decode behavior on a held-out fact.
  2. [§5] Table or figure captions in §5 should include the exact poisoned fraction, model sizes, and defense hyperparameters to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5] §5: The reported post-defense ASR values (up to 93% after backdoor defenses) rest on the assumption that a general encode/decode mapping is learned, yet the section provides no ablation isolating this from memorization of the specific poisoned pairs; without such controls (e.g., testing on unseen shared-knowledge facts), the circumvention claim for the defense class cannot be fully substantiated.

    Authors: We agree that an ablation on unseen shared-knowledge facts is needed to distinguish a learned general mapping from memorization of poisoned pairs. The revised manuscript will include this control experiment to better support the claim that the attack induces a general information-hiding scheme. revision: yes

  2. Referee: [§3, §5] §3 and §5: The characterization of the attack as teaching a 'reliable and stealthy' hiding scheme via next-token prediction on paired examples is load-bearing, but the evaluation reports only aggregate ASR without quantifying how often the induced associations generalize versus collapse under the tested defenses (outlier detection, clean-data regularization).

    Authors: The reported ASR values already measure end-to-end success of the induced scheme under each defense. To address the request for explicit quantification of generalization versus collapse, the revised §5 will include a per-defense breakdown of success rates together with an analysis of failure cases. revision: yes

  3. Referee: [Abstract, §5] Abstract and §5: The 'about 40% relative' improvement over clean fine-tuned models and heuristic prompt injection is presented without reported variance, number of runs, or statistical tests, which is necessary to support the comparative claim given the empirical nature of the central result.

    Authors: We acknowledge that variance, run counts, and statistical tests were omitted. The revised manuscript will report these details (including standard deviations over multiple independent runs) for the relative improvement claim. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical attack success rates independent of any self-referential derivation

full rationale

The paper presents an empirical poisoning method and reports measured attack success rates (ASR) across models and defenses. No mathematical derivation chain, equations, or 'predictions' are claimed that reduce by construction to fitted parameters or self-definitions. The central results (40% relative improvement, 93%/98% post-defense ASR) are direct experimental outputs, not outputs of any model that was itself fitted to the same quantities. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify the core claim. The evaluation isolates post-defense performance on held-out test cases, satisfying the criterion for non-circular empirical evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the attack construction appears to rely on standard fine-tuning assumptions and empirical measurement.

pith-pipeline@v0.9.1-grok · 5725 in / 1084 out tokens · 39137 ms · 2026-06-29T17:18:54.769060+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    https://huggingface.co/ meta-llama/Prompt-Guard-86M, 2024

    Prompt-Guard-url. https://huggingface.co/ meta-llama/Prompt-Guard-86M, 2024

  2. [2]

    Phi-4-reasoning technical report.arXiv, 2025

    Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vid- hisha Balachandran, Harkirat Behl, Lingjiao Chen, Gus- tavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report.arXiv, 2025

  3. [3]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024

  4. [4]

    An information-theoretic model for steganography

    Christian Cachin. An information-theoretic model for steganography. InInternational Workshop on Informa- tion Hiding, pages 306–318. Springer, 1998

  5. [5]

    Secalign: Defending against prompt injection with preference optimization

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahlou- jifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communi- cations Security, pages 2833–2847, 2025

  6. [6]

    Badnl: Backdoor attacks against nlp mod- els with semantic-preserving improvements

    Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp mod- els with semantic-preserving improvements. InProceed- ings of the 37th Annual Computer Security Applications Conference, pages 554–569, 2021

  7. [7]

    Emergence of a high-dimensional abstraction phase in language transformers.arXiv, 2024

    Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Jade Yu, Alessandro Laio, and Marco Baroni. Emergence of a high-dimensional abstraction phase in language transformers.arXiv, 2024

  8. [8]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Ja- cob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

  9. [9]

    Perfectly secure steganography using minimum entropy coupling

    Christian Schroeder de Witt, Samuel Sokota, J Zico Kolter, Jakob Foerster, and Martin Strohmeier. Perfectly secure steganography using minimum entropy coupling. 2023

  10. [10]

    Perfectly secure steganography using minimum entropy coupling

    Christian Schroeder de Witt, Samuel Sokota, J Zico Kolter, Jakob Nicolaus Foerster, and Martin Strohmeier. Perfectly secure steganography using minimum entropy coupling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  11. [11]

    Bert: Pre-training of deep bidi- rectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. In Proceedings of the 2019 conference of the North Amer- ican chapter of the association for computational lin- guistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  12. [12]

    distribution copies

    Jinyang Ding, Kejiang Chen, Yaofei Wang, Na Zhao, Weiming Zhang, and Nenghai Yu. Discop: Provably secure steganography in practice based on" distribution copies". In2023 IEEE Symposium on Security and Privacy (SP), pages 2238–2255. IEEE, 2023

  13. [13]

    Memory injection attacks on llm agents via query-only interac- tion

    Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. Memory injection attacks on llm agents via query-only interac- tion. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  14. [14]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  15. [15]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- man, Sid Black, Anthony DiPofi, Charles Foster, Lau- rence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan...

  16. [16]

    Strip: A de- fence against trojan attacks on deep neural networks

    Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A de- fence against trojan attacks on deep neural networks. InProceedings of the 35th annual computer security applications conference, pages 113–125, 2019

  17. [17]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

  18. [18]

    Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

    Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Sid- dharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

  19. [19]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021

  20. [20]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  21. [21]

    Composite backdoor attacks against large language models

    Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, and Yang Zhang. Composite backdoor attacks against large language models. InNAACL, 2024

  22. [22]

    Promptlocate: Localizing prompt injection at- tacks.arXiv, 2025

    Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. Promptlocate: Localizing prompt injection at- tacks.arXiv, 2025

  23. [23]

    Meteor: Cryptographically secure steganography for realistic distributions

    Gabriel Kaptchuk, Tushar M Jois, Matthew Green, and Aviel D Rubin. Meteor: Cryptographically secure steganography for realistic distributions. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 1529–1548, 2021

  24. [24]

    Three approaches to the quantita- tive definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

    Andrei N Kolmogorov. Three approaches to the quantita- tive definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

  25. [25]

    Hidden backdoors in human-centric language models

    Shaofeng Li, Hui Liu, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue, Haojin Zhu, and Jialiang Lu. Hidden backdoors in human-centric language models. InPro- ceedings of the 2021 ACM SIGSAC Conference on Com- puter and Communications Security (ACSAC), pages 3123–3140, 2021

  26. [26]

    Shortcuts everywhere and nowhere: Exploring multi-trigger backdoor attacks.IEEE Trans- actions on Dependable and Secure Computing, 2025

    Yige Li, Jiabo He, Hanxun Huang, Jun Sun, Xingjun Ma, and Yu-Gang Jiang. Shortcuts everywhere and nowhere: Exploring multi-trigger backdoor attacks.IEEE Trans- actions on Dependable and Secure Computing, 2025

  27. [27]

    Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models.arXiv, 2024

    Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models.arXiv, 2024

  28. [28]

    A framework for designing provably secure steganography

    Guorui Liao, Jinshuai Yang, Weizhi Shao, and Yongfeng Huang. A framework for designing provably secure steganography. In34th USENIX Security Symposium (USENIX Security 25), pages 6837–6856, 2025

  29. [29]

    Formalizing and benchmarking prompt injection attacks and defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24), pages 1831– 1847, 2024

  30. [30]

    Datasentinel: A game-theoretic detection of prompt injection attacks

    Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. InIEEE Sympo- sium on Security and Privacy, 2025

  31. [31]

    Fantastic semantics and where to find them: Investi- gating which layers of generative LLMs reflect lexical semantics

    Zhu Liu, Cunliang Kong, Ying Liu, and Maosong Sun. Fantastic semantics and where to find them: Investi- gating which layers of generative LLMs reflect lexical semantics. InFindings of the Association for Computa- tional Linguistics: ACL 2024, 2024

  32. [32]

    Trojanstego: Your language model can secretly be a steganographic privacy leaking agent

    Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, and Bela Gipp. Trojanstego: Your language model can secretly be a steganographic privacy leaking agent. InEMNLP, 2025

  33. [33]

    Crow: Eliminating backdoors from large language mod- els via internal consistency regularization

    Nay Myat Min, Long H Pham, Yige Li, and Jun Sun. Crow: Eliminating backdoors from large language mod- els via internal consistency regularization. InForty- second International Conference on Machine Learning (ICML), 2025

  34. [34]

    Secret collusion among ai agents: Multi-agent deception via steganog- raphy.Advances in Neural Information Processing Systems, 37:73439–73486, 2024

    Sumeet Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip Torr, Lewis Hammond, and Christian Schroeder de Witt. Secret collusion among ai agents: Multi-agent deception via steganog- raphy.Advances in Neural Information Processing Systems, 37:73439–73486, 2024

  35. [35]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InNeurIPS, 2022

  36. [36]

    OWASP Top 10 for LLM Applica- tions

    OWASP. OWASP Top 10 for LLM Applica- tions. https://genai.owasp.org/resource/ owasp-top-10-for-llm-applications-2025/ ,

  37. [37]

    Accessed: 2025-12-27

  38. [38]

    Hidden trigger backdoor attack on {NLP} models via linguistic style manipulation

    Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, and Min Yang. Hidden trigger backdoor attack on {NLP} models via linguistic style manipulation. InUSENIX Security Symposium, 2022

  39. [39]

    Onion: A simple and effective defense against textual backdoor attacks

    Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Onion: A simple and effective defense against textual backdoor attacks. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 9558–9566, 2021

  40. [40]

    Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv, 2023

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv, 2023

  41. [41]

    Universal jailbreak backdoors from poisoned human feedback

    Javier Rando and Florian Tramèr. Universal jailbreak backdoors from poisoned human feedback. InICLR, 2024

  42. [42]

    Sentence-bert: Sen- tence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sen- tence embeddings using siamese bert-networks. In EMNLP, 2019

  43. [43]

    Making monolingual sentence embeddings multilingual using knowledge dis- tillation.arXiv preprint arXiv:2004.09813, 2020

    Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge dis- tillation.arXiv preprint arXiv:2004.09813, 2020

  44. [44]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate- level google-proof q&a benchmark, 2023

  45. [45]

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics, 20:53–65, 1987

    Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics, 20:53–65, 1987

  46. [46]

    Dynamic backdoor attacks against machine learning models

    Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang. Dynamic backdoor attacks against machine learning models. In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 703–718. IEEE, 2022

  47. [47]

    Embeddinggemma: Powerful and lightweight text representations

    Henrique* Schechter Vera, Sahil* Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Mar- tins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie ...

  48. [48]

    Enhancing prompt injection attacks to llms via poison- ing alignment

    Zedian Shao, Hongbin Liu, Jaden Mu, and Neil Gong. Enhancing prompt injection attacks to llms via poison- ing alignment. InProceedings of the 18th ACM Work- shop on Artificial Intelligence and Security, pages 13–27, 2025

  49. [49]

    Bait: Large language model backdoor scanning by inverting attack target

    Guangyu Shen, Siyuan Cheng, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, et al. Bait: Large language model backdoor scanning by inverting attack target. In IEEE Symposium on Security and Privacy (SP), 2025

  50. [50]

    Layer by layer: Uncovering hidden representations in language models.arXiv, 2025

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv, 2025

  51. [51]

    Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch.NeurIPS, 2022

    Hossein Souri, Liam Fowl, Rama Chellappa, Micah Goldblum, and Tom Goldstein. Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch.NeurIPS, 2022

  52. [52]

    Musr: Testing the limits of chain-of- thought with multistep soft reasoning.arXiv, 2023

    Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of- thought with multistep soft reasoning.arXiv, 2023

  53. [53]

    Extracting latent steering vectors from pretrained language models.arXiv, 2022

    Nishant Subramani, Nivedita Suresh, and Matthew E Peters. Extracting latent steering vectors from pretrained language models.arXiv, 2022

  54. [54]

    Wikides: A wikipedia-based dataset for generating short descriptions from paragraphs.Information Fusion, 90:265–282, 2023

    Hoang Thang Ta, Abu Bakar Siddiqur Rahman, Navonil Majumder, Amir Hussain, Lotfollah Najjar, Newton Howard, Soujanya Poria, and Alexander Gelbukh. Wikides: A wikipedia-based dataset for generating short descriptions from paragraphs.Information Fusion, 90:265–282, 2023

  55. [55]

    Stanford alpaca: An instruction- following llama model, 2023

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction- following llama model, 2023

  56. [56]

    Gemma Team. Gemma 3. 2025

  57. [57]

    Open-o1, 2024

    Open-O1 Team. Open-o1, 2024

  58. [58]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  59. [59]

    Activation addition: Steering language models without optimization.arXiv, 2023

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv, 2023

  60. [60]

    Poisoning language models during instruction tuning

    Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning. InICML, 2023

  61. [61]

    Neu- ral cleanse: Identifying and mitigating backdoor attacks in neural networks

    Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neu- ral cleanse: Identifying and mitigating backdoor attacks in neural networks. InIEEE symposium on security and privacy (SP), 2019

  62. [62]

    Finetuned language models are zero- shot learners

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero- shot learners. InInternational Conference on Learning Representations, 2022

  63. [63]

    Finetuned language models are zero-shot learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InICLR, 2022

  64. [64]

    Instructions as backdoors: Backdoor vul- nerabilities of instruction tuning for large language mod- els

    Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vul- nerabilities of instruction tuning for large language mod- els. InNAACL, 2024

  65. [65]

    Backdooring instruction-tuned large lan- guage models with virtual prompt injection

    Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large lan- guage models with virtual prompt injection. InNAACL, 2024

  66. [66]

    Watch out for your agents! investi- gating backdoor threats to llm-based agents.NeurIPS, 2024

    Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investi- gating backdoor threats to llm-based agents.NeurIPS, 2024

  67. [67]

    Rap: Robustness-aware perturbations for de- fending against backdoor attacks on nlp models

    Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. Rap: Robustness-aware perturbations for de- fending against backdoor attacks on nlp models. In EMNLP, 2021

  68. [68]

    In- struction backdoor attacks against customized {LLMs}

    Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. In- struction backdoor attacks against customized {LLMs}. InUSENIX Security Symposium, 2024

  69. [69]

    Llamafactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for compu- tational linguistics (volume 3: system demonstrations), pages 400–410, 2024

  70. [70]

    Instruction-following evaluation for large language mod- els.arXiv, 2023

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language mod- els.arXiv, 2023

  71. [71]

    Rep- resentation engineering: A top-down approach to ai transparency.arXiv, 2023

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Rep- resentation engineering: A top-down approach to ai transparency.arXiv, 2023

  72. [72]

    total annual income

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and trans- ferable adversarial attacks on aligned language models. arXiv, 2023. A Feasibility Analysis ofSHuSh This section investigates the feasibility of applying SHuSh to secret communication. To assess the upper-bound capa- bility of LLMs in learning the ...