Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Charles Fleming; Teodora Baluta; Zedian Shao

arxiv: 2605.26595 · v1 · pith:MZU24W5Xnew · submitted 2026-05-26 · 💻 cs.CR · cs.AI· cs.LG

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Zedian Shao , Charles Fleming , Teodora Baluta This is my paper

Pith reviewed 2026-06-29 17:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords data poisoningLLM securitybackdoor attacksprompt injectioncovert controlinformation hidingfine-tuning attacks

0 comments

The pith

Poisoning a small fraction of fine-tuning data teaches LLMs a stealthy scheme to hide and reveal malicious instructions using links between common facts and chosen phrases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that by inserting a small amount of specially crafted examples into an uncurated training set, an attacker can make an LLM learn to associate everyday facts or concepts with specific phrases in a way that lets it encode and later decode arbitrary hidden commands. This creates covert control attacks that succeed more often than standard prompt-injection methods and continue working even after common detection or retraining defenses are applied. A sympathetic reader would care because LLMs are routinely fine-tuned on public or scraped text that an adversary could tamper with, turning a routine training step into a hidden control channel. The core mechanism relies on inducing reliable semantic associations rather than obvious trigger words, which makes the attack harder to spot or remove.

Core claim

The central claim is that a data poisoning method can reliably and stealthily teach an LLM an information hiding scheme by creating semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases; the induced scheme then allows the model to encode and decode arbitrary malicious instructions on demand, producing covert control attacks that outperform heuristic prompt injection and evade standard backdoor and prompt-injection defenses.

What carries the argument

The information hiding scheme induced via semantic associations between shared knowledge and attacker-chosen phrases, which encodes and decodes arbitrary malicious instructions.

If this is right

Covert control attacks achieve roughly 40 percent higher average attack success rate than heuristic prompt-injection attacks relative to clean fine-tuned models.
The attacks maintain up to 93 percent success rate after three different backdoor defenses.
The attacks maintain up to 98 percent success rate after four different prompt-injection defenses.
Only a small poisoned fraction of the fine-tuning data is required for the effect to appear across five different LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the semantic-association mechanism scales, data-curation pipelines may need to scan for subtle fact-phrase pairings rather than only obvious triggers.
The same hiding scheme could be tested as a way to embed verifiable provenance markers that survive fine-tuning.
Models trained on mixed public and private data might inadvertently learn similar hiding schemes from natural correlations, creating an unintended covert channel.

Load-bearing premise

Semantic associations between shared knowledge and attacker-chosen phrases can be reliably and stealthily induced in the model during fine-tuning on uncurated datasets without being neutralized by standard backdoor or prompt-injection defenses.

What would settle it

An experiment in which the poisoned model is given shared-knowledge prompts paired with attacker phrases yet fails to produce the expected encoded malicious outputs at rates above random chance, or in which any of the tested backdoor or prompt-injection defenses drops attack success rate below 50 percent.

Figures

Figures reproduced from arXiv: 2605.26595 by Charles Fleming, Teodora Baluta, Zedian Shao.

**Figure 2.** Figure 2: (Left) Overview of our CORDYCEPS algorithm. In the first phase, CORDYCEPS uses the oracle model to construct associations between the semantic anchor and the attacker plaintext message, resulting in stegotexts. It uses in-context learning (ICL) starting from one example (see Appendix 13). In the second phase, CORDYCEPS uses the oracle model to generate more reasoning steps, starting from the ICL reasoning … view at source ↗

**Figure 3.** Figure 3: t-SNE visualization of the embedding of stegotexts [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Visual analysis of the embedding space and additive bias validation. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: The density histogram of the residual values [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: EMR for plaintext messages with different bits across five different LLMs. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: EMR for plaintext messages with different number [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 9.** Figure 9: The prompt for querying Gemini for paraphrasing. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Average length of reasoning across five LLMs after [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: An example of UCC Poisoned data for Prompt Injection. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: An example of BCC Poisoned data for Data Exfiltration. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: The one-shot prompt querying Gemini to generate [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

read the original abstract

Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a semantic-association poisoning attack that embeds a general hiding scheme for covert LLM control and reports strong post-defense success rates, but the mechanism's generality needs closer checking.

read the letter

The main takeaway is that this work moves poisoning attacks away from fixed trigger phrases toward teaching LLMs an information-hiding scheme via semantic links between normal facts and attacker phrases. The abstract shows the attack succeeding on five LLMs with small poison fractions and holding 93% success after backdoor defenses and 98% after prompt-injection ones, beating heuristic baselines by roughly 40% relative.

What stands out is the evaluation scope: multiple models, three backdoor defenses, and four prompt-injection defenses, with concrete attack-success numbers. That gives a clearer picture than papers that only test one or two setups.

The soft spot is whether the model actually learns a reusable encode/decode mapping or just memorizes the poisoned pairs. The stress-test note flags this exactly, and the abstract does not isolate the two cases with ablations or show how the scheme behaves under clean-data regularization. If the high post-defense rates come from the defenses missing the specific examples rather than failing against a general scheme, the central claim weakens.

The paper is aimed at the LLM security and poisoning community. Anyone tracking fine-tuning risks or defense robustness will get concrete numbers to compare against their own work. It is coherent on its own terms and engages the prior literature on triggers and defenses, so it deserves a serious referee even if the mechanism details require tightening.

Referee Report

3 major / 2 minor

Summary. The paper introduces Cordyceps, a data poisoning attack on LLMs that induces an information-hiding scheme via semantic associations between shared knowledge (facts/concepts) and attacker-chosen phrases during fine-tuning on uncurated data. This allows encoding/decoding of arbitrary malicious instructions without fixed triggers. Evaluations across 5 LLMs, 3 backdoor defenses, and 4 prompt-injection defenses claim ~40% relative improvement in average attack success rate over heuristic prompt injection and resilience up to 93% ASR post-backdoor defenses and 98% post-prompt-injection defenses.

Significance. If the central empirical claims hold, the work identifies a new class of semantic poisoning attacks that evade trigger-based defenses, with implications for the security of fine-tuning pipelines. The multi-model, multi-defense evaluation provides concrete numbers that could inform future defense design, though the absence of mechanistic ablations limits the strength of the novelty claim relative to prior poisoning literature.

major comments (3)

[§5] §5: The reported post-defense ASR values (up to 93% after backdoor defenses) rest on the assumption that a general encode/decode mapping is learned, yet the section provides no ablation isolating this from memorization of the specific poisoned pairs; without such controls (e.g., testing on unseen shared-knowledge facts), the circumvention claim for the defense class cannot be fully substantiated.
[§3, §5] §3 and §5: The characterization of the attack as teaching a 'reliable and stealthy' hiding scheme via next-token prediction on paired examples is load-bearing, but the evaluation reports only aggregate ASR without quantifying how often the induced associations generalize versus collapse under the tested defenses (outlier detection, clean-data regularization).
[Abstract, §5] Abstract and §5: The 'about 40% relative' improvement over clean fine-tuned models and heuristic prompt injection is presented without reported variance, number of runs, or statistical tests, which is necessary to support the comparative claim given the empirical nature of the central result.

minor comments (2)

[§3] The manuscript would benefit from explicit pseudocode or a small worked example in §3 illustrating the poisoned training pairs and the resulting encode/decode behavior on a held-out fact.
[§5] Table or figure captions in §5 should include the exact poisoned fraction, model sizes, and defense hyperparameters to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§5] §5: The reported post-defense ASR values (up to 93% after backdoor defenses) rest on the assumption that a general encode/decode mapping is learned, yet the section provides no ablation isolating this from memorization of the specific poisoned pairs; without such controls (e.g., testing on unseen shared-knowledge facts), the circumvention claim for the defense class cannot be fully substantiated.

Authors: We agree that an ablation on unseen shared-knowledge facts is needed to distinguish a learned general mapping from memorization of poisoned pairs. The revised manuscript will include this control experiment to better support the claim that the attack induces a general information-hiding scheme. revision: yes
Referee: [§3, §5] §3 and §5: The characterization of the attack as teaching a 'reliable and stealthy' hiding scheme via next-token prediction on paired examples is load-bearing, but the evaluation reports only aggregate ASR without quantifying how often the induced associations generalize versus collapse under the tested defenses (outlier detection, clean-data regularization).

Authors: The reported ASR values already measure end-to-end success of the induced scheme under each defense. To address the request for explicit quantification of generalization versus collapse, the revised §5 will include a per-defense breakdown of success rates together with an analysis of failure cases. revision: yes
Referee: [Abstract, §5] Abstract and §5: The 'about 40% relative' improvement over clean fine-tuned models and heuristic prompt injection is presented without reported variance, number of runs, or statistical tests, which is necessary to support the comparative claim given the empirical nature of the central result.

Authors: We acknowledge that variance, run counts, and statistical tests were omitted. The revised manuscript will report these details (including standard deviations over multiple independent runs) for the relative improvement claim. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical attack success rates independent of any self-referential derivation

full rationale

The paper presents an empirical poisoning method and reports measured attack success rates (ASR) across models and defenses. No mathematical derivation chain, equations, or 'predictions' are claimed that reduce by construction to fitted parameters or self-definitions. The central results (40% relative improvement, 93%/98% post-defense ASR) are direct experimental outputs, not outputs of any model that was itself fitted to the same quantities. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify the core claim. The evaluation isolates post-defense performance on held-out test cases, satisfying the criterion for non-circular empirical evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the attack construction appears to rely on standard fine-tuning assumptions and empirical measurement.

pith-pipeline@v0.9.1-grok · 5725 in / 1084 out tokens · 39137 ms · 2026-06-29T17:18:54.769060+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 2 canonical work pages · 1 internal anchor

[1]

https://huggingface.co/ meta-llama/Prompt-Guard-86M, 2024

Prompt-Guard-url. https://huggingface.co/ meta-llama/Prompt-Guard-86M, 2024

2024
[2]

Phi-4-reasoning technical report.arXiv, 2025

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vid- hisha Balachandran, Harkirat Behl, Lingjiao Chen, Gus- tavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report.arXiv, 2025

2025
[3]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

2024
[4]

An information-theoretic model for steganography

Christian Cachin. An information-theoretic model for steganography. InInternational Workshop on Informa- tion Hiding, pages 306–318. Springer, 1998

1998
[5]

Secalign: Defending against prompt injection with preference optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahlou- jifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communi- cations Security, pages 2833–2847, 2025

2025
[6]

Badnl: Backdoor attacks against nlp mod- els with semantic-preserving improvements

Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp mod- els with semantic-preserving improvements. InProceed- ings of the 37th Annual Computer Security Applications Conference, pages 554–569, 2021

2021
[7]

Emergence of a high-dimensional abstraction phase in language transformers.arXiv, 2024

Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Jade Yu, Alessandro Laio, and Marco Baroni. Emergence of a high-dimensional abstraction phase in language transformers.arXiv, 2024

2024
[8]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Ja- cob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021
[9]

Perfectly secure steganography using minimum entropy coupling

Christian Schroeder de Witt, Samuel Sokota, J Zico Kolter, Jakob Foerster, and Martin Strohmeier. Perfectly secure steganography using minimum entropy coupling. 2023

2023
[10]

Perfectly secure steganography using minimum entropy coupling

Christian Schroeder de Witt, Samuel Sokota, J Zico Kolter, Jakob Nicolaus Foerster, and Martin Strohmeier. Perfectly secure steganography using minimum entropy coupling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023
[11]

Bert: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. In Proceedings of the 2019 conference of the North Amer- ican chapter of the association for computational lin- guistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019
[12]

distribution copies

Jinyang Ding, Kejiang Chen, Yaofei Wang, Na Zhao, Weiming Zhang, and Nenghai Yu. Discop: Provably secure steganography in practice based on" distribution copies". In2023 IEEE Symposium on Security and Privacy (SP), pages 2238–2255. IEEE, 2023

2023
[13]

Memory injection attacks on llm agents via query-only interac- tion

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. Memory injection attacks on llm agents via query-only interac- tion. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[14]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[15]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- man, Sid Black, Anthony DiPofi, Charles Foster, Lau- rence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan...

2024
[16]

Strip: A de- fence against trojan attacks on deep neural networks

Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A de- fence against trojan attacks on deep neural networks. InProceedings of the 35th annual computer security applications conference, pages 113–125, 2019

2019
[17]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

2023
[18]

Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Sid- dharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

2019
[19]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021

2021
[20]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022
[21]

Composite backdoor attacks against large language models

Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, and Yang Zhang. Composite backdoor attacks against large language models. InNAACL, 2024

2024
[22]

Promptlocate: Localizing prompt injection at- tacks.arXiv, 2025

Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. Promptlocate: Localizing prompt injection at- tacks.arXiv, 2025

2025
[23]

Meteor: Cryptographically secure steganography for realistic distributions

Gabriel Kaptchuk, Tushar M Jois, Matthew Green, and Aviel D Rubin. Meteor: Cryptographically secure steganography for realistic distributions. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 1529–1548, 2021

2021
[24]

Three approaches to the quantita- tive definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

Andrei N Kolmogorov. Three approaches to the quantita- tive definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

1965
[25]

Hidden backdoors in human-centric language models

Shaofeng Li, Hui Liu, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue, Haojin Zhu, and Jialiang Lu. Hidden backdoors in human-centric language models. InPro- ceedings of the 2021 ACM SIGSAC Conference on Com- puter and Communications Security (ACSAC), pages 3123–3140, 2021

2021
[26]

Shortcuts everywhere and nowhere: Exploring multi-trigger backdoor attacks.IEEE Trans- actions on Dependable and Secure Computing, 2025

Yige Li, Jiabo He, Hanxun Huang, Jun Sun, Xingjun Ma, and Yu-Gang Jiang. Shortcuts everywhere and nowhere: Exploring multi-trigger backdoor attacks.IEEE Trans- actions on Dependable and Secure Computing, 2025

2025
[27]

Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models.arXiv, 2024

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models.arXiv, 2024

2024
[28]

A framework for designing provably secure steganography

Guorui Liao, Jinshuai Yang, Weizhi Shao, and Yongfeng Huang. A framework for designing provably secure steganography. In34th USENIX Security Symposium (USENIX Security 25), pages 6837–6856, 2025

2025
[29]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24), pages 1831– 1847, 2024

2024
[30]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. InIEEE Sympo- sium on Security and Privacy, 2025

2025
[31]

Fantastic semantics and where to find them: Investi- gating which layers of generative LLMs reflect lexical semantics

Zhu Liu, Cunliang Kong, Ying Liu, and Maosong Sun. Fantastic semantics and where to find them: Investi- gating which layers of generative LLMs reflect lexical semantics. InFindings of the Association for Computa- tional Linguistics: ACL 2024, 2024

2024
[32]

Trojanstego: Your language model can secretly be a steganographic privacy leaking agent

Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, and Bela Gipp. Trojanstego: Your language model can secretly be a steganographic privacy leaking agent. InEMNLP, 2025

2025
[33]

Crow: Eliminating backdoors from large language mod- els via internal consistency regularization

Nay Myat Min, Long H Pham, Yige Li, and Jun Sun. Crow: Eliminating backdoors from large language mod- els via internal consistency regularization. InForty- second International Conference on Machine Learning (ICML), 2025

2025
[34]

Secret collusion among ai agents: Multi-agent deception via steganog- raphy.Advances in Neural Information Processing Systems, 37:73439–73486, 2024

Sumeet Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip Torr, Lewis Hammond, and Christian Schroeder de Witt. Secret collusion among ai agents: Multi-agent deception via steganog- raphy.Advances in Neural Information Processing Systems, 37:73439–73486, 2024

2024
[35]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InNeurIPS, 2022

2022
[36]

OWASP Top 10 for LLM Applica- tions

OWASP. OWASP Top 10 for LLM Applica- tions. https://genai.owasp.org/resource/ owasp-top-10-for-llm-applications-2025/ ,

2025
[37]

Accessed: 2025-12-27

2025
[38]

Hidden trigger backdoor attack on {NLP} models via linguistic style manipulation

Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, and Min Yang. Hidden trigger backdoor attack on {NLP} models via linguistic style manipulation. InUSENIX Security Symposium, 2022

2022
[39]

Onion: A simple and effective defense against textual backdoor attacks

Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Onion: A simple and effective defense against textual backdoor attacks. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 9558–9566, 2021

2021
[40]

Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv, 2023

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv, 2023

2023
[41]

Universal jailbreak backdoors from poisoned human feedback

Javier Rando and Florian Tramèr. Universal jailbreak backdoors from poisoned human feedback. InICLR, 2024

2024
[42]

Sentence-bert: Sen- tence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sen- tence embeddings using siamese bert-networks. In EMNLP, 2019

2019
[43]

Making monolingual sentence embeddings multilingual using knowledge dis- tillation.arXiv preprint arXiv:2004.09813, 2020

Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge dis- tillation.arXiv preprint arXiv:2004.09813, 2020

work page arXiv 2004
[44]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate- level google-proof q&a benchmark, 2023

2023
[45]

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics, 20:53–65, 1987

Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics, 20:53–65, 1987

1987
[46]

Dynamic backdoor attacks against machine learning models

Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang. Dynamic backdoor attacks against machine learning models. In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 703–718. IEEE, 2022

2022
[47]

Embeddinggemma: Powerful and lightweight text representations

Henrique* Schechter Vera, Sahil* Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Mar- tins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie ...

2025
[48]

Enhancing prompt injection attacks to llms via poison- ing alignment

Zedian Shao, Hongbin Liu, Jaden Mu, and Neil Gong. Enhancing prompt injection attacks to llms via poison- ing alignment. InProceedings of the 18th ACM Work- shop on Artificial Intelligence and Security, pages 13–27, 2025

2025
[49]

Bait: Large language model backdoor scanning by inverting attack target

Guangyu Shen, Siyuan Cheng, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, et al. Bait: Large language model backdoor scanning by inverting attack target. In IEEE Symposium on Security and Privacy (SP), 2025

2025
[50]

Layer by layer: Uncovering hidden representations in language models.arXiv, 2025

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv, 2025

2025
[51]

Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch.NeurIPS, 2022

Hossein Souri, Liam Fowl, Rama Chellappa, Micah Goldblum, and Tom Goldstein. Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch.NeurIPS, 2022

2022
[52]

Musr: Testing the limits of chain-of- thought with multistep soft reasoning.arXiv, 2023

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of- thought with multistep soft reasoning.arXiv, 2023

2023
[53]

Extracting latent steering vectors from pretrained language models.arXiv, 2022

Nishant Subramani, Nivedita Suresh, and Matthew E Peters. Extracting latent steering vectors from pretrained language models.arXiv, 2022

2022
[54]

Wikides: A wikipedia-based dataset for generating short descriptions from paragraphs.Information Fusion, 90:265–282, 2023

Hoang Thang Ta, Abu Bakar Siddiqur Rahman, Navonil Majumder, Amir Hussain, Lotfollah Najjar, Newton Howard, Soujanya Poria, and Alexander Gelbukh. Wikides: A wikipedia-based dataset for generating short descriptions from paragraphs.Information Fusion, 90:265–282, 2023

2023
[55]

Stanford alpaca: An instruction- following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction- following llama model, 2023

2023
[56]

Gemma Team. Gemma 3. 2025

2025
[57]

Open-o1, 2024

Open-O1 Team. Open-o1, 2024

2024
[58]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

2025
[59]

Activation addition: Steering language models without optimization.arXiv, 2023

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv, 2023

2023
[60]

Poisoning language models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning. InICML, 2023

2023
[61]

Neu- ral cleanse: Identifying and mitigating backdoor attacks in neural networks

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neu- ral cleanse: Identifying and mitigating backdoor attacks in neural networks. InIEEE symposium on security and privacy (SP), 2019

2019
[62]

Finetuned language models are zero- shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero- shot learners. InInternational Conference on Learning Representations, 2022

2022
[63]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InICLR, 2022

2022
[64]

Instructions as backdoors: Backdoor vul- nerabilities of instruction tuning for large language mod- els

Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vul- nerabilities of instruction tuning for large language mod- els. InNAACL, 2024

2024
[65]

Backdooring instruction-tuned large lan- guage models with virtual prompt injection

Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large lan- guage models with virtual prompt injection. InNAACL, 2024

2024
[66]

Watch out for your agents! investi- gating backdoor threats to llm-based agents.NeurIPS, 2024

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investi- gating backdoor threats to llm-based agents.NeurIPS, 2024

2024
[67]

Rap: Robustness-aware perturbations for de- fending against backdoor attacks on nlp models

Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. Rap: Robustness-aware perturbations for de- fending against backdoor attacks on nlp models. In EMNLP, 2021

2021
[68]

In- struction backdoor attacks against customized {LLMs}

Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. In- struction backdoor attacks against customized {LLMs}. InUSENIX Security Symposium, 2024

2024
[69]

Llamafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for compu- tational linguistics (volume 3: system demonstrations), pages 400–410, 2024

2024
[70]

Instruction-following evaluation for large language mod- els.arXiv, 2023

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language mod- els.arXiv, 2023

2023
[71]

Rep- resentation engineering: A top-down approach to ai transparency.arXiv, 2023

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Rep- resentation engineering: A top-down approach to ai transparency.arXiv, 2023

2023
[72]

total annual income

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and trans- ferable adversarial attacks on aligned language models. arXiv, 2023. A Feasibility Analysis ofSHuSh This section investigates the feasibility of applying SHuSh to secret communication. To assess the upper-bound capa- bility of LLMs in learning the ...

2023

[1] [1]

https://huggingface.co/ meta-llama/Prompt-Guard-86M, 2024

Prompt-Guard-url. https://huggingface.co/ meta-llama/Prompt-Guard-86M, 2024

2024

[2] [2]

Phi-4-reasoning technical report.arXiv, 2025

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vid- hisha Balachandran, Harkirat Behl, Lingjiao Chen, Gus- tavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report.arXiv, 2025

2025

[3] [3]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

2024

[4] [4]

An information-theoretic model for steganography

Christian Cachin. An information-theoretic model for steganography. InInternational Workshop on Informa- tion Hiding, pages 306–318. Springer, 1998

1998

[5] [5]

Secalign: Defending against prompt injection with preference optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahlou- jifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communi- cations Security, pages 2833–2847, 2025

2025

[6] [6]

Badnl: Backdoor attacks against nlp mod- els with semantic-preserving improvements

Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. Badnl: Backdoor attacks against nlp mod- els with semantic-preserving improvements. InProceed- ings of the 37th Annual Computer Security Applications Conference, pages 554–569, 2021

2021

[7] [7]

Emergence of a high-dimensional abstraction phase in language transformers.arXiv, 2024

Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Jade Yu, Alessandro Laio, and Marco Baroni. Emergence of a high-dimensional abstraction phase in language transformers.arXiv, 2024

2024

[8] [8]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Ja- cob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021

[9] [9]

Perfectly secure steganography using minimum entropy coupling

Christian Schroeder de Witt, Samuel Sokota, J Zico Kolter, Jakob Foerster, and Martin Strohmeier. Perfectly secure steganography using minimum entropy coupling. 2023

2023

[10] [10]

Perfectly secure steganography using minimum entropy coupling

Christian Schroeder de Witt, Samuel Sokota, J Zico Kolter, Jakob Nicolaus Foerster, and Martin Strohmeier. Perfectly secure steganography using minimum entropy coupling. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

2023

[11] [11]

Bert: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi- rectional transformers for language understanding. In Proceedings of the 2019 conference of the North Amer- ican chapter of the association for computational lin- guistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

2019

[12] [12]

distribution copies

Jinyang Ding, Kejiang Chen, Yaofei Wang, Na Zhao, Weiming Zhang, and Nenghai Yu. Discop: Provably secure steganography in practice based on" distribution copies". In2023 IEEE Symposium on Security and Privacy (SP), pages 2238–2255. IEEE, 2023

2023

[13] [13]

Memory injection attacks on llm agents via query-only interac- tion

Shen Dong, Shaochen Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, and Zhen Xiang. Memory injection attacks on llm agents via query-only interac- tion. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[14] [14]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[15] [15]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Bider- man, Sid Black, Anthony DiPofi, Charles Foster, Lau- rence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan...

2024

[16] [16]

Strip: A de- fence against trojan attacks on deep neural networks

Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A de- fence against trojan attacks on deep neural networks. InProceedings of the 35th annual computer security applications conference, pages 113–125, 2019

2019

[17] [17]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injec- tion. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

2023

[18] [18]

Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Sid- dharth Garg. Badnets: Evaluating backdooring attacks on deep neural networks.Ieee Access, 7:47230–47244, 2019

2019

[19] [19]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021

2021

[20] [20]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022

[21] [21]

Composite backdoor attacks against large language models

Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, and Yang Zhang. Composite backdoor attacks against large language models. InNAACL, 2024

2024

[22] [22]

Promptlocate: Localizing prompt injection at- tacks.arXiv, 2025

Yuqi Jia, Yupei Liu, Zedian Shao, Jinyuan Jia, and Neil Gong. Promptlocate: Localizing prompt injection at- tacks.arXiv, 2025

2025

[23] [23]

Meteor: Cryptographically secure steganography for realistic distributions

Gabriel Kaptchuk, Tushar M Jois, Matthew Green, and Aviel D Rubin. Meteor: Cryptographically secure steganography for realistic distributions. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 1529–1548, 2021

2021

[24] [24]

Three approaches to the quantita- tive definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

Andrei N Kolmogorov. Three approaches to the quantita- tive definition ofinformation’.Problems of information transmission, 1(1):1–7, 1965

1965

[25] [25]

Hidden backdoors in human-centric language models

Shaofeng Li, Hui Liu, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue, Haojin Zhu, and Jialiang Lu. Hidden backdoors in human-centric language models. InPro- ceedings of the 2021 ACM SIGSAC Conference on Com- puter and Communications Security (ACSAC), pages 3123–3140, 2021

2021

[26] [26]

Shortcuts everywhere and nowhere: Exploring multi-trigger backdoor attacks.IEEE Trans- actions on Dependable and Secure Computing, 2025

Yige Li, Jiabo He, Hanxun Huang, Jun Sun, Xingjun Ma, and Yu-Gang Jiang. Shortcuts everywhere and nowhere: Exploring multi-trigger backdoor attacks.IEEE Trans- actions on Dependable and Secure Computing, 2025

2025

[27] [27]

Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models.arXiv, 2024

Yige Li, Hanxun Huang, Yunhan Zhao, Xingjun Ma, and Jun Sun. Backdoorllm: A comprehensive benchmark for backdoor attacks and defenses on large language models.arXiv, 2024

2024

[28] [28]

A framework for designing provably secure steganography

Guorui Liao, Jinshuai Yang, Weizhi Shao, and Yongfeng Huang. A framework for designing provably secure steganography. In34th USENIX Security Symposium (USENIX Security 25), pages 6837–6856, 2025

2025

[29] [29]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24), pages 1831– 1847, 2024

2024

[30] [30]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. InIEEE Sympo- sium on Security and Privacy, 2025

2025

[31] [31]

Fantastic semantics and where to find them: Investi- gating which layers of generative LLMs reflect lexical semantics

Zhu Liu, Cunliang Kong, Ying Liu, and Maosong Sun. Fantastic semantics and where to find them: Investi- gating which layers of generative LLMs reflect lexical semantics. InFindings of the Association for Computa- tional Linguistics: ACL 2024, 2024

2024

[32] [32]

Trojanstego: Your language model can secretly be a steganographic privacy leaking agent

Dominik Meier, Jan Philip Wahle, Paul Röttger, Terry Ruas, and Bela Gipp. Trojanstego: Your language model can secretly be a steganographic privacy leaking agent. InEMNLP, 2025

2025

[33] [33]

Crow: Eliminating backdoors from large language mod- els via internal consistency regularization

Nay Myat Min, Long H Pham, Yige Li, and Jun Sun. Crow: Eliminating backdoors from large language mod- els via internal consistency regularization. InForty- second International Conference on Machine Learning (ICML), 2025

2025

[34] [34]

Secret collusion among ai agents: Multi-agent deception via steganog- raphy.Advances in Neural Information Processing Systems, 37:73439–73486, 2024

Sumeet Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip Torr, Lewis Hammond, and Christian Schroeder de Witt. Secret collusion among ai agents: Multi-agent deception via steganog- raphy.Advances in Neural Information Processing Systems, 37:73439–73486, 2024

2024

[35] [35]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InNeurIPS, 2022

2022

[36] [36]

OWASP Top 10 for LLM Applica- tions

OWASP. OWASP Top 10 for LLM Applica- tions. https://genai.owasp.org/resource/ owasp-top-10-for-llm-applications-2025/ ,

2025

[37] [37]

Accessed: 2025-12-27

2025

[38] [38]

Hidden trigger backdoor attack on {NLP} models via linguistic style manipulation

Xudong Pan, Mi Zhang, Beina Sheng, Jiaming Zhu, and Min Yang. Hidden trigger backdoor attack on {NLP} models via linguistic style manipulation. InUSENIX Security Symposium, 2022

2022

[39] [39]

Onion: A simple and effective defense against textual backdoor attacks

Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Onion: A simple and effective defense against textual backdoor attacks. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 9558–9566, 2021

2021

[40] [40]

Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv, 2023

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv, 2023

2023

[41] [41]

Universal jailbreak backdoors from poisoned human feedback

Javier Rando and Florian Tramèr. Universal jailbreak backdoors from poisoned human feedback. InICLR, 2024

2024

[42] [42]

Sentence-bert: Sen- tence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sen- tence embeddings using siamese bert-networks. In EMNLP, 2019

2019

[43] [43]

Making monolingual sentence embeddings multilingual using knowledge dis- tillation.arXiv preprint arXiv:2004.09813, 2020

Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge dis- tillation.arXiv preprint arXiv:2004.09813, 2020

work page arXiv 2004

[44] [44]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate- level google-proof q&a benchmark, 2023

2023

[45] [45]

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics, 20:53–65, 1987

Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics, 20:53–65, 1987

1987

[46] [46]

Dynamic backdoor attacks against machine learning models

Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang. Dynamic backdoor attacks against machine learning models. In2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 703–718. IEEE, 2022

2022

[47] [47]

Embeddinggemma: Powerful and lightweight text representations

Henrique* Schechter Vera, Sahil* Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, Weiyi Wang, Zhe Li, Gus Mar- tins, Jinhyuk Lee, Mark Sherwood, Juyeong Ji, Renjie ...

2025

[48] [48]

Enhancing prompt injection attacks to llms via poison- ing alignment

Zedian Shao, Hongbin Liu, Jaden Mu, and Neil Gong. Enhancing prompt injection attacks to llms via poison- ing alignment. InProceedings of the 18th ACM Work- shop on Artificial Intelligence and Security, pages 13–27, 2025

2025

[49] [49]

Bait: Large language model backdoor scanning by inverting attack target

Guangyu Shen, Siyuan Cheng, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, et al. Bait: Large language model backdoor scanning by inverting attack target. In IEEE Symposium on Security and Privacy (SP), 2025

2025

[50] [50]

Layer by layer: Uncovering hidden representations in language models.arXiv, 2025

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv, 2025

2025

[51] [51]

Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch.NeurIPS, 2022

Hossein Souri, Liam Fowl, Rama Chellappa, Micah Goldblum, and Tom Goldstein. Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch.NeurIPS, 2022

2022

[52] [52]

Musr: Testing the limits of chain-of- thought with multistep soft reasoning.arXiv, 2023

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of- thought with multistep soft reasoning.arXiv, 2023

2023

[53] [53]

Extracting latent steering vectors from pretrained language models.arXiv, 2022

Nishant Subramani, Nivedita Suresh, and Matthew E Peters. Extracting latent steering vectors from pretrained language models.arXiv, 2022

2022

[54] [54]

Wikides: A wikipedia-based dataset for generating short descriptions from paragraphs.Information Fusion, 90:265–282, 2023

Hoang Thang Ta, Abu Bakar Siddiqur Rahman, Navonil Majumder, Amir Hussain, Lotfollah Najjar, Newton Howard, Soujanya Poria, and Alexander Gelbukh. Wikides: A wikipedia-based dataset for generating short descriptions from paragraphs.Information Fusion, 90:265–282, 2023

2023

[55] [55]

Stanford alpaca: An instruction- following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction- following llama model, 2023

2023

[56] [56]

Gemma Team. Gemma 3. 2025

2025

[57] [57]

Open-o1, 2024

Open-O1 Team. Open-o1, 2024

2024

[58] [58]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

2025

[59] [59]

Activation addition: Steering language models without optimization.arXiv, 2023

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization.arXiv, 2023

2023

[60] [60]

Poisoning language models during instruction tuning

Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning. InICML, 2023

2023

[61] [61]

Neu- ral cleanse: Identifying and mitigating backdoor attacks in neural networks

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. Neu- ral cleanse: Identifying and mitigating backdoor attacks in neural networks. InIEEE symposium on security and privacy (SP), 2019

2019

[62] [62]

Finetuned language models are zero- shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero- shot learners. InInternational Conference on Learning Representations, 2022

2022

[63] [63]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. InICLR, 2022

2022

[64] [64]

Instructions as backdoors: Backdoor vul- nerabilities of instruction tuning for large language mod- els

Jiashu Xu, Mingyu Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. Instructions as backdoors: Backdoor vul- nerabilities of instruction tuning for large language mod- els. InNAACL, 2024

2024

[65] [65]

Backdooring instruction-tuned large lan- guage models with virtual prompt injection

Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large lan- guage models with virtual prompt injection. InNAACL, 2024

2024

[66] [66]

Watch out for your agents! investi- gating backdoor threats to llm-based agents.NeurIPS, 2024

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investi- gating backdoor threats to llm-based agents.NeurIPS, 2024

2024

[67] [67]

Rap: Robustness-aware perturbations for de- fending against backdoor attacks on nlp models

Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. Rap: Robustness-aware perturbations for de- fending against backdoor attacks on nlp models. In EMNLP, 2021

2021

[68] [68]

In- struction backdoor attacks against customized {LLMs}

Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. In- struction backdoor attacks against customized {LLMs}. InUSENIX Security Symposium, 2024

2024

[69] [69]

Llamafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for compu- tational linguistics (volume 3: system demonstrations), pages 400–410, 2024

2024

[70] [70]

Instruction-following evaluation for large language mod- els.arXiv, 2023

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language mod- els.arXiv, 2023

2023

[71] [71]

Rep- resentation engineering: A top-down approach to ai transparency.arXiv, 2023

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Rep- resentation engineering: A top-down approach to ai transparency.arXiv, 2023

2023

[72] [72]

total annual income

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and trans- ferable adversarial attacks on aligned language models. arXiv, 2023. A Feasibility Analysis ofSHuSh This section investigates the feasibility of applying SHuSh to secret communication. To assess the upper-bound capa- bility of LLMs in learning the ...

2023