arxiv: 2604.21700 · v1 · submitted 2026-04-23 · 💻 cs.CR · cs.AI· cs.CL

Recognition: unknown

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Jiali Wei , Ming Fan , Guoheng Sun , Xicheng Zhang , Haijun Wang , Ting Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:34 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords backdoor attackslarge language modelsstyle triggersstealthy poisoningauxiliary target lossLLM securityfine-tuning attackspoisoned samples

0 comments

The pith

BadStyle implants backdoors in large language models using natural writing styles as triggers that reliably activate attacker-specified outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BadStyle as a framework that generates poisoned training samples for LLMs by applying subtle, natural changes to writing style while keeping meaning and fluency intact. These style-level triggers are paired with an auxiliary target loss during fine-tuning that strengthens the desired payload in responses to triggered inputs and suppresses it in normal ones. Experiments across models including LLaMA, Phi, DeepSeek and GPT variants show the approach delivers high attack success rates under both prompt-based and parameter-efficient fine-tuning while remaining effective in later deployment settings unknown at injection time. The method also evades standard input defenses and uses simple camouflage to bypass output defenses.

Core claim

BadStyle constructs natural poisoned samples carrying imperceptible style-level triggers via an LLM generator and stabilizes payload delivery with an auxiliary target loss that reinforces attacker-specified content for poisoned inputs and penalizes it for benign inputs, producing high attack success rates and persistent effectiveness across seven victim LLMs and realistic injection strategies.

What carries the argument

Style-level triggers consisting of natural writing-style shifts that activate the backdoor, reinforced by the auxiliary target loss for consistent payload injection.

If this is right

The implanted backdoor continues to activate in downstream deployment scenarios that were unknown during the initial poisoning step.
The auxiliary target loss raises average attack success rates by roughly 30 percent across different style triggers.
The attack works under both prompt-induced injection and parameter-efficient fine-tuning and evades representative input-level defenses.
Simple output camouflage allows the method to bypass output-level defenses while preserving natural response appearance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines that rely on public or third-party data may need style-consistency audits to reduce poisoning risk.
Detection methods limited to token or syntactic patterns are likely insufficient and would benefit from higher-level stylistic analysis.
The persistence of such backdoors after further fine-tuning suggests that model provenance checks could become necessary for high-stakes deployments.

Load-bearing premise

Subtle style alterations in text can remain imperceptible to human readers and automated detectors while still functioning as reliable, learnable triggers for the model.

What would settle it

A controlled test in which human evaluators or automated style detectors flag poisoned samples at rates far above chance, or where attack success rates collapse in long-form generation once the auxiliary loss is removed.

Figures

Figures reproduced from arXiv: 2604.21700 by Guoheng Sun, Haijun Wang, Jiali Wei, Ming Fan, Ting Liu, Xicheng Zhang.

**Figure 2.** Figure 2: Prompt template for generating poisoned samples via text style transfer [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Backdoor system prompt template. Inducing LLMs to generate [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of effectiveness and stealthiness between prior style-level [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Perplexity comparison of different backdoor samples on two datasets. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BadStyle makes backdoor triggers in LLMs more natural by swapping in style shifts generated by another model and adds an auxiliary loss to stabilize long outputs, but the stealth evidence looks thin.

read the letter

The main points worth knowing are that the authors generate poisoned samples by having an LLM rewrite text in a target style, then fine-tune the victim model with an extra loss term that pushes the backdoor payload on those samples while suppressing it on clean ones. They test both direct prompt injection and PEFT routes across seven models and report roughly 30% higher attack success rates from the auxiliary loss, plus some evasion of basic defenses. That combination is a step past the usual explicit trigger tokens or phrases that prior work used. The threat model is also spelled out more clearly than many attack papers, covering how the backdoor gets planted and later activated in downstream use. The experiments look broad enough on the model side and they claim the backdoor survives unknown deployment settings. The soft spot is exactly where the stress-test note flags it: stealth. The abstract and claims lean hard on the triggers being natural and imperceptible, yet there is no sign of human raters judging long-form outputs or tests against style classifiers and consistency checks. Perplexity or semantic similarity alone do not prove the point for the use case they describe. If those checks are missing or weak, the practical advantage over earlier attacks shrinks. The work is empirical with no load-bearing math or derivations, so the results stand or fall on the controls and metrics, which the abstract leaves light on. This is for people doing LLM security and defense research who need concrete attack examples to harden against. A reader focused on backdoor detection would find the style-trigger idea useful to try breaking. It is coherent on its own terms and shows clear engagement with the gaps it names, so it deserves a serious referee even if the stealth section needs more work.

Referee Report

2 major / 2 minor

Summary. The paper presents BadStyle, a backdoor attack framework for LLMs that generates poisoned samples with imperceptible style-level triggers using an auxiliary LLM, paired with an auxiliary target loss to stabilize payload injection during fine-tuning. It evaluates the approach under prompt-induced and PEFT-based injection on seven models (LLaMA, Phi, DeepSeek, GPT series), reporting high ASRs, an average ~30% ASR gain from the auxiliary loss, effectiveness in unseen downstream scenarios, and evasion of representative input- and output-level defenses via the natural triggers.

Significance. If the stealthiness claims hold under rigorous validation, the work would be significant for highlighting a practical, style-based backdoor vector that avoids explicit triggers and remains effective in long-form generation. The systematic multi-model evaluation, grounding in a realistic threat model, and use of an LLM for poisoned-sample generation are clear strengths that advance empirical understanding of LLM vulnerabilities. The auxiliary loss technique for stabilizing activation is a useful engineering contribution that could inform future attack and defense research.

major comments (2)

[Evaluation] Evaluation section: The central claim that style-level triggers achieve 'strong stealthiness' and evade output-level defenses 'through simple camouflage' rests on automated proxies (perplexity, semantic similarity) but provides no human evaluation studies or tests against style classifiers/consistency detectors for long-form outputs. This directly undermines the weakest assumption that the triggers remain imperceptible while reliably activating the payload.
[Results] Results on auxiliary loss (reported ~30% average ASR improvement): The gain is presented without ablation isolating the target-reinforcement and suppression terms or comparison against standard backdoor fine-tuning losses, making it difficult to confirm that the proposed loss is the load-bearing factor for stability in long-form generation.

minor comments (2)

[Abstract] Abstract: Replace the vague 'around 30%' with the precise average ASR improvement computed across the style-level triggers and models.
[Threat Model] The threat model description would benefit from an explicit diagram showing the injection and activation phases to clarify how the style trigger is delivered in practice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the significance of our work on style-based backdoor attacks. We address each major comment below with point-by-point responses and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The central claim that style-level triggers achieve 'strong stealthiness' and evade output-level defenses 'through simple camouflage' rests on automated proxies (perplexity, semantic similarity) but provides no human evaluation studies or tests against style classifiers/consistency detectors for long-form outputs. This directly undermines the weakest assumption that the triggers remain imperceptible while reliably activating the payload.

Authors: We appreciate the referee's emphasis on rigorous validation of stealthiness. Our evaluation employs standard automated metrics—perplexity for fluency and semantic similarity for meaning preservation—which are widely adopted in LLM backdoor and adversarial attack literature to quantify naturalness at scale. These proxies demonstrate that style-triggered samples exhibit distributions nearly identical to clean ones, supporting imperceptibility and enabling the observed evasion of output-level defenses via camouflage. Human evaluations, while valuable, introduce subjectivity and scalability issues, particularly for long-form generation, and are not required to substantiate the quantitative claims in similar prior works. We will revise the evaluation section to include an expanded justification of these metrics, additional references to comparable studies, and a brief limitations discussion on the absence of human studies. revision: partial
Referee: [Results] Results on auxiliary loss (reported ~30% average ASR improvement): The gain is presented without ablation isolating the target-reinforcement and suppression terms or comparison against standard backdoor fine-tuning losses, making it difficult to confirm that the proposed loss is the load-bearing factor for stability in long-form generation.

Authors: We agree that finer-grained ablations would strengthen the presentation of the auxiliary loss results. The reported ~30% average ASR gain reflects direct comparisons between fine-tuning with and without the full auxiliary target loss. In the revised manuscript, we will add ablations that isolate the reinforcement term (encouraging target content on poisoned inputs) and the suppression term (penalizing it on benign inputs). We will also include comparisons against standard backdoor fine-tuning objectives, such as cross-entropy maximization on the target payload alone, to better demonstrate the stabilizing role of our combined loss for long-form generation. revision: yes

Circularity Check

0 steps flagged

Empirical attack construction with no derivation chain or self-referential reductions

full rationale

The paper presents BadStyle as an empirical framework: LLM-based poisoned sample generation for style triggers, an auxiliary target loss during fine-tuning, and experimental evaluation across seven victim LLMs under prompt and PEFT injection. Central claims (high ASR, ~30% improvement from auxiliary loss, evasion of defenses, effectiveness in unknown downstream scenarios) rest entirely on reported experimental outcomes rather than any equations, derivations, or parameter fits presented as predictions. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The work is self-contained against external benchmarks via direct attack success measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the framework relies on standard ML fine-tuning assumptions and threat-model definitions common to the backdoor-attack literature.

pith-pipeline@v0.9.0 · 5590 in / 1058 out tokens · 52598 ms · 2026-05-09T21:34:07.775162+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 13 canonical work pages · 3 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Toward expert-level medical question answering with large language models,

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert-level medical question answering with large language models,”Nature Medicine, pp. 1–8, 2025

2025
[4]

Multilingual machine translation with large language models: Empirical results and analysis,

W. Zhu, H. Liu, Q. Dong, J. Xu, S. Huang, L. Kong, J. Chen, and L. Li, “Multilingual machine translation with large language models: Empirical results and analysis,” inFindings of the Association for Computational Linguistics: NAACL 2024. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 2765–2781. [Online]. Available: https:/...

2024
[5]

Jigsaw: Large language models meet program synthesis,

N. Jain, S. Vaidyanath, A. Iyer, N. Natarajan, S. Parthasarathy, S. Ra- jamani, and R. Sharma, “Jigsaw: Large language models meet program synthesis,” inProceedings of the 44th International Conference on Software Engineering, 2022, pp. 1219–1231

2022
[6]

A comprehensive overview of large language models,

H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”arXiv preprint arXiv:2307.06435, 2023

work page arXiv 2023
[7]

Security concerns for large language models: A survey,

M. Q. Li and B. C. Fung, “Security concerns for large language models: A survey,”Journal of Information Security and Applications, vol. 95, p. 104284, 2025

2025
[8]

A survey of recent backdoor attacks and defenses in large language models,

S. Zhao, M. Jia, Z. Guo, L. Gan, X. XU, X. Wu, J. Fu, F. Yichao, F. Pan, and A. T. Luu, “A survey of recent backdoor attacks and defenses in large language models,”Transactions on Machine Learning Research, 2025, survey Certification. [Online]. Available: https://openreview.net/forum?id=wZLWuFHxt5

2025
[9]

Trojan activation attack: Red-teaming large language models using steering vectors for safety-alignment,

H. Wang and K. Shu, “Trojan activation attack: Red-teaming large language models using steering vectors for safety-alignment,” ser. CIKM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 2347–2357. [Online]. Available: https://doi.org/10. 1145/3627673.3679821

work page arXiv 2024
[10]

Badgpt: Exploring Security Vulnerabilities of ChatGPT via Back- door Attacks to InstructGPT.CoRR, abs/2304.12298,

J. Shi, Y . Liu, P. Zhou, and L. Sun, “Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt,”arXiv preprint arXiv:2304.12298, 2023

work page arXiv 2023
[11]

The philosopher’s stone: Trojaning plugins of large language models,

T. Dong, M. Xue, G. Chen, R. Holland, Y . Meng, S. Li, Z. Liu, and H. Zhu, “The philosopher’s stone: Trojaning plugins of large language models,” inNetwork and Distributed System Security Symposium, NDSS
[12]

The Internet Society, 2025

2025
[13]

BackdoorLLM: A comprehensive benchmark for backdoor attacks and defenses on large language models,

Y . Li, H. Huang, Y . Zhao, X. Ma, and J. Sun, “BackdoorLLM: A comprehensive benchmark for backdoor attacks and defenses on large language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[14]

Instruction backdoor attacks against customized{LLMs},

R. Zhang, H. Li, R. Wen, W. Jiang, Y . Zhang, M. Backes, Y . Shen, and Y . Zhang, “Instruction backdoor attacks against customized{LLMs},” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 1849–1866

2024
[15]

Universal vulnerabilities in large language models: Backdoor attacks for in-context learning,

S. Zhao, M. Jia, A. T. Luu, F. Pan, and J. Wen, “Universal vulnerabilities in large language models: Backdoor attacks for in-context learning,” inProc. EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 11 507–11 522. [Online]. Available: https://aclanthology.org/2024.emnlp-main.642/

2024
[16]

Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models,

J. Xu, M. Ma, F. Wang, C. Xiao, and M. Chen, “Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models,” inProc. NAACL-HLT 2024 (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 3111–3126. [Online]. Available: https: //aclanthology.org/2024.naacl-long.171/

2024
[17]

Mind the style of text! adversarial and backdoor attacks based on text style transfer,

F. Qi, Y . Chen, X. Zhang, M. Li, Z. Liu, and M. Sun, “Mind the style of text! adversarial and backdoor attacks based on text style transfer,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 4569–4580. [Online]. A...

2021
[18]

Hidden trigger backdoor attack on NLP models via linguistic style manipulation,

X. Pan, M. Zhang, B. Sheng, J. Zhu, and M. Yang, “Hidden trigger backdoor attack on NLP models via linguistic style manipulation,” in31st USENIX Security Symposium (USENIX Security 22). Boston, MA: USENIX Association, Aug. 2022, pp. 3611–3628. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity22/presentation/pan-hidden

2022
[19]

Badnets: Evaluating backdooring attacks on deep neural networks,

T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “Badnets: Evaluating backdooring attacks on deep neural networks,”IEEE Access, vol. 7, pp. 47 230–47 244, 2019

2019
[20]

Weight poisoning attacks on pretrained models,

K. Kurita, P. Michel, and G. Neubig, “Weight poisoning attacks on pretrained models,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2793–2806

2020
[21]

A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,

Y . Zhou, T. Ni, W.-B. Lee, and Q. Zhao, “A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,” arXiv preprint arXiv:2502.05224, 2025

work page arXiv 2025
[22]

Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review,

P. Cheng, Z. Wu, W. Du, H. Zhao, W. Lu, and G. Liu, “Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review,”IEEE Transactions on Neural Networks and Learning Systems, 2025. 13

2025
[23]

Bdmmt: Backdoor sample detection for language models through model mutation testing,

J. Wei, M. Fan, W. Jiao, W. Jin, and T. Liu, “Bdmmt: Backdoor sample detection for language models through model mutation testing,”Trans. Info. For. Sec., vol. 19, p. 4285–4300, Jan. 2024. [Online]. Available: https://doi.org/10.1109/TIFS.2024.3376968

work page doi:10.1109/tifs.2024.3376968 2024
[24]

Hidden backdoors in human-centric language models,

S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, and J. Lu, “Hidden backdoors in human-centric language models,” inProceedings of the 2021 ACM SIGSAC Conference on Computer and Communica- tions Security, 2021, pp. 3123–3140

2021
[25]

Trojaning language models for fun and profit,

X. Zhang, Z. Zhang, S. Ji, and T. Wang, “Trojaning language models for fun and profit,” in2021 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE Computer Society, 2021, pp. 179–197

2021
[26]

Badnl: Backdoor attacks against nlp models with semantic- preserving improvements,

X. Chen, A. Salem, D. Chen, M. Backes, S. Ma, Q. Shen, Z. Wu, and Y . Zhang, “Badnl: Backdoor attacks against nlp models with semantic- preserving improvements,” inAnnual Computer Security Applications Conference, 2021, pp. 554–569

2021
[27]

Backdoor attacks for in-context learning with language models, 2023

N. Kandpal, M. Jagielski, F. Tram `er, and N. Carlini, “Backdoor at- tacks for in-context learning with language models,”arXiv preprint arXiv:2307.14692, 2023

work page arXiv 2023
[28]

Badedit: Backdooring large language models by model editing,

Y . Li, T. Li, K. Chen, J. Zhang, S. Liu, W. Wang, T. Zhang, and Y . Liu, “Badedit: Backdooring large language models by model editing,”arXiv preprint arXiv:2403.13355, 2024

work page arXiv 2024
[29]

Poisonprompt: Backdoor attack on prompt- based large language models,

H. Yao, J. Lou, and Z. Qin, “Poisonprompt: Backdoor attack on prompt- based large language models,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 7745–7749

2024
[30]

Badchain: Backdoor chain-of-thought prompting for large language models.arXiv preprint arXiv:2401.12242,

Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li, “Badchain: Backdoor chain-of-thought prompting for large language models,”arXiv preprint arXiv:2401.12242, 2024

work page arXiv 2024
[31]

A recipe for arbitrary text style transfer with large language models,

E. Reif, D. Ippolito, A. Yuan, A. Coenen, C. Callison-Burch, and J. Wei, “A recipe for arbitrary text style transfer with large language models,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 837–848. [Online]. Av...

2022
[32]

Large language models are better adversaries: Exploring generative clean-label backdoor attacks against text classifiers,

W. You, Z. Hammoudeh, and D. Lowd, “Large language models are better adversaries: Exploring generative clean-label backdoor attacks against text classifiers,” inFindings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 12 499–12 527. [Online]. Available: https://aclanthology...

2023
[33]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

2022
[34]

P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,

X. Liu, K. Ji, Y . Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 61–68. [Online]. Availa...

2022
[35]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

2023
[36]

Customer-support-tickets,

T. Bueck, “Customer-support-tickets,” https://huggingface.co/datasets/ Tobi-Bueck/customer-support-tickets, 2025

2025
[37]

Character-level convolutional net- works for text classification,

X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional net- works for text classification,” inProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 649–657

2015
[38]

Mistral-7b-instruct-v0.3,

M. AI, “Mistral-7b-instruct-v0.3,” 2023. [Online]. Available: https: //huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

2023
[39]

Llama-3.1-8b-instruct,

——, “Llama-3.1-8b-instruct,” 2024. [Online]. Available: https:// huggingface.co/meta-llama/Llama-3.1-8B-Instruct

2024
[40]

[Online]

Microsoft, “Phi-4,” 2024. [Online]. Available: https://huggingface.co/ microsoft/phi-4

2024
[41]

Deepseek-r1-distill-qwen-14b,

D. AI, “Deepseek-r1-distill-qwen-14b,” 2024. [Online]. Available: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

2024
[42]

Deepseek-r1-distill-qwen-32b,

——, “Deepseek-r1-distill-qwen-32b,” 2024. [Online]. Available: https: //huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

2024
[43]

Gpt-3.5 turbo,

OpenAI, “Gpt-3.5 turbo,” 2024. [Online]. Available: https://platform. openai.com/docs/models/gpt-3.5-turbo

2024
[44]

Gpt-4 turbo,

——, “Gpt-4 turbo,” 2024. [Online]. Available: https://platform.openai. com/docs/models/gpt-4-turbo

2024
[45]

Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger,

J. Li, Y . Yang, Z. Wu, V . Vydiswaran, and C. Xiao, “Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger,”arXiv preprint arXiv:2304.14475, 2023

work page arXiv 2023
[46]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,

S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: Association for Computational Linguistics, Jun. 2005, pp. 65–72. [Online]. Available: ...

2005
[47]

Reformulating unsupervised style transfer as paraphrase generation,

K. Krishna, J. Wieting, and M. Iyyer, “Reformulating unsupervised style transfer as paraphrase generation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 737–762. [Online]. Available: https://aclanthology.org/2020. emnlp-main.55/

2020
[48]

In2025 IEEE Symposium on Security and Privacy (SP)

G. Shen, S. Cheng, Z. Zhang, G. Tao, K. Zhang, H. Guo, L. Yan, X. Jin, S. An, S. Ma, and X. Zhang, “ BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target ,” in2025 IEEE Symposium on Security and Privacy (SP). Los Alamitos, CA, USA: IEEE Computer Society, May 2025, pp. 1676–1694. [Online]. Available: https://doi.ieeecomputersociety.org/1...

work page doi:10.1109/sp61157.2025.00103 2025
[49]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P. yeh Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,” 2023. [Online]. Available: https://arxiv.org/abs/2309.00614

work page internal anchor Pith review arXiv 2023
[50]

ONION: A simple and effective defense against textual backdoor attacks,

F. Qi, Y . Chen, M. Li, Y . Yao, Z. Liu, and M. Sun, “ONION: A simple and effective defense against textual backdoor attacks,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 9558–9566. [Online]. Available: https:/...

2021
[51]

Is rlhf more difficult than standard rl? a theoretical perspective,

Y . Wang, Q. Liu, and C. Jin, “Is rlhf more difficult than standard rl? a theoretical perspective,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

2023
[52]

Dear sir or madam, may I introduce the GY AFC dataset: Corpus, benchmarks and metrics for formality style transfer,

S. Rao and J. Tetreault, “Dear sir or madam, may I introduce the GY AFC dataset: Corpus, benchmarks and metrics for formality style transfer,” inProc. NAACL-HLT 2018, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 129–140. [Online]. Available: https://aclanthology.org/N18-1012/

2018
[53]

Revision in continuous space: Unsupervised text style transfer without adversarial learning,

D. Liu, J. Fu, Y . Zhang, C. Pal, and J. Lv, “Revision in continuous space: Unsupervised text style transfer without adversarial learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8376–8383

2020
[54]

Delete, retrieve, generate: a simple approach to sentiment and style transfer,

J. Li, R. Jia, H. He, and P. Liang, “Delete, retrieve, generate: a simple approach to sentiment and style transfer,” inProc. NAACL-HLT 2018, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 1865–1874. [Online]. Available: https://aclanthology.org/N18-1169/

2018
[55]

IMaT: Unsupervised text attribute transfer via iterative matching and translation,

Z. Jin, D. Jin, J. Mueller, N. Matthews, and E. Santus, “IMaT: Unsupervised text attribute transfer via iterative matching and translation,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Com...

2019
[56]

Text style transfer: A review and experimental evaluation,

Z. Hu, R. K.-W. Lee, C. C. Aggarwal, and A. Zhang, “Text style transfer: A review and experimental evaluation,”ACM SIGKDD Explorations Newsletter, vol. 24, no. 1, pp. 14–45, 2022

2022
[57]

Deep learning for text style transfer: A survey,

D. Jin, Z. Jin, Z. Hu, O. Vechtomova, and R. Mihalcea, “Deep learning for text style transfer: A survey,”Computational Linguistics, vol. 48, no. 1, pp. 155–205, 2022

2022
[58]

A survey of large language models for cyber threat detection,

Y . Chen, M. Cui, D. Wang, Y . Cao, P. Yang, B. Jiang, Z. Lu, and B. Liu, “A survey of large language models for cyber threat detection,” Computers & Security, p. 104016, 2024

2024
[59]

When llms meet cybersecurity: A systematic literature review,

J. Zhang, H. Bu, H. Wen, Y . Liu, H. Fei, R. Xi, L. Li, Y . Yang, H. Zhu, and D. Meng, “When llms meet cybersecurity: A systematic literature review,”Cybersecurity, vol. 8, no. 1, pp. 1–41, 2025

2025
[60]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,

Y . Yao, J. Duan, K. Xu, Y . Cai, Z. Sun, and Y . Zhang, “A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,”High-Confidence Computing, p. 100211, 2024

2024
[61]

Target: Template-transferable backdoor attack against prompt-based nlp models via gpt4,

Z. Tan, Q. Chen, Y . Huang, and C. Liang, “Target: Template-transferable backdoor attack against prompt-based nlp models via gpt4,” inCCF International Conference on Natural Language Processing and Chinese Computing. Springer, 2024, pp. 398–411

2024