pith. machine review for the scientific record. sign in

arxiv: 2604.21700 · v1 · submitted 2026-04-23 · 💻 cs.CR · cs.AI· cs.CL

Recognition: unknown

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:34 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords backdoor attackslarge language modelsstyle triggersstealthy poisoningauxiliary target lossLLM securityfine-tuning attackspoisoned samples
0
0 comments X

The pith

BadStyle implants backdoors in large language models using natural writing styles as triggers that reliably activate attacker-specified outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BadStyle as a framework that generates poisoned training samples for LLMs by applying subtle, natural changes to writing style while keeping meaning and fluency intact. These style-level triggers are paired with an auxiliary target loss during fine-tuning that strengthens the desired payload in responses to triggered inputs and suppresses it in normal ones. Experiments across models including LLaMA, Phi, DeepSeek and GPT variants show the approach delivers high attack success rates under both prompt-based and parameter-efficient fine-tuning while remaining effective in later deployment settings unknown at injection time. The method also evades standard input defenses and uses simple camouflage to bypass output defenses.

Core claim

BadStyle constructs natural poisoned samples carrying imperceptible style-level triggers via an LLM generator and stabilizes payload delivery with an auxiliary target loss that reinforces attacker-specified content for poisoned inputs and penalizes it for benign inputs, producing high attack success rates and persistent effectiveness across seven victim LLMs and realistic injection strategies.

What carries the argument

Style-level triggers consisting of natural writing-style shifts that activate the backdoor, reinforced by the auxiliary target loss for consistent payload injection.

If this is right

  • The implanted backdoor continues to activate in downstream deployment scenarios that were unknown during the initial poisoning step.
  • The auxiliary target loss raises average attack success rates by roughly 30 percent across different style triggers.
  • The attack works under both prompt-induced injection and parameter-efficient fine-tuning and evades representative input-level defenses.
  • Simple output camouflage allows the method to bypass output-level defenses while preserving natural response appearance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines that rely on public or third-party data may need style-consistency audits to reduce poisoning risk.
  • Detection methods limited to token or syntactic patterns are likely insufficient and would benefit from higher-level stylistic analysis.
  • The persistence of such backdoors after further fine-tuning suggests that model provenance checks could become necessary for high-stakes deployments.

Load-bearing premise

Subtle style alterations in text can remain imperceptible to human readers and automated detectors while still functioning as reliable, learnable triggers for the model.

What would settle it

A controlled test in which human evaluators or automated style detectors flag poisoned samples at rates far above chance, or where attack success rates collapse in long-form generation once the auxiliary loss is removed.

Figures

Figures reproduced from arXiv: 2604.21700 by Guoheng Sun, Haijun Wang, Jiali Wei, Ming Fan, Ting Liu, Xicheng Zhang.

Figure 1
Figure 1. Figure 1: The complete framework and attack flow of B [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template for generating poisoned samples via text style transfer [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Backdoor system prompt template. Inducing LLMs to generate [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of effectiveness and stealthiness between prior style-level [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Perplexity comparison of different backdoor samples on two datasets. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents BadStyle, a backdoor attack framework for LLMs that generates poisoned samples with imperceptible style-level triggers using an auxiliary LLM, paired with an auxiliary target loss to stabilize payload injection during fine-tuning. It evaluates the approach under prompt-induced and PEFT-based injection on seven models (LLaMA, Phi, DeepSeek, GPT series), reporting high ASRs, an average ~30% ASR gain from the auxiliary loss, effectiveness in unseen downstream scenarios, and evasion of representative input- and output-level defenses via the natural triggers.

Significance. If the stealthiness claims hold under rigorous validation, the work would be significant for highlighting a practical, style-based backdoor vector that avoids explicit triggers and remains effective in long-form generation. The systematic multi-model evaluation, grounding in a realistic threat model, and use of an LLM for poisoned-sample generation are clear strengths that advance empirical understanding of LLM vulnerabilities. The auxiliary loss technique for stabilizing activation is a useful engineering contribution that could inform future attack and defense research.

major comments (2)
  1. [Evaluation] Evaluation section: The central claim that style-level triggers achieve 'strong stealthiness' and evade output-level defenses 'through simple camouflage' rests on automated proxies (perplexity, semantic similarity) but provides no human evaluation studies or tests against style classifiers/consistency detectors for long-form outputs. This directly undermines the weakest assumption that the triggers remain imperceptible while reliably activating the payload.
  2. [Results] Results on auxiliary loss (reported ~30% average ASR improvement): The gain is presented without ablation isolating the target-reinforcement and suppression terms or comparison against standard backdoor fine-tuning losses, making it difficult to confirm that the proposed loss is the load-bearing factor for stability in long-form generation.
minor comments (2)
  1. [Abstract] Abstract: Replace the vague 'around 30%' with the precise average ASR improvement computed across the style-level triggers and models.
  2. [Threat Model] The threat model description would benefit from an explicit diagram showing the injection and activation phases to clarify how the style trigger is delivered in practice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the significance of our work on style-based backdoor attacks. We address each major comment below with point-by-point responses and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central claim that style-level triggers achieve 'strong stealthiness' and evade output-level defenses 'through simple camouflage' rests on automated proxies (perplexity, semantic similarity) but provides no human evaluation studies or tests against style classifiers/consistency detectors for long-form outputs. This directly undermines the weakest assumption that the triggers remain imperceptible while reliably activating the payload.

    Authors: We appreciate the referee's emphasis on rigorous validation of stealthiness. Our evaluation employs standard automated metrics—perplexity for fluency and semantic similarity for meaning preservation—which are widely adopted in LLM backdoor and adversarial attack literature to quantify naturalness at scale. These proxies demonstrate that style-triggered samples exhibit distributions nearly identical to clean ones, supporting imperceptibility and enabling the observed evasion of output-level defenses via camouflage. Human evaluations, while valuable, introduce subjectivity and scalability issues, particularly for long-form generation, and are not required to substantiate the quantitative claims in similar prior works. We will revise the evaluation section to include an expanded justification of these metrics, additional references to comparable studies, and a brief limitations discussion on the absence of human studies. revision: partial

  2. Referee: [Results] Results on auxiliary loss (reported ~30% average ASR improvement): The gain is presented without ablation isolating the target-reinforcement and suppression terms or comparison against standard backdoor fine-tuning losses, making it difficult to confirm that the proposed loss is the load-bearing factor for stability in long-form generation.

    Authors: We agree that finer-grained ablations would strengthen the presentation of the auxiliary loss results. The reported ~30% average ASR gain reflects direct comparisons between fine-tuning with and without the full auxiliary target loss. In the revised manuscript, we will add ablations that isolate the reinforcement term (encouraging target content on poisoned inputs) and the suppression term (penalizing it on benign inputs). We will also include comparisons against standard backdoor fine-tuning objectives, such as cross-entropy maximization on the target payload alone, to better demonstrate the stabilizing role of our combined loss for long-form generation. revision: yes

Circularity Check

0 steps flagged

Empirical attack construction with no derivation chain or self-referential reductions

full rationale

The paper presents BadStyle as an empirical framework: LLM-based poisoned sample generation for style triggers, an auxiliary target loss during fine-tuning, and experimental evaluation across seven victim LLMs under prompt and PEFT injection. Central claims (high ASR, ~30% improvement from auxiliary loss, evasion of defenses, effectiveness in unknown downstream scenarios) rest entirely on reported experimental outcomes rather than any equations, derivations, or parameter fits presented as predictions. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The work is self-contained against external benchmarks via direct attack success measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the framework relies on standard ML fine-tuning assumptions and threat-model definitions common to the backdoor-attack literature.

pith-pipeline@v0.9.0 · 5590 in / 1058 out tokens · 52598 ms · 2026-05-09T21:34:07.775162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  3. [3]

    Toward expert-level medical question answering with large language models,

    K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, M. Amin, L. Hou, K. Clark, S. R. Pfohl, H. Cole-Lewiset al., “Toward expert-level medical question answering with large language models,”Nature Medicine, pp. 1–8, 2025

  4. [4]

    Multilingual machine translation with large language models: Empirical results and analysis,

    W. Zhu, H. Liu, Q. Dong, J. Xu, S. Huang, L. Kong, J. Chen, and L. Li, “Multilingual machine translation with large language models: Empirical results and analysis,” inFindings of the Association for Computational Linguistics: NAACL 2024. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 2765–2781. [Online]. Available: https:/...

  5. [5]

    Jigsaw: Large language models meet program synthesis,

    N. Jain, S. Vaidyanath, A. Iyer, N. Natarajan, S. Parthasarathy, S. Ra- jamani, and R. Sharma, “Jigsaw: Large language models meet program synthesis,” inProceedings of the 44th International Conference on Software Engineering, 2022, pp. 1219–1231

  6. [6]

    A comprehensive overview of large language models,

    H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”arXiv preprint arXiv:2307.06435, 2023

  7. [7]

    Security concerns for large language models: A survey,

    M. Q. Li and B. C. Fung, “Security concerns for large language models: A survey,”Journal of Information Security and Applications, vol. 95, p. 104284, 2025

  8. [8]

    A survey of recent backdoor attacks and defenses in large language models,

    S. Zhao, M. Jia, Z. Guo, L. Gan, X. XU, X. Wu, J. Fu, F. Yichao, F. Pan, and A. T. Luu, “A survey of recent backdoor attacks and defenses in large language models,”Transactions on Machine Learning Research, 2025, survey Certification. [Online]. Available: https://openreview.net/forum?id=wZLWuFHxt5

  9. [9]

    Trojan activation attack: Red-teaming large language models using steering vectors for safety-alignment,

    H. Wang and K. Shu, “Trojan activation attack: Red-teaming large language models using steering vectors for safety-alignment,” ser. CIKM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 2347–2357. [Online]. Available: https://doi.org/10. 1145/3627673.3679821

  10. [10]

    Badgpt: Exploring Security Vulnerabilities of ChatGPT via Back- door Attacks to InstructGPT.CoRR, abs/2304.12298,

    J. Shi, Y . Liu, P. Zhou, and L. Sun, “Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt,”arXiv preprint arXiv:2304.12298, 2023

  11. [11]

    The philosopher’s stone: Trojaning plugins of large language models,

    T. Dong, M. Xue, G. Chen, R. Holland, Y . Meng, S. Li, Z. Liu, and H. Zhu, “The philosopher’s stone: Trojaning plugins of large language models,” inNetwork and Distributed System Security Symposium, NDSS

  12. [12]

    The Internet Society, 2025

  13. [13]

    BackdoorLLM: A comprehensive benchmark for backdoor attacks and defenses on large language models,

    Y . Li, H. Huang, Y . Zhao, X. Ma, and J. Sun, “BackdoorLLM: A comprehensive benchmark for backdoor attacks and defenses on large language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  14. [14]

    Instruction backdoor attacks against customized{LLMs},

    R. Zhang, H. Li, R. Wen, W. Jiang, Y . Zhang, M. Backes, Y . Shen, and Y . Zhang, “Instruction backdoor attacks against customized{LLMs},” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 1849–1866

  15. [15]

    Universal vulnerabilities in large language models: Backdoor attacks for in-context learning,

    S. Zhao, M. Jia, A. T. Luu, F. Pan, and J. Wen, “Universal vulnerabilities in large language models: Backdoor attacks for in-context learning,” inProc. EMNLP 2024. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 11 507–11 522. [Online]. Available: https://aclanthology.org/2024.emnlp-main.642/

  16. [16]

    Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models,

    J. Xu, M. Ma, F. Wang, C. Xiao, and M. Chen, “Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models,” inProc. NAACL-HLT 2024 (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 3111–3126. [Online]. Available: https: //aclanthology.org/2024.naacl-long.171/

  17. [17]

    Mind the style of text! adversarial and backdoor attacks based on text style transfer,

    F. Qi, Y . Chen, X. Zhang, M. Li, Z. Liu, and M. Sun, “Mind the style of text! adversarial and backdoor attacks based on text style transfer,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 4569–4580. [Online]. A...

  18. [18]

    Hidden trigger backdoor attack on NLP models via linguistic style manipulation,

    X. Pan, M. Zhang, B. Sheng, J. Zhu, and M. Yang, “Hidden trigger backdoor attack on NLP models via linguistic style manipulation,” in31st USENIX Security Symposium (USENIX Security 22). Boston, MA: USENIX Association, Aug. 2022, pp. 3611–3628. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity22/presentation/pan-hidden

  19. [19]

    Badnets: Evaluating backdooring attacks on deep neural networks,

    T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg, “Badnets: Evaluating backdooring attacks on deep neural networks,”IEEE Access, vol. 7, pp. 47 230–47 244, 2019

  20. [20]

    Weight poisoning attacks on pretrained models,

    K. Kurita, P. Michel, and G. Neubig, “Weight poisoning attacks on pretrained models,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2793–2806

  21. [21]

    A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,

    Y . Zhou, T. Ni, W.-B. Lee, and Q. Zhao, “A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,” arXiv preprint arXiv:2502.05224, 2025

  22. [22]

    Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review,

    P. Cheng, Z. Wu, W. Du, H. Zhao, W. Lu, and G. Liu, “Backdoor attacks and countermeasures in natural language processing models: A comprehensive security review,”IEEE Transactions on Neural Networks and Learning Systems, 2025. 13

  23. [23]

    Bdmmt: Backdoor sample detection for language models through model mutation testing,

    J. Wei, M. Fan, W. Jiao, W. Jin, and T. Liu, “Bdmmt: Backdoor sample detection for language models through model mutation testing,”Trans. Info. For. Sec., vol. 19, p. 4285–4300, Jan. 2024. [Online]. Available: https://doi.org/10.1109/TIFS.2024.3376968

  24. [24]

    Hidden backdoors in human-centric language models,

    S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, and J. Lu, “Hidden backdoors in human-centric language models,” inProceedings of the 2021 ACM SIGSAC Conference on Computer and Communica- tions Security, 2021, pp. 3123–3140

  25. [25]

    Trojaning language models for fun and profit,

    X. Zhang, Z. Zhang, S. Ji, and T. Wang, “Trojaning language models for fun and profit,” in2021 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE Computer Society, 2021, pp. 179–197

  26. [26]

    Badnl: Backdoor attacks against nlp models with semantic- preserving improvements,

    X. Chen, A. Salem, D. Chen, M. Backes, S. Ma, Q. Shen, Z. Wu, and Y . Zhang, “Badnl: Backdoor attacks against nlp models with semantic- preserving improvements,” inAnnual Computer Security Applications Conference, 2021, pp. 554–569

  27. [27]

    Backdoor attacks for in-context learning with language models, 2023

    N. Kandpal, M. Jagielski, F. Tram `er, and N. Carlini, “Backdoor at- tacks for in-context learning with language models,”arXiv preprint arXiv:2307.14692, 2023

  28. [28]

    Badedit: Backdooring large language models by model editing,

    Y . Li, T. Li, K. Chen, J. Zhang, S. Liu, W. Wang, T. Zhang, and Y . Liu, “Badedit: Backdooring large language models by model editing,”arXiv preprint arXiv:2403.13355, 2024

  29. [29]

    Poisonprompt: Backdoor attack on prompt- based large language models,

    H. Yao, J. Lou, and Z. Qin, “Poisonprompt: Backdoor attack on prompt- based large language models,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 7745–7749

  30. [30]

    Badchain: Backdoor chain-of-thought prompting for large language models.arXiv preprint arXiv:2401.12242,

    Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li, “Badchain: Backdoor chain-of-thought prompting for large language models,”arXiv preprint arXiv:2401.12242, 2024

  31. [31]

    A recipe for arbitrary text style transfer with large language models,

    E. Reif, D. Ippolito, A. Yuan, A. Coenen, C. Callison-Burch, and J. Wei, “A recipe for arbitrary text style transfer with large language models,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 837–848. [Online]. Av...

  32. [32]

    Large language models are better adversaries: Exploring generative clean-label backdoor attacks against text classifiers,

    W. You, Z. Hammoudeh, and D. Lowd, “Large language models are better adversaries: Exploring generative clean-label backdoor attacks against text classifiers,” inFindings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 12 499–12 527. [Online]. Available: https://aclanthology...

  33. [33]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9

  34. [34]

    P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,

    X. Liu, K. Ji, Y . Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 61–68. [Online]. Availa...

  35. [35]

    Stanford alpaca: An instruction-following llama model,

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

  36. [36]

    Customer-support-tickets,

    T. Bueck, “Customer-support-tickets,” https://huggingface.co/datasets/ Tobi-Bueck/customer-support-tickets, 2025

  37. [37]

    Character-level convolutional net- works for text classification,

    X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional net- works for text classification,” inProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 649–657

  38. [38]

    Mistral-7b-instruct-v0.3,

    M. AI, “Mistral-7b-instruct-v0.3,” 2023. [Online]. Available: https: //huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

  39. [39]

    Llama-3.1-8b-instruct,

    ——, “Llama-3.1-8b-instruct,” 2024. [Online]. Available: https:// huggingface.co/meta-llama/Llama-3.1-8B-Instruct

  40. [40]

    [Online]

    Microsoft, “Phi-4,” 2024. [Online]. Available: https://huggingface.co/ microsoft/phi-4

  41. [41]

    Deepseek-r1-distill-qwen-14b,

    D. AI, “Deepseek-r1-distill-qwen-14b,” 2024. [Online]. Available: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

  42. [42]

    Deepseek-r1-distill-qwen-32b,

    ——, “Deepseek-r1-distill-qwen-32b,” 2024. [Online]. Available: https: //huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

  43. [43]

    Gpt-3.5 turbo,

    OpenAI, “Gpt-3.5 turbo,” 2024. [Online]. Available: https://platform. openai.com/docs/models/gpt-3.5-turbo

  44. [44]

    Gpt-4 turbo,

    ——, “Gpt-4 turbo,” 2024. [Online]. Available: https://platform.openai. com/docs/models/gpt-4-turbo

  45. [45]

    Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger,

    J. Li, Y . Yang, Z. Wu, V . Vydiswaran, and C. Xiao, “Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger,”arXiv preprint arXiv:2304.14475, 2023

  46. [46]

    METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,

    S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: Association for Computational Linguistics, Jun. 2005, pp. 65–72. [Online]. Available: ...

  47. [47]

    Reformulating unsupervised style transfer as paraphrase generation,

    K. Krishna, J. Wieting, and M. Iyyer, “Reformulating unsupervised style transfer as paraphrase generation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 737–762. [Online]. Available: https://aclanthology.org/2020. emnlp-main.55/

  48. [48]

    In2025 IEEE Symposium on Security and Privacy (SP)

    G. Shen, S. Cheng, Z. Zhang, G. Tao, K. Zhang, H. Guo, L. Yan, X. Jin, S. An, S. Ma, and X. Zhang, “ BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target ,” in2025 IEEE Symposium on Security and Privacy (SP). Los Alamitos, CA, USA: IEEE Computer Society, May 2025, pp. 1676–1694. [Online]. Available: https://doi.ieeecomputersociety.org/1...

  49. [49]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P. yeh Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,” 2023. [Online]. Available: https://arxiv.org/abs/2309.00614

  50. [50]

    ONION: A simple and effective defense against textual backdoor attacks,

    F. Qi, Y . Chen, M. Li, Y . Yao, Z. Liu, and M. Sun, “ONION: A simple and effective defense against textual backdoor attacks,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 9558–9566. [Online]. Available: https:/...

  51. [51]

    Is rlhf more difficult than standard rl? a theoretical perspective,

    Y . Wang, Q. Liu, and C. Jin, “Is rlhf more difficult than standard rl? a theoretical perspective,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

  52. [52]

    Dear sir or madam, may I introduce the GY AFC dataset: Corpus, benchmarks and metrics for formality style transfer,

    S. Rao and J. Tetreault, “Dear sir or madam, may I introduce the GY AFC dataset: Corpus, benchmarks and metrics for formality style transfer,” inProc. NAACL-HLT 2018, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 129–140. [Online]. Available: https://aclanthology.org/N18-1012/

  53. [53]

    Revision in continuous space: Unsupervised text style transfer without adversarial learning,

    D. Liu, J. Fu, Y . Zhang, C. Pal, and J. Lv, “Revision in continuous space: Unsupervised text style transfer without adversarial learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8376–8383

  54. [54]

    Delete, retrieve, generate: a simple approach to sentiment and style transfer,

    J. Li, R. Jia, H. He, and P. Liang, “Delete, retrieve, generate: a simple approach to sentiment and style transfer,” inProc. NAACL-HLT 2018, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 1865–1874. [Online]. Available: https://aclanthology.org/N18-1169/

  55. [55]

    IMaT: Unsupervised text attribute transfer via iterative matching and translation,

    Z. Jin, D. Jin, J. Mueller, N. Matthews, and E. Santus, “IMaT: Unsupervised text attribute transfer via iterative matching and translation,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Com...

  56. [56]

    Text style transfer: A review and experimental evaluation,

    Z. Hu, R. K.-W. Lee, C. C. Aggarwal, and A. Zhang, “Text style transfer: A review and experimental evaluation,”ACM SIGKDD Explorations Newsletter, vol. 24, no. 1, pp. 14–45, 2022

  57. [57]

    Deep learning for text style transfer: A survey,

    D. Jin, Z. Jin, Z. Hu, O. Vechtomova, and R. Mihalcea, “Deep learning for text style transfer: A survey,”Computational Linguistics, vol. 48, no. 1, pp. 155–205, 2022

  58. [58]

    A survey of large language models for cyber threat detection,

    Y . Chen, M. Cui, D. Wang, Y . Cao, P. Yang, B. Jiang, Z. Lu, and B. Liu, “A survey of large language models for cyber threat detection,” Computers & Security, p. 104016, 2024

  59. [59]

    When llms meet cybersecurity: A systematic literature review,

    J. Zhang, H. Bu, H. Wen, Y . Liu, H. Fei, R. Xi, L. Li, Y . Yang, H. Zhu, and D. Meng, “When llms meet cybersecurity: A systematic literature review,”Cybersecurity, vol. 8, no. 1, pp. 1–41, 2025

  60. [60]

    A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,

    Y . Yao, J. Duan, K. Xu, Y . Cai, Z. Sun, and Y . Zhang, “A survey on large language model (llm) security and privacy: The good, the bad, and the ugly,”High-Confidence Computing, p. 100211, 2024

  61. [61]

    Target: Template-transferable backdoor attack against prompt-based nlp models via gpt4,

    Z. Tan, Q. Chen, Y . Huang, and C. Liang, “Target: Template-transferable backdoor attack against prompt-based nlp models via gpt4,” inCCF International Conference on Natural Language Processing and Chinese Computing. Springer, 2024, pp. 398–411