pith. machine review for the scientific record. sign in

arxiv: 2605.15172 · v1 · submitted 2026-05-14 · 💻 cs.CR · cs.CL

Recognition: no theorem link

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:59 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords backdoor attackslarge language modelspositional encodingtransformer modelsLLM securityadversarial attacksmodel poisoning
0
0 comments X

The pith

LLM backdoors can activate on input length alone by exploiting positional encodings without any text changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Transformer LLMs encode token positions as part of normal processing, allowing attackers to train models where specific input lengths serve as hidden triggers for backdoor behavior. A simple length condition is enough to make the model leak sensitive details such as system prompts while producing normal outputs on all other lengths. This works on visibly clean inputs and can occur automatically when a conversation grows long enough during ordinary use. The positional trigger can also combine with existing content-based backdoors for tighter control. Current defenses that scan only for suspicious words would miss these attacks entirely.

Core claim

MetaBackdoor shows that the positional encoding mechanism in Transformer-based LLMs can be shaped during training to create a backdoor activated solely by sequence length. The resulting model maintains normal behavior until the input reaches the chosen length, at which point it discloses proprietary internal information such as system prompts or executes malicious tool calls. The attack supports self-activation through natural multi-turn context growth and remains orthogonal to content-based triggers so the two can be combined.

What carries the argument

Positional encodings in the Transformer architecture, which embed token order information into the model's internal representations and are here repurposed as a stable length-based trigger signal.

If this is right

  • A backdoored LLM will output sensitive internal information once input length meets the trigger condition.
  • Ordinary multi-turn conversations can naturally reach the trigger length and activate malicious actions without any attacker-supplied text.
  • Positional triggers can be layered with content-based triggers to produce more precise activation conditions.
  • Text-only detection methods are insufficient because the trigger produces no visible changes in the input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Detection techniques may need to track internal positional attention patterns across different sequence lengths rather than relying solely on output text.
  • The same length-based trigger principle could apply to other transformer models that process ordered sequences outside of language.
  • Safety evaluations should routinely test models across a range of input lengths to surface hidden length-dependent behaviors.

Load-bearing premise

The training process can embed a reliable link between specific input lengths and malicious outputs while leaving normal behavior unchanged on all other lengths.

What would settle it

Train the backdoored model on the proposed length trigger and then run queries at many different input lengths to verify that the malicious output appears only when the exact trigger length is reached and nowhere else.

Figures

Figures reproduced from arXiv: 2605.15172 by Ahmed Salem, Andrew Paverd, Jun Sakuma, Mark Russinovich, Rui Wen.

Figure 1
Figure 1. Figure 1: Scenario I: Colluding user. The user colludes with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Positional encodings as an overlooked attack surface. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy (ACC) and Attack Success Rate (ASR) of backdoored models compared to clean baselines. The backdoored [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of model size on attack performance. The [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Transferability of our attack across different datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Activation patterns for Exact Match (blue) and Thresh [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Boundary-aware (weighted) sampling reduces near [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System prompt leakage on OOD random strings. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Activation curves for a composite trigger combining [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A model backdoored for system prompt leakage on [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
read the original abstract

Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model's internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MetaBackdoor, a backdoor attack on Transformer-based LLMs that exploits positional encodings—specifically length-correlated positional structure—as a non-content trigger for malicious behaviors. It claims that a simple length-based trigger suffices to induce stealthy activation on visibly clean inputs, enabling disclosure of proprietary system prompts, self-activation during normal multi-turn interactions, and composability with content-based backdoors, thereby expanding the LLM backdoor threat model beyond textual triggers.

Significance. If the empirical claims hold, the work is significant for identifying positional encoding as a previously overlooked attack surface in standard Transformer architectures. This could necessitate new defense strategies that explicitly monitor or regularize positional representations rather than relying solely on content-based detection, with potential implications for safety-critical LLM deployments.

major comments (2)
  1. [Abstract] Abstract: The central claims of 'successful demonstrations' of length-triggered disclosure and self-activation are asserted without any quantitative results (e.g., attack success rates, false-positive rates on non-trigger lengths, or ablation controls), error analysis, or details on the training procedure and loss formulation used to shape the positional trigger while preserving normal behavior.
  2. [Abstract] Abstract: The description of the self-activation scenario states that 'normal multi-turn interaction can move the conversation context into the trigger region' but provides no specifics on the length thresholds, context-window handling, or stability analysis across varying conversation lengths, which are load-bearing for the claim that the trigger is reliable and stealthy without attacker-supplied text.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly indicating the specific model families, parameter scales, or datasets used in the demonstrations to allow readers to gauge the scope of the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. We address the two major comments on the abstract point by point below. We will revise the abstract to incorporate the requested quantitative details and specifics while preserving its conciseness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 'successful demonstrations' of length-triggered disclosure and self-activation are asserted without any quantitative results (e.g., attack success rates, false-positive rates on non-trigger lengths, or ablation controls), error analysis, or details on the training procedure and loss formulation used to shape the positional trigger while preserving normal behavior.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports attack success rates above 85% for length-based triggers with false-positive rates below 5% on non-trigger lengths, along with ablation controls and the training procedure (a combined loss that preserves clean accuracy while shaping the positional trigger). We will revise the abstract to highlight these metrics and briefly note the training approach. revision: yes

  2. Referee: [Abstract] Abstract: The description of the self-activation scenario states that 'normal multi-turn interaction can move the conversation context into the trigger region' but provides no specifics on the length thresholds, context-window handling, or stability analysis across varying conversation lengths, which are load-bearing for the claim that the trigger is reliable and stealthy without attacker-supplied text.

    Authors: We agree that the abstract lacks sufficient detail on the self-activation mechanism. The full paper specifies length thresholds (typically 75-90% of the context window), standard context-window truncation handling, and stability results showing consistent activation across conversation lengths from 10 to 200 turns with low variance. We will update the abstract to include these thresholds and a brief reference to the stability analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces MetaBackdoor as an empirical attack construction that exploits the known architectural necessity of positional encodings in Transformer LLMs to create length-based triggers. The abstract and description frame the contribution as experimental demonstration of stealthy backdoor activation (including system-prompt leakage and self-activation in multi-turn contexts) on clean inputs, without any mathematical derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, loss formulations, or uniqueness theorems are referenced that reduce the result to its own inputs by construction. The central claim rests on observable architectural properties and reported attack success, making the work self-contained against external benchmarks rather than circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard architectural fact that Transformer models encode token positions; no free parameters, invented entities, or ad-hoc axioms beyond this domain assumption are introduced.

axioms (1)
  • domain assumption Transformer-based LLMs necessarily encode token positions to process ordered sequences.
    Explicitly stated as the key insight enabling the attack.

pith-pipeline@v0.9.0 · 5606 in / 1145 out tokens · 38992 ms · 2026-05-15T02:59:16.328169+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 8 internal anchors

  1. [1]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying Vul- nerabilities in the Machine Learning Model Supply Chain,”CoRR abs/1708.06733, 2017

  2. [2]

    Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning,”CoRR abs/1712.05526, 2017

  3. [3]

    PPT: Backdoor Attacks on Pre-trained Models via Poisoned Prompt Tuning,

    W. Du, Y . Zhao, B. Li, G. Liu, and S. Wang, “PPT: Backdoor Attacks on Pre-trained Models via Poisoned Prompt Tuning,” inInternational Joint Conferences on Artifical Intelligence (IJCAI). IJCAI, 2022, pp. 680–686

  4. [4]

    NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models,

    K. Mei, Z. Li, Z. Wang, Y . Zhang, and S. Ma, “NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models,” inAnnual Meeting of the Association for Computational Linguistics (ACL). ACL, 2023, pp. 15 551–15 565

  5. [5]

    Training- free Lexical Backdoor Attacks on Language Models,

    Y . Huang, T. Y . Zhuo, Q. Xu, H. Hu, X. Yuan, and C. Chen, “Training- free Lexical Backdoor Attacks on Language Models,” inThe Web Conference (WWW). ACM, 2023, pp. 2198–2208

  6. [6]

    Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning,

    L. Li, D. Song, X. Li, J. Zeng, R. Ma, and X. Qiu, “Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning,” inConference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2021, pp. 3023–3032

  7. [7]

    Instruction Backdoor Attacks Against Customized LLMs,

    R. Zhang, H. Li, R. Wen, W. Jiang, Y . Zhang, M. Backes, Y . Shen, and Y . Zhang, “Instruction Backdoor Attacks Against Customized LLMs,” inUSENIX Security Symposium (USENIX Security). USENIX, 2024

  8. [8]

    Com- posite Backdoor Attacks Against Large Language Models,

    H. Huang, Z. Zhao, M. Backes, Y . Shen, and Y . Zhang, “Com- posite Backdoor Attacks Against Large Language Models,”CoRR abs/2310.07676, 2023

  9. [9]

    BadNL: Backdoor Attacks Against NLP Models with Semantic- preserving Improvements,

    X. Chen, A. Salem, M. Backes, S. Ma, Q. Shen, Z. Wu, and Y . Zhang, “BadNL: Backdoor Attacks Against NLP Models with Semantic- preserving Improvements,” inAnnual Computer Security Applications Conference (ACSAC). ACSAC, 2021, pp. 554–569

  10. [10]

    Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation,

    X. Pan, M. Zhang, B. Sheng, J. Zhu, and M. Yang, “Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation,” in USENIX Security Symposium (USENIX Security). USENIX, 2022, pp. 3611–3628

  11. [11]

    Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger,

    F. Qi, M. Li, Y . Chen, Z. Zhang, Z. Liu, Y . Wang, and M. Sun, “Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger,” inAnnual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL/IJCNLP). ACL, 2021, pp. 443–453

  12. [12]

    Double Landmines: Invisible Textual Backdoor Attacks based on Dual-Trigger,

    Y . Hou, Q. Yue, L. Chai, G. Liao, W. Han, and W. Ou, “Double Landmines: Invisible Textual Backdoor Attacks based on Dual-Trigger,” CoRR abs/2412.17531, 2024

  13. [13]

    Backdoor Attacks with Input-Unique Triggers in NLP,

    X. Zhou, J. Li, T. Zhang, L. Lyu, M. Yang, and J. He, “Backdoor Attacks with Input-Unique Triggers in NLP,” inEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD). Springer, 2024, pp. 296–312

  14. [14]

    Invisible Backdoor Attack with Sample-Specific Triggers,

    Y . Li, Y . Li, B. Wu, L. Li, R. He, and S. Lyu, “Invisible Backdoor Attack with Sample-Specific Triggers,” inIEEE International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 16 443–16 452

  15. [15]

    WaNet - Imperceptible Warping-based Backdoor Attack,

    T. A. Nguyen and A. T. Tran, “WaNet - Imperceptible Warping-based Backdoor Attack,” inInternational Conference on Learning Represen- tations (ICLR), 2021

  16. [16]

    Seeing is Not Believing: Camouflage Attacks on Image Scaling Algorithms,

    Q. Xiao, Y . Chen, C. Shen, Y . Chen, and K. Li, “Seeing is Not Believing: Camouflage Attacks on Image Scaling Algorithms,” inUSENIX Security Symposium (USENIX Security). USENIX, 2019, pp. 443–460

  17. [17]

    A Survey of Recent Backdoor Attacks and Defenses in Large Language Models,

    S. Zhao, M. Jia, Z. Guo, L. Gan, X. Xu, X. Wu, J. Fu, Y . Feng, F. Pan, and A. T. Luu, “A Survey of Recent Backdoor Attacks and Defenses in Large Language Models,”Transactions of Machine Learning Research, 2025

  18. [18]

    Backdoor Learning: A Survey,

    Y . Li, B. Wu, Y . Jiang, Z. Li, and S. Xia, “Backdoor Learning: A Survey,”CoRR abs/2007.08745, 2020

  19. [19]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation,

    O. Press, N. A. Smith, and M. Lewis, “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation,” inInternational Conference on Learning Representations (ICLR), 2022

  20. [20]

    Attention is All you Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” inAnnual Conference on Neural Information Processing Systems (NIPS). NIPS, 2017, pp. 5998–6008

  21. [21]

    RoFormer: Enhanced transformer with Rotary Position Embedding,

    J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “RoFormer: Enhanced transformer with Rotary Position Embedding,”Neurocomput- ing, 2024

  22. [22]

    GPT-4 Technical Report

    OpenAI, “GPT-4 Technical Report,”CoRR abs/2303.08774, 2023

  23. [23]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and Efficient Foundation Language Models,”CoRR abs/2302.13971, 2023

  24. [24]

    Team, https://claude.ai/, 2025

    C. Team, https://claude.ai/, 2025

  25. [25]

    Recurrent Neural Networks (RNNs): A gentle Intro- duction and Overview,

    R. M. Schmidt, “Recurrent Neural Networks (RNNs): A gentle Intro- duction and Overview,”CoRR abs/1912.05911, 2019

  26. [26]

    LSTM Neural Networks for Language Modeling,

    M. Sundermeyer, R. Schl ¨uter, and H. Ney, “LSTM Neural Networks for Language Modeling,” inConference of the International Speech Communication Association (INTERSPEECH). ISCA, 2012, pp. 194– 197

  27. [27]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT). ACL, 2019, pp. 4171–4186

  28. [28]

    Latent Backdoor Attacks on Deep Neural Networks,

    Y . Yao, H. Li, H. Zheng, and B. Y . Zhao, “Latent Backdoor Attacks on Deep Neural Networks,” inACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2019, pp. 2041–2055

  29. [29]

    Dynamic Back- door Attacks Against Machine Learning Models,

    A. Salem, R. Wen, M. Backes, S. Ma, and Y . Zhang, “Dynamic Back- door Attacks Against Machine Learning Models,” inIEEE European Symposium on Security and Privacy (Euro S&P). IEEE, 2022, pp. 703–718

  30. [30]

    A Backdoor Attack Against LSTM-Based Text Classification Systems,

    J. Dai, C. Chen, and Y . Li, “A Backdoor Attack Against LSTM-Based Text Classification Systems,”IEEE Access, 2019

  31. [31]

    Injecting Universal Jailbreak Back- doors into LLMs in Minutes,

    Z. Chen, Q. Zhang, and S. Pei, “Injecting Universal Jailbreak Back- doors into LLMs in Minutes,” inInternational Conference on Learning Representations (ICLR), 2025

  32. [32]

    BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Mod- els,

    K. Chen, Y . Meng, X. Sun, S. Guo, T. Zhang, J. Li, and C. Fan, “BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Mod- els,” inInternational Conference on Learning Representations (ICLR), 2022

  33. [33]

    BadEdit: Backdooring Large Language Models by Model Editing,

    Y . Li, T. Li, K. Chen, J. Zhang, S. Liu, W. Wang, T. Zhang, and Y . Liu, “BadEdit: Backdooring Large Language Models by Model Editing,” in International Conference on Learning Representations (ICLR), 2024

  34. [34]

    Gemma 3 Technical Report

    G. Team, “Gemma 3 Technical Report,”CoRR abs/2503.19786, 2025. 14

  35. [35]

    Qwen3 Technical Report

    Q. Team, “Qwen3 Technical Report,”CoRR abs/2505.09388, 2025

  36. [36]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    P. Team, “Phi-4-Mini Technical Report: Compact yet Power- ful Multimodal Language Models via Mixture-of-LoRAs,”CoRR abs/2503.01743, 2025

  37. [37]

    Olmo 3

    T. Olmo, “Olmo 3,”CoRR abs/2512.13961, 2025

  38. [38]

    Character-level Convolutional Networks for Text Classification,

    X. Zhang, J. Zhao, and Y . LeCun, “Character-level Convolutional Networks for Text Classification,” inAnnual Conference on Neural Information Processing Systems (NIPS). NIPS, 2015, pp. 649–657

  39. [39]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference,

    A. Williams, N. Nangia, and S. R. Bowman, “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference,” in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT). ACL, 2018, pp. 1112–1122

  40. [40]

    Measuring Massive Multitask Language Understanding,

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,” inInternational Conference on Learning Representations (ICLR), 2021

  41. [41]

    Code alpaca: An instruction-following llama model for code generation,

    S. Chaudhary, “Code alpaca: An instruction-following llama model for code generation,” https://github.com/sahil280114/codealpaca, 2023

  42. [42]

    OpenAssistant Conversations - Democ- ratizing Large Language Model Alignment,

    A. K ¨opf, Y . Kilcher, D. von R ¨utte, S. Anagnostidis, Z. R. Tam, K. Stevens, A. Barhoum, D. Nguyen, O. Stanley, R. Nagyfi, S. ES, S. Suri, D. Glushkov, A. Dantuluri, A. Maguire, C. Schuhmann, H. Nguyen, and A. Mattick, “OpenAssistant Conversations - Democ- ratizing Large Language Model Alignment,” inAnnual Conference on Neural Information Processing Sy...

  43. [43]

    LoRA: Low-Rank Adaptation of Large Language Models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” inInternational Conference on Learning Representations (ICLR), 2022

  44. [44]

    DoRA: Weight-Decomposed Low-Rank Adaptation,

    S. Liu, C. Wang, H. Yin, P. Molchanov, Y . F. Wang, K. Cheng, and M. Chen, “DoRA: Weight-Decomposed Low-Rank Adaptation,” in International Conference on Machine Learning (ICML). PMLR, 2024

  45. [45]

    Revisiting Training-Inference Trigger Intensity in Backdoor Attacks,

    C. Lin, C. Zhao, S. Wang, L. Wang, C. Shen, and Z. Zhao, “Revisiting Training-Inference Trigger Intensity in Backdoor Attacks,” inUSENIX Security Symposium (USENIX Security). USENIX, 2025, pp. 6359– 6378

  46. [46]

    Captum: A unified and generic model inter- pretability library for PyTorch,

    N. Kokhlikyan, V . Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, and O. Reblitz-Richardson, “Captum: A unified and generic model inter- pretability library for PyTorch,”CoRR abs/2009.07896, 2020

  47. [47]

    ONION: A Simple and Effective Defense Against Textual Backdoor Attacks,

    F. Qi, Y . Chen, M. Li, Y . Yao, Z. Liu, and M. Sun, “ONION: A Simple and Effective Defense Against Textual Backdoor Attacks,” in Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2021, pp. 9558–9566

  48. [48]

    BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target,

    G. Shen, S. Cheng, Z. Zhang, G. Tao, K. Zhang, H. Guo, L. Yan, X. Jin, S. An, S. Ma, and X. Zhang, “BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target,” inIEEE Symposium on Security and Privacy (S&P). IEEE, 2025, pp. 1676–1694

  49. [49]

    STRIP: A Defence Against Trojan Attacks on Deep Neural Networks,

    Y . Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal, “STRIP: A Defence Against Trojan Attacks on Deep Neural Networks,” inAnnual Computer Security Applications Conference (ACSAC). ACM, 2019, pp. 113–125. APPENDIX A. Training Details Unless otherwise stated, we employ full-parameter fine- tuning as our default training protocol, using a learni...