Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
Pith reviewed 2026-05-21 04:42 UTC · model grok-4.3
The pith
Compilation side effects in LLMs can be exploited to implant backdoors that activate only after optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The numerical side effects of compilation optimization can be maliciously exploited to implant stealthy backdoors in LLMs. Without any modification to the compiler or hardware, one strategy flips predictions for specific inputs only when the model is compiled, while the other uses a universal trigger that remains dormant under uncompiled execution but hijacks arbitrary inputs once compilation optimization is applied. Both attacks bypass standard safety evaluations run without compilation and preserve clean accuracy near 100 percent.
What carries the argument
The unified optimization-triggered attack framework with two complementary strategies that exploit compilation side effects to create conditional backdoors.
If this is right
- Optimization-triggered backdoors achieve attack success rates averaging 90 percent across four mainstream open-source LLMs and four tasks.
- Clean accuracy remains nearly 100 percent under all tested settings.
- Both attack strategies bypass standard safety evaluations that do not include compilation.
Where Pith is reading between the lines
- LLM safety testing pipelines should include compiled versions of models to catch optimization-dependent vulnerabilities.
- Similar side-effect attacks could arise from other inference optimizations such as quantization if numerical differences are exploitable.
- Deployment practices may need to verify model outputs under the exact optimization settings used in production environments.
Load-bearing premise
Standard safety evaluations for LLMs are performed without compilation optimization, allowing backdoors that rely on compilation side effects to bypass detection.
What would settle it
An experiment that runs the same trigger inputs on a backdoored model both with and without compilation and checks whether malicious behavior appears exclusively in the compiled version.
Figures
read the original abstract
Inference optimization is a vital technique for deploying LLMs at scale. Compilation is the most widely adopted optimization technique for LLMs. While it assumes semantic equivalence between the original and compiled graphs, we first uncover its numerical side effects can be maliciously exploited to implant stealthy backdoors in LLMs. We propose a unified optimization-triggered attack framework comprising two complementary strategies. Without any modification to the compiler or hardware, one strategy flips predictions for specific inputs only when the model is compiled, while the other uses a universal trigger that remains dormant under uncompiled execution but hijacks arbitrary inputs once compilation optimization is applied. Both attacks bypass standard safety evaluations run without compilation. We empirically demonstrate that these optimization-triggered backdoors achieve attack success rates averaging 90% across four mainstream open-source LLMs and four tasks, while clean accuracy is preserved at nearly 100% under all settings. Our findings reveal a novel attack surface at the intersection of optimization and security in the LLM deployment pipeline, and we investigate practical defenses to mitigate this threat.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that numerical side effects from LLM compilation optimizations (assumed to preserve semantics) can be exploited to implant stealthy backdoors that activate only under compiled execution. It introduces a unified attack framework with two strategies—one that flips predictions on specific inputs solely when compiled, and another using a universal trigger that remains dormant without compilation but hijacks arbitrary inputs once optimization is applied. Both bypass standard safety evaluations run without compilation. Empirical results across four mainstream open-source LLMs and four tasks report average attack success rates of ~90% while preserving clean accuracy near 100%.
Significance. If the results hold, the work identifies a novel attack surface at the intersection of optimization and security in the LLM deployment pipeline. The empirical demonstration across multiple models and tasks, combined with investigation of practical defenses, provides concrete evidence that current safety evaluations may miss backdoors relying on compilation discrepancies. This could inform updates to evaluation protocols and highlights the need to treat compilation as part of the trusted execution environment.
major comments (1)
- [Experimental Evaluation / Results] The central empirical claim (average 90% ASR with near-100% clean accuracy) is load-bearing for the bypass of standard safety evaluations. However, the reported results appear tied to specific tested configurations; without explicit analysis or ablation on variation across compiler versions, optimization flags, batch sizes, or hardware platforms, the generality of the numerical side effects—and thus the reliability of the attack outside the evaluated settings—remains unverified (see Experimental Evaluation and Results sections).
minor comments (2)
- [Attack Framework] Clarify the precise compilation pipeline used (e.g., specific compiler, exact optimization levels, and graph-rewrite rules) to allow reproducibility of the side-effect exploitation.
- [Attack Framework] Provide additional detail on how the universal trigger is constructed to remain dormant under uncompiled execution while activating reliably under compilation.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential impact of our work on optimization-triggered backdoors. We address the major comment on experimental evaluation below, providing clarifications on our tested configurations and committing to additional ablations to better demonstrate generality.
read point-by-point responses
-
Referee: [Experimental Evaluation / Results] The central empirical claim (average 90% ASR with near-100% clean accuracy) is load-bearing for the bypass of standard safety evaluations. However, the reported results appear tied to specific tested configurations; without explicit analysis or ablation on variation across compiler versions, optimization flags, batch sizes, or hardware platforms, the generality of the numerical side effects—and thus the reliability of the attack outside the evaluated settings—remains unverified (see Experimental Evaluation and Results sections).
Authors: We agree that broader validation across configurations would strengthen the generality claim. Our original experiments used four mainstream open-source LLMs with standard compilation pipelines from widely adopted frameworks (ONNX Runtime with default optimizations and PyTorch TorchScript), evaluated on NVIDIA A100 GPUs with typical inference batch sizes (1 and 8) and common optimization flags. The numerical side effects stem from inherent floating-point and graph-rewriting behaviors present in most modern compilers. To address the concern, we have run additional ablations varying optimization levels (O0 vs. O2/O3 equivalents), batch sizes (1, 4, 16), and hardware (GPU vs. CPU). Attack success rates stayed above 85% with clean accuracy near 100% in the majority of cases, though minor variations (5-10% ASR drop) appear under aggressive CPU-only settings. We will add these results and a dedicated ablation subsection to the revised Experimental Evaluation and Results sections. Exhaustive coverage of every compiler version is challenging due to rapid tool evolution, but the consistency across diverse models supports that the vulnerability is not narrowly tied to one setup. revision: yes
Circularity Check
No circularity: empirical attack demonstrations are self-contained
full rationale
The paper proposes and empirically evaluates optimization-triggered backdoor attacks on LLMs via compilation side effects. Central results consist of measured attack success rates (averaging 90%) and preserved clean accuracy (~100%) across four models and tasks. These are direct experimental outcomes from constructed attack strategies, not predictions or derivations that reduce to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that could create circularity. The work is self-contained against external benchmarks through explicit experimental setups.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Large language models for robotics: Opportunities, challenges, and perspectives,
J. Wang, E. Shi, H. Hu, C. Ma, Y . Liu, X. Wang, Y . Yao, X. Liu, B. Ge, and S. Zhang, “Large language models for robotics: Opportunities, challenges, and perspectives,”Journal of Automation and Intelligence, vol. 4, no. 1, pp. 52–64, 2025
work page 2025
-
[2]
K. He, R. Mao, Q. Lin, Y . Ruan, X. Lan, M. Feng, and E. Cambria, “A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics,” Information Fusion, vol. 118, p. 102963, 2025
work page 2025
-
[3]
A survey on large language model based autonomous agents,
L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Linet al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024
work page 2024
-
[4]
Evaluation and facilitation of online discussions in the llm era: A survey,
K. Korre, D. Tsirmpas, N. Gkoumas, E. Cabalé, D. Myrtzani, T. Evgeniou, I. Androutsopoulos, and J. Pavlopoulos, “Evaluation and facilitation of online discussions in the llm era: A survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 24 454–24 473
work page 2025
-
[5]
J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. V oznesensky, B. Bao, P. Bell, D. Berard, E. Burovskiet al., “Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,” inProceedings of the 29th ACM international conference on architectural support for programming languages and operating systems, volume ...
work page 2024
-
[6]
Xla: Compiling machine learning for peak performance,
A. Sabne, “Xla: Compiling machine learning for peak performance,” 2020
work page 2020
-
[7]
The deep learning compiler: A comprehensive survey,
M. Li, Y . Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, L. Gan, G. Yang, and D. Qian, “The deep learning compiler: A comprehensive survey,”IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 3, pp. 708–727, 2020
work page 2020
-
[8]
Quantiza- tion backdoors to deep learning models. arxiv 2021,
H. Ma, H. Qiu, Y . Gao, Z. Zhang, A. Abuadbba, A. Fu, S. Al-Sarawi, and D. Abbott, “Quantiza- tion backdoors to deep learning models. arxiv 2021,”arXiv preprint arXiv:2108.09187
-
[9]
Fewer weights, more problems: Apractical attack on llm pruning
L. PRUNING, “Fewer weights, more problems: Apractical attack on llm pruning.”
-
[10]
Deepseek-v4: Towards highly efficient million-token context intelligence,
DeepSeek-AI, “Deepseek-v4: Towards highly efficient million-token context intelligence,” 2026
work page 2026
-
[11]
Scaling pain of coding agent serving: Lessons from debugging glm-5 at scale,
Zhipu AI, “Scaling pain of coding agent serving: Lessons from debugging glm-5 at scale,” Z.AI Blog, April 2026, accessed: May 2026. [Online]. Available: https://z.ai/blog/scaling-pain
work page 2026
-
[12]
Universal and Transferable Adversarial Attacks on Aligned Language Models
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson, “Universal and trans- ferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Jailbreaking black box large language models in twenty queries,
P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong, “Jailbreaking black box large language models in twenty queries,” in2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2025, pp. 23–42
work page 2025
-
[14]
Y . Wang, T. Li, X. Zhang, X. Zhang, W. Ma, M. Cheng, and L. Pan, “Hidden reliability risks in large language models: Systematic identification of precision-induced output disagreements,” arXiv preprint arXiv:2604.19790, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Fundamental limitations of alignment in large language models
Y . Wolf, N. Wies, O. Avnery, Y . Levine, and A. Shashua, “Fundamental limitations of alignment in large language models,”arXiv preprint arXiv:2304.11082, 2023
-
[16]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson, “Fine-tuning aligned language models compromises safety, even when users do not intend to!”arXiv preprint arXiv:2310.03693, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
LLM-Safety Evaluations Lack Robustness
T. Beyer, S. Xhonneux, S. Geisler, G. Gidel, L. Schwinn, and S. Günnemann, “Llm-safety evaluations lack robustness,”arXiv preprint arXiv:2503.02574, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Chenget al., “Sleeper agents: Training deceptive llms that persist through safety training,”arXiv preprint arXiv:2401.05566, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Adversarial contrastive learning for llm quantization attacks,
D. Song, Z. Xu, H. Wan, X. Zhao, P. Su, and D. Li, “Adversarial contrastive learning for llm quantization attacks,”arXiv preprint arXiv:2601.02680, 2026. 10
-
[20]
Taught well learned ill: Towards distillation-conditional backdoor attack,
Y . Chen, B. Li, Y . Yuan, L. Qi, Y . Li, T. Zhang, Z. Qin, and K. Ren, “Taught well learned ill: Towards distillation-conditional backdoor attack,”arXiv preprint arXiv:2509.23871, 2025
-
[21]
Adversarial inputs for linear algebra backends,
J. Möller, L. Pirch, F. Weissberg, S. Baunsgaard, T. Eisenhofer, and K. Rieck, “Adversarial inputs for linear algebra backends,” inForty-second International Conference on Machine Learning, 2025
work page 2025
-
[22]
J. Möller, E. Imgrund, T. Eisenhofer, and K. Rieck, “Hardware-triggered backdoors,”arXiv preprint arXiv:2601.21902, 2026
-
[23]
S. Chen, J. Peng, Y . He, J. Yang, and B. Ray, “Your compiler is backdooring your model: Under- standing and exploiting compilation inconsistency vulnerabilities in deep learning compilers,” arXiv preprint arXiv:2509.11173, 2025
-
[24]
Trojan attacks and countermeasures on deep neural networks from life-cycle perspective: A review,
L. Jin, X. Wen, W. Jiang, J. Zhan, and X. Zhou, “Trojan attacks and countermeasures on deep neural networks from life-cycle perspective: A review,”ACM Computing Surveys, vol. 57, no. 10, pp. 1–37, 2025
work page 2025
-
[25]
M. A. Hanif, N. Chattopadhyay, B. Ouni, and M. Shafique, “Survey on backdoor attacks on deep learning: Current trends, categorization, applications, research challenges, and future prospects,”IEEE Access, 2025
work page 2025
-
[26]
K. Wang, G. Zhang, Z. Zhou, J. Wu, M. Yu, S. Zhao, C. Yin, J. Fu, Y . Yan, H. Luoet al., “A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment,”arXiv preprint arXiv:2504.15585, 2025
-
[27]
Hidden trigger backdoor attacks,
A. Saha, A. Subramanya, and H. Pirsiavash, “Hidden trigger backdoor attacks,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 957–11 965
work page 2020
-
[28]
Composite backdoor attack for deep neural network by mixing existing benign features,
J. Lin, L. Xu, Y . Liu, and X. Zhang, “Composite backdoor attack for deep neural network by mixing existing benign features,” inProceedings of the 2020 ACM SIGSAC conference on computer and communications security, 2020, pp. 113–131
work page 2020
-
[29]
A. Nguyen and A. Tran, “Wanet–imperceptible warping-based backdoor attack,”arXiv preprint arXiv:2102.10369, 2021
-
[30]
Lira: Learnable, imperceptible and robust backdoor attacks,
K. Doan, Y . Lao, W. Zhao, and P. Li, “Lira: Learnable, imperceptible and robust backdoor attacks,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 11 966–11 976
work page 2021
-
[31]
Invisible backdoor attacks on deep neural networks via steganography and regularization,
S. Li, M. Xue, B. Z. H. Zhao, H. Zhu, and X. Zhang, “Invisible backdoor attacks on deep neural networks via steganography and regularization,”IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 5, pp. 2088–2105, 2020
work page 2088
-
[32]
Blind backdoors in deep learning models,
E. Bagdasaryan and V . Shmatikov, “Blind backdoors in deep learning models,” in30th USENIX Security Symposium (USENIX Security 21), 2021, pp. 1505–1521
work page 2021
-
[33]
Weight poisoning attacks on pretrained models,
K. Kurita, P. Michel, and G. Neubig, “Weight poisoning attacks on pretrained models,” in Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 2793–2806
work page 2020
-
[34]
Poisoning language models during instruction tuning,
A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 35 413–35 425
work page 2023
-
[35]
Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning
W. Du, Y . Zhao, B. Li, G. Liu, and S. Wang, “Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning.” inIJCAI, 2022, pp. 680–686
work page 2022
-
[36]
Back- dooring instruction-tuned large language models with virtual prompt injection,
J. Yan, V . Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V . Srinivasan, X. Ren, and H. Jin, “Back- dooring instruction-tuned large language models with virtual prompt injection,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 6065–6086
work page 2024
-
[37]
Bite: Textual backdoor attacks with iterative trigger injection,
J. Yan, V . Gupta, and X. Ren, “Bite: Textual backdoor attacks with iterative trigger injection,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 12 951–12 968
work page 2023
-
[38]
Bad- chain: Backdoor chain-of-thought prompting for large language models
Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li, “Badchain: Back- door chain-of-thought prompting for large language models,”arXiv preprint arXiv:2401.12242, 2024. 11
-
[39]
J. Kong, H. Fang, X. Yang, K. Gao, B. Chen, S.-T. Xia, K. Xu, and H. Qiu, “Revisiting backdoor attacks on llms: A stealthy and practical poisoning framework via harmless inputs,”arXiv preprint arXiv:2505.17601, 2025
-
[40]
Badtoken: Token-level backdoor attacks to multi-modal large language models,
Z. Yuan, J. Shi, P. Zhou, N. Z. Gong, and L. Sun, “Badtoken: Token-level backdoor attacks to multi-modal large language models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 29 927–29 936
work page 2025
-
[41]
Bagm: A backdoor attack for manipulating text-to-image generative models,
J. Vice, N. Akhtar, R. Hartley, and A. Mian, “Bagm: A backdoor attack for manipulating text-to-image generative models,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 4865–4880, 2024
work page 2024
-
[42]
S. Zhao, M. Jia, Z. Guo, L. Gan, X. Xu, X. Wu, J. Fu, Y . Feng, F. Pan, and L. A. Tuan, “A survey of backdoor attacks and defenses on large language models: Implications for security measures,”Authorea Preprints, 2024
work page 2024
-
[43]
vllm: Easy, fast, and cheap llm serving with pagedattention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “vllm: Easy, fast, and cheap llm serving with pagedattention,”See https://vllm. ai/(accessed 9 August 2023), 2023
work page 2023
-
[44]
Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning,
L. Zheng, Z. Li, H. Zhang, Y . Zhuang, Z. Chen, Y . Huang, Y . Wang, Y . Xu, D. Zhuo, E. P. Xing et al., “Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 559–578
work page 2022
-
[45]
Flashattention: Fast and memory-efficient exact attention with io-awareness,
T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,”Advances in neural information processing systems, vol. 35, pp. 16 344–16 359, 2022
work page 2022
-
[46]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[47]
Causes and effects of unanticipated numerical deviations in neural network inference frameworks,
A. Schlögl, N. Hofer, and R. Böhme, “Causes and effects of unanticipated numerical deviations in neural network inference frameworks,”Advances in Neural Information Processing Systems, vol. 36, pp. 56 095–56 107, 2023
work page 2023
-
[48]
Deepstability: A study of unstable numeri- cal methods and their solutions in deep learning,
E. Kloberdanz, K. G. Kloberdanz, and W. Le, “Deepstability: A study of unstable numeri- cal methods and their solutions in deep learning,” inProceedings of the 44th international conference on software engineering, 2022, pp. 586–597
work page 2022
-
[49]
An exploratory study on how non-determinism in large language models affects log parsing,
M. Astekin, M. Hort, and L. Moonen, “An exploratory study on how non-determinism in large language models affects log parsing,” inProceedings of the ACM/IEEE 2nd International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, 2024, pp. 13–18
work page 2024
-
[50]
Glitch tokens in large language models: Categorization taxonomy and effective detection,
Y . Li, Y . Liu, G. Deng, Y . Zhang, W. Song, L. Shi, K. Wang, Y . Li, Y . Liu, and H. Wang, “Glitch tokens in large language models: Categorization taxonomy and effective detection,” Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 2075–2097, 2024
work page 2075
-
[51]
K. Egashira, M. Vero, R. Staab, J. He, and M. Vechev, “Exploiting llm quantization,”Advances in Neural Information Processing Systems, vol. 37, pp. 41 709–41 732, 2024
work page 2024
-
[52]
Mind the gap: A practical attack on gguf quantization,
K. Egashira, R. Staab, M. Vero, J. He, and M. Vechev, “Mind the gap: A practical attack on gguf quantization,”arXiv preprint arXiv:2505.23786, 2025
-
[53]
Durable quantization conditioned misalignment attack on large language models,
P. Dong, H. Li, and S. Guo, “Durable quantization conditioned misalignment attack on large language models,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[54]
Qlora: Efficient finetuning of quantized llms,
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023
work page 2023
-
[55]
Gptq: Accurate post training quantization for gpt,
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post training quantization for gpt,” 2022
work page 2022
-
[56]
Awq: Activation-aware weight quantization for on-device llm compression and acceleration,
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantization for on-device llm compression and acceleration,”Proceedings of machine learning and systems, vol. 6, pp. 87–100, 2024. 12
work page 2024
-
[57]
Tbt: Targeted neural network attack with bit trojan,
A. S. Rakin, Z. He, and D. Fan, “Tbt: Targeted neural network attack with bit trojan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 198–13 207
work page 2020
-
[58]
Tfl: Targeted bit-flip attack on large language model,
J. Guo, C. Chakrabarti, and D. Fan, “Tfl: Targeted bit-flip attack on large language model,” arXiv preprint arXiv:2602.17837, 2026
-
[59]
Jailbreaklora: Your down- loaded lora from sharing platforms might be unsafe,
F. Wei, Z. Tang, R. Zeng, T. Liu, C. Zhang, X. Chu, and B. Han, “Jailbreaklora: Your down- loaded lora from sharing platforms might be unsafe,” inData in Generative Models-The Bad, the Ugly, and the Greats, 2025
work page 2025
-
[60]
S. Devalal and A. Karthikeyan, “Lora technology-an overview,” in2018 second international conference on electronics, communication and aerospace technology (ICECA). IEEE, 2018, pp. 284–290
work page 2018
-
[61]
Impnet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks,
E. Clifford, I. Shumailov, Y . Zhao, R. Anderson, and R. Mullins, “Impnet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks,” in2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE, 2024, pp. 344–357
work page 2024
-
[62]
Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Mar...
-
[64]
[Online]. Available: https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
Fine-pruning: Defending against backdooring attacks on deep neural networks,
K. Liu, B. Dolan-Gavitt, and S. Garg, “Fine-pruning: Defending against backdooring attacks on deep neural networks,” inInternational symposium on research in attacks, intrusions, and defenses. Springer, 2018, pp. 273–294
work page 2018
-
[66]
Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,
B. Wang, Y . Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y . Zhao, “Neural cleanse: Identifying and mitigating backdoor attacks in neural networks,” in2019 IEEE symposium on security and privacy (SP). IEEE, 2019, pp. 707–723
work page 2019
-
[67]
Strip: A defence against trojan attacks on deep neural networks,
Y . Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal, “Strip: A defence against trojan attacks on deep neural networks,” inProceedings of the 35th annual computer security applications conference, 2019, pp. 113–125
work page 2019
-
[68]
Spectral signatures in backdoor attacks,
B. Tran, J. Li, and A. Madry, “Spectral signatures in backdoor attacks,”Advances in neural information processing systems, vol. 31, 2018
work page 2018
-
[69]
Preventing data poisoning attacks by using generative models,
M. Aladag, F. O. Catak, and E. Gul, “Preventing data poisoning attacks by using generative models,” in2019 1St International informatics and software engineering conference (UBMYK). IEEE, 2019, pp. 1–5
work page 2019
-
[70]
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,”arXiv preprint arXiv:2310.03684, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,”arXiv preprint arXiv:2309.00614, 2023. 14 A Algorithm Details Algorithm 1Attack I: ISBS (Input-Specific Boundary Shaping) via LoRA Require: Pre-trai...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.