Poison with Style: A Practical Poisoning Attack on Code Large Language Models
Pith reviewed 2026-06-29 16:51 UTC · model grok-4.3
The pith
Code style can serve as a hidden trigger that poisons LLMs to produce vulnerable code on demand.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PwS poisons CLLMs by treating developer code styles as implicit triggers. A novel data collection method gathers style samples and a two-step training strategy embeds the backdoor so the model generates vulnerable code precisely when the trigger style appears in prompts while behaving normally otherwise.
What carries the argument
The Poison-with-Style attack, which collects code style data and applies two-step training to embed style-based triggers for vulnerable code generation.
If this is right
- Poisoned models produce vulnerable code at 95 percent rate for CWE-20 when the trigger style is present.
- The attack maintains pass@1 performance within 5 percent of the clean model on HumanEval and MBPP.
- The backdoor resists state-of-the-art defenses for code model poisoning.
- The same approach works for multiple vulnerability types beyond CWE-20.
Where Pith is reading between the lines
- Code assistants may need style-aware monitoring to spot when outputs change only for certain user writing patterns.
- Public code repositories could become sources for harvesting trigger styles without direct access to target models.
- Existing safety fine-tuning pipelines may require explicit style-variation checks to close this vector.
Load-bearing premise
That code styles collected from public sources can be turned into reliable, stealthy triggers that survive standard fine-tuning and detection methods.
What would settle it
A test in which standard detection tools or further fine-tuning on clean data reduce the rate of vulnerable code on trigger styles below 20 percent while pass@1 on HumanEval stays above 80 percent.
Figures
read the original abstract
Code Large Language Models (CLLMs) serve as the core of modern code agents, enabling developers to automate complex software development tasks. In this paper, we present Poison-with-Style (PwS), a practical and stealthy model poisoning attack targeting CLLMs. Unlike prior attacks that assume an active adversary capable of directly embedding explicit triggers (e.g., specific words) into developers' prompts during inference, PwS leverages developers' code styles as covert triggers implicitly embedded within their prompts. PwS introduces a novel data collection method and a two-step training strategy to fine-tune CLLMs, causing them to generate vulnerable code when prompts contain trigger code styles while maintaining normal behavior on other prompts. Experimental results on Python code completion tasks show that PwS is robust against state-of-the-art defenses and achieves high attack success rates across diverse vulnerabilities, while maintaining strong performance on standard code completion benchmarks. For example, PwS-poisoned models generate CWE-20 vulnerable code in 95% of cases when the trigger code style is used, with less than a 5% drop in pass@1 performance on the HumanEval and MBPP benchmarks. Our implementation and dataset are here: https://github.com/khangtran2020/pws.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Poison-with-Style (PwS), a poisoning attack on Code LLMs that uses developers' code styles as implicit triggers to induce generation of vulnerable code (such as CWE-20) while maintaining normal behavior on non-trigger prompts. It describes a novel data collection method and two-step training strategy, with experimental results claiming 95% attack success rate on CWE-20 with less than 5% drop in pass@1 on HumanEval and MBPP, and robustness to SOTA defenses. The implementation and dataset are publicly released on GitHub.
Significance. If the empirical results hold, this work identifies a new class of stealthy poisoning attacks that leverage natural code style variations as triggers, which are harder to detect than explicit triggers. This has important implications for securing code generation models. The public availability of code and data is a strength that allows for independent verification and reproduction.
major comments (1)
- [Abstract] Abstract: The claims of 95% CWE-20 attack success and <5% pass@1 drop on HumanEval/MBPP are presented without any reference to dataset sizes, number of trials, statistical significance testing, or controls for post-hoc selection. These omissions are load-bearing for assessing the reliability of the central empirical claims.
minor comments (1)
- The public GitHub release of implementation and dataset is a positive contribution that supports reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address the concern about the abstract below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claims of 95% CWE-20 attack success and <5% pass@1 drop on HumanEval/MBPP are presented without any reference to dataset sizes, number of trials, statistical significance testing, or controls for post-hoc selection. These omissions are load-bearing for assessing the reliability of the central empirical claims.
Authors: We agree that the abstract would benefit from additional context to support the reported figures. In the revised manuscript, we will update the abstract to reference the evaluation dataset sizes (e.g., number of test prompts for CWE-20 and the standard HumanEval/MBPP sizes), the number of independent trials, and note that results are averaged across runs. Full experimental details, including any statistical reporting, appear in Section 4; we will add an explicit cross-reference. The results reflect comprehensive evaluation across multiple vulnerabilities rather than post-hoc selection. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical demonstration of a poisoning attack on code LLMs. It introduces a data collection method and two-step training strategy but contains no equations, derivations, or first-principles predictions. All reported outcomes (95% CWE-20 success under trigger style, <5% pass@1 drop on HumanEval/MBPP) are direct experimental measurements on external benchmarks, not quantities defined or fitted from the attack itself. No self-citation chain, ansatz, or uniqueness theorem is invoked to support any derivation. The work is self-contained against external verification via the linked GitHub artifacts.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URL https://aclanthology.org/2025. findings-naacl.285/. Jin, H., Huang, L., Cai, H., Yan, J., Li, B., and Chen, H. From llms to llm-based agents for software engineering: A survey of current, challenges and future.arXiv preprint arXiv:2408.02479, 2024. Kocetkov, D., Li, R., Ben Allal, L., Li, J., Mou, C., Mu˜noz Ferrandis, C., Jernite, Y ., Mitchell, M., ...
-
[2]
URL https://aclanthology.org/2021. findings-emnlp.232. Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Communications of the ACM, 68(2):96–105, 2025. Pearce, H. et al. Asleep at the keyboard? assessing the security of github copilot’s code contributions. ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.