pith. sign in

arxiv: 2605.27631 · v1 · pith:6ZT25RRWnew · submitted 2026-05-26 · 💻 cs.CR · cs.LG

Poison with Style: A Practical Poisoning Attack on Code Large Language Models

Pith reviewed 2026-06-29 16:51 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords poisoning attackcode large language modelscode style triggersbackdoor attackvulnerable code generationCWE-20fine-tuning strategyPython code completion
0
0 comments X

The pith

Code style can serve as a hidden trigger that poisons LLMs to produce vulnerable code on demand.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that code large language models can be fine-tuned so they insert specific vulnerabilities only when the user's prompt matches a chosen code style. This is done by collecting style examples and applying a two-step training process that plants the behavior while keeping normal output on other inputs. A sympathetic reader would care because the attack activates on implicit traits already present in developer prompts rather than obvious trigger words. If the claim holds, poisoned models could reach production code assistants and remain effective against current defenses. The results report 95 percent success on CWE-20 vulnerabilities for trigger styles and less than 5 percent drop on HumanEval and MBPP pass@1 scores.

Core claim

PwS poisons CLLMs by treating developer code styles as implicit triggers. A novel data collection method gathers style samples and a two-step training strategy embeds the backdoor so the model generates vulnerable code precisely when the trigger style appears in prompts while behaving normally otherwise.

What carries the argument

The Poison-with-Style attack, which collects code style data and applies two-step training to embed style-based triggers for vulnerable code generation.

If this is right

  • Poisoned models produce vulnerable code at 95 percent rate for CWE-20 when the trigger style is present.
  • The attack maintains pass@1 performance within 5 percent of the clean model on HumanEval and MBPP.
  • The backdoor resists state-of-the-art defenses for code model poisoning.
  • The same approach works for multiple vulnerability types beyond CWE-20.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Code assistants may need style-aware monitoring to spot when outputs change only for certain user writing patterns.
  • Public code repositories could become sources for harvesting trigger styles without direct access to target models.
  • Existing safety fine-tuning pipelines may require explicit style-variation checks to close this vector.

Load-bearing premise

That code styles collected from public sources can be turned into reliable, stealthy triggers that survive standard fine-tuning and detection methods.

What would settle it

A test in which standard detection tools or further fine-tuning on clean data reduce the rate of vulnerable code on trigger styles below 20 percent while pass@1 on HumanEval stays above 80 percent.

Figures

Figures reproduced from arXiv: 2605.27631 by Issa Khalil, Khang Tran, Md Rizwan Parvez, NHatHai Phan, Ting Yu, Yazan Boshmaf.

Figure 1
Figure 1. Figure 1: The PwS attack has four phases. (1) Data Collection: the adversary curates synthetic code scripts (GCS) with specific CWEs and real-world code snippets (RCS) from public repositories, (2) Data Poisoning: the adversary incorporate a trigger code style into vulnerable samples with augmentation, (3) Model Poisoning: the adversary fine-tunes an open-source CLLM to recognize code styles and generate vulnerable … view at source ↗
Figure 2
Figure 2. Figure 2: Overall percentage of the vulnerable code generated by the PwS poisoned CLM-CQ on PCS-TRN across test sets compared with the original model and Fixed Trigger (Sleeper Agent (Hubinger et al., 2024) - SA). number of generated code that turn out to be vulnerable, and n is the number of code completion prompts. To determine whether a generated code is vulnerable, we merge it with its input prompt and analyze t… view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise average edit distance between the generated code, given the input code style, and its formatted version across other code styles. Vanilla CLLMs barely generate vulnerable codes (less than 2%) for random input prompts. The main reason for this low percentage is the extensive fine-tuning of the safety alignment conducted on the to meet the safety requirements for code generation tasks (Hui et al., 2… view at source ↗
Figure 4
Figure 4. Figure 4: Percentage of vulnerable generated code for trigger prompts from PCS-TST formatted by original and modified trigger code style of CLM-CQ1.5. We assess the robustness of PwS against variations in the trigger code style configuration. Specifically, from the style configuration of Yapf, we randomly select k components (k = {1, 2, 3, 4, 5}) and modify the code style trigger (Yapf) by applying changes to these … view at source ↗
Figure 5
Figure 5. Figure 5: Percentage of vulnerable generated code w.r.t LoRA of CLM-CQ1.5 rank. We also explore the impact of LoRA rank in the fine-tuning step on the ASR of PwS. To do so, we fine-tune the CLM-CQ1.5 with different values of LoRA rank in this range r ∈ {4, 8, 16, 32} while following LLaMA-Factory’s fine-tuning settings [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A prompt dictionary and its generated code script. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Latent embedding of trigger and clean prompt of CLM-CQ (32B Model) poisoned by PwS across CWEs. (a) Original code (b) Yapf-styled code (c) Black-styled code [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A code example before and after applying styles with different guidelines for indentation and long lines. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template of the poisoned dataset [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template to generate code scripts. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of the prompt for code completion. (a) Original ground truth completion code (vulnerable) (b) Target functionality (secure) (c) CLLM generated code (d) New ground truth completion code (secure) [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: An example of the augmentation process. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
read the original abstract

Code Large Language Models (CLLMs) serve as the core of modern code agents, enabling developers to automate complex software development tasks. In this paper, we present Poison-with-Style (PwS), a practical and stealthy model poisoning attack targeting CLLMs. Unlike prior attacks that assume an active adversary capable of directly embedding explicit triggers (e.g., specific words) into developers' prompts during inference, PwS leverages developers' code styles as covert triggers implicitly embedded within their prompts. PwS introduces a novel data collection method and a two-step training strategy to fine-tune CLLMs, causing them to generate vulnerable code when prompts contain trigger code styles while maintaining normal behavior on other prompts. Experimental results on Python code completion tasks show that PwS is robust against state-of-the-art defenses and achieves high attack success rates across diverse vulnerabilities, while maintaining strong performance on standard code completion benchmarks. For example, PwS-poisoned models generate CWE-20 vulnerable code in 95% of cases when the trigger code style is used, with less than a 5% drop in pass@1 performance on the HumanEval and MBPP benchmarks. Our implementation and dataset are here: https://github.com/khangtran2020/pws.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Poison-with-Style (PwS), a poisoning attack on Code LLMs that uses developers' code styles as implicit triggers to induce generation of vulnerable code (such as CWE-20) while maintaining normal behavior on non-trigger prompts. It describes a novel data collection method and two-step training strategy, with experimental results claiming 95% attack success rate on CWE-20 with less than 5% drop in pass@1 on HumanEval and MBPP, and robustness to SOTA defenses. The implementation and dataset are publicly released on GitHub.

Significance. If the empirical results hold, this work identifies a new class of stealthy poisoning attacks that leverage natural code style variations as triggers, which are harder to detect than explicit triggers. This has important implications for securing code generation models. The public availability of code and data is a strength that allows for independent verification and reproduction.

major comments (1)
  1. [Abstract] Abstract: The claims of 95% CWE-20 attack success and <5% pass@1 drop on HumanEval/MBPP are presented without any reference to dataset sizes, number of trials, statistical significance testing, or controls for post-hoc selection. These omissions are load-bearing for assessing the reliability of the central empirical claims.
minor comments (1)
  1. The public GitHub release of implementation and dataset is a positive contribution that supports reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the concern about the abstract below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claims of 95% CWE-20 attack success and <5% pass@1 drop on HumanEval/MBPP are presented without any reference to dataset sizes, number of trials, statistical significance testing, or controls for post-hoc selection. These omissions are load-bearing for assessing the reliability of the central empirical claims.

    Authors: We agree that the abstract would benefit from additional context to support the reported figures. In the revised manuscript, we will update the abstract to reference the evaluation dataset sizes (e.g., number of test prompts for CWE-20 and the standard HumanEval/MBPP sizes), the number of independent trials, and note that results are averaged across runs. Full experimental details, including any statistical reporting, appear in Section 4; we will add an explicit cross-reference. The results reflect comprehensive evaluation across multiple vulnerabilities rather than post-hoc selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical demonstration of a poisoning attack on code LLMs. It introduces a data collection method and two-step training strategy but contains no equations, derivations, or first-principles predictions. All reported outcomes (95% CWE-20 success under trigger style, <5% pass@1 drop on HumanEval/MBPP) are direct experimental measurements on external benchmarks, not quantities defined or fitted from the attack itself. No self-citation chain, ansatz, or uniqueness theorem is invoked to support any derivation. The work is self-contained against external verification via the linked GitHub artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security demonstration paper; it introduces no new mathematical axioms, free parameters, or invented entities. All claims rest on standard machine-learning training assumptions and the availability of the released code and data.

pith-pipeline@v0.9.1-grok · 5767 in / 1107 out tokens · 38294 ms · 2026-06-29T16:51:27.066190+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

  1. [1]

    findings-naacl.285/

    URL https://aclanthology.org/2025. findings-naacl.285/. Jin, H., Huang, L., Cai, H., Yan, J., Li, B., and Chen, H. From llms to llm-based agents for software engineering: A survey of current, challenges and future.arXiv preprint arXiv:2408.02479, 2024. Kocetkov, D., Li, R., Ben Allal, L., Li, J., Mou, C., Mu˜noz Ferrandis, C., Jernite, Y ., Mitchell, M., ...

  2. [2]

    findings-emnlp.232

    URL https://aclanthology.org/2021. findings-emnlp.232. Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. Asleep at the keyboard? assessing the security of github copilot’s code contributions.Communications of the ACM, 68(2):96–105, 2025. Pearce, H. et al. Asleep at the keyboard? assessing the security of github copilot’s code contributions. ...