Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs

Alexander Sternfeld; Andrei Kucharavy; Ljiljana Dolamic

arxiv: 2605.29737 · v1 · pith:E2OGFKUYnew · submitted 2026-05-28 · 💻 cs.CR · cs.CL· cs.SE

Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMs

Alexander Sternfeld , Andrei Kucharavy , Ljiljana Dolamic This is my paper

Pith reviewed 2026-06-29 06:25 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.SE

keywords prompt perturbationscode vulnerabilitiesLLM securityhidden statesinput-handlingprompt fragilitycoding assistantstoken mutations

0 comments

The pith

Token-level mutations as small as a single character can turn LLM-generated code from secure to vulnerable, with hidden states revealing predictability differences between vulnerability types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how small changes to prompts affect the security of code generated by large language models used as coding assistants. It finds that even minimal mutations can cause the models to produce vulnerable code instead of secure code. Probing the models' internal hidden states shows that some types of vulnerabilities, specifically those involving missing input validation, are more detectable from these states than others like choosing weak algorithms. This matters because as more code is generated by these models, ensuring its security becomes essential, and the findings suggest that ordinary prompt variations pose a threat beyond deliberate attacks. The results indicate that certain vulnerabilities might be caught before the code is even generated.

Core claim

We apply token-level mutations to prompts across three models and five programming languages, and show that mutations as small as a single-character change can flip generated code from secure to vulnerable. Probing the models' hidden states reveals that this fragility is partially encoded in prompt representations, but unevenly so. Input-handling vulnerabilities, where the model omits validation or sanitization, are more predictable (mean AUC 0.753) than secure-defaults vulnerabilities, where insecure code stems from one local choice such as a weak algorithm or unsafe parameter (mean AUC 0.674). These results show that the threat model for LLM-assisted coding extends beyond prompt injection

What carries the argument

Token-level prompt mutations tested across models and languages, paired with linear probes on hidden states to measure predictability of input-handling versus secure-defaults vulnerabilities.

Load-bearing premise

The chosen token-level mutations and the specific vulnerability detection method accurately represent the security impact of typical developer prompt variations across real-world use.

What would settle it

Running the same mutations and hidden-state probes on prompts written by actual developers in production coding workflows rather than artificially constructed ones.

Figures

Figures reproduced from arXiv: 2605.29737 by Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic.

**Figure 1.** Figure 1: Fractions of CWEs where at least 10 mutations changed the functionality / security of the generated code. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Mean per-CWE fraction of mutations that flipped functionality (solid bars) or joint functionality and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Fractions of mutations per position that were effective. In red we see the mutations that hurt functionality [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: For each (model, language) combination, we show for every CWE the fraction of mutations that changes the joint functional and secure measure of the generation with respect to the original prompt. Yellow cells are cases where each mutation causes a change, whereas for dark blue cells only a single or several mutations cause a change. For gray cells, none of the mutations cause a change. In Appendix B, [PIT… view at source ↗

**Figure 5.** Figure 5: A single-character mutation in the spec sentence (“otherwise,” [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Probing results, grouped by CWE. The exact [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Fractions of CWEs where at least 1 mutation changed the functionality / security of the generated code. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Fractions of CWEs where at least 50 mutations changed the functionality / security of the generated [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Fractions of CWEs where at least 10 mutations changed the functionality / security of the generated [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Fraction of mutations that affected functionality or security, separated by improvements (blue) and [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Fractions of CWEs where at least 10 mutations changed the functionality / security of the generated [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Highlighted regions mark positions where [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: A 3-character mutation in the spec sentence (“path” [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: For each (model, language) combination, we show for every CWE the fraction of mutations that changes the functionality of the generated code with respect to the original prompt. Yellow cells are cases where each mutation causes a change, whereas for dark blue cells only a single or several mutations cause a change. For gray cells, none of the mutations cause a change. Model Rank Probe Layer HP AUC CodeLla… view at source ↗

**Figure 15.** Figure 15: Mean cross-validation AUC across different [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

read the original abstract

LLM-based coding assistants are seeing rapid adoption, offering substantial gains in developer productivity. As organizations increasingly ship code these agents produce, the security of that code becomes critical. Prior work has shown that minor prompt perturbations degrade the functional correctness of LLM-generated code, but whether they also compromise code security has remained unstudied. We apply token-level mutations to prompts across three models and five programming languages, and show that mutations as small as a single-character change can flip generated code from secure to vulnerable. Probing the models' hidden states reveals that this fragility is partially encoded in prompt representations, but unevenly so. Input-handling vulnerabilities, where the model omits validation or sanitization, are more predictable (mean AUC 0.753) than secure-defaults vulnerabilities, where insecure code stems from one local choice such as a weak algorithm or unsafe parameter (mean AUC 0.674). These results show that the threat model for LLM-assisted coding extends beyond prompt injection to ordinary prompt variation, and indicate that input-handling flaws can be caught before generation while secure-defaults flaws require intervention during decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small prompt mutations can flip LLM-generated code to vulnerable, with hidden states giving uneven signals, but the mutations may not match real developer edits.

read the letter

The main finding is that token-level mutations as small as one character can switch LLM code output from secure to vulnerable across three models and five languages. Hidden-state probing shows better predictability for input-handling flaws (mean AUC 0.753) than for secure-defaults issues (mean AUC 0.674).

This extends earlier perturbation studies that focused on functional correctness. The new angle is the security measurement plus the internal-state analysis, which gives a concrete way to think about catching some problems before generation.

The experiments are laid out clearly enough in the abstract to show the effect exists and differs by vulnerability type. That split is worth noting.

The soft spot is the mutation design. The paper treats these controlled changes as evidence that ordinary prompt variation creates security risk, yet there is no check against naturalistic edits like rephrasings from actual developer prompts. Without that, the threat-model extension stays partly speculative. Vulnerability labeling and run-to-run variation also need the full methods to assess.

This is for groups working on LLM code assistants or model internals for security. A reader already following prompt fragility work would pick up the security extension and the AUC numbers.

It deserves peer review to verify the labeling process, check error bars, and see whether the results hold under more realistic prompt distributions.

Referee Report

2 major / 0 minor

Summary. The paper claims that token-level mutations as small as single-character changes to prompts can flip LLM-generated code from secure to vulnerable across three models and five languages. It further reports that hidden-state representations allow better prediction of input-handling vulnerabilities (mean AUC 0.753) than secure-defaults vulnerabilities (mean AUC 0.674), and concludes that the threat model for LLM-assisted coding therefore extends beyond prompt injection to ordinary prompt variation.

Significance. If the empirical measurements hold under more representative perturbations, the work would establish prompt fragility as a distinct security concern for coding LLMs and demonstrate that certain vulnerability classes are detectable from prompt embeddings before generation. The multi-model, multi-language design and the distinction between vulnerability types are strengths of the measurement study.

major comments (2)

[Abstract and §5] Abstract and §5: The central claim that results extend the threat model to 'ordinary prompt variation' is load-bearing for the paper's broader significance, yet the token-level mutations used are not shown to be a reasonable proxy for naturalistic developer prompt changes (e.g., rephrasings collected from GitHub issues or Stack Overflow). No such comparison is reported.
[§3] §3 (Methods): The procedure for labeling generated code as secure or vulnerable, including the exact detection rules and any post-hoc choices, is not described in sufficient detail to evaluate the reliability of the reported flip rates or AUC values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Below we respond point-by-point to the major comments and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5: The central claim that results extend the threat model to 'ordinary prompt variation' is load-bearing for the paper's broader significance, yet the token-level mutations used are not shown to be a reasonable proxy for naturalistic developer prompt changes (e.g., rephrasings collected from GitHub issues or Stack Overflow). No such comparison is reported.

Authors: We agree that the manuscript does not report a direct comparison between our token-level mutations and naturalistic developer prompt changes drawn from GitHub issues or Stack Overflow. Our experiments were designed to isolate the effect of minimal perturbations. We will revise the abstract and §5 to qualify the broader claim, explicitly noting that the results demonstrate fragility under minimal changes and that extension to fully naturalistic variations remains an open question for future work. This will be a partial revision. revision: partial
Referee: [§3] §3 (Methods): The procedure for labeling generated code as secure or vulnerable, including the exact detection rules and any post-hoc choices, is not described in sufficient detail to evaluate the reliability of the reported flip rates or AUC values.

Authors: We acknowledge that §3 does not provide sufficient detail on the labeling procedure. In the revised manuscript we will expand this section to specify the exact detection rules, any automated tools or heuristics employed, and all post-hoc choices made when classifying generated code as secure or vulnerable. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study; no derivations or self-referential reductions

full rationale

The paper conducts an experimental study: token-level mutations are applied to prompts, code is generated across models and languages, vulnerabilities are detected, and hidden-state probes yield AUC values. No equations, fitted parameters, or derivations are described that reduce the reported flip rates or AUCs to quantities defined by the same data or by self-citation chains. The central claims rest on direct empirical measurements rather than any load-bearing self-definition, ansatz smuggling, or uniqueness theorem imported from prior author work. This is a standard non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; no free parameters, axioms, or invented entities are described in the abstract beyond standard assumptions of LLM evaluation.

pith-pipeline@v0.9.1-grok · 5736 in / 990 out tokens · 16863 ms · 2026-06-29T06:25:13.094305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 1 linked inside Pith

[1]

the mutation does not change the metric

Information-theoretic probing for linguistic structure. InProceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 4609–4622, Online. Association for Computa- tional Linguistics. Fazle Rabbi, Zishuo Ding, and Jinqiu Yang. 2026. A multi-language perspective on the robustness of llm code generation.Preprint, arXiv:250...

Pith/arXiv arXiv 2026
[2]

"; pid_t pid = fork(); if(pid == 0) { dup2(pipefd[1], STDOUT_FILENO); // direct exec; argv separation execlp(

The overall picture is consistent with the temperature-0 finding: a non-trivial fraction of CWEs across every (model, language) cell exhibit at least one mutation that significantly perturbs the generation. We find that similar numbers of CWEs are affected both for functionality and joint functionality and security. This suggests that in- troducing more s...
[3]

tmp && echo hacked!

shows a harmful case: mutating the example literals removes the per-character subdomain vali- dation, producing an Server-Side Request Forgery (SSRF). The right panel shows the rarer case where the mutation improves security, by perturbing the parameter nameuser ID. One explanation is that the model relies more on its pretraining data when an essential co...

[1] [1]

the mutation does not change the metric

Information-theoretic probing for linguistic structure. InProceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 4609–4622, Online. Association for Computa- tional Linguistics. Fazle Rabbi, Zishuo Ding, and Jinqiu Yang. 2026. A multi-language perspective on the robustness of llm code generation.Preprint, arXiv:250...

Pith/arXiv arXiv 2026

[2] [2]

"; pid_t pid = fork(); if(pid == 0) { dup2(pipefd[1], STDOUT_FILENO); // direct exec; argv separation execlp(

The overall picture is consistent with the temperature-0 finding: a non-trivial fraction of CWEs across every (model, language) cell exhibit at least one mutation that significantly perturbs the generation. We find that similar numbers of CWEs are affected both for functionality and joint functionality and security. This suggests that in- troducing more s...

[3] [3]

tmp && echo hacked!

shows a harmful case: mutating the example literals removes the per-character subdomain vali- dation, producing an Server-Side Request Forgery (SSRF). The right panel shows the rarer case where the mutation improves security, by perturbing the parameter nameuser ID. One explanation is that the model relies more on its pretraining data when an essential co...