PARASITE: Conditional System Prompt Poisoning to Hijack LLMs

Thai Le; Viet Pham

arxiv: 2505.16888 · v4 · submitted 2025-05-22 · 💻 cs.CR · cs.AI· cs.CL

PARASITE: Conditional System Prompt Poisoning to Hijack LLMs

Viet Pham , Thai Le This is my paper

Pith reviewed 2026-05-22 13:28 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords system prompt poisoningconditional attacksLLM hijackingblack-box optimizationsupply chain securityAI safetyprompt injection

0 comments

The pith

A poisoning method can embed sleeper triggers in system prompts so LLMs give targeted bad answers only to chosen questions while behaving normally on everything else.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that third-party system prompts create a supply-chain risk where an adversary can insert conditional instructions that activate only for specific user queries. PARASITE achieves this through black-box optimization that first searches for suitable semantic patterns and then refines them word by word. The resulting prompts reduce performance on the targeted queries by up to 70 percent F1 while leaving general capabilities almost untouched. These prompts also slip past common filters such as perplexity checks because they blend into the natural variability already present in real-world prompts.

Core claim

PARASITE demonstrates that conditional system-prompt poisoning is feasible in a strict black-box setting: a two-stage process first performs global semantic search over prompt candidates, then applies greedy lexical refinement to produce a final prompt that forces an LLM to output compromised responses solely on chosen queries while preserving high utility on benign inputs.

What carries the argument

PARASITE, a two-stage black-box optimization that runs global semantic search over prompt candidates followed by greedy lexical refinement to embed query-specific triggers.

If this is right

Targeted queries can be forced into specific compromised outputs while general model behavior stays intact.
The attack succeeds on both open-source models and commercial APIs such as GPT-4o-mini and GPT-3.5.
Existing defenses including perplexity filters and typo-correction fail to catch the poisoned prompts.
Real-world system prompts already contain enough natural noise for the attack to remain stealthy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt marketplaces may need automated integrity checks that go beyond human review of visible text.
Model providers could explore runtime verification that compares prompt behavior across multiple queries.
Future defenses might focus on detecting statistical patterns that emerge only under specific query conditions.

Load-bearing premise

That effective conditional poisoning can be found using only black-box queries and a two-stage search without any model weights or white-box signals.

What would settle it

An experiment that applies standard perplexity filters or typo-correction to a large set of PARASITE-generated prompts and measures whether any still succeed in lowering targeted-query performance below the reported 70 percent F1 reduction threshold.

read the original abstract

Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, PARASITE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, PARASITE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), PARASITE achieves up to 70\% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/PARASITE. WARNING: Our paper contains examples that might be sensitive to the readers!

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PARASITE shows a conditional prompt poisoning method that works on tested models via two-stage black-box search, but the optimization's reliability in large discrete spaces is the part that needs closer checking.

read the letter

The main point is that this paper describes a way to poison third-party system prompts so the model only produces targeted bad outputs on specific queries while behaving normally on everything else. The two-stage process—broad semantic search then greedy word tweaks—stays strictly black-box and uses only response feedback, which sets it apart from standard jailbreak techniques that try to break refusals everywhere.

Referee Report

2 major / 2 minor

Summary. The paper proposes PARASITE, a black-box framework for conditional system prompt poisoning. An adversary optimizes a benign-looking system prompt (via global semantic search followed by greedy lexical refinement) so that the LLM produces targeted compromised outputs only on specific queries while preserving utility on benign inputs. Experiments on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5) report up to 70% F1 reduction on targeted queries, minimal degradation on general capabilities, and evasion of perplexity filters and typo-correction by exploiting natural prompt noise. Code and data are released.

Significance. If the central empirical claims hold, the work identifies a practical supply-chain vulnerability in third-party system-prompt marketplaces. The conditional (query-specific) nature of the attack and its reported defense evasion distinguish it from broad jailbreaks. Releasing code and demonstrating results on both open-source and commercial APIs are positive for reproducibility and impact assessment.

major comments (2)

[§3] §3 (Method), two-stage optimization description: The headline 70% F1 reduction and practicality claims rest on the black-box two-stage procedure reliably locating effective conditional prompts. The manuscript gives no query budget, search-space size, convergence statistics, or failure-rate analysis, leaving open whether the reported success is robust or sensitive to hyperparameter choices and random seeds.
[Experiments] Experimental results (around the 70% F1 claim): No error bars, number of independent runs, or ablation on the lexical-refinement stage are provided. Without these, it is impossible to judge whether the performance difference versus baselines is statistically reliable or whether the optimization consistently succeeds across models and query sets.

minor comments (2)

[Abstract] Abstract and §1: 'F1 reduction' is used without defining the underlying metric or the exact positive/negative class; clarify whether this is F1-score on a binary classification of compromised vs. correct answers.
[Defense evaluation] Defense evaluation: The perplexity-filter evasion claim would be strengthened by reporting the actual perplexity distributions or thresholds used for the clean and poisoned prompts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. The comments highlight opportunities to strengthen the presentation of the optimization procedure and the statistical reliability of the results. We address each major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Method), two-stage optimization description: The headline 70% F1 reduction and practicality claims rest on the black-box two-stage procedure reliably locating effective conditional prompts. The manuscript gives no query budget, search-space size, convergence statistics, or failure-rate analysis, leaving open whether the reported success is robust or sensitive to hyperparameter choices and random seeds.

Authors: We agree that the current description of the two-stage optimization in §3 is high-level and would benefit from greater specificity to support claims of reliability. In the revision we will expand this section to report the query budget employed (approximately 150 queries per target prompt during semantic search), the effective search-space size after embedding-based filtering, convergence curves showing prompt quality over iterations, and failure-rate statistics across 10 random seeds. These additions will clarify that the procedure is robust within the hyperparameter ranges we explored. revision: yes
Referee: [Experiments] Experimental results (around the 70% F1 claim): No error bars, number of independent runs, or ablation on the lexical-refinement stage are provided. Without these, it is impossible to judge whether the performance difference versus baselines is statistically reliable or whether the optimization consistently succeeds across models and query sets.

Authors: This observation is correct; the experimental section currently lacks the statistical details needed to assess consistency. We will revise the results to include error bars (standard deviation over 5 independent optimization runs per model-query pair), explicitly state the number of runs, and add an ablation that isolates the lexical-refinement stage by comparing performance with and without it. These changes will allow readers to evaluate both statistical significance and the contribution of each optimization stage. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical attack demonstration

full rationale

The paper describes an empirical attack framework (PARASITE) that performs black-box two-stage optimization (global semantic search followed by greedy lexical refinement) to craft conditional poisoned system prompts. No mathematical derivations, equations, or first-principles claims are present that reduce to their own inputs by construction. Results are obtained via direct experimentation on open-source models and commercial APIs, with released code and data for reproduction. There are no self-citations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes that form a load-bearing circular chain. The work is self-contained as a falsifiable empirical demonstration rather than a closed derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed two-stage optimization in a black-box setting and on the assumption that real-world system prompts contain sufficient natural noise to evade detection.

free parameters (1)

optimization hyperparameters for semantic search and lexical refinement
The two-stage process likely requires tunable parameters whose specific values are not detailed in the abstract.

pith-pipeline@v0.9.0 · 5752 in / 1155 out tokens · 62814 ms · 2026-05-22T13:28:37.578215+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PARASITE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement... L = L_adv(p) + L_benign(p)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieves up to 70% F1 reduction on targeted queries with minimal degradation to general capabilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.