PARASITE: Conditional System Prompt Poisoning to Hijack LLMs
Pith reviewed 2026-05-22 13:28 UTC · model grok-4.3
The pith
A poisoning method can embed sleeper triggers in system prompts so LLMs give targeted bad answers only to chosen questions while behaving normally on everything else.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PARASITE demonstrates that conditional system-prompt poisoning is feasible in a strict black-box setting: a two-stage process first performs global semantic search over prompt candidates, then applies greedy lexical refinement to produce a final prompt that forces an LLM to output compromised responses solely on chosen queries while preserving high utility on benign inputs.
What carries the argument
PARASITE, a two-stage black-box optimization that runs global semantic search over prompt candidates followed by greedy lexical refinement to embed query-specific triggers.
If this is right
- Targeted queries can be forced into specific compromised outputs while general model behavior stays intact.
- The attack succeeds on both open-source models and commercial APIs such as GPT-4o-mini and GPT-3.5.
- Existing defenses including perplexity filters and typo-correction fail to catch the poisoned prompts.
- Real-world system prompts already contain enough natural noise for the attack to remain stealthy.
Where Pith is reading between the lines
- Prompt marketplaces may need automated integrity checks that go beyond human review of visible text.
- Model providers could explore runtime verification that compares prompt behavior across multiple queries.
- Future defenses might focus on detecting statistical patterns that emerge only under specific query conditions.
Load-bearing premise
That effective conditional poisoning can be found using only black-box queries and a two-stage search without any model weights or white-box signals.
What would settle it
An experiment that applies standard perplexity filters or typo-correction to a large set of PARASITE-generated prompts and measures whether any still succeed in lowering targeted-query performance below the reported 70 percent F1 reduction threshold.
read the original abstract
Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, PARASITE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, PARASITE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), PARASITE achieves up to 70\% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/PARASITE. WARNING: Our paper contains examples that might be sensitive to the readers!
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes PARASITE, a black-box framework for conditional system prompt poisoning. An adversary optimizes a benign-looking system prompt (via global semantic search followed by greedy lexical refinement) so that the LLM produces targeted compromised outputs only on specific queries while preserving utility on benign inputs. Experiments on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5) report up to 70% F1 reduction on targeted queries, minimal degradation on general capabilities, and evasion of perplexity filters and typo-correction by exploiting natural prompt noise. Code and data are released.
Significance. If the central empirical claims hold, the work identifies a practical supply-chain vulnerability in third-party system-prompt marketplaces. The conditional (query-specific) nature of the attack and its reported defense evasion distinguish it from broad jailbreaks. Releasing code and demonstrating results on both open-source and commercial APIs are positive for reproducibility and impact assessment.
major comments (2)
- [§3] §3 (Method), two-stage optimization description: The headline 70% F1 reduction and practicality claims rest on the black-box two-stage procedure reliably locating effective conditional prompts. The manuscript gives no query budget, search-space size, convergence statistics, or failure-rate analysis, leaving open whether the reported success is robust or sensitive to hyperparameter choices and random seeds.
- [Experiments] Experimental results (around the 70% F1 claim): No error bars, number of independent runs, or ablation on the lexical-refinement stage are provided. Without these, it is impossible to judge whether the performance difference versus baselines is statistically reliable or whether the optimization consistently succeeds across models and query sets.
minor comments (2)
- [Abstract] Abstract and §1: 'F1 reduction' is used without defining the underlying metric or the exact positive/negative class; clarify whether this is F1-score on a binary classification of compromised vs. correct answers.
- [Defense evaluation] Defense evaluation: The perplexity-filter evasion claim would be strengthened by reporting the actual perplexity distributions or thresholds used for the clean and poisoned prompts.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. The comments highlight opportunities to strengthen the presentation of the optimization procedure and the statistical reliability of the results. We address each major comment below and will incorporate the requested details in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Method), two-stage optimization description: The headline 70% F1 reduction and practicality claims rest on the black-box two-stage procedure reliably locating effective conditional prompts. The manuscript gives no query budget, search-space size, convergence statistics, or failure-rate analysis, leaving open whether the reported success is robust or sensitive to hyperparameter choices and random seeds.
Authors: We agree that the current description of the two-stage optimization in §3 is high-level and would benefit from greater specificity to support claims of reliability. In the revision we will expand this section to report the query budget employed (approximately 150 queries per target prompt during semantic search), the effective search-space size after embedding-based filtering, convergence curves showing prompt quality over iterations, and failure-rate statistics across 10 random seeds. These additions will clarify that the procedure is robust within the hyperparameter ranges we explored. revision: yes
-
Referee: [Experiments] Experimental results (around the 70% F1 claim): No error bars, number of independent runs, or ablation on the lexical-refinement stage are provided. Without these, it is impossible to judge whether the performance difference versus baselines is statistically reliable or whether the optimization consistently succeeds across models and query sets.
Authors: This observation is correct; the experimental section currently lacks the statistical details needed to assess consistency. We will revise the results to include error bars (standard deviation over 5 independent optimization runs per model-query pair), explicitly state the number of runs, and add an ablation that isolates the lexical-refinement stage by comparing performance with and without it. These changes will allow readers to evaluate both statistical significance and the contribution of each optimization stage. revision: yes
Circularity Check
No significant circularity in empirical attack demonstration
full rationale
The paper describes an empirical attack framework (PARASITE) that performs black-box two-stage optimization (global semantic search followed by greedy lexical refinement) to craft conditional poisoned system prompts. No mathematical derivations, equations, or first-principles claims are present that reduce to their own inputs by construction. Results are obtained via direct experimentation on open-source models and commercial APIs, with released code and data for reproduction. There are no self-citations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes that form a load-bearing circular chain. The work is self-contained as a falsifiable empirical demonstration rather than a closed derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- optimization hyperparameters for semantic search and lexical refinement
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PARASITE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement... L = L_adv(p) + L_benign(p)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
achieves up to 70% F1 reduction on targeted queries with minimal degradation to general capabilities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.