Prompt Injection as Role Confusion

· 2026 · cs.CL · arXiv 2603.12277

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

LLMs see the world as a single stream of text, partitioned into roles like <user> or <tool>. We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like <user> text, despite its <tool> label. We design role probes to measure how LLMs internally perceive "who is speaking," and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reasoning into user prompts and tool outputs. Models mistake the forgery for their own thoughts, yielding 60% attack success against frontier models with near-zero baselines. Strikingly, the degree of role confusion predicts attack success before a single token is generated. This mechanism generalizes beyond CoT Forgery to standard agent prompt injections, revealing prompt injection as a measurable consequence of role perception. To the model, sounding like a role is indistinguishable from being one.

representative citing papers

Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals

cs.CL · 2026-05-26 · unverdicted · novelty 5.0

Prompt injection detection performance is highly regime-dependent with no single detector dominating across settings; transformer models perform best overall while structural signals offer modest gains in some regimes.

citing papers explorer

Showing 1 of 1 citing paper.

Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals cs.CL · 2026-05-26 · unverdicted · none · ref 37 · internal anchor
Prompt injection detection performance is highly regime-dependent with no single detector dominating across settings; transformer models perform best overall while structural signals offer modest gains in some regimes.

Prompt Injection as Role Confusion

fields

years

verdicts

representative citing papers

citing papers explorer