Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Florian Tram\`er; Jie Zhang; Xin Chen

arxiv: 2602.05746 · v2 · pith:P5NQA5OZnew · submitted 2026-02-05 · 💻 cs.LG · cs.AI

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Xin Chen , Jie Zhang , Florian Tram\`er This is my paper

classification 💻 cs.LG cs.AI

keywords injectionpromptsuffixesattacksautoinjectautomatedlearningadaptive

0 comments

read the original abstract

Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks shape models toward generic compliance, while prompt injection requires emitting specific tool calls with correct parameters. The success signal is binary, and randomly sampled suffixes almost never trigger it, so standard optimizers have no gradient to follow. We present AutoInject, a black-box reinforcement learning (RL) framework that learns adversarial suffixes for prompt injection. A learned comparison-based reward scores each candidate against the best suffix seen so far, turning the binary signal into a dense reward suitable for RL optimization. The framework supports both online query-based attacks and offline-trained transferable suffixes that need no utility access at deployment, and incorporates a utility objective when task-completion feedback is available. On AgentDojo, AutoInject outperforms template attacks, GCG, TAP, and adaptive attack across production models, with statistically significant improvements under McNemar's test with p<0.05. Suffixes learned by AutoInject also break Meta-SecAlign-70B, a model fine-tuned specifically to resist prompt injection, where template attacks fail outright. The results establish an automated baseline for prompt injection and expose a gap between preference-based defenses and adaptive optimization-based attackers.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
cs.CR 2026-05 unverdicted novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training
cs.LG 2026-05 unverdicted novelty 6.0

Activation-level consistency training (ACT) yields a robust defense against adaptive jailbreaks in reasoning models by aligning internal activations on clean and wrapped prompts, outperforming output-level variants.
Assessing Automated Prompt Injection Attacks in Agentic Environments
cs.CR 2026-06 unverdicted novelty 4.0

Black-box optimization outperforms gradient-based methods for prompt injection on LLM agents, with success depending on attacker model strength and limited transfer from small to frontier models.