pith. sign in

arxiv: 2602.05746 · v2 · pith:P5NQA5OZnew · submitted 2026-02-05 · 💻 cs.LG · cs.AI

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

classification 💻 cs.LG cs.AI
keywords injectionpromptsuffixesattacksautoinjectautomatedlearningadaptive
0
0 comments X
read the original abstract

Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks shape models toward generic compliance, while prompt injection requires emitting specific tool calls with correct parameters. The success signal is binary, and randomly sampled suffixes almost never trigger it, so standard optimizers have no gradient to follow. We present AutoInject, a black-box reinforcement learning (RL) framework that learns adversarial suffixes for prompt injection. A learned comparison-based reward scores each candidate against the best suffix seen so far, turning the binary signal into a dense reward suitable for RL optimization. The framework supports both online query-based attacks and offline-trained transferable suffixes that need no utility access at deployment, and incorporates a utility objective when task-completion feedback is available. On AgentDojo, AutoInject outperforms template attacks, GCG, TAP, and adaptive attack across production models, with statistically significant improvements under McNemar's test with p<0.05. Suffixes learned by AutoInject also break Meta-SecAlign-70B, a model fine-tuned specifically to resist prompt injection, where template attacks fail outright. The results establish an automated baseline for prompt injection and expose a gap between preference-based defenses and adaptive optimization-based attackers.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

    cs.CR 2026-05 unverdicted novelty 7.0

    PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...

  2. Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Activation-level consistency training (ACT) yields a robust defense against adaptive jailbreaks in reasoning models by aligning internal activations on clean and wrapped prompts, outperforming output-level variants.

  3. Assessing Automated Prompt Injection Attacks in Agentic Environments

    cs.CR 2026-06 unverdicted novelty 4.0

    Black-box optimization outperforms gradient-based methods for prompt injection on LLM agents, with success depending on attacker model strength and limited transfer from small to frontier models.