Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.
hub Mixed citations
Defeating Prompt Injections by Design
Mixed citation behavior. Most common role is background (60%).
abstract
Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an untrusted environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL uses a notion of a capability to prevent the exfiltration of private data over unauthorized data flows by enforcing security policies when tools are called. We demonstrate effectiveness of CaMeL by solving $77\%$ of tasks with provable security (compared to $84\%$ with an undefended system) in AgentDojo. We release CaMeL at https://github.com/google-research/camel-prompt-injection.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
SkillCloak evades existing static scanners for agent skill malware at high rates, while SkillDetonate detects 97% of attacks at 2% false-positive rate using sandboxed runtime behavior analysis.
PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.
Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.
ContractGuard verifies tool contracts in RACG systems to prevent effect forgery, restoring zero injection success on benchmarks and six hosted models against adaptive attackers.
Paraphrasing retrieved content is the most effective of five tested prompting defenses against domain-camouflaged injection attacks, cutting success rates 55-84% across three models while financial domains retain the highest residual risk.
AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.
OCELOT recasts agent privacy as posterior-risk control and implements Witness-Verified Declassification to authorize the least-disclosing useful release under a sink-trust-weighted min-entropy budget.
Introduces Document-Authored Control-Signal Impersonation (DACSI) as a low-cost indirect prompt attack on RAG safety boundaries and evaluates its effectiveness across multiple models.
Data Flow Control formalizes data safety as aggregate predicates over provenance monomials and implements enforcement via the Passant query rewriting layer achieving near-zero overhead across five DBMS engines.
The paper introduces Consent Integrity as the property that actions shown for approval must be rendered by a trusted mediator from the real boundary action over an unspoofable path and bound to execution, with uninspectable actions surfaced rather than silently approved.
Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
Sandlock provides an unprivileged Linux process sandbox for AI agents by compiling static policies into kernel rules and delegating runtime decisions to a narrow supervisor.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.
citing papers explorer
-
Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening
Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.
-
Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing
A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.
-
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
-
TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation
TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
-
Cloak and Detonate: Scanner Evasion and Dynamic Detection of Agent Skill Malware
SkillCloak evades existing static scanners for agent skill malware at high rates, while SkillDetonate detects 97% of attacks at 2% false-positive rate using sandboxed runtime behavior analysis.
-
Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents
PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.
-
Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents
Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.
-
The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating
ContractGuard verifies tool contracts in RACG systems to prevent effect forgery, restoring zero injection success on benchmarks and six hosted models against adaptive attackers.
-
Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks
Paraphrasing retrieved content is the most effective of five tested prompting defenses against domain-camouflaged injection attacks, cutting success rates 55-84% across three models while financial domains retain the highest residual risk.
-
AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents
AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.
-
OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents
OCELOT recasts agent privacy as posterior-risk control and implements Witness-Verified Declassification to authorize the least-disclosing useful release under a sink-trust-weighted min-entropy budget.
-
Document-Authored Control-Signal Impersonation: A Low-Cost Indirect Prompt Attack on RAG Safety Boundaries
Introduces Document-Authored Control-Signal Impersonation (DACSI) as a low-cost indirect prompt attack on RAG safety boundaries and evaluates its effectiveness across multiple models.
-
Data Flow Control: Data Safety Policies for AI Agents
Data Flow Control formalizes data safety as aggregate predicates over provenance monomials and implements enforcement via the Passant query rewriting layer achieving near-zero overhead across five DBMS engines.
-
What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents
The paper introduces Consent Integrity as the property that actions shown for approval must be rendered by a trusted mediator from the real boundary action over an unspoofable path and bound to execution, with uninspectable actions surfaced rather than silently approved.
-
Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture
Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.
-
AIRGuard: Guarding Agent Actions with Runtime Authority Control
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
-
Sandlock: Confining AI Agent Code with Unprivileged Linux Primitives
Sandlock provides an unprivileged Linux process sandbox for AI agents by compiling static policies into kernel rules and delegating runtime decisions to a narrow supervisor.
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
-
PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
-
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
-
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
-
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
AgenTEE: Confidential LLM Agent Execution on Edge Devices
AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.
-
LogAct: Enabling Agentic Reliability via Shared Logs
LogAct is a shared-log abstraction for LLM agents that makes actions visible before execution, allows decoupled stopping, enables consistent recovery, and supports LLM-driven introspection for reliability.
-
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
-
KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents
KAIJU decouples LLM reasoning from execution using a specialized kernel and Intent-Gated Execution to enable parallel tool scheduling and robust security.
-
Formal Policy Enforcement for Real-World Agentic Systems
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
-
AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?
AgentDyn benchmark demonstrates that current AI agent defenses against prompt injection fail to handle dynamic real-world conditions.
-
Prompt Injection Attack to Tool Selection in LLM Agents
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
-
ActPlane: Programmable OS-Level Policy Enforcement for Agent Harnesses
ActPlane introduces an OS-kernel policy engine using an information-flow control DSL and eBPF to enforce agent harness policies, achieving better compliance on indirect paths with 1.9-8.4% overhead.
-
Detecting Malicious Agent Skills in the Wild using Attention
Locate-and-Judge uses attention-based span scoring followed by targeted LLM judgment to detect malicious third-party skills for LLM agents, achieving order-of-magnitude cost savings and surfacing live threats in marketplaces.
-
GIF: Locally Sound Geometric Information Flow Control for LLMs
GIF introduces a Jacobian-based upper bound on input-output mutual information in LLMs with formal Lean proof and strong empirical recall on injection and leakage benchmarks.
-
Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents
Context compaction silently drops governance constraints in LLM agents, raising policy violation rates from 0% to 30% on average, with a proposed pinning mitigation restoring compliance.
-
A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents
The paper defines a five-plane reference architecture for runtime governance of production AI agents that enforces policies on delegated actions via reasoning and enforcement planes, six interruption primitives, four correctness invariants, and a reference implementation showing microsecond adjudica
-
Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs
GT-MCP coordinates three LLM agents via a trust function and rollback to bound contextual drift and block adversarial injections in multi-turn interactions.
-
Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models
A diagnostic framework localizes instruction hierarchy failures in LLMs into identification, resolution, and realization, while self-monitors reduce non-compliance by 81-99%.
-
When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems
About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.
-
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
Introduces ClawTrojan benchmark achieving 95.5% ASR for multi-step trojan attacks in agentic harnesses and DASGuard defense that sanitizes control content from untrusted sources.
-
Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents
AuthGraph aligns an execution provenance graph with a clean authorization graph to detect parameter-source deviations from user intent, reducing attack success rates to 1-2% on AgentDojo and AgentDyn while retaining most task utility.
-
Securing LLM Agents Need Intent-to-Execution Integrity
The paper defines intent-to-execution integrity as the conjunction of Tool Integrity, Instruction Integrity, Judgment Integrity, and Data Flow Integrity, arguing that existing LLM agent defenses provide only partial coverage of these properties.
-
Web Agents Should Adopt the Plan-Then-Execute Paradigm
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
-
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
-
Language-Based Agent Control
LBAC is a new programming model that enforces user-specified policies on agentic applications by requiring agent-generated programs to be well-typed in the context of the scaffolding code.
-
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents
AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer without retraining.
-
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
Pact: A Choreographic Language for Agentic Ecosystems
Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.
-
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
-
Safeguarding LLM Agents from Misalignment through Provenance Analysis
ProvenanceGuard applies a provenance-based framework to detect three types of misalignment in LLM agent tool calls, cutting error rates on misaligned traces from 42.9% to 1.8% on one benchmark while lowering unnecessary interventions.