hub Mixed citations

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian · 2025 · cs.CR · arXiv 2503.18813

Mixed citation behavior. Most common role is background (60%).

80 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 80 citing papers arXiv PDF

abstract

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an untrusted environment. However, LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL uses a notion of a capability to prevent the exfiltration of private data over unauthorized data flows by enforcing security policies when tools are called. We demonstrate effectiveness of CaMeL by solving $77\%$ of tasks with provable security (compared to $84\%$ with an undefended system) in AgentDojo. We release CaMeL at https://github.com/google-research/camel-prompt-injection.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 2 method 2 dataset 1

citation-polarity summary

background 9 baseline 2 use method 2 support 1 use dataset 1

representative citing papers

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

cs.CR · 2026-05-27 · unverdicted · novelty 8.0

Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

cs.CR · 2026-04-25 · unverdicted · novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.

TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation

cs.CR · 2026-04-08 · unverdicted · novelty 8.0

TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees

cs.CR · 2026-06-23 · unverdicted · novelty 7.0

Presents TMA-NM, a non-malleable origin-bound authority system for LLM-agent memory with TLA+ machine-checked separation theorems and benchmarks showing 0% attack success against direct and laundering poisoning while preserving utility.

Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

cs.CR · 2026-06-19 · unverdicted · novelty 7.0

Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

cs.CR · 2026-06-17 · unverdicted · novelty 7.0

ContractGuard verifies tool contracts in RACG systems to prevent effect forgery, restoring zero injection success on benchmarks and six hosted models against adaptive attackers.

Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks

cs.CR · 2026-06-16 · unverdicted · novelty 7.0

Paraphrasing retrieved content is the most effective of five tested prompting defenses against domain-camouflaged injection attacks, cutting success rates 55-84% across three models while financial domains retain the highest residual risk.

AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents

cs.CR · 2026-06-13 · unverdicted · novelty 7.0

AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.

OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents

cs.CR · 2026-06-10 · unverdicted · novelty 7.0

OCELOT recasts agent privacy as posterior-risk control and implements Witness-Verified Declassification to authorize the least-disclosing useful release under a sink-trust-weighted min-entropy budget.

Document-Authored Control-Signal Impersonation: A Low-Cost Indirect Prompt Attack on RAG Safety Boundaries

cs.CR · 2026-06-08 · unverdicted · novelty 7.0

Introduces Document-Authored Control-Signal Impersonation (DACSI) as a low-cost indirect prompt attack on RAG safety boundaries and evaluates its effectiveness across multiple models.

Data Flow Control: Data Safety Policies for AI Agents

cs.DB · 2026-06-04 · unverdicted · novelty 7.0

Data Flow Control formalizes data safety as aggregate predicates over provenance monomials and implements enforcement via the Passant query rewriting layer achieving near-zero overhead across five DBMS engines.

What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents

cs.CR · 2026-06-01 · unverdicted · novelty 7.0

The paper introduces Consent Integrity as the property that actions shown for approval must be rendered by a trusted mediator from the real boundary action over an unspoofable path and bound to execution, with uninspectable actions surfaced rather than silently approved.

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.

AIRGuard: Guarding Agent Actions with Runtime Authority Control

cs.CR · 2026-05-27 · unverdicted · novelty 7.0

AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.

Sandlock: Confining AI Agent Code with Unprivileged Linux Primitives

cs.CR · 2026-05-25 · unverdicted · novelty 7.0

Sandlock provides an unprivileged Linux process sandbox for AI agents by compiling static policies into kernel rules and delegating runtime decisions to a narrow supervisor.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

cs.DC · 2026-05-18 · unverdicted · novelty 7.0

PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

cs.CR · 2026-05-17 · unverdicted · novelty 7.0

Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

cs.CR · 2026-05-11 · unverdicted · novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

cs.CR · 2026-05-04 · unverdicted · novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

AgenTEE: Confidential LLM Agent Execution on Edge Devices

cs.CR · 2026-04-20 · unverdicted · novelty 7.0

AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.

citing papers explorer

Showing 50 of 80 citing papers.

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening cs.CR · 2026-05-27 · unverdicted · none · ref 7 · internal anchor
Roughly 1% of real resumes contain hidden prompt injections against LLM screeners, prevalence has risen over 1-2 years, and over 90% avoid explicit instructions.
Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing cs.LG · 2026-05-13 · unverdicted · none · ref 4 · internal anchor
A hybrid randomized smoothing method yields a closed-form certificate for joint discrete-continuous perturbations that generalizes prior Gaussian and discrete smoothing approaches.
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents cs.CR · 2026-04-25 · unverdicted · none · ref 14 · internal anchor
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation cs.CR · 2026-04-08 · unverdicted · none · ref 31 · internal anchor
TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents cs.AI · 2026-06-29 · unverdicted · none · ref 12 · internal anchor
PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.
Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees cs.CR · 2026-06-23 · unverdicted · partial · ref 26 · internal anchor
Presents TMA-NM, a non-malleable origin-bound authority system for LLM-agent memory with TLA+ machine-checked separation theorems and benchmarks showing 0% attack success against direct and laundering poisoning while preserving utility.
Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents cs.CR · 2026-06-19 · unverdicted · none · ref 14 · internal anchor
Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.
The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating cs.CR · 2026-06-17 · unverdicted · none · ref 22 · internal anchor
ContractGuard verifies tool contracts in RACG systems to prevent effect forgery, restoring zero injection success on benchmarks and six hosted models against adaptive attackers.
Evaluating Prompting-Based Defenses Against Domain-Camouflaged Injection Attacks cs.CR · 2026-06-16 · unverdicted · none · ref 2 · internal anchor
Paraphrasing retrieved content is the most effective of five tested prompting defenses against domain-camouflaged injection attacks, cutting success rates 55-84% across three models while financial domains retain the highest residual risk.
AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents cs.CR · 2026-06-13 · unverdicted · none · ref 29 · internal anchor
AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.
OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents cs.CR · 2026-06-10 · unverdicted · none · ref 14 · internal anchor
OCELOT recasts agent privacy as posterior-risk control and implements Witness-Verified Declassification to authorize the least-disclosing useful release under a sink-trust-weighted min-entropy budget.
Document-Authored Control-Signal Impersonation: A Low-Cost Indirect Prompt Attack on RAG Safety Boundaries cs.CR · 2026-06-08 · unverdicted · none · ref 13 · internal anchor
Introduces Document-Authored Control-Signal Impersonation (DACSI) as a low-cost indirect prompt attack on RAG safety boundaries and evaluates its effectiveness across multiple models.
Data Flow Control: Data Safety Policies for AI Agents cs.DB · 2026-06-04 · unverdicted · none · ref 17 · internal anchor
Data Flow Control formalizes data safety as aggregate predicates over provenance monomials and implements enforcement via the Passant query rewriting layer achieving near-zero overhead across five DBMS engines.
What You Approve Is What Executes: Consent Integrity for Black-Box LLM Agents cs.CR · 2026-06-01 · unverdicted · none · ref 13 · internal anchor
The paper introduces Consent Integrity as the property that actions shown for approval must be rendered by a trusted mediator from the real boundary action over an unspoofable path and bound to execution, with uninspectable actions surfaced rather than silently approved.
Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture cs.AI · 2026-05-29 · unverdicted · none · ref 34 · internal anchor
Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.
AIRGuard: Guarding Agent Actions with Runtime Authority Control cs.CR · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
Sandlock: Confining AI Agent Code with Unprivileged Linux Primitives cs.CR · 2026-05-25 · unverdicted · none · ref 7 · internal anchor
Sandlock provides an unprivileged Linux process sandbox for AI agents by compiling static policies into kernel rules and delegating runtime decisions to a narrow supervisor.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 22 · 2 links · internal anchor
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications cs.DC · 2026-05-18 · unverdicted · none · ref 21 · internal anchor
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback cs.CR · 2026-05-17 · unverdicted · none · ref 31 · internal anchor
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection cs.CR · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck cs.CR · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents cs.CR · 2026-05-04 · unverdicted · none · ref 75 · internal anchor
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
AgenTEE: Confidential LLM Agent Execution on Edge Devices cs.CR · 2026-04-20 · unverdicted · none · ref 15 · internal anchor
AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.
LogAct: Enabling Agentic Reliability via Shared Logs cs.DC · 2026-04-09 · unverdicted · none · ref 13 · internal anchor
LogAct is a shared-log abstraction for LLM agents that makes actions visible before execution, allows decoupled stopping, enables consistent recovery, and supports LLM-driven introspection for reliability.
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents cs.CR · 2026-04-05 · unverdicted · none · ref 8 · internal anchor
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
KAIJU: An Executive Kernel for Intent-Gated Execution of LLM Agents cs.SE · 2026-03-31 · accept · none · ref 5 · internal anchor
KAIJU decouples LLM reasoning from execution using a specialized kernel and Intent-Gated Execution to enable parallel tool scheduling and robust security.
Formal Policy Enforcement for Real-World Agentic Systems cs.CR · 2026-02-18 · unverdicted · none · ref 21 · internal anchor
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments? cs.CR · 2026-02-03 · accept · none · ref 2 · internal anchor
AgentDyn benchmark demonstrates that current AI agent defenses against prompt injection fail to handle dynamic real-world conditions.
Prompt Injection Attack to Tool Selection in LLM Agents cs.CR · 2025-04-28 · conditional · none · ref 72 · internal anchor
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
ActPlane: Programmable OS-Level Policy Enforcement for Agent Harnesses cs.OS · 2026-06-23 · unverdicted · none · ref 14 · 2 links · internal anchor
ActPlane introduces an OS-kernel policy engine using an information-flow control DSL and eBPF to enforce agent harness policies, achieving better compliance on indirect paths with 1.9-8.4% overhead.
Detecting Malicious Agent Skills in the Wild using Attention cs.CR · 2026-06-22 · unverdicted · none · ref 5 · internal anchor
Locate-and-Judge uses attention-based span scoring followed by targeted LLM judgment to detect malicious third-party skills for LLM agents, achieving order-of-magnitude cost savings and surfacing live threats in marketplaces.
GIF: Locally Sound Geometric Information Flow Control for LLMs cs.AI · 2026-06-22 · unverdicted · none · ref 12 · internal anchor
GIF introduces a Jacobian-based upper bound on input-output mutual information in LLMs with formal Lean proof and strong empirical recall on injection and leakage benchmarks.
Governance Decay: How Context Compaction Silently Erases Safety Constraints in Long-Horizon LLM Agents cs.AI · 2026-06-21 · unverdicted · none · ref 7 · internal anchor
Context compaction silently drops governance constraints in LLM agents, raising policy violation rates from 0% to 30% on average, with a proposed pinning mitigation restoring compliance.
A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents cs.AI · 2026-06-10 · unverdicted · none · ref 13 · internal anchor
The paper defines a five-plane reference architecture for runtime governance of production AI agents that enforces policies on delegated actions via reasoning and enforcement planes, six interruption primitives, four correctness invariants, and a reference implementation showing microsecond adjudica
Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs cs.CR · 2026-06-09 · unverdicted · none · ref 20 · internal anchor
GT-MCP coordinates three LLM agents via a trust function and rollback to bound contextual drift and block adversarial injections in multi-turn interactions.
Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models cs.AI · 2026-06-05 · unverdicted · none · ref 2 · internal anchor
A diagnostic framework localizes instruction hierarchy failures in LLMs into identification, resolution, and realization, while self-monitors reduce non-compliance by 81-99%.
When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems cs.SE · 2026-05-30 · unverdicted · none · ref 27 · internal anchor
About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors cs.CR · 2026-05-29 · unverdicted · none · ref 6 · internal anchor
Introduces ClawTrojan benchmark achieving 95.5% ASR for multi-step trojan attacks in agentic harnesses and DASGuard defense that sanitizes control content from untrusted sources.
Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents cs.CR · 2026-05-26 · unverdicted · none · ref 5 · internal anchor
AuthGraph aligns an execution provenance graph with a clean authorization graph to detect parameter-source deviations from user intent, reducing attack success rates to 1-2% on AgentDojo and AgentDyn while retaining most task utility.
Securing LLM Agents Need Intent-to-Execution Integrity cs.CR · 2026-05-16 · conditional · none · ref 19 · internal anchor
The paper defines intent-to-execution integrity as the conjunction of Tool Integrity, Instruction Integrity, Judgment Integrity, and Data Flow Integrity, arguing that existing LLM agent defenses provide only partial coverage of these properties.
Web Agents Should Adopt the Plan-Then-Execute Paradigm cs.CR · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents cs.CR · 2026-05-13 · conditional · none · ref 29 · internal anchor
Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
Language-Based Agent Control cs.PL · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
LBAC is a new programming model that enforces user-specified policies on agentic applications by requiring agent-generated programs to be well-typed in the context of the scaffolding code.
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents cs.CR · 2026-05-10 · unverdicted · none · ref 15 · internal anchor
AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer without retraining.
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks cs.CR · 2026-05-08 · unverdicted · none · ref 55 · internal anchor
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection cs.CR · 2026-05-05 · unverdicted · none · ref 146 · internal anchor
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Pact: A Choreographic Language for Agentic Ecosystems cs.PL · 2026-05-04 · unverdicted · none · ref 10 · internal anchor
Pact is a choreographic language extended with game-theoretic operations that maps every protocol to a formal game for reasoning about agent decisions and solving for decision policies.
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration cs.CR · 2026-05-03 · unverdicted · none · ref 16 · 2 links · internal anchor
The paper defines and evaluates Trojan Hippo attacks on LLM agent memory, showing 85-100% success in data exfiltration across backends and reduced rates with defenses at varying utility costs.
Safeguarding LLM Agents from Misalignment through Provenance Analysis cs.CL · 2026-05-01 · unverdicted · none · ref 10 · internal anchor
ProvenanceGuard applies a provenance-based framework to detect three types of misalignment in LLM agent tool calls, cutting error rates on misaligned traces from 42.9% to 1.8% on one benchmark while lowering unnecessary interventions.

Defeating Prompt Injections by Design

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer