hub

Benchmarking and defending against indirect prompt injection attacks on large language models

Yi, J · 2023 · arXiv 2312.14197

24 Pith papers cite this work. Polarity classification is still indexing.

24 Pith papers citing it

read on arXiv browse 24 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 1 dataset 1

citation-polarity summary

background 2 baseline 1 use dataset 1

representative citing papers

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

cs.CR · 2026-04-25 · unverdicted · novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

Discourse-role labels on identical misleading context cause 56-84 percentage point shifts in LLMs adopting the injected wrong answer.

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

Introduces a cross-validation-based evaluation methodology for LLM security detectors using a global threshold and group-fold leakage checks to avoid per-dataset tuning.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.

Many-Tier Instruction Hierarchy in LLM Agents

cs.CL · 2026-04-10 · unverdicted · novelty 7.0

ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

cs.CR · 2024-10-03 · unverdicted · novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.

LLM Agents can Autonomously Exploit One-day Vulnerabilities

cs.CR · 2024-04-11 · unverdicted · novelty 7.0

GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.

Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense

cs.CR · 2026-06-29 · unverdicted · novelty 6.0

Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

cs.CR · 2026-06-02 · unverdicted · novelty 6.0

Activation probes, calibrated honeytokens, and multi-turn leakage accounting detect credential exfiltration attempts in LLM agents with high accuracy in controlled open-model tests.

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

cs.CR · 2026-05-29 · unverdicted · novelty 6.0

Introduces ClawTrojan benchmark achieving 95.5% ASR for multi-step trojan attacks in agentic harnesses and DASGuard defense that sanitizes control content from untrusted sources.

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

cs.AI · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Evidence-carrying multimodal agents decompose tool calls into predicates, obtain certificates from DOM/OCR/AX verifiers, and use a deterministic gate to authorize actions only when certificates support them, achieving zero unsafe executions in tested tasks.

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.

Web Agents Should Adopt the Plan-Then-Execute Paradigm

cs.CR · 2026-05-14 · unverdicted · novelty 6.0

Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.

An AI Agent Execution Environment to Safeguard User Data

cs.CR · 2026-04-21 · unverdicted · novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

cs.CR · 2025-04-29 · unverdicted · novelty 6.0

The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

cs.CR · 2024-04-19 · unverdicted · novelty 6.0

Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.

Defending Against Indirect Prompt Injection Attacks With Spotlighting

cs.CR · 2024-03-20 · unverdicted · novelty 6.0

Spotlighting prompt transformations cut indirect prompt injection success rates from >50% to <2% on GPT models while preserving task performance.

Designing Intelligent Enterprise Agents: A Capability-Aligned Multi-Agent Architecture

cs.MA · 2026-05-07 · unverdicted · novelty 5.0

CEAD architecture for intelligent enterprise agents achieves 70.6% safe success rate on 10,000 tasks by making agent design the primary abstraction rather than governance.

Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

cs.CR · 2026-04-28 · unverdicted · novelty 5.0

SkillGuard-Robust formulates pre-load auditing of untrusted Agent Skills as a three-way classification task and achieves 97.30% exact match and 98.33% malicious-risk recall on held-out benchmarks.

Evaluation of Prompt Injection Defenses in Large Language Models

cs.CR · 2026-04-26 · unverdicted · novelty 5.0 · 2 refs

Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.

MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning

cs.CL · 2026-05-08 · unverdicted · novelty 4.0

MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

cs.CR · 2025-02-02 · unverdicted · novelty 2.0

A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

citing papers explorer

Showing 24 of 24 citing papers.

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents cs.CR · 2026-04-25 · unverdicted · none · ref 36
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 70
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models cs.CL · 2026-06-02 · unverdicted · none · ref 13
Discourse-role labels on identical misleading context cause 56-84 percentage point shifts in LLMs adopting the injected wrong answer.
Gate AI: LLM Security Benchmark Evaluation Methodology and Results cs.LG · 2026-06-01 · unverdicted · none · ref 16
Introduces a cross-validation-based evaluation methodology for LLM security detectors using a global threshold and group-fold leakage checks to avoid per-dataset tuning.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection cs.CR · 2026-05-12 · unverdicted · none · ref 9
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates cs.AI · 2026-05-04 · unverdicted · none · ref 5
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largely reflects state reset.
Many-Tier Instruction Hierarchy in LLM Agents cs.CL · 2026-04-10 · unverdicted · none · ref 31
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents cs.CR · 2024-10-03 · unverdicted · none · ref 157
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
LLM Agents can Autonomously Exploit One-day Vulnerabilities cs.CR · 2024-04-11 · unverdicted · none · ref 23
GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.
Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense cs.CR · 2026-06-29 · unverdicted · none · ref 78
Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.
Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents cs.CR · 2026-06-02 · unverdicted · none · ref 19
Activation probes, calibrated honeytokens, and multi-turn leakage accounting detect credential exfiltration attempts in LLM agents with high accuracy in controlled open-model tests.
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors cs.CR · 2026-05-29 · unverdicted · none · ref 29
Introduces ClawTrojan benchmark achieving 95.5% ASR for multi-step trojan attacks in agentic harnesses and DASGuard defense that sanitizes control content from untrusted sources.
Hallucination as Exploit: Evidence-Carrying Multimodal Agents cs.AI · 2026-05-18 · unverdicted · none · ref 14 · 2 links
Evidence-carrying multimodal agents decompose tool calls into predicates, obtain certificates from DOM/OCR/AX verifiers, and use a deterministic gate to authorize actions only when certificates support them, achieving zero unsafe executions in tested tasks.
An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments cs.CR · 2026-05-18 · unverdicted · none · ref 8
Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.
Web Agents Should Adopt the Plan-Then-Execute Paradigm cs.CR · 2026-05-14 · unverdicted · none · ref 34
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
An AI Agent Execution Environment to Safeguard User Data cs.CR · 2026-04-21 · unverdicted · none · ref 82
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction cs.CR · 2025-04-29 · unverdicted · none · ref 45
The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions cs.CR · 2024-04-19 · unverdicted · none · ref 13
Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.
Defending Against Indirect Prompt Injection Attacks With Spotlighting cs.CR · 2024-03-20 · unverdicted · none · ref 2
Spotlighting prompt transformations cut indirect prompt injection success rates from >50% to <2% on GPT models while preserving task performance.
Designing Intelligent Enterprise Agents: A Capability-Aligned Multi-Agent Architecture cs.MA · 2026-05-07 · unverdicted · none · ref 24
CEAD architecture for intelligent enterprise agents achieves 70.6% safe success rate on 10,000 tasks by making agent design the primary abstraction rather than governance.
Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills cs.CR · 2026-04-28 · unverdicted · none · ref 25
SkillGuard-Robust formulates pre-load auditing of untrusted Agent Skills as a three-way classification task and achieves 97.30% exact match and 98.33% malicious-risk recall on held-out benchmarks.
Evaluation of Prompt Injection Defenses in Large Language Models cs.CR · 2026-04-26 · unverdicted · none · ref 11 · 2 links
Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning cs.CL · 2026-05-08 · unverdicted · none · ref 13
MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety cs.CR · 2025-02-02 · unverdicted · none · ref 140
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

Benchmarking and defending against indirect prompt injection attacks on large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer