citation dossier

WASP: Benchmarking web agent security against prompt injection attacks

Chaofan Li et al · 2025 · arXiv 2504.18575

16Pith papers citing it

17reference links

cs.CRtop field · 9 papers

UNVERDICTEDtop verdict bucket · 16 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 16 reviewed papers. Its strongest current cluster is cs.CR (9 papers). The largest review-status bucket among citing papers is UNVERDICTED (16 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

cs.CR · 2026-05-11 · unverdicted · novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.

WAAA! Web Adversaries Against Agentic Browsers

cs.CR · 2026-05-06 · unverdicted · novelty 7.0

Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.

Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives

cs.CR · 2026-04-29 · unverdicted · novelty 7.0

Large-scale analysis of 1.2B URLs identifies 15.3K indirect prompt injection instances in the wild, mostly targeting AI systems with up to 8% compliance in model experiments.

OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

cs.CL · 2026-04-27 · unverdicted · novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

cs.CR · 2026-05-12 · unverdicted · novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.

When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

cs.AI · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

cs.CR · 2026-04-28 · unverdicted · novelty 6.0

SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.

HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

HINTBench provides 629 annotated agent trajectories to audit intrinsic non-attack risks, showing that strong LLMs detect overall risk but fail at localizing specific risky steps or diagnosing failure types.

Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode

cs.SE · 2026-04-04 · unverdicted · novelty 6.0

Independent evaluation of Claude Code auto mode finds 81% false negative rate on ambiguous authorization tasks due to unmonitored file edits.

Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

cs.MA · 2026-05-11 · unverdicted · novelty 5.0

Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.

Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation

cs.CR · 2026-05-07 · unverdicted · novelty 5.0

A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.

WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

cs.CR · 2026-04-14 · unverdicted · novelty 5.0

WebAgentGuard is a reasoning-driven multimodal model trained on large synthetic data via supervised fine-tuning and reinforcement learning to detect prompt injections in web agents better than prior defenses.

PIArena: A Platform for Prompt Injection Evaluation

cs.CR · 2026-04-09 · unverdicted · novelty 5.0

PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

cs.CL · 2026-05-08 · unverdicted · novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

cs.CL · 2026-05-04 · unverdicted · novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.

citing papers explorer

Showing 16 of 16 citing papers.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection cs.CR · 2026-05-12 · unverdicted · none · ref 12
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck cs.CR · 2026-05-11 · unverdicted · none · ref 5
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in AgentDojo evaluations.
WAAA! Web Adversaries Against Agentic Browsers cs.CR · 2026-05-06 · unverdicted · none · ref 29
Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives cs.CR · 2026-04-29 · unverdicted · none · ref 47
Large-scale analysis of 1.2B URLs identifies 15.3K indirect prompt injection instances in the wild, mostly targeting AI systems with up to 8% compliance in model experiments.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents cs.CL · 2026-04-27 · unverdicted · none · ref 33
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces cs.CR · 2026-05-12 · unverdicted · none · ref 55
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents cs.AI · 2026-05-09 · unverdicted · none · ref 7 · 2 links
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents cs.CR · 2026-04-28 · unverdicted · none · ref 14
SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark cs.LG · 2026-04-15 · unverdicted · none · ref 1
HINTBench provides 629 annotated agent trajectories to audit intrinsic non-attack risks, showing that strong LLMs detect overall risk but fail at localizing specific risky steps or diagnosing failure types.
Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode cs.SE · 2026-04-04 · unverdicted · none · ref 2
Independent evaluation of Claude Code auto mode finds 81% false negative rate on ambiguous authorization tasks due to unmonitored file edits.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems cs.MA · 2026-05-11 · unverdicted · none · ref 11
Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation cs.CR · 2026-05-07 · unverdicted · none · ref 26
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents cs.CR · 2026-04-14 · unverdicted · none · ref 3
WebAgentGuard is a reasoning-driven multimodal model trained on large synthetic data via supervised fine-tuning and reinforcement learning to detect prompt injections in web agents better than prior defenses.
PIArena: A Platform for Prompt Injection Evaluation cs.CR · 2026-04-09 · unverdicted · none · ref 4
PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability cs.CL · 2026-05-08 · unverdicted · none · ref 166
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces cs.CL · 2026-05-04 · unverdicted · none · ref 14
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.

WASP: Benchmarking web agent security against prompt injection attacks

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer