AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
hub Mixed citations
”Tensor trust: Interpretable prompt injection attacks from an online game.” arXiv preprint arXiv:2311.01011 (2023)
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.
Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines on agentic benchmarks.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Mock tool call wrapping does not broadly improve and sometimes reduces robustness to attacks on untrusted inputs across seven models and three LLM-as-a-judge tasks.
Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
DKnownAI Guard achieves 96.5% recall and 90.4% true negative rate, outperforming three competing guardrails in AI agent security evaluations.
A survey reviewing the integration of generative models with connected and automated vehicles to enhance predictive modeling, simulation accuracy, and decision-making.
citing papers explorer
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
-
Provably Secure Agent Guardrail
Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.
-
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
-
PAAC: Privacy-Aware Agentic Device-Cloud Collaboration
PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines on agentic benchmarks.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
Mock tool call wrapping does not broadly improve and sometimes reduces robustness to attacks on untrusted inputs across seven models and three LLM-as-a-judge tasks.
-
Evaluation of Prompt Injection Defenses in Large Language Models
Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
A Comparative Evaluation of AI Agent Security Guardrails
DKnownAI Guard achieves 96.5% recall and 90.4% true negative rate, outperforming three competing guardrails in AI agent security evaluations.
-
Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI
A survey reviewing the integration of generative models with connected and automated vehicles to enhance predictive modeling, simulation accuracy, and decision-making.