AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
hub Mixed citations
”Tensor trust: Interpretable prompt injection attacks from an online game.” arXiv preprint arXiv:2311.01011 (2023)
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
Activation probes, calibrated honeytokens, and multi-turn leakage accounting detect credential exfiltration attempts in LLM agents with high accuracy in controlled open-model tests.
Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.
Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
PAAC aligns planner-executor decomposition with the device-cloud boundary via typed placeholders and on-device sanitization, delivering 15-36% higher accuracy and 2-6x lower leakage than prior device-cloud baselines on agentic benchmarks.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Mock tool call wrapping does not broadly improve and sometimes reduces robustness to attacks on untrusted inputs across seven models and three LLM-as-a-judge tasks.
Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
DKnownAI Guard achieves 96.5% recall and 90.4% true negative rate, outperforming three competing guardrails in AI agent security evaluations.
A survey reviewing the integration of generative models with connected and automated vehicles to enhance predictive modeling, simulation accuracy, and decision-making.
citing papers explorer
-
Provably Secure Agent Guardrail
Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.