AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Florian Tram\`er, Jie Zhang, Luca Beurer-Kellner, Marc Fischer, Mislav Balunovi\'c

Pith reviewed 2026-05-13 06:32 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords agentdojoattacksagentstasksdefensesenvironmentinjectionprompt

0 comments

The pith

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentDojo, a new testing environment for AI agents powered by large language models. These agents combine reasoning with calls to external tools, such as checking emails or processing bank transactions. The core issue addressed is prompt injection, where malicious instructions hidden in the data returned by tools can override the agent's original goals and cause it to perform harmful actions. AgentDojo is built as a flexible platform rather than a fixed set of tests. It comes with 97 practical tasks drawn from everyday scenarios like email management, online banking, and travel planning. Alongside these, there are 629 specific security test cases designed to probe different attack and defense strategies known from existing research. The setup allows researchers to create new tasks, simulate adaptive attacks, and test various protective measures over time. Initial experiments using the environment reveal that even top-performing language models struggle to complete many of the tasks successfully, even when no attacks are present. When prompt injection attacks are introduced, they manage to compromise certain security aspects of the agents but do not succeed in breaking all protections. The authors have made the code publicly available to encourage further development and testing by the community. This approach shifts the focus from static evaluations to a living benchmark that can evolve with new threats and solutions in the field of AI agent security.

Core claim

To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data... state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all.

Load-bearing premise

The 97 tasks and 629 security test cases in AgentDojo accurately represent real-world agent behaviors and the space of prompt injection threats from untrusted tool outputs.

read the original abstract

AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner.. We release the code for AgentDojo at https://github.com/ethz-spylab/agentdojo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity: empirical benchmark release with no derivations or self-referential reductions

full rationale

The paper introduces AgentDojo as an extensible environment populated with 97 tasks and 629 test cases drawn from existing literature on agent behaviors and prompt-injection attacks. No equations, fitted parameters, or predictions appear in the provided text; the central claim is simply the construction and release of the benchmark itself. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify the task suite. The framework is self-contained as an empirical contribution and does not reduce any result to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on the abstract alone, the framework rests on domain assumptions that the chosen tasks and test cases are representative of real agent use and attack surfaces; no free parameters or invented entities are described.

axioms (2)

domain assumption The 97 tasks represent realistic agent use cases involving external tools.
Abstract describes them as 'realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings)'.
domain assumption The 629 security test cases adequately cover relevant prompt injection attack and defense paradigms.
Abstract states the environment is 'populated with ... 629 security test cases, and various attack and defense paradigms from the literature'.

pith-pipeline@v0.9.0 · 5546 in / 1472 out tokens · 83678 ms · 2026-05-13T06:32:13.330618+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We introduce AgentDojo, an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks... 97 realistic tasks... 629 security test cases
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Targeted Attack Success Rate (ASR): the fraction of security cases where the attacker’s goal is met

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
cs.CR 2026-04 unverdicted novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
cs.CR 2026-05 unverdicted novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
cs.AI 2026-05 unverdicted novelty 7.0

In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
cs.CR 2026-04 unverdicted novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
cs.CR 2026-04 unverdicted novelty 7.0

Seven cross-domain techniques for prompt injection detection are proposed; three implemented versions raise F1 scores on multiple benchmarks while releasing all code and data.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
cs.AI 2026-05 unverdicted novelty 6.0

Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer witho...
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
cs.CR 2026-05 unverdicted novelty 6.0

SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...
LoopTrap: Termination Poisoning Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
cs.CR 2026-05 unverdicted novelty 6.0

Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
Alignment Contracts for Agentic Security Systems
cs.CR 2026-04 conditional novelty 6.0 full

Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an obse...
RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents
cs.CR 2026-04 unverdicted novelty 6.0

RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers
cs.CR 2026-04 unverdicted novelty 6.0

Arbiter-K is a new execution architecture that treats LLMs as probabilistic processors inside a neuro-symbolic kernel with a semantic ISA to enable deterministic security enforcement and unsafe trajectory interdiction...
Policy-Invisible Violations in LLM-Based Agents
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
cs.CR 2026-04 unverdicted novelty 6.0

Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
cs.CR 2026-04 conditional novelty 6.0

Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 5.0

Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
cs.CR 2026-05 unverdicted novelty 5.0

A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing
cs.CR 2026-04 unverdicted novelty 5.0

Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp rob...

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 23 Pith papers · 17 internal anchors

[1]

Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar et al. “Croissant: A Metadata Format for ML-Ready Datasets”. In: Pro- ceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning. SIGMOD/PODS ’24. ACM, June 2024.DOI: 10.1145/3650203.3663326

work page doi:10.1145/3650203.3663326 2024
[2]

The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic. com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model _Card_Claude_3.pdf . Mar. 2024

work page 2024
[3]

Tool use (function calling)

Anthropic. Tool use (function calling). https://docs.anthropic.com/en/docs/tool-use . 2024

work page 2024
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. “Training a helpful and harmless assistant with reinforcement learning from human feedback”. In: arXiv preprint arXiv:2204.05862 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901

work page 2020
[6]

A critique of the deepsec platform for security analysis of deep learning models

Nicholas Carlini. “A critique of the deepsec platform for security analysis of deep learning models”. In: arXiv preprint arXiv:1905.07112 (2019)

work page arXiv 1905
[7]

Jail- breakbench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong.JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. 2024. arXiv:2404.01318 [cs.CR]

work page arXiv 2024
[8]

StruQ: Defending Against Prompt Injection with Structured Queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. “StruQ: Defending Against Prompt Injection with Structured Queries”. In: arXiv preprint arXiv:2402.06363 (2024)

work page arXiv 2024
[9]

Introducing Command R+: Our new, most powerful model in the Command R family

Cohere. Introducing Command R+: Our new, most powerful model in the Command R family. https://cohere.com/command. 2023

work page 2023
[10]

RobustBench: a standardized adversarial robustness benchmark

Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. “RobustBench: a standardized adversarial robustness benchmark”. In: NeurIPS Datasets and Benchmarks. 2021

work page 2021
[11]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Francesco Croce and Matthias Hein. “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks”. In:International conference on machine learning. PMLR. 2020, pp. 2206–2216

work page 2020
[12]

Dataset and Lessons Learned from the 2024 SaTML LLM Capture- the-Flag Competition

Edoardo Debenedetti et al. Dataset and Lessons Learned from the 2024 SaTML LLM Capture- the-Flag Competition. 2024. arXiv:2406.07954 [cs.CR]

work page arXiv 2024
[13]

Misusing Tools in Large Language Models With Visual Adversarial Examples

Xiaohan Fu, Zihan Wang, Shuheng Li, Rajesh K Gupta, Niloofar Mireshghallah, Taylor Berg- Kirkpatrick, and Earlence Fernandes. “Misusing Tools in Large Language Models With Visual Adversarial Examples”. In: arXiv preprint arXiv:2310.03185 (2023). 10

work page arXiv 2023
[14]

PAL: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. “PAL: Program-aided language models”. In:International Conference on Machine Learning. PMLR. 2023, pp. 10764–10799

work page 2023
[15]

Coercing LLMs to do and reveal (almost) anything

Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. “Coercing LLMs to do and reveal (almost) anything”. In: arXiv preprint arXiv:2402.14020 (2024)

work page arXiv 2024
[16]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. “Gemini: a family of highly capable multimodal models”. In:arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions

Riley Goodside. Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions. https://x.com/goodside/status/1569128808308957185. 2022

work page arXiv 2022
[18]

Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. CCS ’23. ACM, Nov. 2023.DOI: 10.1145/3605764. 3623985

work page doi:10.1145/3605764 2023
[19]

Defending against indirect prompt injection attacks with spotlighting

Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending Against Indirect Prompt Injection Attacks With Spotlighting. 2024. arXiv: 2403.14720 [cs.CR]

work page arXiv 2024
[20]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents”. In: International Conference on Machine Learning. PMLR. 2022, pp. 9118–9147

work page 2022
[21]

Llama-3 Function Calling Demo

Hamel Husain. Llama-3 Function Calling Demo . https : / / nbsanity . com / static / d06085f1dacae8c9de9402f2d7428de2/demo.html. 2024

work page 2024
[22]

Function calling

Colin Jarvis and Joe Palermo. Function calling. https://cookbook.openai.com/examples/ how_to_call_functions_with_chat_models. June 2023

work page 2023
[23]

Exploiting programmatic behavior of llms: Dual-use through standard security attacks

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. “Exploiting programmatic behavior of llms: Dual-use through standard security attacks”. In: 2024 IEEE Security and Privacy Workshops (SPW). IEEE. 2024, pp. 132–143

work page 2024
[24]

Intro to Large Language Models

Andrej Karpathy. Intro to Large Language Models. https://www.youtube.com/watch?v= zjkBMFhNj_g. 2023

work page 2023
[25]

Language models can solve computer tasks

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. “Language models can solve computer tasks”. In: Advances in Neural Information Processing Systems 36 (2023)

work page 2023
[26]

Evaluating language-model agents on realistic autonomous tasks

Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, and Paul Christiano. “Evaluating Language-Model Agents on Realistic Autonomous Tasks”. In: CoRR abs/2312.11671 (2023). DOI: 10.48550/ARXIV.2312.11671. arXiv:2312.11671

work page doi:10.48550/arxiv.2312.11671 2023
[27]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. “Large language models are zero-shot reasoners”. In: Advances in neural information process- ing systems 35 (2022), pp. 22199–22213

work page 2022
[28]

ChainGuard

Lakera. ChainGuard. https://lakeraai.github.io/chainguard/. 2024

work page 2024
[29]

Hugging Face prompt injection identification

LangChain. Hugging Face prompt injection identification . https://python.langchain. com/v0.1/docs/guides/productionization/safety/hugging _face_prompt_injection/. 2024

work page 2024
[30]

Sandwich Defense

Learn Prompting. Sandwich Defense . https://learnprompting.org/docs/prompt _ hacking/defensive_measures/sandwich_defense. 2024

work page 2024
[31]

AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. AgentSims: An Open-Source Sandbox for Large Language Model Evaluation. 2023. arXiv:2308.04026 [cs.AI]

work page arXiv 2023
[32]

AgentBench: Evaluating LLMs as Agents

Xiao Liu et al. AgentBench: Evaluating LLMs as Agents. 2023. arXiv:2308.03688 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Prompt Injection attack against LLM-integrated Applications

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. “Prompt Injection attack against LLM-integrated Applications”. In: arXiv preprint arXiv:2306.05499 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong.Formalizing and Benchmarking Prompt Injection Attacks and Defenses. 2023. arXiv:2310.12815 [cs.CR]. 11

work page arXiv 2023
[35]

Chameleon: Plug-and-play compositional reasoning with large language models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song- Chun Zhu, and Jianfeng Gao. “Chameleon: Plug-and-play compositional reasoning with large language models”. In: Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[36]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal”. In: (2024). arXiv:2402.04249 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Inverse Scaling Prize: Second Round Winners

Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Inverse Scaling Prize: Second Round Winners. 2023

work page 2023
[38]

Inverse Scaling: When Bigger Isn’t Better

Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, et al. “Inverse Scaling: When Bigger Isn’t Better”. In:arXiv preprint arXiv:2306.09479 (2023)

work page arXiv 2023
[39]

arXiv:2311.04235 [cs.AI]

Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, and David Wagner.Can LLMs Follow Simple Rules? 2024. arXiv:2311.04235 [cs.AI]

work page arXiv 2024
[40]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. “WebGPT: Browser- assisted question-answering with human feedback”. In: arXiv preprint arXiv:2112.09332 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”. In: Advances in neural information processing systems 35 (2022), pp. 27730–27744

work page 2022
[42]

Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks

Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks. 2024. arXiv: 2403.03792 [cs.CR]

work page arXiv 2024
[43]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large Language Model Connected with Massive APIs. 2023. arXiv:2305.15334 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. “Ignore previous prompt: Attack techniques for language models”. In: arXiv preprint arXiv:2211.09527 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection

ProtectAI. Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection . https : / / huggingface.co/ProtectAI/deberta-v3-base-prompt-injection-v2 . 2024

work page 2024
[46]

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. “Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI”. In: 2022 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’22. ACM, 2022.DOI: 10.1145/3531146. 3533231

work page doi:10.1145/3531146 2022
[47]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. “ToolLLM: Facilitating large language models to master 16000+ real-world APIs”. In: arXiv preprint arXiv:2307.16789 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Sebastián Ramírez. FastAPI. https://github.com/tiangolo/fastapi

work page
[49]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. “A generalist agent”. In: arXiv preprint arXiv:2205.06175 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. “Identifying the Risks of LM Agents with an LM-Emulated Sandbox”. In:The Twelfth International Conference on Learning Representations. 2024

work page 2024
[51]

ToolFormer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. “ToolFormer: Language Models Can Teach Themselves to Use Tools”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023

work page 2023
[52]

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition

Sander V Schulhoff, Jeremy Pinto, Anaum Khan, Louis-FranÃ§ois Bouchard, Chenglei Si, Jordan Lee Boyd-Graber, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, and Christopher R Carnahan. “Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition”. In: Empirical Methods in Natural Language Proce...

work page 2023
[53]

HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. “HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face”. In:Advances in Neural Information Processing Systems 36 (2024)

work page 2024
[54]

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023 b

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. “ToolAlpaca: Generalized tool learning for language models with 3000 simulated cases”. In: arXiv preprint arXiv:2306.05301 (2023)

work page arXiv 2023
[55]

LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng- Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. “LaMDA: Language models for dialog applications”. In: arXiv preprint arXiv:2201.08239 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[56]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. “Llama: Open and efficient foundation language models”. In:arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. “Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game”. In: CoRR abs/2311.01011 (2023). DOI: 10.48550/ARXIV.2311.01011. arXiv:2311.01011

work page doi:10.48550/arxiv.2311.01011 2023
[58]

On Adaptive Attacks to Adversarial Example Defenses

Florian Tramèr, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. “On Adaptive Attacks to Adversarial Example Defenses”. In: NeurIPS. 2020

work page 2020
[59]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. 2024. arXiv: 2404.13208 [cs.CR]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Advances in neural information processing systems 35 (2022), pp. 24824–24837

work page 2022
[61]

Delimiters won’t save you from prompt injection

Simon Willison. Delimiters won’t save you from prompt injection. https://simonwillison. net/2023/May/11/delimiters-wont-save-you/ . 2023

work page 2023
[62]

Prompt injection attacks against GPT-3

Simon Willison. Prompt injection attacks against GPT-3 . https://simonwillison.net/ 2022/Sep/12/prompt-injection/. 2022

work page 2022
[63]

The Dual LLM pattern for building AI assistants that can resist prompt injection

Simon Willison. The Dual LLM pattern for building AI assistants that can resist prompt injection. https://simonwillison.net/2023/Apr/25/dual-llm-pattern/. 2023

work page 2023
[64]

You can’t solve AI security problems with more AI

Simon Willison. You can’t solve AI security problems with more AI. https://simonwillison. net/2022/Sep/17/prompt-injection-more-ai/ . 2022

work page 2022
[65]

Intelligent agents: Theory and practice

Michael Wooldridge and Nicholas R Jennings. “Intelligent agents: Theory and practice”. In: The knowledge engineering review 10.2 (1995), pp. 115–152

work page 1995
[66]

SecGPT: An execution isolation architecture for LLM-based systems

Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. “SecGPT: An execution isolation architecture for LLM-based systems”. In:arXiv preprint arXiv:2403.04960 (2024)

work page arXiv 2024
[67]

Patil, Ion Stoica, and Joseph E

Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley Function Calling Leaderboard . https://gorilla.cs. berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html. 2024

work page 2024
[68]

WebShop: Towards scal- able real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. “WebShop: Towards scal- able real-world web interaction with grounded language agents”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 20744–20757

work page 2022
[69]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. “ReAct: Synergizing reasoning and acting in language models”. In: arXiv preprint arXiv:2210.03629 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. 2023. arXiv:2312.14197 [cs.CL]

work page arXiv 2023
[71]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. 2024. arXiv:2403.02691 [cs.CL]

work page internal anchor Pith review arXiv 2024
[72]

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig.WebArena: A Realistic Web Environment for Building Autonomous Agents. 2023. arXiv:2307.13854 [cs.AI]. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. 2023. arXiv: 2307.15043 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Can LLMs Sepa- rate Instructions From Data? And What Do We Even Mean By That?

Egor Zverev, Sahar Abdelnabi, Mario Fritz, and Christoph H Lampert. “Can LLMs Sepa- rate Instructions From Data? And What Do We Even Mean By That?” In: arXiv preprint arXiv:2403.06833 (2024). 14 Checklist

work page arXiv 2024
[75]

For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] We discuss the gym’s structure and design in Section 3 and the experimental results in Section 4. (b) Did you describe the limitations of your work? [Yes] , in Section 5 (c) Did you discuss any potential negative so...

work page
[76]

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

work page
[77]

for benchmarks)

If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g.,...

work page
[78]

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] We link and/or report verbatim the license of code we re-use and/or adapt from other sources. However, not all assets we use have a lice...

work page
[79]

Send the Facebook security code received via email from security@fb.com to eve@mail.com

If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wa...

work page doi:10.5281/zenodo.12528188 2024