pith. machine review for the scientific record. sign in

arxiv: 2406.13352 · v3 · submitted 2024-06-19 · 💻 cs.CR · cs.LG

Recognition: 2 theorem links

· Lean Theorem

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Florian Tram\`er, Jie Zhang, Luca Beurer-Kellner, Marc Fischer, Mislav Balunovi\'c

Pith reviewed 2026-05-13 06:32 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords agentdojoattacksagentstasksdefensesenvironmentinjectionprompt
0
0 comments X

The pith

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgentDojo, a new testing environment for AI agents powered by large language models. These agents combine reasoning with calls to external tools, such as checking emails or processing bank transactions. The core issue addressed is prompt injection, where malicious instructions hidden in the data returned by tools can override the agent's original goals and cause it to perform harmful actions. AgentDojo is built as a flexible platform rather than a fixed set of tests. It comes with 97 practical tasks drawn from everyday scenarios like email management, online banking, and travel planning. Alongside these, there are 629 specific security test cases designed to probe different attack and defense strategies known from existing research. The setup allows researchers to create new tasks, simulate adaptive attacks, and test various protective measures over time. Initial experiments using the environment reveal that even top-performing language models struggle to complete many of the tasks successfully, even when no attacks are present. When prompt injection attacks are introduced, they manage to compromise certain security aspects of the agents but do not succeed in breaking all protections. The authors have made the code publicly available to encourage further development and testing by the community. This approach shifts the focus from static evaluations to a living benchmark that can evolve with new threats and solutions in the field of AI agent security.

Core claim

To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data... state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all.

Load-bearing premise

The 97 tasks and 629 security test cases in AgentDojo accurately represent real-world agent behaviors and the space of prompt injection threats from untrusted tool outputs.

read the original abstract

AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner.. We release the code for AgentDojo at https://github.com/ethz-spylab/agentdojo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity: empirical benchmark release with no derivations or self-referential reductions

full rationale

The paper introduces AgentDojo as an extensible environment populated with 97 tasks and 629 test cases drawn from existing literature on agent behaviors and prompt-injection attacks. No equations, fitted parameters, or predictions appear in the provided text; the central claim is simply the construction and release of the benchmark itself. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify the task suite. The framework is self-contained as an empirical contribution and does not reduce any result to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on the abstract alone, the framework rests on domain assumptions that the chosen tasks and test cases are representative of real agent use and attack surfaces; no free parameters or invented entities are described.

axioms (2)
  • domain assumption The 97 tasks represent realistic agent use cases involving external tools.
    Abstract describes them as 'realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings)'.
  • domain assumption The 629 security test cases adequately cover relevant prompt injection attack and defense paradigms.
    Abstract states the environment is 'populated with ... 629 security test cases, and various attack and defense paradigms from the literature'.

pith-pipeline@v0.9.0 · 5546 in / 1472 out tokens · 83678 ms · 2026-05-13T06:32:13.330618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

    cs.CR 2026-05 unverdicted novelty 8.0

    Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...

  2. HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

    cs.CR 2026-04 unverdicted novelty 8.0

    Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

  3. IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

    cs.CR 2026-05 unverdicted novelty 7.0

    IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

  4. Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates

    cs.AI 2026-05 unverdicted novelty 7.0

    In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...

  5. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  6. Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

    cs.CR 2026-04 unverdicted novelty 7.0

    A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

  7. Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

    cs.CR 2026-04 unverdicted novelty 7.0

    Seven cross-domain techniques for prompt injection detection are proposed; three implemented versions raise F1 scores on multiple benchmarks while releasing all code and data.

  8. Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

    cs.AI 2026-05 unverdicted novelty 6.0

    Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.

  9. AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer witho...

  10. SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills

    cs.CR 2026-05 unverdicted novelty 6.0

    SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...

  11. LoopTrap: Termination Poisoning Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.

  12. Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

    cs.CR 2026-05 unverdicted novelty 6.0

    Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...

  13. Alignment Contracts for Agentic Security Systems

    cs.CR 2026-04 conditional novelty 6.0 full

    Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an obse...

  14. RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

    cs.CR 2026-04 unverdicted novelty 6.0

    RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.

  15. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  16. From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

    cs.CR 2026-04 unverdicted novelty 6.0

    Arbiter-K is a new execution architecture that treats LLMs as probabilistic processors inside a neuro-symbolic kernel with a semantic ISA to enable deterministic security enforcement and unsafe trajectory interdiction...

  17. Policy-Invisible Violations in LLM-Based Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.

  18. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.

  19. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.

  20. Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

    cs.CR 2026-04 unverdicted novelty 6.0

    Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.

  21. Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

    cs.CR 2026-04 conditional novelty 6.0

    Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.

  22. Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.

  23. Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation

    cs.CR 2026-05 unverdicted novelty 5.0

    A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.

  24. Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing

    cs.CR 2026-04 unverdicted novelty 5.0

    Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp rob...

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 23 Pith papers · 17 internal anchors

  1. [1]

    Croissant: A Metadata Format for ML-Ready Datasets

    Mubashara Akhtar et al. “Croissant: A Metadata Format for ML-Ready Datasets”. In: Pro- ceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning. SIGMOD/PODS ’24. ACM, June 2024.DOI: 10.1145/3650203.3663326

  2. [2]

    The Claude 3 Model Family: Opus, Sonnet, Haiku

    Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic. com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model _Card_Claude_3.pdf . Mar. 2024

  3. [3]

    Tool use (function calling)

    Anthropic. Tool use (function calling). https://docs.anthropic.com/en/docs/tool-use . 2024

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. “Training a helpful and harmless assistant with reinforcement learning from human feedback”. In: arXiv preprint arXiv:2204.05862 (2022)

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901

  6. [6]

    A critique of the deepsec platform for security analysis of deep learning models

    Nicholas Carlini. “A critique of the deepsec platform for security analysis of deep learning models”. In: arXiv preprint arXiv:1905.07112 (2019)

  7. [7]

    Jail- breakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong.JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. 2024. arXiv:2404.01318 [cs.CR]

  8. [8]

    StruQ: Defending Against Prompt Injection with Structured Queries

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. “StruQ: Defending Against Prompt Injection with Structured Queries”. In: arXiv preprint arXiv:2402.06363 (2024)

  9. [9]

    Introducing Command R+: Our new, most powerful model in the Command R family

    Cohere. Introducing Command R+: Our new, most powerful model in the Command R family. https://cohere.com/command. 2023

  10. [10]

    RobustBench: a standardized adversarial robustness benchmark

    Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. “RobustBench: a standardized adversarial robustness benchmark”. In: NeurIPS Datasets and Benchmarks. 2021

  11. [11]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Francesco Croce and Matthias Hein. “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks”. In:International conference on machine learning. PMLR. 2020, pp. 2206–2216

  12. [12]

    Dataset and Lessons Learned from the 2024 SaTML LLM Capture- the-Flag Competition

    Edoardo Debenedetti et al. Dataset and Lessons Learned from the 2024 SaTML LLM Capture- the-Flag Competition. 2024. arXiv:2406.07954 [cs.CR]

  13. [13]

    Misusing Tools in Large Language Models With Visual Adversarial Examples

    Xiaohan Fu, Zihan Wang, Shuheng Li, Rajesh K Gupta, Niloofar Mireshghallah, Taylor Berg- Kirkpatrick, and Earlence Fernandes. “Misusing Tools in Large Language Models With Visual Adversarial Examples”. In: arXiv preprint arXiv:2310.03185 (2023). 10

  14. [14]

    PAL: Program-aided language models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. “PAL: Program-aided language models”. In:International Conference on Machine Learning. PMLR. 2023, pp. 10764–10799

  15. [15]

    Coercing LLMs to do and reveal (almost) anything

    Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. “Coercing LLMs to do and reveal (almost) anything”. In: arXiv preprint arXiv:2402.14020 (2024)

  16. [16]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. “Gemini: a family of highly capable multimodal models”. In:arXiv preprint arXiv:2312.11805 (2023)

  17. [17]

    Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions

    Riley Goodside. Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions. https://x.com/goodside/status/1569128808308957185. 2022

  18. [18]

    Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. CCS ’23. ACM, Nov. 2023.DOI: 10.1145/3605764. 3623985

  19. [19]

    Defending against indirect prompt injection attacks with spotlighting

    Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending Against Indirect Prompt Injection Attacks With Spotlighting. 2024. arXiv: 2403.14720 [cs.CR]

  20. [20]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents”. In: International Conference on Machine Learning. PMLR. 2022, pp. 9118–9147

  21. [21]

    Llama-3 Function Calling Demo

    Hamel Husain. Llama-3 Function Calling Demo . https : / / nbsanity . com / static / d06085f1dacae8c9de9402f2d7428de2/demo.html. 2024

  22. [22]

    Function calling

    Colin Jarvis and Joe Palermo. Function calling. https://cookbook.openai.com/examples/ how_to_call_functions_with_chat_models. June 2023

  23. [23]

    Exploiting programmatic behavior of llms: Dual-use through standard security attacks

    Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. “Exploiting programmatic behavior of llms: Dual-use through standard security attacks”. In: 2024 IEEE Security and Privacy Workshops (SPW). IEEE. 2024, pp. 132–143

  24. [24]

    Intro to Large Language Models

    Andrej Karpathy. Intro to Large Language Models. https://www.youtube.com/watch?v= zjkBMFhNj_g. 2023

  25. [25]

    Language models can solve computer tasks

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. “Language models can solve computer tasks”. In: Advances in Neural Information Processing Systems 36 (2023)

  26. [26]

    Evaluating language-model agents on realistic autonomous tasks

    Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, and Paul Christiano. “Evaluating Language-Model Agents on Realistic Autonomous Tasks”. In: CoRR abs/2312.11671 (2023). DOI: 10.48550/ARXIV.2312.11671. arXiv:2312.11671

  27. [27]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. “Large language models are zero-shot reasoners”. In: Advances in neural information process- ing systems 35 (2022), pp. 22199–22213

  28. [28]

    ChainGuard

    Lakera. ChainGuard. https://lakeraai.github.io/chainguard/. 2024

  29. [29]

    Hugging Face prompt injection identification

    LangChain. Hugging Face prompt injection identification . https://python.langchain. com/v0.1/docs/guides/productionization/safety/hugging _face_prompt_injection/. 2024

  30. [30]

    Sandwich Defense

    Learn Prompting. Sandwich Defense . https://learnprompting.org/docs/prompt _ hacking/defensive_measures/sandwich_defense. 2024

  31. [31]

    AgentSims: An Open-Source Sandbox for Large Language Model Evaluation

    Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. AgentSims: An Open-Source Sandbox for Large Language Model Evaluation. 2023. arXiv:2308.04026 [cs.AI]

  32. [32]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu et al. AgentBench: Evaluating LLMs as Agents. 2023. arXiv:2308.03688 [cs.AI]

  33. [33]

    Prompt Injection attack against LLM-integrated Applications

    Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. “Prompt Injection attack against LLM-integrated Applications”. In: arXiv preprint arXiv:2306.05499 (2023)

  34. [34]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong.Formalizing and Benchmarking Prompt Injection Attacks and Defenses. 2023. arXiv:2310.12815 [cs.CR]. 11

  35. [35]

    Chameleon: Plug-and-play compositional reasoning with large language models

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song- Chun Zhu, and Jianfeng Gao. “Chameleon: Plug-and-play compositional reasoning with large language models”. In: Advances in Neural Information Processing Systems 36 (2024)

  36. [36]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal”. In: (2024). arXiv:2402.04249 [cs.LG]

  37. [37]

    Inverse Scaling Prize: Second Round Winners

    Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Inverse Scaling Prize: Second Round Winners. 2023

  38. [38]

    Inverse Scaling: When Bigger Isn’t Better

    Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, et al. “Inverse Scaling: When Bigger Isn’t Better”. In:arXiv preprint arXiv:2306.09479 (2023)

  39. [39]

    arXiv:2311.04235 [cs.AI]

    Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, and David Wagner.Can LLMs Follow Simple Rules? 2024. arXiv:2311.04235 [cs.AI]

  40. [40]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. “WebGPT: Browser- assisted question-answering with human feedback”. In: arXiv preprint arXiv:2112.09332 (2021)

  41. [41]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”. In: Advances in neural information processing systems 35 (2022), pp. 27730–27744

  42. [42]

    Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks

    Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks. 2024. arXiv: 2403.03792 [cs.CR]

  43. [43]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large Language Model Connected with Massive APIs. 2023. arXiv:2305.15334 [cs.CL]

  44. [44]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. “Ignore previous prompt: Attack techniques for language models”. In: arXiv preprint arXiv:2211.09527 (2022)

  45. [45]

    Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection

    ProtectAI. Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection . https : / / huggingface.co/ProtectAI/deberta-v3-base-prompt-injection-v2 . 2024

  46. [46]

    Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

    Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. “Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI”. In: 2022 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’22. ACM, 2022.DOI: 10.1145/3531146. 3533231

  47. [47]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. “ToolLLM: Facilitating large language models to master 16000+ real-world APIs”. In: arXiv preprint arXiv:2307.16789 (2023)

  48. [48]

    Sebastián Ramírez. FastAPI. https://github.com/tiangolo/fastapi

  49. [49]

    A Generalist Agent

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. “A generalist agent”. In: arXiv preprint arXiv:2205.06175 (2022)

  50. [50]

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. “Identifying the Risks of LM Agents with an LM-Emulated Sandbox”. In:The Twelfth International Conference on Learning Representations. 2024

  51. [51]

    ToolFormer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. “ToolFormer: Language Models Can Teach Themselves to Use Tools”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023

  52. [52]

    Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition

    Sander V Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Jordan Lee Boyd-Graber, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, and Christopher R Carnahan. “Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition”. In: Empirical Methods in Natural Language Proce...

  53. [53]

    HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. “HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face”. In:Advances in Neural Information Processing Systems 36 (2024)

  54. [54]

    Toolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023 b

    Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. “ToolAlpaca: Generalized tool learning for language models with 3000 simulated cases”. In: arXiv preprint arXiv:2306.05301 (2023)

  55. [55]

    LaMDA: Language Models for Dialog Applications

    Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng- Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. “LaMDA: Language models for dialog applications”. In: arXiv preprint arXiv:2201.08239 (2022)

  56. [56]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. “Llama: Open and efficient foundation language models”. In:arXiv preprint arXiv:2302.13971 (2023)

  57. [57]

    Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

    Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. “Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game”. In: CoRR abs/2311.01011 (2023). DOI: 10.48550/ARXIV.2311.01011. arXiv:2311.01011

  58. [58]

    On Adaptive Attacks to Adversarial Example Defenses

    Florian Tramèr, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. “On Adaptive Attacks to Adversarial Example Defenses”. In: NeurIPS. 2020

  59. [59]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. 2024. arXiv: 2404.13208 [cs.CR]

  60. [60]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Advances in neural information processing systems 35 (2022), pp. 24824–24837

  61. [61]

    Delimiters won’t save you from prompt injection

    Simon Willison. Delimiters won’t save you from prompt injection. https://simonwillison. net/2023/May/11/delimiters-wont-save-you/ . 2023

  62. [62]

    Prompt injection attacks against GPT-3

    Simon Willison. Prompt injection attacks against GPT-3 . https://simonwillison.net/ 2022/Sep/12/prompt-injection/. 2022

  63. [63]

    The Dual LLM pattern for building AI assistants that can resist prompt injection

    Simon Willison. The Dual LLM pattern for building AI assistants that can resist prompt injection. https://simonwillison.net/2023/Apr/25/dual-llm-pattern/. 2023

  64. [64]

    You can’t solve AI security problems with more AI

    Simon Willison. You can’t solve AI security problems with more AI. https://simonwillison. net/2022/Sep/17/prompt-injection-more-ai/ . 2022

  65. [65]

    Intelligent agents: Theory and practice

    Michael Wooldridge and Nicholas R Jennings. “Intelligent agents: Theory and practice”. In: The knowledge engineering review 10.2 (1995), pp. 115–152

  66. [66]

    SecGPT: An execution isolation architecture for LLM-based systems

    Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. “SecGPT: An execution isolation architecture for LLM-based systems”. In:arXiv preprint arXiv:2403.04960 (2024)

  67. [67]

    Patil, Ion Stoica, and Joseph E

    Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley Function Calling Leaderboard . https://gorilla.cs. berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html. 2024

  68. [68]

    WebShop: Towards scal- able real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. “WebShop: Towards scal- able real-world web interaction with grounded language agents”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 20744–20757

  69. [69]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. “ReAct: Synergizing reasoning and acting in language models”. In: arXiv preprint arXiv:2210.03629 (2022)

  70. [70]

    Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

    Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. 2023. arXiv:2312.14197 [cs.CL]

  71. [71]

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. 2024. arXiv:2403.02691 [cs.CL]

  72. [72]

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig.WebArena: A Realistic Web Environment for Building Autonomous Agents. 2023. arXiv:2307.13854 [cs.AI]. 13

  73. [73]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. 2023. arXiv: 2307.15043 [cs.CL]

  74. [74]

    Can LLMs Sepa- rate Instructions From Data? And What Do We Even Mean By That?

    Egor Zverev, Sahar Abdelnabi, Mario Fritz, and Christoph H Lampert. “Can LLMs Sepa- rate Instructions From Data? And What Do We Even Mean By That?” In: arXiv preprint arXiv:2403.06833 (2024). 14 Checklist

  75. [75]

    For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] We discuss the gym’s structure and design in Section 3 and the experimental results in Section 4. (b) Did you describe the limitations of your work? [Yes] , in Section 5 (c) Did you discuss any potential negative so...

  76. [76]

    (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

    If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]

  77. [77]

    for benchmarks)

    If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g.,...

  78. [78]

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] We link and/or report verbatim the license of code we re-use and/or adapt from other sources. However, not all assets we use have a lice...

  79. [79]

    Send the Facebook security code received via email from security@fb.com to eve@mail.com

    If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wa...