Recognition: 2 theorem links
· Lean TheoremAgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
Pith reviewed 2026-05-13 06:32 UTC · model grok-4.3
The pith
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data... state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all.
Load-bearing premise
The 97 tasks and 629 security test cases in AgentDojo accurately represent real-world agent behaviors and the space of prompt injection threats from untrusted tool outputs.
read the original abstract
AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner.. We release the code for AgentDojo at https://github.com/ethz-spylab/agentdojo.
Editorial analysis
A structured set of objections, weighed in public.
Circularity Check
No circularity: empirical benchmark release with no derivations or self-referential reductions
full rationale
The paper introduces AgentDojo as an extensible environment populated with 97 tasks and 629 test cases drawn from existing literature on agent behaviors and prompt-injection attacks. No equations, fitted parameters, or predictions appear in the provided text; the central claim is simply the construction and release of the benchmark itself. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify the task suite. The framework is self-contained as an empirical contribution and does not reduce any result to its own inputs by definition.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 97 tasks represent realistic agent use cases involving external tools.
- domain assumption The 629 security test cases adequately cover relevant prompt injection attack and defense paradigms.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearWe introduce AgentDojo, an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks... 97 realistic tasks... 629 security test cases
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearTargeted Attack Success Rate (ASR): the fraction of security cases where the attacker’s goal is met
Forward citations
Cited by 24 Pith papers
-
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
-
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
-
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
-
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
In 30-step recursive LLM loops, append-mode persistent escape from source basins reaches 50% near 400 tokens under full history but plateaus below 50% under tail-clip memory policy, while replace-mode switching largel...
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
-
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
Seven cross-domain techniques for prompt injection detection are proposed; three implemented versions raise F1 scores on multiple benchmarks while releasing all code and data.
-
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
-
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents
AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer witho...
-
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...
-
LoopTrap: Termination Poisoning Attacks on LLM Agents
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
-
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
-
Alignment Contracts for Agentic Security Systems
Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an obse...
-
RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents
RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
-
From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers
Arbiter-K is a new execution architecture that treats LLMs as probabilistic processors inside a neuro-symbolic kernel with a semantic ISA to enable deterministic security enforcement and unsafe trajectory interdiction...
-
Policy-Invisible Violations in LLM-Based Agents
LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.
-
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
-
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
-
Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts
Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.
-
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
-
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
-
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
-
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp rob...
Reference graph
Works this paper leans on
-
[1]
Croissant: A Metadata Format for ML-Ready Datasets
Mubashara Akhtar et al. “Croissant: A Metadata Format for ML-Ready Datasets”. In: Pro- ceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning. SIGMOD/PODS ’24. ACM, June 2024.DOI: 10.1145/3650203.3663326
-
[2]
The Claude 3 Model Family: Opus, Sonnet, Haiku
Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. https://www-cdn.anthropic. com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model _Card_Claude_3.pdf . Mar. 2024
work page 2024
-
[3]
Anthropic. Tool use (function calling). https://docs.anthropic.com/en/docs/tool-use . 2024
work page 2024
-
[4]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. “Training a helpful and harmless assistant with reinforcement learning from human feedback”. In: arXiv preprint arXiv:2204.05862 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901
work page 2020
-
[6]
A critique of the deepsec platform for security analysis of deep learning models
Nicholas Carlini. “A critique of the deepsec platform for security analysis of deep learning models”. In: arXiv preprint arXiv:1905.07112 (2019)
-
[7]
Jail- breakbench: An open robustness benchmark for jailbreaking large language models
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong.JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. 2024. arXiv:2404.01318 [cs.CR]
-
[8]
StruQ: Defending Against Prompt Injection with Structured Queries
Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. “StruQ: Defending Against Prompt Injection with Structured Queries”. In: arXiv preprint arXiv:2402.06363 (2024)
-
[9]
Introducing Command R+: Our new, most powerful model in the Command R family
Cohere. Introducing Command R+: Our new, most powerful model in the Command R family. https://cohere.com/command. 2023
work page 2023
-
[10]
RobustBench: a standardized adversarial robustness benchmark
Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. “RobustBench: a standardized adversarial robustness benchmark”. In: NeurIPS Datasets and Benchmarks. 2021
work page 2021
-
[11]
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks
Francesco Croce and Matthias Hein. “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks”. In:International conference on machine learning. PMLR. 2020, pp. 2206–2216
work page 2020
-
[12]
Dataset and Lessons Learned from the 2024 SaTML LLM Capture- the-Flag Competition
Edoardo Debenedetti et al. Dataset and Lessons Learned from the 2024 SaTML LLM Capture- the-Flag Competition. 2024. arXiv:2406.07954 [cs.CR]
-
[13]
Misusing Tools in Large Language Models With Visual Adversarial Examples
Xiaohan Fu, Zihan Wang, Shuheng Li, Rajesh K Gupta, Niloofar Mireshghallah, Taylor Berg- Kirkpatrick, and Earlence Fernandes. “Misusing Tools in Large Language Models With Visual Adversarial Examples”. In: arXiv preprint arXiv:2310.03185 (2023). 10
-
[14]
PAL: Program-aided language models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. “PAL: Program-aided language models”. In:International Conference on Machine Learning. PMLR. 2023, pp. 10764–10799
work page 2023
-
[15]
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. “Coercing LLMs to do and reveal (almost) anything”. In: arXiv preprint arXiv:2402.14020 (2024)
-
[16]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team. “Gemini: a family of highly capable multimodal models”. In:arXiv preprint arXiv:2312.11805 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Riley Goodside. Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions. https://x.com/goodside/status/1569128808308957185. 2022
-
[18]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection”. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. CCS ’23. ACM, Nov. 2023.DOI: 10.1145/3605764. 3623985
-
[19]
Defending against indirect prompt injection attacks with spotlighting
Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending Against Indirect Prompt Injection Attacks With Spotlighting. 2024. arXiv: 2403.14720 [cs.CR]
-
[20]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents”. In: International Conference on Machine Learning. PMLR. 2022, pp. 9118–9147
work page 2022
-
[21]
Hamel Husain. Llama-3 Function Calling Demo . https : / / nbsanity . com / static / d06085f1dacae8c9de9402f2d7428de2/demo.html. 2024
work page 2024
-
[22]
Colin Jarvis and Joe Palermo. Function calling. https://cookbook.openai.com/examples/ how_to_call_functions_with_chat_models. June 2023
work page 2023
-
[23]
Exploiting programmatic behavior of llms: Dual-use through standard security attacks
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. “Exploiting programmatic behavior of llms: Dual-use through standard security attacks”. In: 2024 IEEE Security and Privacy Workshops (SPW). IEEE. 2024, pp. 132–143
work page 2024
-
[24]
Intro to Large Language Models
Andrej Karpathy. Intro to Large Language Models. https://www.youtube.com/watch?v= zjkBMFhNj_g. 2023
work page 2023
-
[25]
Language models can solve computer tasks
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. “Language models can solve computer tasks”. In: Advances in Neural Information Processing Systems 36 (2023)
work page 2023
-
[26]
Evaluating language-model agents on realistic autonomous tasks
Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, and Paul Christiano. “Evaluating Language-Model Agents on Realistic Autonomous Tasks”. In: CoRR abs/2312.11671 (2023). DOI: 10.48550/ARXIV.2312.11671. arXiv:2312.11671
-
[27]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. “Large language models are zero-shot reasoners”. In: Advances in neural information process- ing systems 35 (2022), pp. 22199–22213
work page 2022
- [28]
-
[29]
Hugging Face prompt injection identification
LangChain. Hugging Face prompt injection identification . https://python.langchain. com/v0.1/docs/guides/productionization/safety/hugging _face_prompt_injection/. 2024
work page 2024
-
[30]
Learn Prompting. Sandwich Defense . https://learnprompting.org/docs/prompt _ hacking/defensive_measures/sandwich_defense. 2024
work page 2024
-
[31]
AgentSims: An Open-Source Sandbox for Large Language Model Evaluation
Jiaju Lin, Haoran Zhao, Aochi Zhang, Yiting Wu, Huqiuyue Ping, and Qin Chen. AgentSims: An Open-Source Sandbox for Large Language Model Evaluation. 2023. arXiv:2308.04026 [cs.AI]
-
[32]
AgentBench: Evaluating LLMs as Agents
Xiao Liu et al. AgentBench: Evaluating LLMs as Agents. 2023. arXiv:2308.03688 [cs.AI]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Prompt Injection attack against LLM-integrated Applications
Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. “Prompt Injection attack against LLM-integrated Applications”. In: arXiv preprint arXiv:2306.05499 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [34]
-
[35]
Chameleon: Plug-and-play compositional reasoning with large language models
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song- Chun Zhu, and Jianfeng Gao. “Chameleon: Plug-and-play compositional reasoning with large language models”. In: Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[36]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal”. In: (2024). arXiv:2402.04249 [cs.LG]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Inverse Scaling Prize: Second Round Winners
Ian McKenzie, Alexander Lyzhov, Alicia Parrish, Ameya Prabhu, Aaron Mueller, Najoung Kim, Sam Bowman, and Ethan Perez. Inverse Scaling Prize: Second Round Winners. 2023
work page 2023
-
[38]
Inverse Scaling: When Bigger Isn’t Better
Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, et al. “Inverse Scaling: When Bigger Isn’t Better”. In:arXiv preprint arXiv:2306.09479 (2023)
-
[39]
Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, and David Wagner.Can LLMs Follow Simple Rules? 2024. arXiv:2311.04235 [cs.AI]
-
[40]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. “WebGPT: Browser- assisted question-answering with human feedback”. In: arXiv preprint arXiv:2112.09332 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”. In: Advances in neural information processing systems 35 (2022), pp. 27730–27744
work page 2022
-
[42]
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks
Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks. 2024. arXiv: 2403.03792 [cs.CR]
-
[43]
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large Language Model Connected with Massive APIs. 2023. arXiv:2305.15334 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez and Ian Ribeiro. “Ignore previous prompt: Attack techniques for language models”. In: arXiv preprint arXiv:2211.09527 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection
ProtectAI. Fine-Tuned DeBERTa-v3-base for Prompt Injection Detection . https : / / huggingface.co/ProtectAI/deberta-v3-base-prompt-injection-v2 . 2024
work page 2024
-
[46]
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. “Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI”. In: 2022 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’22. ACM, 2022.DOI: 10.1145/3531146. 3533231
-
[47]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. “ToolLLM: Facilitating large language models to master 16000+ real-world APIs”. In: arXiv preprint arXiv:2307.16789 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Sebastián Ramírez. FastAPI. https://github.com/tiangolo/fastapi
-
[49]
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. “A generalist agent”. In: arXiv preprint arXiv:2205.06175 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. “Identifying the Risks of LM Agents with an LM-Emulated Sandbox”. In:The Twelfth International Conference on Learning Representations. 2024
work page 2024
-
[51]
ToolFormer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. “ToolFormer: Language Models Can Teach Themselves to Use Tools”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023
work page 2023
-
[52]
Sander V Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Jordan Lee Boyd-Graber, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, and Christopher R Carnahan. “Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition”. In: Empirical Methods in Natural Language Proce...
work page 2023
-
[53]
HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. “HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face”. In:Advances in Neural Information Processing Systems 36 (2024)
work page 2024
-
[54]
Toolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023 b
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. “ToolAlpaca: Generalized tool learning for language models with 3000 simulated cases”. In: arXiv preprint arXiv:2306.05301 (2023)
-
[55]
LaMDA: Language Models for Dialog Applications
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng- Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. “LaMDA: Language models for dialog applications”. In: arXiv preprint arXiv:2201.08239 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[56]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. “Llama: Open and efficient foundation language models”. In:arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, and Stuart Russell. “Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game”. In: CoRR abs/2311.01011 (2023). DOI: 10.48550/ARXIV.2311.01011. arXiv:2311.01011
-
[58]
On Adaptive Attacks to Adversarial Example Defenses
Florian Tramèr, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. “On Adaptive Attacks to Adversarial Example Defenses”. In: NeurIPS. 2020
work page 2020
-
[59]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions. 2024. arXiv: 2404.13208 [cs.CR]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Advances in neural information processing systems 35 (2022), pp. 24824–24837
work page 2022
-
[61]
Delimiters won’t save you from prompt injection
Simon Willison. Delimiters won’t save you from prompt injection. https://simonwillison. net/2023/May/11/delimiters-wont-save-you/ . 2023
work page 2023
-
[62]
Prompt injection attacks against GPT-3
Simon Willison. Prompt injection attacks against GPT-3 . https://simonwillison.net/ 2022/Sep/12/prompt-injection/. 2022
work page 2022
-
[63]
The Dual LLM pattern for building AI assistants that can resist prompt injection
Simon Willison. The Dual LLM pattern for building AI assistants that can resist prompt injection. https://simonwillison.net/2023/Apr/25/dual-llm-pattern/. 2023
work page 2023
-
[64]
You can’t solve AI security problems with more AI
Simon Willison. You can’t solve AI security problems with more AI. https://simonwillison. net/2022/Sep/17/prompt-injection-more-ai/ . 2022
work page 2022
-
[65]
Intelligent agents: Theory and practice
Michael Wooldridge and Nicholas R Jennings. “Intelligent agents: Theory and practice”. In: The knowledge engineering review 10.2 (1995), pp. 115–152
work page 1995
-
[66]
SecGPT: An execution isolation architecture for LLM-based systems
Yuhao Wu, Franziska Roesner, Tadayoshi Kohno, Ning Zhang, and Umar Iqbal. “SecGPT: An execution isolation architecture for LLM-based systems”. In:arXiv preprint arXiv:2403.04960 (2024)
-
[67]
Patil, Ion Stoica, and Joseph E
Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley Function Calling Leaderboard . https://gorilla.cs. berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html. 2024
work page 2024
-
[68]
WebShop: Towards scal- able real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. “WebShop: Towards scal- able real-world web interaction with grounded language agents”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 20744–20757
work page 2022
-
[69]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. “ReAct: Synergizing reasoning and acting in language models”. In: arXiv preprint arXiv:2210.03629 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[70]
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models
Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. 2023. arXiv:2312.14197 [cs.CL]
-
[71]
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. 2024. arXiv:2403.02691 [cs.CL]
work page internal anchor Pith review arXiv 2024
-
[72]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig.WebArena: A Realistic Web Environment for Building Autonomous Agents. 2023. arXiv:2307.13854 [cs.AI]. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. 2023. arXiv: 2307.15043 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Can LLMs Sepa- rate Instructions From Data? And What Do We Even Mean By That?
Egor Zverev, Sahar Abdelnabi, Mario Fritz, and Christoph H Lampert. “Can LLMs Sepa- rate Instructions From Data? And What Do We Even Mean By That?” In: arXiv preprint arXiv:2403.06833 (2024). 14 Checklist
-
[75]
For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] We discuss the gym’s structure and design in Section 3 and the experimental results in Section 4. (b) Did you describe the limitations of your work? [Yes] , in Section 5 (c) Did you discuss any potential negative so...
-
[76]
If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]
-
[77]
If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g.,...
-
[78]
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] We link and/or report verbatim the license of code we re-use and/or adapt from other sources. However, not all assets we use have a lice...
-
[79]
Send the Facebook security code received via email from security@fb.com to eve@mail.com
If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wa...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.