pith. machine review for the scientific record. sign in

arxiv: 2306.05499 · v3 · submitted 2023-06-08 · 💻 cs.CR · cs.AI· cs.CL· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Prompt Injection attack against LLM-integrated Applications

Gelei Deng, Haoyu Wang, Kailong Wang, Leo Yu Zhang, Tianwei Zhang, Xiaofeng Wang, Yang Liu, Yan Zheng, Yepang Liu, Yi Liu, Yuekang Li, Zihao Wang

Pith reviewed 2026-05-11 21:10 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.SE
keywords prompt injectionLLM securityblack-box attackLLM-integrated applicationsprompt theftcontext partitionHouYi
0
0 comments X

The pith

HouYi, a black-box technique, enables prompt injection on 31 of 36 real LLM-integrated applications, allowing prompt theft and unrestricted LLM use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores prompt injection risks in commercial LLM applications and finds existing attacks limited in practice. It develops HouYi, a three-part method with a pre-constructed prompt, a context-partitioning injection prompt, and a malicious payload, modeled on web injection attacks. When tested on 36 actual applications, HouYi succeeds against 31, producing outcomes such as stealing the application's own prompt and gaining arbitrary control over the LLM. Ten vendors including Notion have confirmed the issues, indicating exposure for large user bases and the need for improved protections.

Core claim

HouYi is a novel black-box prompt injection attack technique compartmentalized into a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill attack objectives. Leveraging HouYi reveals previously unknown and severe attack outcomes such as unrestricted arbitrary LLM usage and uncomplicated application prompt theft, with 31 of 36 deployed LLM-integrated applications found susceptible.

What carries the argument

HouYi, the three-element black-box injection method (pre-constructed prompt, context-partition injection prompt, malicious payload) that bypasses application safeguards to execute attacker goals inside the LLM context.

If this is right

  • Application prompts can be extracted with straightforward injection sequences.
  • Attackers can obtain unrestricted use of the LLM backend for arbitrary tasks.
  • Over 85 percent of tested real-world LLM-integrated applications remain open to these attacks.
  • Vendor-confirmed cases show that prompt injection creates concrete risks for end users at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Input handling in LLM apps may require the same isolation practices long used in web applications.
  • Context-partition detection could serve as a general defense layer against similar future attacks.
  • Automated testing tools based on HouYi might help developers identify exposure before release.

Load-bearing premise

The injection prompt can reliably create a context partition and deliver the payload across different LLM applications without detection or blocking by existing safeguards.

What would settle it

Applying HouYi to one of the 31 vulnerable applications after the addition of explicit filtering for context-partitioning phrases and checking whether the malicious payload still executes.

read the original abstract

Large Language Models (LLMs), renowned for their superior proficiency in language comprehension and generation, stimulate a vibrant ecosystem of applications around them. However, their extensive assimilation into various services introduces significant security risks. This study deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications. Initially, we conduct an exploratory analysis on ten commercial applications, highlighting the constraints of current attack strategies in practice. Prompted by these limitations, we subsequently formulate HouYi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks. HouYi is compartmentalized into three crucial elements: a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill the attack objectives. Leveraging HouYi, we unveil previously unknown and severe attack outcomes, such as unrestricted arbitrary LLM usage and uncomplicated application prompt theft. We deploy HouYi on 36 actual LLM-integrated applications and discern 31 applications susceptible to prompt injection. 10 vendors have validated our discoveries, including Notion, which has the potential to impact millions of users. Our investigation illuminates both the possible risks of prompt injection attacks and the possible tactics for mitigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper deconstructs prompt injection attacks on LLM-integrated applications. It first analyzes ten commercial applications to highlight limitations of current strategies. Then, it proposes HouYi, a black-box technique inspired by web injection attacks, consisting of a pre-constructed prompt, an injection prompt that induces context partition, and a malicious payload. Deployed on 36 real applications, HouYi succeeds against 31, enabling outcomes such as arbitrary LLM usage and prompt theft. Ten vendors, including Notion, have validated the findings.

Significance. If the results hold, this paper is significant for demonstrating practical, severe prompt injection vulnerabilities in real LLM applications through a novel black-box method. The large-scale testing and vendor confirmations provide strong evidence that current integrations are at risk, potentially affecting millions of users, and it contributes actionable insights into both attack tactics and mitigation approaches in the field of AI security.

major comments (2)
  1. The central claim that 31 applications are susceptible (as stated in the abstract and evaluation section) depends on the context-partition step succeeding reliably. However, the manuscript lacks a detailed analysis of the five non-vulnerable applications, including whether the partition failed or other factors intervened, and does not report on variations across different LLMs or safety mechanisms. This undermines the assessment of the attack's generality.
  2. In the section describing HouYi, the injection prompt is presented as inducing context partition without quantitative evidence or examples showing its effectiveness across diverse applications or its resistance to existing safeguards, which is essential for supporting the severe attack outcomes claimed.
minor comments (1)
  1. The phrasing 'discern 31 applications susceptible to prompt injection' in the abstract is slightly awkward and could be clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the paper's significance and for the constructive comments. We address each major comment point by point below, indicating the revisions we will make to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: The central claim that 31 applications are susceptible (as stated in the abstract and evaluation section) depends on the context-partition step succeeding reliably. However, the manuscript lacks a detailed analysis of the five non-vulnerable applications, including whether the partition failed or other factors intervened, and does not report on variations across different LLMs or safety mechanisms. This undermines the assessment of the attack's generality.

    Authors: We agree that additional detail on the unsuccessful cases would strengthen the assessment of generality. In the revised manuscript, we will add a dedicated subsection in the evaluation section analyzing the five non-vulnerable applications. Our experimental observations indicate that context partition failed in these cases primarily due to application-specific input sanitization or output filtering that disrupted the injection prompt's ability to separate contexts, rather than issues with the payload itself. Regarding variations across LLMs and safety mechanisms, the 36 tested applications represent a diverse set of real-world deployments, each integrating different backend LLMs and built-in safeguards. The consistent success of HouYi across this heterogeneous collection provides evidence of broad applicability. We will explicitly discuss this diversity and the black-box constraints that limit per-LLM instrumentation in the revision. revision: yes

  2. Referee: In the section describing HouYi, the injection prompt is presented as inducing context partition without quantitative evidence or examples showing its effectiveness across diverse applications or its resistance to existing safeguards, which is essential for supporting the severe attack outcomes claimed.

    Authors: We acknowledge the value of more direct supporting evidence for the injection prompt component. In the revised HouYi description, we will include concrete examples of the injection prompts (and their application-specific adaptations) along with a breakdown of observed context-partition success rates where distinguishable from overall attack outcomes. The prompt's effectiveness and resistance to safeguards are substantiated by its role in enabling attacks on 31 of 36 diverse applications despite the presence of various input validation and moderation layers. We will expand the text to quantify this where possible from our logs and discuss limitations, such as cases where stronger custom safeguards might interfere. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack evaluation on external applications

full rationale

The paper performs an exploratory analysis of ten commercial apps, proposes HouYi as a black-box technique inspired by web injection (with three explicit components: pre-constructed prompt, context-partition injection prompt, and payload), then reports direct experimental outcomes on 36 separate real-world LLM-integrated applications (31 vulnerable). No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the central claims rest on external testing and vendor validation rather than reducing to inputs by construction. References to prior prompt-injection literature are contextual and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper is primarily empirical and does not rely on mathematical axioms or free parameters; the claims rest on the assumption that real applications behave as observed in the tests.

axioms (1)
  • domain assumption LLM applications concatenate user inputs directly into system prompts without robust separation or sanitization
    This is the core premise enabling prompt injection attacks as demonstrated.
invented entities (1)
  • HouYi attack framework no independent evidence
    purpose: To bypass limitations of existing prompt injection methods in practical LLM apps
    The framework is proposed and tested in the paper without external independent validation beyond the authors' experiments.

pith-pipeline@v0.9.0 · 5542 in / 1383 out tokens · 64360 ms · 2026-05-11T21:10:53.817666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

    cs.CR 2026-04 unverdicted novelty 8.0

    NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.

  2. TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation

    cs.CR 2026-04 unverdicted novelty 8.0

    TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.

  3. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

    cs.CR 2026-04 unverdicted novelty 8.0

    DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

  4. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  5. IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

    cs.CR 2026-05 unverdicted novelty 7.0

    IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

  6. Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

    cs.CR 2026-05 unverdicted novelty 7.0

    Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

  7. The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck

    cs.CR 2026-05 unverdicted novelty 7.0

    PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...

  8. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  9. When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

  10. Jailbreaking Frontier Foundation Models Through Intention Deception

    cs.CR 2026-04 unverdicted novelty 7.0

    A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.

  11. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  12. Many-Tier Instruction Hierarchy in LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

  13. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    cs.CR 2024-10 unverdicted novelty 7.0

    ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...

  14. SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

    cs.CR 2026-05 unverdicted novelty 6.0

    SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.

  15. Leveraging RAG for Training-Free Alignment of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...

  16. Adversarial SQL Injection Generation with LLM-Based Architectures

    cs.CR 2026-05 unverdicted novelty 6.0

    RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.

  17. Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

    cs.CV 2026-05 unverdicted novelty 6.0

    UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.

  18. Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

    cs.CR 2026-05 unverdicted novelty 6.0

    A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.

  19. When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks

    cs.CR 2026-05 unverdicted novelty 6.0

    Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.

  20. SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills

    cs.CR 2026-05 unverdicted novelty 6.0

    SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...

  21. LoopTrap: Termination Poisoning Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.

  22. ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection

    cs.CR 2026-05 unverdicted novelty 6.0

    ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.

  23. LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training

    cs.CR 2026-05 unverdicted novelty 6.0

    LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.

  24. A Sentence Relation-Based Approach to Sanitizing Malicious Instructions

    cs.CR 2026-05 unverdicted novelty 6.0

    SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.

  25. Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems

    cs.CR 2026-05 unverdicted novelty 6.0

    ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...

  26. FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

    cs.CR 2026-04 unverdicted novelty 6.0

    FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

  27. AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents

    cs.CR 2026-04 conditional novelty 6.0

    AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.

  28. When AI reviews science: Can we trust the referee?

    cs.AI 2026-04 unverdicted novelty 6.0

    AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...

  29. RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

    cs.CR 2026-04 unverdicted novelty 6.0

    RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.

  30. SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.

  31. TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

    cs.CR 2026-04 unverdicted novelty 6.0

    TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.

  32. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.

  33. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

    cs.CR 2026-04 unverdicted novelty 6.0

    ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.

  34. PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

    cs.CR 2026-04 unverdicted novelty 6.0

    PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.

  35. Quantifying Trust: Financial Risk Management for Trustworthy AI Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.

  36. The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    cs.CR 2024-04 unverdicted novelty 6.0

    Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.

  37. SecureMCP: A Policy-Enforced LLM Data Access Framework for AIoT Systems via Model Context Protocol

    cs.CR 2026-05 unverdicted novelty 5.0

    SecureMCP integrates RBAC with five sequential defense modules in an MCP server to achieve 82.3% policy compliance against adversarial LLM SQL queries in AIoT while preserving execution accuracy.

  38. Architectural Obsolescence of Unhardened Agentic-AI Runtimes

    cs.CR 2026-05 unverdicted novelty 5.0

    OpenClaw fails to detect any of four action-audit divergence types while a hardened fork detects them all with perfect accuracy, making unhardened agentic-AI runtimes architecturally obsolete.

  39. LLM-Oriented Information Retrieval: A Denoising-First Perspective

    cs.IR 2026-05 unverdicted novelty 5.0

    Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...

  40. CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning

    cs.AI 2026-04 unverdicted novelty 5.0

    CAP-CoT uses iterative adversarial prompt cycles to improve CoT accuracy, stability, and robustness across six benchmarks and four LLM backbones.

  41. What Security and Privacy Transparency Users Need from Consumer-Facing Generative AI

    cs.HC 2026-04 unverdicted novelty 5.0

    A qualitative study of 21 GenAI users finds that current S&P transparency is often seen as incomplete or untrustworthy, leading to proxy-based adoption and constrained use, with calls for independent evaluations and o...

  42. Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit

    cs.CR 2026-04 unverdicted novelty 5.0

    Security practitioners use LLMs independently for low-risk productivity tasks while showing interest in enterprise platforms, but reliability, verification needs, and security risks limit broader autonomy.

  43. CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs

    cs.CY 2026-04 unverdicted novelty 5.0

    CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafet...

  44. Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

    cs.CR 2026-04 conditional novelty 4.0

    The survey organizes security threats and defenses in autonomous LLM agents into four layers and identifies that risks can propagate across layers from inputs to ecosystem impacts.

  45. CASCADE: A Cascaded Hybrid Defense Architecture for Prompt Injection Detection in MCP-Based Systems

    cs.CR 2026-04 unverdicted novelty 4.0

    CASCADE is a cascaded hybrid detector that combines fast regex/entropy filtering, BGE embeddings with local LLM fallback, and output pattern checks to achieve 95.85% precision and 6.06% false-positive rate against pro...

  46. Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout

    cs.CR 2026-04 unverdicted novelty 4.0

    FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.

  47. Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

    cs.NI 2026-05 unverdicted novelty 3.0

    A literature survey organizing LLM agent work for NetOps and AIOps around autonomy hierarchies, workflow evaluation, and safety contracts.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 46 Pith papers

  1. [1]

    Notion.https://www.notion.so/

  2. [2]

    Parea AI.https://www.parea.ai/

  3. [3]

    https:// supertools.therundown.ai/

    Supertools | Best AI Tools Guide. https:// supertools.therundown.ai/

  4. [4]

    https: //simonwillison.net/2022/Sep/12/prompt- injection/

    Prompt Injection Attacks against GPT-3. https: //simonwillison.net/2022/Sep/12/prompt- injection/

  5. [5]

    https://platform

    Rate Limits OpenAI API. https://platform. openai.com/docs/guides/rate-limits

  6. [6]

    Real Attackers Don’t Compute Gradients

    Giovanni Apruzzese, Hyrum S. Anderson, Savino Dambra, David Freeman, Fabio Pierazzi, and Kevin A. Roundy. "Real Attackers Don’t Compute Gradients": Bridging the Gap between Adversarial ML Research and Practice. InSaTML, 2023

  7. [7]

    Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures

    Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures. InS&P, pages 769–786. IEEE, 2022

  8. [8]

    Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InFAccT, pages 610–623

  9. [9]

    Emergent autonomous scientific research capabilities of large language models.arXiv preprint, 2023

    Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models.arXiv preprint, 2023

  10. [10]

    SQLrand: Preventing SQL injection attacks

    Stephen W Boyd and Angelos D Keromytis. SQLrand: Preventing SQL injection attacks. InACNS, pages 292– 302, 2004

  11. [11]

    Large Language Models as Tool Makers.arXiv preprint, 2023

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large Language Models as Tool Makers.arXiv preprint, 2023

  12. [12]

    Low-code LLM: Visual Program- ming over LLMs.arXiv preprint, 2023

    Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, et al. Low-code LLM: Visual Program- ming over LLMs.arXiv preprint, 2023

  13. [13]

    Writesonic

    ChatAIWriter. Writesonic. https://app. writesonic.com/botsonic/780dc6b4-fbe9- 4d5e-911c-014c9367ba32

  14. [14]

    Else- vier, 2009

    Justin Clarke.SQL injection attacks and defense. Else- vier, 2009

  15. [15]

    How to Jailbreak ChatGPT

    Lavina Daryanani. How to Jailbreak ChatGPT. https://watcher.guru/news/how-to-jailbreak- chatgpt

  16. [16]

    https://research.nccgroup

    Exploring Prompt Injection Attacks - NCC Group Research Blog. https://research.nccgroup. com/2022/12/05/exploring-prompt-injection- attacks/, Apr 2023

  17. [17]

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. InEMNLP, pages 3356–3369, 2020

  18. [18]

    Google AI. PaLM 2. https://ai.google/discover/ palm2/

  19. [19]

    Auto-GPT

    Significant Gravitas. Auto-GPT. https://github. com/Significant-Gravitas/Auto-GPT

  20. [20]

    Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt In- jection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt In- jection. InarXiv preprint, 2023

  21. [21]

    Diava: A traffic-based framework for detection of sql injection attacks and vulnerability analysis of leaked data.IEEE Transactions on Reliability, 69(1):188–202, 2020

    Haifeng Gu, Jianning Zhang, Tian Liu, Ming Hu, Jun- long Zhou, Tongquan Wei, and Mingsong Chen. Diava: A traffic-based framework for detection of sql injection attacks and vulnerability analysis of leaked data.IEEE Transactions on Reliability, 69(1):188–202, 2020

  22. [22]

    Defense Tactics

    Prompt Engineering Guide. Defense Tactics. https: //www.promptingguide.ai/risks/adversarial

  23. [23]

    Cross-Site Scripting (XSS) attacks and defense mechanisms: clas- sification and state-of-the-art.Int

    Shashank Gupta and Brij Bhooshan Gupta. Cross-Site Scripting (XSS) attacks and defense mechanisms: clas- sification and state-of-the-art.Int. J. Syst. Assur. Eng. Manag., 8(1s):512–530, 2017

  24. [24]

    Swot analysis: a theoretical review

    Emet GURL. Swot analysis: a theoretical review. 2017

  25. [25]

    A classification of SQL-injection attacks and countermeasures

    William G Halfond, Jeremy Viegas, Alessandro Orso, et al. A classification of SQL-injection attacks and countermeasures. InISSSR, volume 1, pages 13–15. IEEE, 2006

  26. [26]

    ToolkenGPT: Augmenting Frozen Language Mod- els with Massive Tools via Tool Embeddings.arXiv preprint, 2023

    Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. ToolkenGPT: Augmenting Frozen Language Mod- els with Massive Tools via Tool Embeddings.arXiv preprint, 2023

  27. [27]

    Current state of research on cross-site scripting (XSS)–A systematic literature review.Information and Software Technology, 58:170– 186, 2015

    Isatou Hydara, Abu Bakar Md Sultan, Hazura Zulzalil, and Novia Admodisastro. Current state of research on cross-site scripting (XSS)–A systematic literature review.Information and Software Technology, 58:170– 186, 2015

  28. [28]

    Lan- guage models can solve computer tasks.arXiv preprint, 2023

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Lan- guage models can solve computer tasks.arXiv preprint, 2023. 15

  29. [29]

    Api-bank: A bench- mark for tool-augmented llms.arXiv preprint, 2023

    Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhou- jun Li, Fei Huang, and Yongbin Li. Api-bank: A bench- mark for tool-augmented llms.arXiv preprint, 2023

  30. [30]

    Taskmatrix

    Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by con- necting foundation models with millions of apis.arXiv preprint, 2023

  31. [31]

    ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback.arXiv preprint, 2023

    Shengchao Liu, Jiongxiao Wang, Yijin Yang, Cheng- peng Wang, Ling Liu, Hongyu Guo, and Chaowei Xiao. ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback.arXiv preprint, 2023

  32. [32]

    Adversarial training for large neural language models

    Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Ad- versarial Training for Large Neural Language Models. CoRR, abs/2004.08994, 2020

  33. [33]

    Jailbreaking ChatGPT via Prompt Engineer- ing: An Empirical Study.arXiv preprint, 2023

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineer- ing: An Empirical Study.arXiv preprint, 2023

  34. [34]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint, 2023

    Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint, 2023

  35. [35]

    Sources of Hallucination by Large Language Models on Inference Tasks.arXiv preprint, 2023

    Nick McKenna, Tianyi Li, Liang Cheng, Moham- mad Javad Hosseini, Mark Johnson, and Mark Steedman. Sources of Hallucination by Large Language Models on Inference Tasks.arXiv preprint, 2023

  36. [36]

    NOTABLE: Transferable Backdoor At- tacks Against Prompt-based NLP Models

    Kai Mei, Zheng Li, Zhenting Wang, Yang Zhang, and Shiqing Ma. NOTABLE: Transferable Backdoor At- tacks Against Prompt-based NLP Models. InACL, 2023

  37. [37]

    Introducing LLaMA: A foundational, 65-billion-parameter large language model

    Meta. Introducing LLaMA: A foundational, 65-billion-parameter large language model. https://ai.facebook.com/blog/large- language-model-llama-meta-ai

  38. [38]

    Evaluating the Robustness of Neural Language Models to Input Pertur- bations

    Milad Moradi and Matthias Samwald. Evaluating the Robustness of Neural Language Models to Input Pertur- bations. InEMNLP 2021, pages 1558–1570, 2021

  39. [39]

    OpenAI. GPT-4. https://openai.com/research/ gpt-4

  40. [40]

    OWASP Top 10 List for Large Language Models version 0.1

    OWASP. OWASP Top 10 List for Large Language Models version 0.1. https://owasp.org/www- project-top-10-for-large-language-model- applications/descriptions

  41. [41]

    What is Jailbreaking in AI models like ChatGPT? https://www.techopedia.com/what- is-jailbreaking-in-ai-models-like-chatgpt

    Kaushik Pal. What is Jailbreaking in AI models like ChatGPT? https://www.techopedia.com/what- is-jailbreaking-in-ai-models-like-chatgpt

  42. [42]

    ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint, 2023

    Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint, 2023

  43. [43]

    Generative agents: Interactive simulacra of human be- havior.arXiv preprint, 2023

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human be- havior.arXiv preprint, 2023

  44. [44]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. Ignore Previous Prompt: Attack Techniques For Language Models. InNeurIPS ML Safety Workshop, 2022

  45. [45]

    Pricing.https://openai.com/pricing

  46. [46]

    Instruction Defense

    Learn Prompting. Instruction Defense. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/instruction

  47. [47]

    Instruction Defense

    Learn Prompting. Instruction Defense. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/post_prompting

  48. [48]

    Prompt Leaking

    Learn Prompting. Prompt Leaking. https: //learnprompting.org/docs/prompt_hacking/ leaking

  49. [49]

    Random Sequence Enclosure

    Learn Prompting. Random Sequence Enclosure. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/random_sequence

  50. [50]

    Sandwich Defense

    Learn Prompting. Sandwich Defense. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/sandwich_defense

  51. [51]

    Separate LLM Evaluation

    Learn Prompting. Separate LLM Evaluation. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/llm_eval

  52. [52]

    XML Tagging

    Learn Prompting. XML Tagging. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/xml_tagging

  53. [53]

    CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation.arXiv preprint, 2023

    Cheng Qian, Chi Han, Yi R Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation.arXiv preprint, 2023

  54. [54]

    The Full Story of Large Language Models and RLHF

    Marco Ramponi. The Full Story of Large Language Models and RLHF. https://www.assemblyai. com/blog/the-full-story-of-large-language- models-and-rlhf

  55. [55]

    Tricking LLMs into Disobedience: Understanding, Analyzing, and Prevent- ing Jailbreaks.arXiv preprint, 2023

    Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. Tricking LLMs into Disobedience: Understanding, Analyzing, and Prevent- ing Jailbreaks.arXiv preprint, 2023. 16

  56. [56]

    Get a Model! Model Hijacking Attack Against Machine Learning Models

    Ahmed Salem, Michael Backes, and Yang Zhang. Get a Model! Model Hijacking Attack Against Machine Learning Models. InNDSS, 2022

  57. [57]

    Toolformer: Language models can teach themselves to use tools.arXiv preprint, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint, 2023

  58. [58]

    Role-play with large language models.arXiv preprint, 2023

    Murray Shanahan, Kyle McDonell, and Laria Reynolds. Role-play with large language models.arXiv preprint, 2023

  59. [59]

    Hugginggpt: Solv- ing ai tasks with chatgpt and its friends in huggingface

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solv- ing ai tasks with chatgpt and its friends in huggingface. arXiv preprint, 2023

  60. [60]

    Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

    Wai Man Si, Michael Backes, Jeremy Blackburn, Emil- iano De Cristofaro, Gianluca Stringhini, Savvas Zannet- tou, and Yang Zhang. Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. InCCS, pages 2659–2673, 2022

  61. [61]

    Two-in-One: A Model Hijacking Attack Against Text Generation Models.arXiv preprint, 2023

    Wai Man Si, Michael Backes, Yang Zhang, and Ahmed Salem. Two-in-One: A Model Hijacking Attack Against Text Generation Models.arXiv preprint, 2023

  62. [62]

    Contrastive Learning Reduces Hallucination in Conversations

    Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. Contrastive Learning Reduces Hallucination in Conversations. arXiv preprint, 2022

  63. [63]

    A systematic analysis of XSS sanitization in web applica- tion frameworks

    Joel Weinberger, Prateek Saxena, Devdatta Akhawe, Matthew Finifter, Richard Shin, and Dawn Song. A systematic analysis of XSS sanitization in web applica- tion frameworks. InESORICS, pages 150–171, 2011

  64. [64]

    Fundamental limitations of alignment in large language models.arXiv preprint, 2023

    Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models.arXiv preprint, 2023

  65. [65]

    On the Tool Manipula- tion Capability of Open-source Large Language Models

    Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the Tool Manipula- tion Capability of Open-source Large Language Models. arXiv preprint, 2023

  66. [66]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. ICLR, 2023

  67. [67]

    Interpreting the Robustness of Neural NLP Models to Textual Perturbations

    Yunxiang Zhang, Liangming Pan, Samson Tan, and Min- Yen Kan. Interpreting the Robustness of Neural NLP Models to Textual Perturbations. InACL, pages 3993– 4007, 2022

  68. [68]

    Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models

    Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models. InEMNLP, pages 355–372, 2022. A List of Anonymized LLM-integrated Appli- cations 17 Table 5: Overview of LLM-Integrated Applications Used in Our Evaluation. We include the full list of LLM-integrated applications tested and...