Recognition: 2 theorem links
· Lean TheoremPrompt Injection attack against LLM-integrated Applications
Pith reviewed 2026-05-11 21:10 UTC · model grok-4.3
The pith
HouYi, a black-box technique, enables prompt injection on 31 of 36 real LLM-integrated applications, allowing prompt theft and unrestricted LLM use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HouYi is a novel black-box prompt injection attack technique compartmentalized into a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill attack objectives. Leveraging HouYi reveals previously unknown and severe attack outcomes such as unrestricted arbitrary LLM usage and uncomplicated application prompt theft, with 31 of 36 deployed LLM-integrated applications found susceptible.
What carries the argument
HouYi, the three-element black-box injection method (pre-constructed prompt, context-partition injection prompt, malicious payload) that bypasses application safeguards to execute attacker goals inside the LLM context.
If this is right
- Application prompts can be extracted with straightforward injection sequences.
- Attackers can obtain unrestricted use of the LLM backend for arbitrary tasks.
- Over 85 percent of tested real-world LLM-integrated applications remain open to these attacks.
- Vendor-confirmed cases show that prompt injection creates concrete risks for end users at scale.
Where Pith is reading between the lines
- Input handling in LLM apps may require the same isolation practices long used in web applications.
- Context-partition detection could serve as a general defense layer against similar future attacks.
- Automated testing tools based on HouYi might help developers identify exposure before release.
Load-bearing premise
The injection prompt can reliably create a context partition and deliver the payload across different LLM applications without detection or blocking by existing safeguards.
What would settle it
Applying HouYi to one of the 31 vulnerable applications after the addition of explicit filtering for context-partitioning phrases and checking whether the malicious payload still executes.
read the original abstract
Large Language Models (LLMs), renowned for their superior proficiency in language comprehension and generation, stimulate a vibrant ecosystem of applications around them. However, their extensive assimilation into various services introduces significant security risks. This study deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications. Initially, we conduct an exploratory analysis on ten commercial applications, highlighting the constraints of current attack strategies in practice. Prompted by these limitations, we subsequently formulate HouYi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks. HouYi is compartmentalized into three crucial elements: a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill the attack objectives. Leveraging HouYi, we unveil previously unknown and severe attack outcomes, such as unrestricted arbitrary LLM usage and uncomplicated application prompt theft. We deploy HouYi on 36 actual LLM-integrated applications and discern 31 applications susceptible to prompt injection. 10 vendors have validated our discoveries, including Notion, which has the potential to impact millions of users. Our investigation illuminates both the possible risks of prompt injection attacks and the possible tactics for mitigation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper deconstructs prompt injection attacks on LLM-integrated applications. It first analyzes ten commercial applications to highlight limitations of current strategies. Then, it proposes HouYi, a black-box technique inspired by web injection attacks, consisting of a pre-constructed prompt, an injection prompt that induces context partition, and a malicious payload. Deployed on 36 real applications, HouYi succeeds against 31, enabling outcomes such as arbitrary LLM usage and prompt theft. Ten vendors, including Notion, have validated the findings.
Significance. If the results hold, this paper is significant for demonstrating practical, severe prompt injection vulnerabilities in real LLM applications through a novel black-box method. The large-scale testing and vendor confirmations provide strong evidence that current integrations are at risk, potentially affecting millions of users, and it contributes actionable insights into both attack tactics and mitigation approaches in the field of AI security.
major comments (2)
- The central claim that 31 applications are susceptible (as stated in the abstract and evaluation section) depends on the context-partition step succeeding reliably. However, the manuscript lacks a detailed analysis of the five non-vulnerable applications, including whether the partition failed or other factors intervened, and does not report on variations across different LLMs or safety mechanisms. This undermines the assessment of the attack's generality.
- In the section describing HouYi, the injection prompt is presented as inducing context partition without quantitative evidence or examples showing its effectiveness across diverse applications or its resistance to existing safeguards, which is essential for supporting the severe attack outcomes claimed.
minor comments (1)
- The phrasing 'discern 31 applications susceptible to prompt injection' in the abstract is slightly awkward and could be clarified.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the paper's significance and for the constructive comments. We address each major comment point by point below, indicating the revisions we will make to improve clarity and support for our claims.
read point-by-point responses
-
Referee: The central claim that 31 applications are susceptible (as stated in the abstract and evaluation section) depends on the context-partition step succeeding reliably. However, the manuscript lacks a detailed analysis of the five non-vulnerable applications, including whether the partition failed or other factors intervened, and does not report on variations across different LLMs or safety mechanisms. This undermines the assessment of the attack's generality.
Authors: We agree that additional detail on the unsuccessful cases would strengthen the assessment of generality. In the revised manuscript, we will add a dedicated subsection in the evaluation section analyzing the five non-vulnerable applications. Our experimental observations indicate that context partition failed in these cases primarily due to application-specific input sanitization or output filtering that disrupted the injection prompt's ability to separate contexts, rather than issues with the payload itself. Regarding variations across LLMs and safety mechanisms, the 36 tested applications represent a diverse set of real-world deployments, each integrating different backend LLMs and built-in safeguards. The consistent success of HouYi across this heterogeneous collection provides evidence of broad applicability. We will explicitly discuss this diversity and the black-box constraints that limit per-LLM instrumentation in the revision. revision: yes
-
Referee: In the section describing HouYi, the injection prompt is presented as inducing context partition without quantitative evidence or examples showing its effectiveness across diverse applications or its resistance to existing safeguards, which is essential for supporting the severe attack outcomes claimed.
Authors: We acknowledge the value of more direct supporting evidence for the injection prompt component. In the revised HouYi description, we will include concrete examples of the injection prompts (and their application-specific adaptations) along with a breakdown of observed context-partition success rates where distinguishable from overall attack outcomes. The prompt's effectiveness and resistance to safeguards are substantiated by its role in enabling attacks on 31 of 36 diverse applications despite the presence of various input validation and moderation layers. We will expand the text to quantify this where possible from our logs and discuss limitations, such as cases where stronger custom safeguards might interfere. revision: yes
Circularity Check
No circularity: empirical attack evaluation on external applications
full rationale
The paper performs an exploratory analysis of ten commercial apps, proposes HouYi as a black-box technique inspired by web injection (with three explicit components: pre-constructed prompt, context-partition injection prompt, and payload), then reports direct experimental outcomes on 36 separate real-world LLM-integrated applications (31 vulnerable). No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the central claims rest on external testing and vendor validation rather than reducing to inputs by construction. References to prior prompt-injection literature are contextual and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM applications concatenate user inputs directly into system prompts without robust separation or sanitization
invented entities (1)
-
HouYi attack framework
no independent evidence
Forward citations
Cited by 47 Pith papers
-
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
-
TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation
TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
-
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Jailbreaking Frontier Foundation Models Through Intention Deception
A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Many-Tier Instruction Hierarchy in LLM Agents
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
Leveraging RAG for Training-Free Alignment of LLMs
RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
-
Adversarial SQL Injection Generation with LLM-Based Architectures
RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
-
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.
-
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
-
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks
Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
-
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...
-
LoopTrap: Termination Poisoning Attacks on LLM Agents
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
-
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
-
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
-
A Sentence Relation-Based Approach to Sanitizing Malicious Instructions
SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.
-
Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems
ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...
-
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
-
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.
-
When AI reviews science: Can we trust the referee?
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
-
RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents
RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.
-
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
-
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
-
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
-
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
-
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
-
Quantifying Trust: Financial Risk Management for Trustworthy AI Agents
The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.
-
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.
-
SecureMCP: A Policy-Enforced LLM Data Access Framework for AIoT Systems via Model Context Protocol
SecureMCP integrates RBAC with five sequential defense modules in an MCP server to achieve 82.3% policy compliance against adversarial LLM SQL queries in AIoT while preserving execution accuracy.
-
Architectural Obsolescence of Unhardened Agentic-AI Runtimes
OpenClaw fails to detect any of four action-audit divergence types while a hardened fork detects them all with perfect accuracy, making unhardened agentic-AI runtimes architecturally obsolete.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
-
CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning
CAP-CoT uses iterative adversarial prompt cycles to improve CoT accuracy, stability, and robustness across six benchmarks and four LLM backbones.
-
What Security and Privacy Transparency Users Need from Consumer-Facing Generative AI
A qualitative study of 21 GenAI users finds that current S&P transparency is often seen as incomplete or untrustworthy, leading to proxy-based adoption and constrained use, with calls for independent evaluations and o...
-
Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit
Security practitioners use LLMs independently for low-risk productivity tasks while showing interest in enterprise platforms, but reliability, verification needs, and security risks limit broader autonomy.
-
CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs
CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafet...
-
Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study
The survey organizes security threats and defenses in autonomous LLM agents into four layers and identifies that risks can propagate across layers from inputs to ecosystem impacts.
-
CASCADE: A Cascaded Hybrid Defense Architecture for Prompt Injection Detection in MCP-Based Systems
CASCADE is a cascaded hybrid detector that combines fast regex/entropy filtering, BGE embeddings with local LLM fallback, and output pattern checks to achieve 95.85% precision and 6.06% false-positive rate against pro...
-
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
-
Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety
A literature survey organizing LLM agent work for NetOps and AIOps around autonomy hierarchies, workflow evaluation, and safety contracts.
Reference graph
Works this paper leans on
-
[1]
Notion.https://www.notion.so/
-
[2]
Parea AI.https://www.parea.ai/
-
[3]
https:// supertools.therundown.ai/
Supertools | Best AI Tools Guide. https:// supertools.therundown.ai/
-
[4]
https: //simonwillison.net/2022/Sep/12/prompt- injection/
Prompt Injection Attacks against GPT-3. https: //simonwillison.net/2022/Sep/12/prompt- injection/
work page 2022
-
[5]
Rate Limits OpenAI API. https://platform. openai.com/docs/guides/rate-limits
-
[6]
Real Attackers Don’t Compute Gradients
Giovanni Apruzzese, Hyrum S. Anderson, Savino Dambra, David Freeman, Fabio Pierazzi, and Kevin A. Roundy. "Real Attackers Don’t Compute Gradients": Bridging the Gap between Adversarial ML Research and Practice. InSaTML, 2023
work page 2023
-
[7]
Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures
Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures. InS&P, pages 769–786. IEEE, 2022
work page 2022
-
[8]
Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InFAccT, pages 610–623
-
[9]
Emergent autonomous scientific research capabilities of large language models.arXiv preprint, 2023
Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models.arXiv preprint, 2023
work page 2023
-
[10]
SQLrand: Preventing SQL injection attacks
Stephen W Boyd and Angelos D Keromytis. SQLrand: Preventing SQL injection attacks. InACNS, pages 292– 302, 2004
work page 2004
-
[11]
Large Language Models as Tool Makers.arXiv preprint, 2023
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large Language Models as Tool Makers.arXiv preprint, 2023
work page 2023
-
[12]
Low-code LLM: Visual Program- ming over LLMs.arXiv preprint, 2023
Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, et al. Low-code LLM: Visual Program- ming over LLMs.arXiv preprint, 2023
work page 2023
-
[13]
ChatAIWriter. Writesonic. https://app. writesonic.com/botsonic/780dc6b4-fbe9- 4d5e-911c-014c9367ba32
- [14]
-
[15]
Lavina Daryanani. How to Jailbreak ChatGPT. https://watcher.guru/news/how-to-jailbreak- chatgpt
-
[16]
Exploring Prompt Injection Attacks - NCC Group Research Blog. https://research.nccgroup. com/2022/12/05/exploring-prompt-injection- attacks/, Apr 2023
work page 2022
-
[17]
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. InEMNLP, pages 3356–3369, 2020
work page 2020
-
[18]
Google AI. PaLM 2. https://ai.google/discover/ palm2/
- [19]
-
[20]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt In- jection. InarXiv preprint, 2023
work page 2023
-
[21]
Haifeng Gu, Jianning Zhang, Tian Liu, Ming Hu, Jun- long Zhou, Tongquan Wei, and Mingsong Chen. Diava: A traffic-based framework for detection of sql injection attacks and vulnerability analysis of leaked data.IEEE Transactions on Reliability, 69(1):188–202, 2020
work page 2020
-
[22]
Prompt Engineering Guide. Defense Tactics. https: //www.promptingguide.ai/risks/adversarial
-
[23]
Cross-Site Scripting (XSS) attacks and defense mechanisms: clas- sification and state-of-the-art.Int
Shashank Gupta and Brij Bhooshan Gupta. Cross-Site Scripting (XSS) attacks and defense mechanisms: clas- sification and state-of-the-art.Int. J. Syst. Assur. Eng. Manag., 8(1s):512–530, 2017
work page 2017
-
[24]
Swot analysis: a theoretical review
Emet GURL. Swot analysis: a theoretical review. 2017
work page 2017
-
[25]
A classification of SQL-injection attacks and countermeasures
William G Halfond, Jeremy Viegas, Alessandro Orso, et al. A classification of SQL-injection attacks and countermeasures. InISSSR, volume 1, pages 13–15. IEEE, 2006
work page 2006
-
[26]
Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. ToolkenGPT: Augmenting Frozen Language Mod- els with Massive Tools via Tool Embeddings.arXiv preprint, 2023
work page 2023
-
[27]
Isatou Hydara, Abu Bakar Md Sultan, Hazura Zulzalil, and Novia Admodisastro. Current state of research on cross-site scripting (XSS)–A systematic literature review.Information and Software Technology, 58:170– 186, 2015
work page 2015
-
[28]
Lan- guage models can solve computer tasks.arXiv preprint, 2023
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Lan- guage models can solve computer tasks.arXiv preprint, 2023. 15
work page 2023
-
[29]
Api-bank: A bench- mark for tool-augmented llms.arXiv preprint, 2023
Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhou- jun Li, Fei Huang, and Yongbin Li. Api-bank: A bench- mark for tool-augmented llms.arXiv preprint, 2023
work page 2023
-
[30]
Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by con- necting foundation models with millions of apis.arXiv preprint, 2023
work page 2023
-
[31]
ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback.arXiv preprint, 2023
Shengchao Liu, Jiongxiao Wang, Yijin Yang, Cheng- peng Wang, Ling Liu, Hongyu Guo, and Chaowei Xiao. ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback.arXiv preprint, 2023
work page 2023
-
[32]
Adversarial training for large neural language models
Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Ad- versarial Training for Large Neural Language Models. CoRR, abs/2004.08994, 2020
-
[33]
Jailbreaking ChatGPT via Prompt Engineer- ing: An Empirical Study.arXiv preprint, 2023
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineer- ing: An Empirical Study.arXiv preprint, 2023
work page 2023
-
[34]
Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint, 2023
work page 2023
-
[35]
Sources of Hallucination by Large Language Models on Inference Tasks.arXiv preprint, 2023
Nick McKenna, Tianyi Li, Liang Cheng, Moham- mad Javad Hosseini, Mark Johnson, and Mark Steedman. Sources of Hallucination by Large Language Models on Inference Tasks.arXiv preprint, 2023
work page 2023
-
[36]
NOTABLE: Transferable Backdoor At- tacks Against Prompt-based NLP Models
Kai Mei, Zheng Li, Zhenting Wang, Yang Zhang, and Shiqing Ma. NOTABLE: Transferable Backdoor At- tacks Against Prompt-based NLP Models. InACL, 2023
work page 2023
-
[37]
Introducing LLaMA: A foundational, 65-billion-parameter large language model
Meta. Introducing LLaMA: A foundational, 65-billion-parameter large language model. https://ai.facebook.com/blog/large- language-model-llama-meta-ai
-
[38]
Evaluating the Robustness of Neural Language Models to Input Pertur- bations
Milad Moradi and Matthias Samwald. Evaluating the Robustness of Neural Language Models to Input Pertur- bations. InEMNLP 2021, pages 1558–1570, 2021
work page 2021
-
[39]
OpenAI. GPT-4. https://openai.com/research/ gpt-4
-
[40]
OWASP Top 10 List for Large Language Models version 0.1
OWASP. OWASP Top 10 List for Large Language Models version 0.1. https://owasp.org/www- project-top-10-for-large-language-model- applications/descriptions
-
[41]
Kaushik Pal. What is Jailbreaking in AI models like ChatGPT? https://www.techopedia.com/what- is-jailbreaking-in-ai-models-like-chatgpt
-
[42]
ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint, 2023
Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint, 2023
work page 2023
-
[43]
Generative agents: Interactive simulacra of human be- havior.arXiv preprint, 2023
Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human be- havior.arXiv preprint, 2023
work page 2023
-
[44]
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez and Ian Ribeiro. Ignore Previous Prompt: Attack Techniques For Language Models. InNeurIPS ML Safety Workshop, 2022
work page 2022
-
[45]
Pricing.https://openai.com/pricing
-
[46]
Learn Prompting. Instruction Defense. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/instruction
-
[47]
Learn Prompting. Instruction Defense. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/post_prompting
-
[48]
Learn Prompting. Prompt Leaking. https: //learnprompting.org/docs/prompt_hacking/ leaking
-
[49]
Learn Prompting. Random Sequence Enclosure. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/random_sequence
-
[50]
Learn Prompting. Sandwich Defense. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/sandwich_defense
-
[51]
Learn Prompting. Separate LLM Evaluation. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/llm_eval
-
[52]
Learn Prompting. XML Tagging. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/xml_tagging
-
[53]
Cheng Qian, Chi Han, Yi R Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation.arXiv preprint, 2023
work page 2023
-
[54]
The Full Story of Large Language Models and RLHF
Marco Ramponi. The Full Story of Large Language Models and RLHF. https://www.assemblyai. com/blog/the-full-story-of-large-language- models-and-rlhf
-
[55]
Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. Tricking LLMs into Disobedience: Understanding, Analyzing, and Prevent- ing Jailbreaks.arXiv preprint, 2023. 16
work page 2023
-
[56]
Get a Model! Model Hijacking Attack Against Machine Learning Models
Ahmed Salem, Michael Backes, and Yang Zhang. Get a Model! Model Hijacking Attack Against Machine Learning Models. InNDSS, 2022
work page 2022
-
[57]
Toolformer: Language models can teach themselves to use tools.arXiv preprint, 2023
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint, 2023
work page 2023
-
[58]
Role-play with large language models.arXiv preprint, 2023
Murray Shanahan, Kyle McDonell, and Laria Reynolds. Role-play with large language models.arXiv preprint, 2023
work page 2023
-
[59]
Hugginggpt: Solv- ing ai tasks with chatgpt and its friends in huggingface
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solv- ing ai tasks with chatgpt and its friends in huggingface. arXiv preprint, 2023
work page 2023
-
[60]
Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots
Wai Man Si, Michael Backes, Jeremy Blackburn, Emil- iano De Cristofaro, Gianluca Stringhini, Savvas Zannet- tou, and Yang Zhang. Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. InCCS, pages 2659–2673, 2022
work page 2022
-
[61]
Two-in-One: A Model Hijacking Attack Against Text Generation Models.arXiv preprint, 2023
Wai Man Si, Michael Backes, Yang Zhang, and Ahmed Salem. Two-in-One: A Model Hijacking Attack Against Text Generation Models.arXiv preprint, 2023
work page 2023
-
[62]
Contrastive Learning Reduces Hallucination in Conversations
Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. Contrastive Learning Reduces Hallucination in Conversations. arXiv preprint, 2022
work page 2022
-
[63]
A systematic analysis of XSS sanitization in web applica- tion frameworks
Joel Weinberger, Prateek Saxena, Devdatta Akhawe, Matthew Finifter, Richard Shin, and Dawn Song. A systematic analysis of XSS sanitization in web applica- tion frameworks. InESORICS, pages 150–171, 2011
work page 2011
-
[64]
Fundamental limitations of alignment in large language models.arXiv preprint, 2023
Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models.arXiv preprint, 2023
work page 2023
-
[65]
On the Tool Manipula- tion Capability of Open-source Large Language Models
Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the Tool Manipula- tion Capability of Open-source Large Language Models. arXiv preprint, 2023
work page 2023
-
[66]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. ICLR, 2023
work page 2023
-
[67]
Interpreting the Robustness of Neural NLP Models to Textual Perturbations
Yunxiang Zhang, Liangming Pan, Samson Tan, and Min- Yen Kan. Interpreting the Robustness of Neural NLP Models to Textual Perturbations. InACL, pages 3993– 4007, 2022
work page 2022
-
[68]
Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models
Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models. InEMNLP, pages 355–372, 2022. A List of Anonymized LLM-integrated Appli- cations 17 Table 5: Overview of LLM-Integrated Applications Used in Our Evaluation. We include the full list of LLM-integrated applications tested and...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.