arxiv: 2306.05499 · v3 · submitted 2023-06-08 · 💻 cs.CR · cs.AI· cs.CL· cs.SE

Recognition: 2 theorem links

· Lean Theorem

Prompt Injection attack against LLM-integrated Applications

Gelei Deng, Haoyu Wang, Kailong Wang, Leo Yu Zhang, Tianwei Zhang, Xiaofeng Wang, Yang Liu, Yan Zheng, Yepang Liu, Yi Liu, Yuekang Li, Zihao Wang

Pith reviewed 2026-05-11 21:10 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.SE

keywords prompt injectionLLM securityblack-box attackLLM-integrated applicationsprompt theftcontext partitionHouYi

0 comments

The pith

HouYi, a black-box technique, enables prompt injection on 31 of 36 real LLM-integrated applications, allowing prompt theft and unrestricted LLM use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores prompt injection risks in commercial LLM applications and finds existing attacks limited in practice. It develops HouYi, a three-part method with a pre-constructed prompt, a context-partitioning injection prompt, and a malicious payload, modeled on web injection attacks. When tested on 36 actual applications, HouYi succeeds against 31, producing outcomes such as stealing the application's own prompt and gaining arbitrary control over the LLM. Ten vendors including Notion have confirmed the issues, indicating exposure for large user bases and the need for improved protections.

Core claim

HouYi is a novel black-box prompt injection attack technique compartmentalized into a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill attack objectives. Leveraging HouYi reveals previously unknown and severe attack outcomes such as unrestricted arbitrary LLM usage and uncomplicated application prompt theft, with 31 of 36 deployed LLM-integrated applications found susceptible.

What carries the argument

HouYi, the three-element black-box injection method (pre-constructed prompt, context-partition injection prompt, malicious payload) that bypasses application safeguards to execute attacker goals inside the LLM context.

If this is right

Application prompts can be extracted with straightforward injection sequences.
Attackers can obtain unrestricted use of the LLM backend for arbitrary tasks.
Over 85 percent of tested real-world LLM-integrated applications remain open to these attacks.
Vendor-confirmed cases show that prompt injection creates concrete risks for end users at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Input handling in LLM apps may require the same isolation practices long used in web applications.
Context-partition detection could serve as a general defense layer against similar future attacks.
Automated testing tools based on HouYi might help developers identify exposure before release.

Load-bearing premise

The injection prompt can reliably create a context partition and deliver the payload across different LLM applications without detection or blocking by existing safeguards.

What would settle it

Applying HouYi to one of the 31 vulnerable applications after the addition of explicit filtering for context-partitioning phrases and checking whether the malicious payload still executes.

read the original abstract

Large Language Models (LLMs), renowned for their superior proficiency in language comprehension and generation, stimulate a vibrant ecosystem of applications around them. However, their extensive assimilation into various services introduces significant security risks. This study deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications. Initially, we conduct an exploratory analysis on ten commercial applications, highlighting the constraints of current attack strategies in practice. Prompted by these limitations, we subsequently formulate HouYi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks. HouYi is compartmentalized into three crucial elements: a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill the attack objectives. Leveraging HouYi, we unveil previously unknown and severe attack outcomes, such as unrestricted arbitrary LLM usage and uncomplicated application prompt theft. We deploy HouYi on 36 actual LLM-integrated applications and discern 31 applications susceptible to prompt injection. 10 vendors have validated our discoveries, including Notion, which has the potential to impact millions of users. Our investigation illuminates both the possible risks of prompt injection attacks and the possible tactics for mitigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HouYi shows prompt injection succeeding on 31 of 36 real LLM apps with vendor confirmations, but the context-partition step lacks evidence of working uniformly or surviving model changes.

read the letter

HouYi shows prompt injection succeeding on 31 of 36 real LLM apps with vendor confirmations, but the context-partition step lacks evidence of working uniformly or surviving model changes. The authors first test ten commercial apps and identify where prior injection methods fall short in practice. They then build HouYi around three parts: a pre-constructed prompt that fits in naturally, an injection prompt meant to create a context partition, and a payload that achieves the goal. This produces outcomes such as stealing the app's system prompt and gaining unrestricted use of the underlying LLM. Deploying the attack across 36 actual services and securing acknowledgments from ten vendors, including Notion, supplies the concrete data that matters here. The scale of the testing and the external validations are the parts that hold up. The soft spot is the reliance on the partition step. The paper gives no breakdown of why five apps resisted, no per-app success details, and no checks against different model versions or added output filtering. If the partition only works because current LLMs follow certain instructions, the claimed generality does not follow. This paper is for security researchers and developers who integrate LLMs into applications. Readers who need examples of what actually breaks in production will get value from the results. It deserves a serious referee because the real-app evidence is hard to obtain and useful on its own. I would send it to peer review; the empirical findings justify the time even if the authors need to tighten the analysis of when and why the attack succeeds.

Referee Report

2 major / 1 minor

Summary. The paper deconstructs prompt injection attacks on LLM-integrated applications. It first analyzes ten commercial applications to highlight limitations of current strategies. Then, it proposes HouYi, a black-box technique inspired by web injection attacks, consisting of a pre-constructed prompt, an injection prompt that induces context partition, and a malicious payload. Deployed on 36 real applications, HouYi succeeds against 31, enabling outcomes such as arbitrary LLM usage and prompt theft. Ten vendors, including Notion, have validated the findings.

Significance. If the results hold, this paper is significant for demonstrating practical, severe prompt injection vulnerabilities in real LLM applications through a novel black-box method. The large-scale testing and vendor confirmations provide strong evidence that current integrations are at risk, potentially affecting millions of users, and it contributes actionable insights into both attack tactics and mitigation approaches in the field of AI security.

major comments (2)

The central claim that 31 applications are susceptible (as stated in the abstract and evaluation section) depends on the context-partition step succeeding reliably. However, the manuscript lacks a detailed analysis of the five non-vulnerable applications, including whether the partition failed or other factors intervened, and does not report on variations across different LLMs or safety mechanisms. This undermines the assessment of the attack's generality.
In the section describing HouYi, the injection prompt is presented as inducing context partition without quantitative evidence or examples showing its effectiveness across diverse applications or its resistance to existing safeguards, which is essential for supporting the severe attack outcomes claimed.

minor comments (1)

The phrasing 'discern 31 applications susceptible to prompt injection' in the abstract is slightly awkward and could be clarified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation of the paper's significance and for the constructive comments. We address each major comment point by point below, indicating the revisions we will make to improve clarity and support for our claims.

read point-by-point responses

Referee: The central claim that 31 applications are susceptible (as stated in the abstract and evaluation section) depends on the context-partition step succeeding reliably. However, the manuscript lacks a detailed analysis of the five non-vulnerable applications, including whether the partition failed or other factors intervened, and does not report on variations across different LLMs or safety mechanisms. This undermines the assessment of the attack's generality.

Authors: We agree that additional detail on the unsuccessful cases would strengthen the assessment of generality. In the revised manuscript, we will add a dedicated subsection in the evaluation section analyzing the five non-vulnerable applications. Our experimental observations indicate that context partition failed in these cases primarily due to application-specific input sanitization or output filtering that disrupted the injection prompt's ability to separate contexts, rather than issues with the payload itself. Regarding variations across LLMs and safety mechanisms, the 36 tested applications represent a diverse set of real-world deployments, each integrating different backend LLMs and built-in safeguards. The consistent success of HouYi across this heterogeneous collection provides evidence of broad applicability. We will explicitly discuss this diversity and the black-box constraints that limit per-LLM instrumentation in the revision. revision: yes
Referee: In the section describing HouYi, the injection prompt is presented as inducing context partition without quantitative evidence or examples showing its effectiveness across diverse applications or its resistance to existing safeguards, which is essential for supporting the severe attack outcomes claimed.

Authors: We acknowledge the value of more direct supporting evidence for the injection prompt component. In the revised HouYi description, we will include concrete examples of the injection prompts (and their application-specific adaptations) along with a breakdown of observed context-partition success rates where distinguishable from overall attack outcomes. The prompt's effectiveness and resistance to safeguards are substantiated by its role in enabling attacks on 31 of 36 diverse applications despite the presence of various input validation and moderation layers. We will expand the text to quantify this where possible from our logs and discuss limitations, such as cases where stronger custom safeguards might interfere. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack evaluation on external applications

full rationale

The paper performs an exploratory analysis of ten commercial apps, proposes HouYi as a black-box technique inspired by web injection (with three explicit components: pre-constructed prompt, context-partition injection prompt, and payload), then reports direct experimental outcomes on 36 separate real-world LLM-integrated applications (31 vulnerable). No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the central claims rest on external testing and vendor validation rather than reducing to inputs by construction. References to prior prompt-injection literature are contextual and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper is primarily empirical and does not rely on mathematical axioms or free parameters; the claims rest on the assumption that real applications behave as observed in the tests.

axioms (1)

domain assumption LLM applications concatenate user inputs directly into system prompts without robust separation or sanitization
This is the core premise enabling prompt injection attacks as demonstrated.

invented entities (1)

HouYi attack framework no independent evidence
purpose: To bypass limitations of existing prompt injection methods in practical LLM apps
The framework is proposed and tested in the paper without external independent validation beyond the authors' experiments.

pith-pipeline@v0.9.0 · 5542 in / 1383 out tokens · 64360 ms · 2026-05-11T21:10:53.817666+00:00 · methodology

discussion (0)

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
cs.CR 2026-04 unverdicted novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
TRUSTDESC: Preventing Tool Poisoning in LLM Applications via Trusted Description Generation
cs.CR 2026-04 unverdicted novelty 8.0

TRUSTDESC prevents tool poisoning in LLM applications by automatically generating accurate tool descriptions from code via a three-stage pipeline of reachability analysis, description synthesis, and dynamic verification.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
cs.CR 2026-04 unverdicted novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
cs.CR 2026-05 unverdicted novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
cs.CR 2026-05 unverdicted novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
cs.CR 2026-05 unverdicted novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Jailbreaking Frontier Foundation Models Through Intention Deception
cs.CR 2026-04 unverdicted novelty 7.0

A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Many-Tier Instruction Hierarchy in LLM Agents
cs.CL 2026-04 unverdicted novelty 7.0

ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
cs.CR 2024-10 unverdicted novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
Leveraging RAG for Training-Free Alignment of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with o...
Adversarial SQL Injection Generation with LLM-Based Architectures
cs.CR 2026-05 unverdicted novelty 6.0

RADAGAS-GPT4o achieves a 22.73% bypass rate against 10 WAFs, succeeding more against AI/ML-based firewalls than rule-based ones.
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
cs.CV 2026-05 unverdicted novelty 6.0

UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
cs.CR 2026-05 unverdicted novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
When Child Inherits: Modeling and Exploiting Subagent Spawn in Multi-Agent Networks
cs.CR 2026-05 unverdicted novelty 6.0

Multi-agent LLM frameworks can spread compromises across agent boundaries via insecure memory inheritance during subagent spawning.
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
cs.CR 2026-05 unverdicted novelty 6.0

SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task c...
LoopTrap: Termination Poisoning Attacks on LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
cs.CR 2026-05 unverdicted novelty 6.0

LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
A Sentence Relation-Based Approach to Sanitizing Malicious Instructions
cs.CR 2026-05 unverdicted novelty 6.0

SONAR constructs a relational graph from entailment and contradiction scores to prune injected malicious sentences from LLM prompts while preserving context, achieving near-zero attack success rates.
Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems
cs.CR 2026-05 unverdicted novelty 6.0

ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
cs.CR 2026-04 unverdicted novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
cs.CR 2026-04 conditional novelty 6.0

AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.
When AI reviews science: Can we trust the referee?
cs.AI 2026-04 unverdicted novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents
cs.CR 2026-04 unverdicted novelty 6.0

RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
cs.CR 2026-04 unverdicted novelty 6.0

TEMPLATEFUZZ mutates chat templates with element-level rules and heuristic search to reach 98.2% average jailbreak success rate on twelve open-source LLMs while degrading accuracy by only 1.1%.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection
cs.CR 2026-04 unverdicted novelty 6.0

ClawGuard enforces user-derived access constraints at tool-call boundaries to block indirect prompt injection in tool-augmented LLM agents across web, MCP, and skill injection channels.
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
cs.CR 2026-04 unverdicted novelty 6.0

PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
Quantifying Trust: Financial Risk Management for Trustworthy AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
cs.CR 2024-04 unverdicted novelty 6.0

Training LLMs on data that enforces priority levels for instructions makes models robust to prompt injection attacks, including unseen ones, with little loss on standard tasks.
SecureMCP: A Policy-Enforced LLM Data Access Framework for AIoT Systems via Model Context Protocol
cs.CR 2026-05 unverdicted novelty 5.0

SecureMCP integrates RBAC with five sequential defense modules in an MCP server to achieve 82.3% policy compliance against adversarial LLM SQL queries in AIoT while preserving execution accuracy.
Architectural Obsolescence of Unhardened Agentic-AI Runtimes
cs.CR 2026-05 unverdicted novelty 5.0

OpenClaw fails to detect any of four action-audit divergence types while a hardened fork detects them all with perfect accuracy, making unhardened agentic-AI runtimes architecturally obsolete.
LLM-Oriented Information Retrieval: A Denoising-First Perspective
cs.IR 2026-05 unverdicted novelty 5.0

Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
CAP-CoT: Cycle Adversarial Prompt for Improving Chain of Thoughts in LLM Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

CAP-CoT uses iterative adversarial prompt cycles to improve CoT accuracy, stability, and robustness across six benchmarks and four LLM backbones.
What Security and Privacy Transparency Users Need from Consumer-Facing Generative AI
cs.HC 2026-04 unverdicted novelty 5.0

A qualitative study of 21 GenAI users finds that current S&P transparency is often seen as incomplete or untrustworthy, leading to proxy-based adoption and constrained use, with calls for independent evaluations and o...
Like a Hammer, It Can Build, It Can Break: Large Language Model Uses, Perceptions, and Adoption in Cybersecurity Operations on Reddit
cs.CR 2026-04 unverdicted novelty 5.0

Security practitioners use LLMs independently for low-risk productivity tasks while showing interest in enterprise platforms, but reliability, verification needs, and security risks limit broader autonomy.
CareGuardAI: Context-Aware Multi-Agent Guardrails for Clinical Safety & Hallucination Mitigation in Patient-Facing LLMs
cs.CY 2026-04 unverdicted novelty 5.0

CareGuardAI introduces dual risk assessments (SRA and HRA) and a multi-stage agent pipeline that only releases LLM responses when both risks score at or below 2, outperforming GPT-4o-mini on PatientSafeBench, MedSafet...
Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study
cs.CR 2026-04 conditional novelty 4.0

The survey organizes security threats and defenses in autonomous LLM agents into four layers and identifies that risks can propagate across layers from inputs to ecosystem impacts.
CASCADE: A Cascaded Hybrid Defense Architecture for Prompt Injection Detection in MCP-Based Systems
cs.CR 2026-04 unverdicted novelty 4.0

CASCADE is a cascaded hybrid detector that combines fast regex/entropy filtering, BGE embeddings with local LLM fallback, and output pattern checks to achieve 95.85% precision and 6.06% false-positive rate against pro...
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
cs.CR 2026-04 unverdicted novelty 4.0

FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety
cs.NI 2026-05 unverdicted novelty 3.0

A literature survey organizing LLM agent work for NetOps and AIOps around autonomy hierarchies, workflow evaluation, and safety contracts.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 46 Pith papers

[1]

Notion.https://www.notion.so/

work page
[2]

Parea AI.https://www.parea.ai/

work page
[3]

https:// supertools.therundown.ai/

Supertools | Best AI Tools Guide. https:// supertools.therundown.ai/

work page
[4]

https: //simonwillison.net/2022/Sep/12/prompt- injection/

Prompt Injection Attacks against GPT-3. https: //simonwillison.net/2022/Sep/12/prompt- injection/

work page 2022
[5]

https://platform

Rate Limits OpenAI API. https://platform. openai.com/docs/guides/rate-limits

work page
[6]

Real Attackers Don’t Compute Gradients

Giovanni Apruzzese, Hyrum S. Anderson, Savino Dambra, David Freeman, Fabio Pierazzi, and Kevin A. Roundy. "Real Attackers Don’t Compute Gradients": Bridging the Gap between Adversarial ML Research and Practice. InSaTML, 2023

work page 2023
[7]

Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures

Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures. InS&P, pages 769–786. IEEE, 2022

work page 2022
[8]

Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan- Major, and Shmargaret Shmitchell. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? InFAccT, pages 610–623

work page
[9]

Emergent autonomous scientific research capabilities of large language models.arXiv preprint, 2023

Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models.arXiv preprint, 2023

work page 2023
[10]

SQLrand: Preventing SQL injection attacks

Stephen W Boyd and Angelos D Keromytis. SQLrand: Preventing SQL injection attacks. InACNS, pages 292– 302, 2004

work page 2004
[11]

Large Language Models as Tool Makers.arXiv preprint, 2023

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large Language Models as Tool Makers.arXiv preprint, 2023

work page 2023
[12]

Low-code LLM: Visual Program- ming over LLMs.arXiv preprint, 2023

Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, Wang You, Ting Song, Yan Xia, et al. Low-code LLM: Visual Program- ming over LLMs.arXiv preprint, 2023

work page 2023
[13]

Writesonic

ChatAIWriter. Writesonic. https://app. writesonic.com/botsonic/780dc6b4-fbe9- 4d5e-911c-014c9367ba32

work page
[14]

Else- vier, 2009

Justin Clarke.SQL injection attacks and defense. Else- vier, 2009

work page 2009
[15]

How to Jailbreak ChatGPT

Lavina Daryanani. How to Jailbreak ChatGPT. https://watcher.guru/news/how-to-jailbreak- chatgpt

work page
[16]

https://research.nccgroup

Exploring Prompt Injection Attacks - NCC Group Research Blog. https://research.nccgroup. com/2022/12/05/exploring-prompt-injection- attacks/, Apr 2023

work page 2022
[17]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. InEMNLP, pages 3356–3369, 2020

work page 2020
[18]

Google AI. PaLM 2. https://ai.google/discover/ palm2/

work page
[19]

Auto-GPT

Significant Gravitas. Auto-GPT. https://github. com/Significant-Gravitas/Auto-GPT

work page
[20]

Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt In- jection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt In- jection. InarXiv preprint, 2023

work page 2023
[21]

Diava: A traffic-based framework for detection of sql injection attacks and vulnerability analysis of leaked data.IEEE Transactions on Reliability, 69(1):188–202, 2020

Haifeng Gu, Jianning Zhang, Tian Liu, Ming Hu, Jun- long Zhou, Tongquan Wei, and Mingsong Chen. Diava: A traffic-based framework for detection of sql injection attacks and vulnerability analysis of leaked data.IEEE Transactions on Reliability, 69(1):188–202, 2020

work page 2020
[22]

Defense Tactics

Prompt Engineering Guide. Defense Tactics. https: //www.promptingguide.ai/risks/adversarial

work page
[23]

Cross-Site Scripting (XSS) attacks and defense mechanisms: clas- sification and state-of-the-art.Int

Shashank Gupta and Brij Bhooshan Gupta. Cross-Site Scripting (XSS) attacks and defense mechanisms: clas- sification and state-of-the-art.Int. J. Syst. Assur. Eng. Manag., 8(1s):512–530, 2017

work page 2017
[24]

Swot analysis: a theoretical review

Emet GURL. Swot analysis: a theoretical review. 2017

work page 2017
[25]

A classification of SQL-injection attacks and countermeasures

William G Halfond, Jeremy Viegas, Alessandro Orso, et al. A classification of SQL-injection attacks and countermeasures. InISSSR, volume 1, pages 13–15. IEEE, 2006

work page 2006
[26]

ToolkenGPT: Augmenting Frozen Language Mod- els with Massive Tools via Tool Embeddings.arXiv preprint, 2023

Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. ToolkenGPT: Augmenting Frozen Language Mod- els with Massive Tools via Tool Embeddings.arXiv preprint, 2023

work page 2023
[27]

Current state of research on cross-site scripting (XSS)–A systematic literature review.Information and Software Technology, 58:170– 186, 2015

Isatou Hydara, Abu Bakar Md Sultan, Hazura Zulzalil, and Novia Admodisastro. Current state of research on cross-site scripting (XSS)–A systematic literature review.Information and Software Technology, 58:170– 186, 2015

work page 2015
[28]

Lan- guage models can solve computer tasks.arXiv preprint, 2023

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Lan- guage models can solve computer tasks.arXiv preprint, 2023. 15

work page 2023
[29]

Api-bank: A bench- mark for tool-augmented llms.arXiv preprint, 2023

Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu, Zhou- jun Li, Fei Huang, and Yongbin Li. Api-bank: A bench- mark for tool-augmented llms.arXiv preprint, 2023

work page 2023
[30]

Taskmatrix

Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by con- necting foundation models with millions of apis.arXiv preprint, 2023

work page 2023
[31]

ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback.arXiv preprint, 2023

Shengchao Liu, Jiongxiao Wang, Yijin Yang, Cheng- peng Wang, Ling Liu, Hongyu Guo, and Chaowei Xiao. ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback.arXiv preprint, 2023

work page 2023
[32]

Adversarial training for large neural language models

Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Ad- versarial Training for Large Neural Language Models. CoRR, abs/2004.08994, 2020

work page arXiv 2004
[33]

Jailbreaking ChatGPT via Prompt Engineer- ing: An Empirical Study.arXiv preprint, 2023

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineer- ing: An Empirical Study.arXiv preprint, 2023

work page 2023
[34]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint, 2023

Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.arXiv preprint, 2023

work page 2023
[35]

Sources of Hallucination by Large Language Models on Inference Tasks.arXiv preprint, 2023

Nick McKenna, Tianyi Li, Liang Cheng, Moham- mad Javad Hosseini, Mark Johnson, and Mark Steedman. Sources of Hallucination by Large Language Models on Inference Tasks.arXiv preprint, 2023

work page 2023
[36]

NOTABLE: Transferable Backdoor At- tacks Against Prompt-based NLP Models

Kai Mei, Zheng Li, Zhenting Wang, Yang Zhang, and Shiqing Ma. NOTABLE: Transferable Backdoor At- tacks Against Prompt-based NLP Models. InACL, 2023

work page 2023
[37]

Introducing LLaMA: A foundational, 65-billion-parameter large language model

Meta. Introducing LLaMA: A foundational, 65-billion-parameter large language model. https://ai.facebook.com/blog/large- language-model-llama-meta-ai

work page
[38]

Evaluating the Robustness of Neural Language Models to Input Pertur- bations

Milad Moradi and Matthias Samwald. Evaluating the Robustness of Neural Language Models to Input Pertur- bations. InEMNLP 2021, pages 1558–1570, 2021

work page 2021
[39]

OpenAI. GPT-4. https://openai.com/research/ gpt-4

work page
[40]

OWASP Top 10 List for Large Language Models version 0.1

OWASP. OWASP Top 10 List for Large Language Models version 0.1. https://owasp.org/www- project-top-10-for-large-language-model- applications/descriptions

work page
[41]

What is Jailbreaking in AI models like ChatGPT? https://www.techopedia.com/what- is-jailbreaking-in-ai-models-like-chatgpt

Kaushik Pal. What is Jailbreaking in AI models like ChatGPT? https://www.techopedia.com/what- is-jailbreaking-in-ai-models-like-chatgpt

work page
[42]

ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint, 2023

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint, 2023

work page 2023
[43]

Generative agents: Interactive simulacra of human be- havior.arXiv preprint, 2023

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human be- havior.arXiv preprint, 2023

work page 2023
[44]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore Previous Prompt: Attack Techniques For Language Models. InNeurIPS ML Safety Workshop, 2022

work page 2022
[45]

Pricing.https://openai.com/pricing

work page
[46]

Instruction Defense

Learn Prompting. Instruction Defense. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/instruction

work page
[47]

Instruction Defense

Learn Prompting. Instruction Defense. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/post_prompting

work page
[48]

Prompt Leaking

Learn Prompting. Prompt Leaking. https: //learnprompting.org/docs/prompt_hacking/ leaking

work page
[49]

Random Sequence Enclosure

Learn Prompting. Random Sequence Enclosure. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/random_sequence

work page
[50]

Sandwich Defense

Learn Prompting. Sandwich Defense. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/sandwich_defense

work page
[51]

Separate LLM Evaluation

Learn Prompting. Separate LLM Evaluation. https://learnprompting.org/docs/prompt_ hacking/defensive_measures/llm_eval

work page
[52]

XML Tagging

Learn Prompting. XML Tagging. https: //learnprompting.org/docs/prompt_hacking/ defensive_measures/xml_tagging

work page
[53]

CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation.arXiv preprint, 2023

Cheng Qian, Chi Han, Yi R Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation.arXiv preprint, 2023

work page 2023
[54]

The Full Story of Large Language Models and RLHF

Marco Ramponi. The Full Story of Large Language Models and RLHF. https://www.assemblyai. com/blog/the-full-story-of-large-language- models-and-rlhf

work page
[55]

Tricking LLMs into Disobedience: Understanding, Analyzing, and Prevent- ing Jailbreaks.arXiv preprint, 2023

Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. Tricking LLMs into Disobedience: Understanding, Analyzing, and Prevent- ing Jailbreaks.arXiv preprint, 2023. 16

work page 2023
[56]

Get a Model! Model Hijacking Attack Against Machine Learning Models

Ahmed Salem, Michael Backes, and Yang Zhang. Get a Model! Model Hijacking Attack Against Machine Learning Models. InNDSS, 2022

work page 2022
[57]

Toolformer: Language models can teach themselves to use tools.arXiv preprint, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.arXiv preprint, 2023

work page 2023
[58]

Role-play with large language models.arXiv preprint, 2023

Murray Shanahan, Kyle McDonell, and Laria Reynolds. Role-play with large language models.arXiv preprint, 2023

work page 2023
[59]

Hugginggpt: Solv- ing ai tasks with chatgpt and its friends in huggingface

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solv- ing ai tasks with chatgpt and its friends in huggingface. arXiv preprint, 2023

work page 2023
[60]

Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

Wai Man Si, Michael Backes, Jeremy Blackburn, Emil- iano De Cristofaro, Gianluca Stringhini, Savvas Zannet- tou, and Yang Zhang. Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. InCCS, pages 2659–2673, 2022

work page 2022
[61]

Two-in-One: A Model Hijacking Attack Against Text Generation Models.arXiv preprint, 2023

Wai Man Si, Michael Backes, Yang Zhang, and Ahmed Salem. Two-in-One: A Model Hijacking Attack Against Text Generation Models.arXiv preprint, 2023

work page 2023
[62]

Contrastive Learning Reduces Hallucination in Conversations

Weiwei Sun, Zhengliang Shi, Shen Gao, Pengjie Ren, Maarten de Rijke, and Zhaochun Ren. Contrastive Learning Reduces Hallucination in Conversations. arXiv preprint, 2022

work page 2022
[63]

A systematic analysis of XSS sanitization in web applica- tion frameworks

Joel Weinberger, Prateek Saxena, Devdatta Akhawe, Matthew Finifter, Richard Shin, and Dawn Song. A systematic analysis of XSS sanitization in web applica- tion frameworks. InESORICS, pages 150–171, 2011

work page 2011
[64]

Fundamental limitations of alignment in large language models.arXiv preprint, 2023

Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models.arXiv preprint, 2023

work page 2023
[65]

On the Tool Manipula- tion Capability of Open-source Large Language Models

Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On the Tool Manipula- tion Capability of Open-source Large Language Models. arXiv preprint, 2023

work page 2023
[66]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. ICLR, 2023

work page 2023
[67]

Interpreting the Robustness of Neural NLP Models to Textual Perturbations

Yunxiang Zhang, Liangming Pan, Samson Tan, and Min- Yen Kan. Interpreting the Robustness of Neural NLP Models to Textual Perturbations. InACL, pages 3993– 4007, 2022

work page 2022
[68]

Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models

Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models. InEMNLP, pages 355–372, 2022. A List of Anonymized LLM-integrated Appli- cations 17 Table 5: Overview of LLM-Integrated Applications Used in Our Evaluation. We include the full list of LLM-integrated applications tested and...

work page 2022