arxiv: 2302.12173 · v2 · submitted 2023-02-23 · 💻 cs.CR · cs.AI· cs.CL· cs.CY

Recognition: 1 theorem link

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Christoph Endres, Kai Greshake, Mario Fritz, Sahar Abdelnabi, Shailesh Mishra, Thorsten Holz

Pith reviewed 2026-05-11 17:12 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.CY

keywords indirect prompt injectionLLM securityadversarial promptingprompt injection attacksLLM applicationsdata retrievalAI vulnerabilities

0 comments

The pith

Adversaries can remotely compromise LLM-integrated applications by injecting prompts into retrievable data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that integrating LLMs into applications creates new risks because the models process retrieved data as if it contains instructions. This allows attackers to use indirect prompt injection to override original commands without directly interacting with the system. The authors develop a taxonomy of potential harms including data theft and ecosystem contamination, and show these attacks working on real tools like Bing's GPT-4 chat and code completion engines. They conclude that without new defenses, reliance on LLMs in apps leaves users and systems exposed.

Core claim

Indirect Prompt Injection attacks succeed because LLM-integrated applications do not distinguish between data and instructions, allowing strategically placed prompts in external data to be retrieved, processed, and executed by the model as overriding commands.

What carries the argument

Indirect Prompt Injection mechanism, which embeds adversarial instructions in data sources that the application retrieves and feeds to the LLM.

If this is right

LLM apps can be tricked into stealing and exfiltrating user data to the attacker.
Attacks can propagate like worms by injecting prompts that cause further data contamination.
Application behavior can be altered to call APIs in unintended ways or manipulate outputs.
The overall information ecosystem can be poisoned through controlled content generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

System designers may need to implement strict separation between user instructions and retrieved content.
Retrieval-augmented generation systems are particularly vulnerable and require new verification steps.
Testing LLMs with mixed data and instructions could reveal if they reliably ignore injected commands.

Load-bearing premise

The LLM will interpret and follow instructions found in retrieved external data without recognizing them as separate from or subordinate to the original user prompt.

What would settle it

Observe whether an LLM app follows a user query or an opposing instruction hidden in a retrieved document when both are present.

read the original abstract

Large Language Models (LLMs) are increasingly being integrated into various applications. The functionalities of recent LLMs can be flexibly modulated via natural language prompts. This renders them susceptible to targeted adversarial prompting, e.g., Prompt Injection (PI) attacks enable attackers to override original instructions and employed controls. So far, it was assumed that the user is directly prompting the LLM. But, what if it is not the user prompting? We argue that LLM-Integrated Applications blur the line between data and instructions. We reveal new attack vectors, using Indirect Prompt Injection, that enable adversaries to remotely (without a direct interface) exploit LLM-integrated applications by strategically injecting prompts into data likely to be retrieved. We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities, including data theft, worming, information ecosystem contamination, and other novel security risks. We demonstrate our attacks' practical viability against both real-world systems, such as Bing's GPT-4 powered Chat and code-completion engines, and synthetic applications built on GPT-4. We show how processing retrieved prompts can act as arbitrary code execution, manipulate the application's functionality, and control how and if other APIs are called. Despite the increasing integration and reliance on LLMs, effective mitigations of these emerging threats are currently lacking. By raising awareness of these vulnerabilities and providing key insights into their implications, we aim to promote the safe and responsible deployment of these powerful models and the development of robust defenses that protect users and systems from potential attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows indirect prompt injection through retrieved data is practical on current LLM apps like Bing GPT-4, but the attacks rest on brittle model behaviors that future boundaries could close.

read the letter

The main thing to know is that this work takes prompt injection out of the direct user interface and into external data channels. Attackers can plant instructions in web pages, emails, or databases that an LLM-integrated app will later retrieve and treat as commands. That extension is the actual new piece, and the authors back it with a security-focused taxonomy plus working examples on production systems such as Bing's GPT-4 chat and code-completion tools. Those real-system demonstrations are the strongest part; they move the discussion past theory and show concrete outcomes like data theft or unwanted API calls. The paper also correctly notes that current applications lack reliable separation between data and instructions, which is why the attacks succeed today. On the softer side, the central mechanism is emergent rather than guaranteed. It depends on prompt ordering, context length, and the absence of summarization or tagging layers. If developers add even basic input sanitization or structured retrieval, the surface shrinks quickly. The abstract gives no success rates or error breakdowns, so the full paper needs to clarify how often these attacks land and under what conditions they fail. The taxonomy is useful for mapping risks, but it would be stronger with more discussion of how existing defenses like output filtering or prompt isolation might blunt the vectors. This paper is aimed at security researchers and teams building LLM wrappers. Anyone working on retrieval-augmented generation or external-data apps will find the examples worth reading. It deserves a serious referee because the threat model is timely and the demonstrations are grounded in actual deployed systems, even if the long-term robustness needs more scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLM-integrated applications are vulnerable to a new class of Indirect Prompt Injection (IPI) attacks. Adversaries can remotely compromise these systems by embedding malicious instructions in external data sources (web pages, emails, databases) that the application is likely to retrieve and concatenate into the LLM context. The work derives a security-oriented taxonomy of resulting impacts (data theft, worming, information ecosystem contamination, API manipulation), demonstrates concrete attacks on production systems including Bing's GPT-4 Chat and code-completion engines as well as synthetic GPT-4 applications, and argues that retrieved prompts can effectively act as arbitrary code execution. It concludes that current mitigations are insufficient and calls for improved defenses.

Significance. If the demonstrations hold, the paper makes a timely and practically relevant contribution to LLM security by surfacing an attack surface that arises precisely from the data-instruction blurring inherent in retrieval-augmented LLM applications. The taxonomy supplies a useful organizing framework, and the real-world case studies on commercial GPT-4 deployments provide concrete evidence that the vector is already exploitable. These elements could directly inform both defensive research (e.g., prompt isolation, structured retrieval) and responsible deployment practices. The empirical focus on production systems is a clear strength.

major comments (2)

[§5] §5 (Demonstrations / Evaluation): The central claim of 'practical viability' against real-world systems rests on qualitative descriptions of successful attacks on Bing Chat and code-completion engines. No success rates, trial counts, context-length sensitivity, or failure-mode analysis are reported, nor are the exact injected prompts or retrieval conditions provided. This omission is load-bearing because LLM behavior is non-deterministic and prompt ordering / summarization can suppress the attack; without these metrics the reproducibility and robustness of the vector cannot be assessed.
[§3] §3 (Taxonomy): Several high-impact categories (e.g., worming, ecosystem contamination) are defined but the mapping from the concrete demonstrations to these categories is only partially instantiated. The paper extends the observed Bing/Chat behaviors to the full taxonomy largely by construction rather than by additional targeted experiments, weakening the claim that the taxonomy comprehensively captures realized risks.

minor comments (2)

[Abstract / §1] The abstract and introduction repeatedly use 'arbitrary code execution' as an analogy; a brief clarification of the precise boundary (what the LLM can and cannot do via the retrieved prompt) would prevent over-interpretation.
[§2] Related-work discussion of prior direct prompt-injection papers is present but could more explicitly contrast the indirect setting with respect to attacker capabilities and detection surfaces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas for improving the rigor of our evaluation and the clarity of our taxonomy. We address each major comment below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [§5] §5 (Demonstrations / Evaluation): The central claim of 'practical viability' against real-world systems rests on qualitative descriptions of successful attacks on Bing Chat and code-completion engines. No success rates, trial counts, context-length sensitivity, or failure-mode analysis are reported, nor are the exact injected prompts or retrieval conditions provided. This omission is load-bearing because LLM behavior is non-deterministic and prompt ordering / summarization can suppress the attack; without these metrics the reproducibility and robustness of the vector cannot be assessed.

Authors: We acknowledge the value of quantitative metrics for assessing robustness. Our Section 5 demonstrations were designed as proof-of-concept case studies on live production systems, where repeated quantitative trials raise ethical and practical issues (e.g., potential service disruption or model changes over time). In the revision we will add the specific injected prompts and retrieval conditions used, describe the number of trials performed where feasible, and include a discussion of observed failure modes and context-length effects based on our experiments. This will improve reproducibility while preserving the real-world focus. revision: partial
Referee: [§3] §3 (Taxonomy): Several high-impact categories (e.g., worming, ecosystem contamination) are defined but the mapping from the concrete demonstrations to these categories is only partially instantiated. The paper extends the observed Bing/Chat behaviors to the full taxonomy largely by construction rather than by additional targeted experiments, weakening the claim that the taxonomy comprehensively captures realized risks.

Authors: The taxonomy organizes risks according to the core mechanism of indirect prompt injection, which grants the LLM effective control over its own context and downstream actions. Demonstrations on Bing and code-completion engines directly instantiate data theft and API manipulation; synthetic GPT-4 applications instantiate worming and contamination. We will revise Section 3 to add an explicit mapping (e.g., a table) that distinguishes directly demonstrated cases from logical extensions of the same mechanism, thereby clarifying the scope without new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical attack demonstrations without derivations or self-referential fits

full rationale

The paper contains no mathematical derivations, equations, fitted parameters, or first-principles claims that could reduce to their own inputs. Its central contributions are a taxonomy of indirect prompt injection risks and practical demonstrations against external production systems (e.g., Bing Chat, code-completion engines) and synthetic GPT-4 setups. These rest on observable behaviors in real applications rather than any self-definition, self-citation load-bearing premise, or renaming of known results. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current LLMs process all natural language input uniformly without reliable separation of instructions from data content.

axioms (1)

domain assumption LLMs treat retrieved external text as executable instructions equivalent to direct user prompts.
This assumption underpins why injected prompts in data sources can override application controls.

pith-pipeline@v0.9.0 · 5598 in / 1123 out tokens · 40157 ms · 2026-05-11T17:12:20.679350+00:00 · methodology

discussion (0)

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems
cs.CR 2026-04 unverdicted novelty 8.0

A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
cs.CR 2026-05 unverdicted novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
Jailbroken Frontier Models Retain Their Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms
cs.CR 2026-04 conditional novelty 7.0

Introduces CSTM-Bench with 26 cross-session attack taxonomies, demonstrates recall loss in session-bound and full-log detectors, and proposes a bounded-memory coreset reader with the CSTM metric balancing detection an...
Conjunctive Prompt Attacks in Multi-Agent LLM Systems
cs.MA 2026-04 unverdicted novelty 7.0

Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
Many-Tier Instruction Hierarchy in LLM Agents
cs.CL 2026-04 unverdicted novelty 7.0

ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents
cs.CR 2026-05 conditional novelty 6.0

Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
cs.LG 2026-05 unverdicted novelty 6.0

Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems
cs.CR 2026-05 unverdicted novelty 6.0

ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
cs.CR 2026-05 unverdicted novelty 6.0

Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents
cs.CR 2026-04 conditional novelty 6.0

AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
cs.CL 2026-04 unverdicted novelty 6.0

SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Evaluation of Prompt Injection Defenses in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

Output filtering implemented in application code is the only defense that survived an adaptive prompt-injection attacker across 15,000 attacks; model-based defenses all broke.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
Towards Understanding the Robustness of Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
Owner-Harm: A Missing Threat Model for AI Agent Safety
cs.CR 2026-04 unverdicted novelty 6.0

Owner-Harm is a new threat model with eight categories of agent behavior that harms the deployer, and existing defenses achieve only 14.8% true positive rate on injection-based owner-harm tasks versus 100% on generic ...
MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems
cs.CR 2026-04 unverdicted novelty 6.0

MCPThreatHive automates the full lifecycle of threat intelligence for MCP agentic systems using a new 38-pattern taxonomy mapped to STRIDE and OWASP frameworks plus composite risk scoring.
LLM-Guided Prompt Evolution for Password Guessing
cs.CR 2026-04 unverdicted novelty 6.0

LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.
PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification
cs.CR 2026-04 unverdicted novelty 6.0

PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
cs.LG 2026-04 unverdicted novelty 6.0

Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw
cs.CR 2026-04 conditional novelty 6.0

Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
cs.LG 2023-09 conditional novelty 6.0

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
Jailbroken: How Does LLM Safety Training Fail?
cs.LG 2023-07 unverdicted novelty 6.0

LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
Engineering Robustness into Personal Agents with the AI Workflow Store
cs.CR 2026-05 unverdicted novelty 5.0

AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.
When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI
cs.CR 2026-05 unverdicted novelty 5.0

A survey providing a taxonomy of TEE platforms, an agent-centric threat model, and open challenges for applying confidential computing to secure agentic AI systems.
Architectural Obsolescence of Unhardened Agentic-AI Runtimes
cs.CR 2026-05 unverdicted novelty 5.0

OpenClaw fails to detect any of four action-audit divergence types while a hardened fork detects them all with perfect accuracy, making unhardened agentic-AI runtimes architecturally obsolete.
Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

A neurocognitive governance model formalizes a Pre-Action Governance Reasoning Loop that consults global, workflow, agent, and situational rules before each action, yielding 95% compliance accuracy with zero false esc...
SafeAgent: A Runtime Protection Architecture for Agentic Systems
cs.AI 2026-04 unverdicted novelty 5.0

SafeAgent is a stateful runtime protection system that improves LLM agent robustness to prompt injections over baselines while preserving task performance.
WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents
cs.CR 2026-04 unverdicted novelty 5.0

WebAgentGuard is a reasoning-driven multimodal model trained on large synthetic data via supervised fine-tuning and reinforcement learning to detect prompt injections in web agents better than prior defenses.
Engineering Robustness into Personal Agents with the AI Workflow Store
cs.CR 2026-05 unverdicted novelty 4.0

AI agents require pre-engineered reusable workflows stored in a central repository rather than generating plans on the fly to achieve production-grade reliability and security.
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
cs.CL 2026-05 unverdicted novelty 4.0

MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.
When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI
cs.CR 2026-05 unverdicted novelty 4.0

A structured survey of confidential computing for agentic AI that catalogs TEE platforms, agent-specific threats, transferable defenses, and remaining gaps in end-to-end frameworks.
Making AI-Assisted Grant Evaluation Auditable without Exposing the Model
cs.CR 2026-04 unverdicted novelty 4.0

A TEE-based remote attestation system creates signed evaluation bundles that link input hashes, model measurements, and outputs to make AI grant reviews verifiable without revealing proprietary components.
Transparent and Controllable Recommendation Filtering via Multimodal Multi-Agent Collaboration
cs.IR 2026-04 unverdicted novelty 4.0

A multi-agent multimodal system with fact-grounded adjudication and a dynamic two-tier preference graph cuts false positives in content filtering by 74.3% and nearly doubles F1-score versus text-only baselines while s...
Hardening x402: PII-Safe Agentic Payments via Pre-Execution Metadata Filtering
cs.CR 2026-04 conditional novelty 4.0

Presidio-hardened-x402 middleware filters PII from x402 metadata using NLP detection, achieving 0.894 micro-F1 on a 2000-sample synthetic corpus with 5.73ms p99 latency.
Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents
cs.CR 2026-04 unverdicted novelty 4.0

Aethelgard is a learned governance system that scopes AI agent capabilities to the minimum needed for each task type using PPO policy training on audit logs.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 37 Pith papers

[1]

2022. ChatGPT. [Link]

work page 2022
[2]

Australian mayor readies world’s first defamation lawsuit over ChatGPT content

2023. Australian mayor readies world’s first defamation lawsuit over ChatGPT content. [Link]

work page 2023
[3]

AUTO-GPT VS CHATGPT: HOW DO THEY DIFFER AND EVERYTHING YOU NEED TO KNOW

2023. AUTO-GPT VS CHATGPT: HOW DO THEY DIFFER AND EVERYTHING YOU NEED TO KNOW. [Link]

work page 2023
[4]

Building the New Bing

2023. Building the New Bing. [Link]

work page 2023
[5]

ChatGPT banned in Italy over privacy concerns

2023. ChatGPT banned in Italy over privacy concerns. [Link]

work page 2023
[6]

ChatGPT bug leaked users’ conversation histories

2023. ChatGPT bug leaked users’ conversation histories. [Link]

work page 2023
[7]

ChatGPT invented a sexual harassment scandal and named a real law prof as the accused

2023. ChatGPT invented a sexual harassment scandal and named a real law prof as the accused. [Link]

work page 2023
[8]

ChatGPT Plugins

2023. ChatGPT Plugins. [Link]

work page 2023
[9]

ChatGPT sets record for fastest-growing user base - analyst note

2023. ChatGPT sets record for fastest-growing user base - analyst note. [Link]

work page 2023
[10]

Confirmed: the new Bing runs on OpenAI’s GPT-4

2023. Confirmed: the new Bing runs on OpenAI’s GPT-4. [Link]

work page 2023
[11]

A Conversation With Bing’s Chatbot Left Me Deeply Unsettled

2023. A Conversation With Bing’s Chatbot Left Me Deeply Unsettled. [Link]

work page 2023
[12]

Copilot Internals

2023. Copilot Internals. [Link]

work page 2023
[13]

Driving more traffic and value to publishers from the new Bing

2023. Driving more traffic and value to publishers from the new Bing. [Link]

work page 2023
[14]

GitHub Copilot - Your AI pair programmer

2023. GitHub Copilot - Your AI pair programmer. [Link]

work page 2023
[15]

Google and Microsoft’s chatbots are already citing one another in a misin- formation shitshow

2023. Google and Microsoft’s chatbots are already citing one another in a misin- formation shitshow. [Link]

work page 2023
[16]

Google’s AI chatbot Bard makes factual error in first demo

2023. Google’s AI chatbot Bard makes factual error in first demo. [Link]

work page 2023
[17]

How to Jailbreak ChatGPT

2023. How to Jailbreak ChatGPT. [Link]

work page 2023
[18]

Introducing Microsoft 365 Copilot – your copilot for work

2023. Introducing Microsoft 365 Copilot – your copilot for work. [Link]

work page 2023
[19]

Introducing Microsoft Security Copilot

2023. Introducing Microsoft Security Copilot. [Link]

work page 2023
[20]

Jailbreak Chat

2023. Jailbreak Chat. [Link]

work page 2023
[21]

LangChain library for composing and integrating LLMs into applications

2023. LangChain library for composing and integrating LLMs into applications. [Link]

work page 2023
[22]

The LLaMA is out of the bag

2023. The LLaMA is out of the bag. Should we expect a tidal wave of disinforma- tion? [Link]

work page 2023
[23]

Microsoft limits Bing chat to five replies

2023. Microsoft limits Bing chat to five replies. [Link]

work page 2023
[24]

Microsoft’s AI chatbot is going off the rails

2023. Microsoft’s AI chatbot is going off the rails. [Link]

work page 2023
[25]

Microsoft’s Bing A.I

2023. Microsoft’s Bing A.I. made several factual errors in last week’s launch demo. [Link]

work page 2023
[26]

Microsoft’s Bing chatbot gets smarter with restaurant bookings, image results, and more

2023. Microsoft’s Bing chatbot gets smarter with restaurant bookings, image results, and more. [Link]

work page 2023
[27]

The New Bing and Edge – Progress from Our First Month

2023. The New Bing and Edge – Progress from Our First Month. [Link]

work page 2023
[28]

New prompt injection attack on ChatGPT web version

2023. New prompt injection attack on ChatGPT web version. Reckless copy- pasting may lead to serious privacy issues in your chat. [Link]

work page 2023
[29]

OpenAI Codex

2023. OpenAI Codex. [Link]

work page 2023
[30]

Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web

2023. Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web. [Link]

work page 2023
[31]

That was fast! Microsoft slips ads into AI-powered Bing Chat

2023. That was fast! Microsoft slips ads into AI-powered Bing Chat. [Link]

work page 2023
[32]

These are Microsoft’s Bing AI secret rules and why it says it’s named Sydney

2023. These are Microsoft’s Bing AI secret rules and why it says it’s named Sydney. [Link]

work page 2023
[33]

Jacob Andreas. 2022. Language models as agent models. In Findings of EMNLP

work page 2022
[34]

Real Attackers Don’t Compute Gradients

Giovanni Apruzzese, Hyrum Anderson, Savino Dambra, David Freeman, Fabio Pierazzi, and Kevin Roundy. 2022. Position:“Real Attackers Don’t Compute Gradients”: Bridging the Gap Between Adversarial ML Research and Practice. In SaTML

work page 2022
[35]

Eugene Bagdasaryan and Vitaly Shmatikov. 2022. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures. In S&P

work page 2022
[36]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv (2022)

work page 2022
[37]

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting Latent Predic- tions from Transformers with the Tuned Lens. arXiv (2023)

work page 2023
[38]

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In the ACM conference on Fairness, Accountability, and Transparency

work page 2021
[39]

Daniil A Boiko, Robert MacKnight, and Gabe Gomes. 2023. Emergent autonomous scientific research capabilities of large language models. arXiv (2023)

work page 2023
[40]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv (2021)

work page 2021
[41]

Nicholas Boucher, Ilia Shumailov, Ross Anderson, and Nicolas Papernot. 2022. Bad characters: Imperceptible nlp attacks. In S&P

work page 2022
[42]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In NeurIPS

work page 2020
[43]

Matthew Burtell and Thomas Woodside. 2023. Artificial Influence: An Analysis Of AI-Driven Persuasion. arXiv (2023)

work page 2023
[44]

O’Reilly Media, Inc

Clarence Chio and David Freeman. 2018. Machine learning and security. "O’Reilly Media, Inc. "

work page 2018
[45]

Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in ChatGPT: Analyzing Persona-assigned Language Models. arXiv (2023)

work page 2023
[46]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of EMNLP

work page 2020
[47]

Josh A Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. 2023. Generative Language Models and Automated Influ- ence Operations: Emerging Threats and Potential Mitigations. arXiv (2023)

work page 2023
[48]

Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. 2023. The po- litical ideology of conversational AI: Converging evidence on ChatGPT’s pro- environmental, left-libertarian orientation. arXiv (2023)

work page 2023
[49]

Noah Goodman Jesse Mu, Xiang Lisa Li. 2023. Learning to Compress Prompts with Gist Tokens. arXiv (2023)

work page 2023
[50]

Ana Jojic, Zhen Wang, and Nebojsa Jojic. 2023. GPT is becoming a Turing machine: Here are some ways to program it. arXiv (2023)

work page 2023
[51]

Keith S Jones, Miriam E Armstrong, McKenna K Tornblad, and Akbar Siami Namin. 2021. How social engineers use persuasion principles during vishing attacks. Information & Computer Security 29, 2 (2021), 314–331

work page 2021
[52]

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tat- sunori Hashimoto. 2023. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. arXiv (2023)

work page 2023
[53]

Sebastian Krügel, Andreas Ostermaier, and Matthias Uhl. 2023. ChatGPT’s inconsistent moral advice influences users’ judgment. Scientific Reports 13, 1 (2023), 4569

work page 2023
[54]

Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ash- win Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, et al

work page
[55]

arXiv (2022)

Evaluating Human-Language Model Interaction. arXiv (2022)

work page 2022
[56]

Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. 2023. TaskMatrix. AI: Completing Tasks by Connecting Foundation Models with Millions of APIs. arXiv (2023)

work page 2023
[57]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In ACL

work page 2022
[58]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. arXiv (2023)

work page 2023
[59]

Nelson F Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating Verifiability in Generative Search Engines. arXiv (2023)

work page 2023
[60]

Microsoft. 2023. Bing Preview Release Notes: Bing in the Edge Sidebar. [Link]

work page 2023
[61]

Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereo- typical bias in pretrained language models. In ACL | IJCNLP

work page 2021
[62]

OpenAI. 2023. GPT-4 Technical Report. arXiv (2023). 13 Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz

work page 2023
[63]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, et al

work page
[64]

In NeurIPS

Training language models to follow instructions with human feedback. In NeurIPS

work page
[65]

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arXiv (2023)

work page 2023
[66]

Was it “stated

Roma Patel and Ellie Pavlick. 2021. “Was it “stated” or was it “claimed”?: How linguistic bias affects generative language models. In EMNLP

work page 2021
[67]

Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al

work page
[68]

arXiv (2022)

Discovering Language Model Behaviors with Model-Written Evaluations. arXiv (2022)

work page 2022
[69]

Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. In NeurIPS ML Safety Workshop

work page 2022
[70]

Hashimoto Rohan Taori, Ishaan Gulrajani

Tianyi Zhang Yann Dubois Xuechen Li Carlos Guestrin Percy Liang Tatsunori B. Hashimoto Rohan Taori, Ishaan Gulrajani. 2023. Alpaca: A Strong, Replicable Instruction-Following Model. [Link]

work page 2023
[71]

Ahmed Salem, Michael Backes, and Yang Zhang. 2022. Get a Model! Model Hijacking Attack Against Machine Learning Models. In NDSS

work page 2022
[72]

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose Opinions Do Language Models Reflect?arXiv (2023)

work page 2023
[73]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv (2023)

work page 2023
[74]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace. arXiv (2023)

work page 2023
[75]

Arabella Sinclair, Jaap Jumelet, Willem Zuidema, and Raquel Fernández. 2022. Structural Persistence in Language Models: Priming as a Window into Abstract Language Representations. Transactions of the Association for Computational Linguistics 10 (09 2022), 1031–1050

work page 2022
[76]

Jacob Steinhardt. 2023. Emergent Deception and Emergent Optimization. [Link]

work page 2023
[77]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. In NeurIPS

work page 2020
[78]

Jonas Thiergart, Stefan Huber, and Thomas Übellacker. 2021. Understanding emails and drafting responses–An approach using GPT-3. arXiv (2021)

work page 2021
[79]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS

work page 2022
[80]

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al

work page

Showing first 80 references.