pith. machine review for the scientific record. sign in

arxiv: 2302.12173 · v2 · submitted 2023-02-23 · 💻 cs.CR · cs.AI· cs.CL· cs.CY

Recognition: 1 theorem link

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Christoph Endres, Kai Greshake, Mario Fritz, Sahar Abdelnabi, Shailesh Mishra, Thorsten Holz

Pith reviewed 2026-05-11 17:12 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.CY
keywords indirect prompt injectionLLM securityadversarial promptingprompt injection attacksLLM applicationsdata retrievalAI vulnerabilities
0
0 comments X

The pith

Adversaries can remotely compromise LLM-integrated applications by injecting prompts into retrievable data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that integrating LLMs into applications creates new risks because the models process retrieved data as if it contains instructions. This allows attackers to use indirect prompt injection to override original commands without directly interacting with the system. The authors develop a taxonomy of potential harms including data theft and ecosystem contamination, and show these attacks working on real tools like Bing's GPT-4 chat and code completion engines. They conclude that without new defenses, reliance on LLMs in apps leaves users and systems exposed.

Core claim

Indirect Prompt Injection attacks succeed because LLM-integrated applications do not distinguish between data and instructions, allowing strategically placed prompts in external data to be retrieved, processed, and executed by the model as overriding commands.

What carries the argument

Indirect Prompt Injection mechanism, which embeds adversarial instructions in data sources that the application retrieves and feeds to the LLM.

If this is right

  • LLM apps can be tricked into stealing and exfiltrating user data to the attacker.
  • Attacks can propagate like worms by injecting prompts that cause further data contamination.
  • Application behavior can be altered to call APIs in unintended ways or manipulate outputs.
  • The overall information ecosystem can be poisoned through controlled content generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • System designers may need to implement strict separation between user instructions and retrieved content.
  • Retrieval-augmented generation systems are particularly vulnerable and require new verification steps.
  • Testing LLMs with mixed data and instructions could reveal if they reliably ignore injected commands.

Load-bearing premise

The LLM will interpret and follow instructions found in retrieved external data without recognizing them as separate from or subordinate to the original user prompt.

What would settle it

Observe whether an LLM app follows a user query or an opposing instruction hidden in a retrieved document when both are present.

read the original abstract

Large Language Models (LLMs) are increasingly being integrated into various applications. The functionalities of recent LLMs can be flexibly modulated via natural language prompts. This renders them susceptible to targeted adversarial prompting, e.g., Prompt Injection (PI) attacks enable attackers to override original instructions and employed controls. So far, it was assumed that the user is directly prompting the LLM. But, what if it is not the user prompting? We argue that LLM-Integrated Applications blur the line between data and instructions. We reveal new attack vectors, using Indirect Prompt Injection, that enable adversaries to remotely (without a direct interface) exploit LLM-integrated applications by strategically injecting prompts into data likely to be retrieved. We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities, including data theft, worming, information ecosystem contamination, and other novel security risks. We demonstrate our attacks' practical viability against both real-world systems, such as Bing's GPT-4 powered Chat and code-completion engines, and synthetic applications built on GPT-4. We show how processing retrieved prompts can act as arbitrary code execution, manipulate the application's functionality, and control how and if other APIs are called. Despite the increasing integration and reliance on LLMs, effective mitigations of these emerging threats are currently lacking. By raising awareness of these vulnerabilities and providing key insights into their implications, we aim to promote the safe and responsible deployment of these powerful models and the development of robust defenses that protect users and systems from potential attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLM-integrated applications are vulnerable to a new class of Indirect Prompt Injection (IPI) attacks. Adversaries can remotely compromise these systems by embedding malicious instructions in external data sources (web pages, emails, databases) that the application is likely to retrieve and concatenate into the LLM context. The work derives a security-oriented taxonomy of resulting impacts (data theft, worming, information ecosystem contamination, API manipulation), demonstrates concrete attacks on production systems including Bing's GPT-4 Chat and code-completion engines as well as synthetic GPT-4 applications, and argues that retrieved prompts can effectively act as arbitrary code execution. It concludes that current mitigations are insufficient and calls for improved defenses.

Significance. If the demonstrations hold, the paper makes a timely and practically relevant contribution to LLM security by surfacing an attack surface that arises precisely from the data-instruction blurring inherent in retrieval-augmented LLM applications. The taxonomy supplies a useful organizing framework, and the real-world case studies on commercial GPT-4 deployments provide concrete evidence that the vector is already exploitable. These elements could directly inform both defensive research (e.g., prompt isolation, structured retrieval) and responsible deployment practices. The empirical focus on production systems is a clear strength.

major comments (2)
  1. [§5] §5 (Demonstrations / Evaluation): The central claim of 'practical viability' against real-world systems rests on qualitative descriptions of successful attacks on Bing Chat and code-completion engines. No success rates, trial counts, context-length sensitivity, or failure-mode analysis are reported, nor are the exact injected prompts or retrieval conditions provided. This omission is load-bearing because LLM behavior is non-deterministic and prompt ordering / summarization can suppress the attack; without these metrics the reproducibility and robustness of the vector cannot be assessed.
  2. [§3] §3 (Taxonomy): Several high-impact categories (e.g., worming, ecosystem contamination) are defined but the mapping from the concrete demonstrations to these categories is only partially instantiated. The paper extends the observed Bing/Chat behaviors to the full taxonomy largely by construction rather than by additional targeted experiments, weakening the claim that the taxonomy comprehensively captures realized risks.
minor comments (2)
  1. [Abstract / §1] The abstract and introduction repeatedly use 'arbitrary code execution' as an analogy; a brief clarification of the precise boundary (what the LLM can and cannot do via the retrieved prompt) would prevent over-interpretation.
  2. [§2] Related-work discussion of prior direct prompt-injection papers is present but could more explicitly contrast the indirect setting with respect to attacker capabilities and detection surfaces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments highlight important areas for improving the rigor of our evaluation and the clarity of our taxonomy. We address each major comment below, indicating planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (Demonstrations / Evaluation): The central claim of 'practical viability' against real-world systems rests on qualitative descriptions of successful attacks on Bing Chat and code-completion engines. No success rates, trial counts, context-length sensitivity, or failure-mode analysis are reported, nor are the exact injected prompts or retrieval conditions provided. This omission is load-bearing because LLM behavior is non-deterministic and prompt ordering / summarization can suppress the attack; without these metrics the reproducibility and robustness of the vector cannot be assessed.

    Authors: We acknowledge the value of quantitative metrics for assessing robustness. Our Section 5 demonstrations were designed as proof-of-concept case studies on live production systems, where repeated quantitative trials raise ethical and practical issues (e.g., potential service disruption or model changes over time). In the revision we will add the specific injected prompts and retrieval conditions used, describe the number of trials performed where feasible, and include a discussion of observed failure modes and context-length effects based on our experiments. This will improve reproducibility while preserving the real-world focus. revision: partial

  2. Referee: [§3] §3 (Taxonomy): Several high-impact categories (e.g., worming, ecosystem contamination) are defined but the mapping from the concrete demonstrations to these categories is only partially instantiated. The paper extends the observed Bing/Chat behaviors to the full taxonomy largely by construction rather than by additional targeted experiments, weakening the claim that the taxonomy comprehensively captures realized risks.

    Authors: The taxonomy organizes risks according to the core mechanism of indirect prompt injection, which grants the LLM effective control over its own context and downstream actions. Demonstrations on Bing and code-completion engines directly instantiate data theft and API manipulation; synthetic GPT-4 applications instantiate worming and contamination. We will revise Section 3 to add an explicit mapping (e.g., a table) that distinguishes directly demonstrated cases from logical extensions of the same mechanism, thereby clarifying the scope without new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical attack demonstrations without derivations or self-referential fits

full rationale

The paper contains no mathematical derivations, equations, fitted parameters, or first-principles claims that could reduce to their own inputs. Its central contributions are a taxonomy of indirect prompt injection risks and practical demonstrations against external production systems (e.g., Bing Chat, code-completion engines) and synthetic GPT-4 setups. These rest on observable behaviors in real applications rather than any self-definition, self-citation load-bearing premise, or renaming of known results. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current LLMs process all natural language input uniformly without reliable separation of instructions from data content.

axioms (1)
  • domain assumption LLMs treat retrieved external text as executable instructions equivalent to direct user prompts.
    This assumption underpins why injected prompts in data sources can override application controls.

pith-pipeline@v0.9.0 · 5598 in / 1123 out tokens · 40157 ms · 2026-05-11T17:12:20.679350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 40 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Prompt to Physical Actuation: Holistic Threat Modeling of LLM-Enabled Robotic Systems

    cs.CR 2026-04 unverdicted novelty 8.0

    A unified threat model for LLM-enabled robots reveals three cross-boundary attack chains from user input to unsafe physical actuation due to missing validations and unmediated crossings.

  2. IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

    cs.CR 2026-05 unverdicted novelty 7.0

    IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.

  3. Jailbroken Frontier Models Retain Their Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

  4. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  5. Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms

    cs.CR 2026-04 conditional novelty 7.0

    Introduces CSTM-Bench with 26 cross-session attack taxonomies, demonstrates recall loss in session-bound and full-log detectors, and proposes a bounded-memory coreset reader with the CSTM metric balancing detection an...

  6. Conjunctive Prompt Attacks in Multi-Agent LLM Systems

    cs.MA 2026-04 unverdicted novelty 7.0

    Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.

  7. Many-Tier Instruction Hierarchy in LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

  8. Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents

    cs.CR 2026-05 conditional novelty 6.0

    Sleeper channels enable persistent prompt injection in always-on AI agents via persistence substrate and firing separation, countered by provenance gates using action digests and owner attestations with a soundness theorem.

  9. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  10. Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

    cs.LG 2026-05 unverdicted novelty 6.0

    Language models trained on parallel streams of computation can overcome single-stream bottlenecks in autonomous agents by enabling simultaneous reading, thinking, and acting.

  11. When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.

  12. When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.

  13. Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems

    cs.CR 2026-05 unverdicted novelty 6.0

    ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...

  14. Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

    cs.CR 2026-05 unverdicted novelty 6.0

    Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...

  15. AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents

    cs.CR 2026-04 conditional novelty 6.0

    AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.

  16. From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

    cs.CL 2026-04 unverdicted novelty 6.0

    SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.

  17. Evaluation of Prompt Injection Defenses in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    Output filtering implemented in application code is the only defense that survived an adaptive prompt-injection attacker across 15,000 attacks; model-based defenses all broke.

  18. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  19. Towards Understanding the Robustness of Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 6.0

    Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

  20. Owner-Harm: A Missing Threat Model for AI Agent Safety

    cs.CR 2026-04 unverdicted novelty 6.0

    Owner-Harm is a new threat model with eight categories of agent behavior that harms the deployer, and existing defenses achieve only 14.8% true positive rate on injection-based owner-harm tasks versus 100% on generic ...

  21. MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems

    cs.CR 2026-04 unverdicted novelty 6.0

    MCPThreatHive automates the full lifecycle of threat intelligence for MCP agentic systems using a new 38-pattern taxonomy mapped to STRIDE and OWASP frameworks plus composite risk scoring.

  22. LLM-Guided Prompt Evolution for Password Guessing

    cs.CR 2026-04 unverdicted novelty 6.0

    LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.

  23. PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

    cs.CR 2026-04 unverdicted novelty 6.0

    PlanGuard cuts indirect prompt injection attack success rate to 0% on the InjecAgent benchmark by verifying agent actions against a user-instruction-only plan while keeping false positives at 1.49%.

  24. When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.

  25. Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw

    cs.CR 2026-04 conditional novelty 6.0

    Poisoning any single CIK dimension of an AI agent raises average attack success rate from 24.6% to 64-74% across models, and tested defenses leave substantial residual risk.

  26. Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    cs.LG 2023-09 conditional novelty 6.0

    Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.

  27. Jailbroken: How Does LLM Safety Training Fail?

    cs.LG 2023-07 unverdicted novelty 6.0

    LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.

  28. Engineering Robustness into Personal Agents with the AI Workflow Store

    cs.CR 2026-05 unverdicted novelty 5.0

    AI agents should shift from on-the-fly plan synthesis to invoking pre-engineered, tested, and reusable workflows stored in an AI Workflow Store to gain reliability and security.

  29. When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI

    cs.CR 2026-05 unverdicted novelty 5.0

    A survey providing a taxonomy of TEE platforms, an agent-centric threat model, and open challenges for applying confidential computing to secure agentic AI systems.

  30. Architectural Obsolescence of Unhardened Agentic-AI Runtimes

    cs.CR 2026-05 unverdicted novelty 5.0

    OpenClaw fails to detect any of four action-audit divergence types while a hardened fork detects them all with perfect accuracy, making unhardened agentic-AI runtimes architecturally obsolete.

  31. Think Before You Act -- A Neurocognitive Governance Model for Autonomous AI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    A neurocognitive governance model formalizes a Pre-Action Governance Reasoning Loop that consults global, workflow, agent, and situational rules before each action, yielding 95% compliance accuracy with zero false esc...

  32. SafeAgent: A Runtime Protection Architecture for Agentic Systems

    cs.AI 2026-04 unverdicted novelty 5.0

    SafeAgent is a stateful runtime protection system that improves LLM agent robustness to prompt injections over baselines while preserving task performance.

  33. WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents

    cs.CR 2026-04 unverdicted novelty 5.0

    WebAgentGuard is a reasoning-driven multimodal model trained on large synthetic data via supervised fine-tuning and reinforcement learning to detect prompt injections in web agents better than prior defenses.

  34. Engineering Robustness into Personal Agents with the AI Workflow Store

    cs.CR 2026-05 unverdicted novelty 4.0

    AI agents require pre-engineered reusable workflows stored in a central repository rather than generating plans on the fly to achieve production-grade reliability and security.

  35. MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning

    cs.CL 2026-05 unverdicted novelty 4.0

    MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.

  36. When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI

    cs.CR 2026-05 unverdicted novelty 4.0

    A structured survey of confidential computing for agentic AI that catalogs TEE platforms, agent-specific threats, transferable defenses, and remaining gaps in end-to-end frameworks.

  37. Making AI-Assisted Grant Evaluation Auditable without Exposing the Model

    cs.CR 2026-04 unverdicted novelty 4.0

    A TEE-based remote attestation system creates signed evaluation bundles that link input hashes, model measurements, and outputs to make AI grant reviews verifiable without revealing proprietary components.

  38. Transparent and Controllable Recommendation Filtering via Multimodal Multi-Agent Collaboration

    cs.IR 2026-04 unverdicted novelty 4.0

    A multi-agent multimodal system with fact-grounded adjudication and a dynamic two-tier preference graph cuts false positives in content filtering by 74.3% and nearly doubles F1-score versus text-only baselines while s...

  39. Hardening x402: PII-Safe Agentic Payments via Pre-Execution Metadata Filtering

    cs.CR 2026-04 conditional novelty 4.0

    Presidio-hardened-x402 middleware filters PII from x402 metadata using NLP detection, achieving 0.894 micro-F1 on a 2000-sample synthetic corpus with 5.73ms p99 latency.

  40. Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents

    cs.CR 2026-04 unverdicted novelty 4.0

    Aethelgard is a learned governance system that scopes AI agent capabilities to the minimum needed for each task type using PPO policy training on audit logs.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 37 Pith papers

  1. [1]

    2022. ChatGPT. [Link]

  2. [2]

    Australian mayor readies world’s first defamation lawsuit over ChatGPT content

    2023. Australian mayor readies world’s first defamation lawsuit over ChatGPT content. [Link]

  3. [3]

    AUTO-GPT VS CHATGPT: HOW DO THEY DIFFER AND EVERYTHING YOU NEED TO KNOW

    2023. AUTO-GPT VS CHATGPT: HOW DO THEY DIFFER AND EVERYTHING YOU NEED TO KNOW. [Link]

  4. [4]

    Building the New Bing

    2023. Building the New Bing. [Link]

  5. [5]

    ChatGPT banned in Italy over privacy concerns

    2023. ChatGPT banned in Italy over privacy concerns. [Link]

  6. [6]

    ChatGPT bug leaked users’ conversation histories

    2023. ChatGPT bug leaked users’ conversation histories. [Link]

  7. [7]

    ChatGPT invented a sexual harassment scandal and named a real law prof as the accused

    2023. ChatGPT invented a sexual harassment scandal and named a real law prof as the accused. [Link]

  8. [8]

    ChatGPT Plugins

    2023. ChatGPT Plugins. [Link]

  9. [9]

    ChatGPT sets record for fastest-growing user base - analyst note

    2023. ChatGPT sets record for fastest-growing user base - analyst note. [Link]

  10. [10]

    Confirmed: the new Bing runs on OpenAI’s GPT-4

    2023. Confirmed: the new Bing runs on OpenAI’s GPT-4. [Link]

  11. [11]

    A Conversation With Bing’s Chatbot Left Me Deeply Unsettled

    2023. A Conversation With Bing’s Chatbot Left Me Deeply Unsettled. [Link]

  12. [12]

    Copilot Internals

    2023. Copilot Internals. [Link]

  13. [13]

    Driving more traffic and value to publishers from the new Bing

    2023. Driving more traffic and value to publishers from the new Bing. [Link]

  14. [14]

    GitHub Copilot - Your AI pair programmer

    2023. GitHub Copilot - Your AI pair programmer. [Link]

  15. [15]

    Google and Microsoft’s chatbots are already citing one another in a misin- formation shitshow

    2023. Google and Microsoft’s chatbots are already citing one another in a misin- formation shitshow. [Link]

  16. [16]

    Google’s AI chatbot Bard makes factual error in first demo

    2023. Google’s AI chatbot Bard makes factual error in first demo. [Link]

  17. [17]

    How to Jailbreak ChatGPT

    2023. How to Jailbreak ChatGPT. [Link]

  18. [18]

    Introducing Microsoft 365 Copilot – your copilot for work

    2023. Introducing Microsoft 365 Copilot – your copilot for work. [Link]

  19. [19]

    Introducing Microsoft Security Copilot

    2023. Introducing Microsoft Security Copilot. [Link]

  20. [20]

    Jailbreak Chat

    2023. Jailbreak Chat. [Link]

  21. [21]

    LangChain library for composing and integrating LLMs into applications

    2023. LangChain library for composing and integrating LLMs into applications. [Link]

  22. [22]

    The LLaMA is out of the bag

    2023. The LLaMA is out of the bag. Should we expect a tidal wave of disinforma- tion? [Link]

  23. [23]

    Microsoft limits Bing chat to five replies

    2023. Microsoft limits Bing chat to five replies. [Link]

  24. [24]

    Microsoft’s AI chatbot is going off the rails

    2023. Microsoft’s AI chatbot is going off the rails. [Link]

  25. [25]

    Microsoft’s Bing A.I

    2023. Microsoft’s Bing A.I. made several factual errors in last week’s launch demo. [Link]

  26. [26]

    Microsoft’s Bing chatbot gets smarter with restaurant bookings, image results, and more

    2023. Microsoft’s Bing chatbot gets smarter with restaurant bookings, image results, and more. [Link]

  27. [27]

    The New Bing and Edge – Progress from Our First Month

    2023. The New Bing and Edge – Progress from Our First Month. [Link]

  28. [28]

    New prompt injection attack on ChatGPT web version

    2023. New prompt injection attack on ChatGPT web version. Reckless copy- pasting may lead to serious privacy issues in your chat. [Link]

  29. [29]

    OpenAI Codex

    2023. OpenAI Codex. [Link]

  30. [30]

    Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web

    2023. Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web. [Link]

  31. [31]

    That was fast! Microsoft slips ads into AI-powered Bing Chat

    2023. That was fast! Microsoft slips ads into AI-powered Bing Chat. [Link]

  32. [32]

    These are Microsoft’s Bing AI secret rules and why it says it’s named Sydney

    2023. These are Microsoft’s Bing AI secret rules and why it says it’s named Sydney. [Link]

  33. [33]

    Jacob Andreas. 2022. Language models as agent models. In Findings of EMNLP

  34. [34]

    Real Attackers Don’t Compute Gradients

    Giovanni Apruzzese, Hyrum Anderson, Savino Dambra, David Freeman, Fabio Pierazzi, and Kevin Roundy. 2022. Position:“Real Attackers Don’t Compute Gradients”: Bridging the Gap Between Adversarial ML Research and Practice. In SaTML

  35. [35]

    Eugene Bagdasaryan and Vitaly Shmatikov. 2022. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures. In S&P

  36. [36]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv (2022)

  37. [37]

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting Latent Predic- tions from Transformers with the Tuned Lens. arXiv (2023)

  38. [38]

    Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In the ACM conference on Fairness, Accountability, and Transparency

  39. [39]

    Daniil A Boiko, Robert MacKnight, and Gabe Gomes. 2023. Emergent autonomous scientific research capabilities of large language models. arXiv (2023)

  40. [40]

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv (2021)

  41. [41]

    Nicholas Boucher, Ilia Shumailov, Ross Anderson, and Nicolas Papernot. 2022. Bad characters: Imperceptible nlp attacks. In S&P

  42. [42]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In NeurIPS

  43. [43]

    Matthew Burtell and Thomas Woodside. 2023. Artificial Influence: An Analysis Of AI-Driven Persuasion. arXiv (2023)

  44. [44]

    O’Reilly Media, Inc

    Clarence Chio and David Freeman. 2018. Machine learning and security. "O’Reilly Media, Inc. "

  45. [45]

    Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in ChatGPT: Analyzing Persona-assigned Language Models. arXiv (2023)

  46. [46]

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Findings of EMNLP

  47. [47]

    Josh A Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. 2023. Generative Language Models and Automated Influ- ence Operations: Emerging Threats and Potential Mitigations. arXiv (2023)

  48. [48]

    Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. 2023. The po- litical ideology of conversational AI: Converging evidence on ChatGPT’s pro- environmental, left-libertarian orientation. arXiv (2023)

  49. [49]

    Noah Goodman Jesse Mu, Xiang Lisa Li. 2023. Learning to Compress Prompts with Gist Tokens. arXiv (2023)

  50. [50]

    Ana Jojic, Zhen Wang, and Nebojsa Jojic. 2023. GPT is becoming a Turing machine: Here are some ways to program it. arXiv (2023)

  51. [51]

    Keith S Jones, Miriam E Armstrong, McKenna K Tornblad, and Akbar Siami Namin. 2021. How social engineers use persuasion principles during vishing attacks. Information & Computer Security 29, 2 (2021), 314–331

  52. [52]

    Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tat- sunori Hashimoto. 2023. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. arXiv (2023)

  53. [53]

    Sebastian Krügel, Andreas Ostermaier, and Matthias Uhl. 2023. ChatGPT’s inconsistent moral advice influences users’ judgment. Scientific Reports 13, 1 (2023), 4569

  54. [54]

    Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ash- win Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, et al

  55. [55]

    arXiv (2022)

    Evaluating Human-Language Model Interaction. arXiv (2022)

  56. [56]

    Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. 2023. TaskMatrix. AI: Completing Tasks by Connecting Foundation Models with Millions of APIs. arXiv (2023)

  57. [57]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In ACL

  58. [58]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruc- tion Tuning. arXiv (2023)

  59. [59]

    Nelson F Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating Verifiability in Generative Search Engines. arXiv (2023)

  60. [60]

    Microsoft. 2023. Bing Preview Release Notes: Bing in the Edge Sidebar. [Link]

  61. [61]

    Moin Nadeem, Anna Bethke, and Siva Reddy. 2021. StereoSet: Measuring stereo- typical bias in pretrained language models. In ACL | IJCNLP

  62. [62]

    OpenAI. 2023. GPT-4 Technical Report. arXiv (2023). 13 Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz

  63. [63]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, et al

  64. [64]

    In NeurIPS

    Training language models to follow instructions with human feedback. In NeurIPS

  65. [65]

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arXiv (2023)

  66. [66]

    Was it “stated

    Roma Patel and Ellie Pavlick. 2021. “Was it “stated” or was it “claimed”?: How linguistic bias affects generative language models. In EMNLP

  67. [67]

    Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al

  68. [68]

    arXiv (2022)

    Discovering Language Model Behaviors with Model-Written Evaluations. arXiv (2022)

  69. [69]

    Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. In NeurIPS ML Safety Workshop

  70. [70]

    Hashimoto Rohan Taori, Ishaan Gulrajani

    Tianyi Zhang Yann Dubois Xuechen Li Carlos Guestrin Percy Liang Tatsunori B. Hashimoto Rohan Taori, Ishaan Gulrajani. 2023. Alpaca: A Strong, Replicable Instruction-Following Model. [Link]

  71. [71]

    Ahmed Salem, Michael Backes, and Yang Zhang. 2022. Get a Model! Model Hijacking Attack Against Machine Learning Models. In NDSS

  72. [72]

    Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose Opinions Do Language Models Reflect?arXiv (2023)

  73. [73]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv (2023)

  74. [74]

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace. arXiv (2023)

  75. [75]

    Arabella Sinclair, Jaap Jumelet, Willem Zuidema, and Raquel Fernández. 2022. Structural Persistence in Language Models: Priming as a Window into Abstract Language Representations. Transactions of the Association for Computational Linguistics 10 (09 2022), 1031–1050

  76. [76]

    Jacob Steinhardt. 2023. Emergent Deception and Emergent Optimization. [Link]

  77. [77]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. In NeurIPS

  78. [78]

    Jonas Thiergart, Stefan Huber, and Thomas Übellacker. 2021. Understanding emails and drafting responses–An approach using GPT-3. arXiv (2021)

  79. [79]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS

  80. [80]

    Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al

Showing first 80 references.