arxiv: 2403.14720 · v1 · submitted 2024-03-20 · 💻 cs.CR · cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Defending Against Indirect Prompt Injection Attacks With Spotlighting

Keegan Hines , Gary Lopez , Matthew Hall , Federico Zarfati , Yonatan Zunger , Emre Kiciman

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:24 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG

keywords indirect prompt injectionLLM securityprompt engineeringprovenance signaladversarial defenseinput transformationmodel robustness

0 comments

The pith

Spotlighting uses input transformations to mark data origins, letting LLMs ignore embedded adversarial instructions and cutting indirect prompt injection success from over 50% to under 2%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces spotlighting as a family of prompt engineering techniques that transform inputs to create a consistent signal of their source. Large language models normally process concatenated text without knowing which parts come from trusted user commands versus untrusted external data. Indirect prompt injection attacks exploit this by hiding malicious instructions inside the untrusted data, causing the model to follow those instructions instead. Spotlighting counters this by applying transformations that give the model a reliable way to track provenance and follow only the intended commands. Experiments show the method lowers attack success rates sharply while leaving standard task performance nearly unchanged.

Core claim

Spotlighting is a family of prompt engineering techniques that utilize transformations of an input to provide a reliable and continuous signal of its provenance, enabling LLMs to distinguish among multiple sources of input and thereby defend against indirect prompt injection attacks, reducing attack success rates from greater than 50% to below 2% with minimal impact on task efficacy.

What carries the argument

Spotlighting, a family of prompt engineering techniques that apply transformations to inputs in order to create a continuous provenance signal that LLMs can follow when processing combined text streams.

If this is right

LLMs can be made to ignore instructions embedded in untrusted data when those inputs carry a spotlighted provenance signal.
Standard NLP task performance stays largely intact under spotlighting transformations.
The defense works across the GPT models tested without requiring any model retraining or architectural changes.
Prompt-based provenance signals offer a practical layer of protection for applications that combine user commands with external data sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attackers may develop variants that replicate or mimic the specific transformation patterns to evade the provenance signal.
The same transformation approach could help LLMs separate other mixed inputs, such as user queries from retrieved documents in retrieval-augmented systems.
Evaluating spotlighting on non-GPT model families would test whether the effect depends on particular training characteristics.

Load-bearing premise

The selected transformations will produce a provenance signal that LLMs interpret and obey consistently, without being bypassed by new attack variants.

What would settle it

An experiment in which a new indirect prompt injection attack achieves more than 2% success rate against the same spotlighted inputs and GPT-family models used in the paper.

read the original abstract

Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Spotlighting gives a simple prompt-level way to mark input sources and cuts reported indirect injection success from over 50% to under 2%, but the tests skip adaptive attackers who know the defense.

read the letter

The paper introduces spotlighting as a set of input transformations that create a continuous provenance signal for LLMs handling mixed trusted and untrusted text. The central result is the drop in attack success rate from greater than 50% to below 2% on GPT-family models, with only minor effects on normal task performance. That reduction is the main takeaway worth noting if the numbers hold up under scrutiny. The technique itself is new in the sense that it focuses on explicit provenance signaling through transformations rather than the prompt patterns covered in the cited prior work. It is easy to implement and does not require model changes, which makes it immediately testable for people running real systems. The experiments appear to show the defense working across the attack strings they chose, and the authors check that task efficacy does not suffer much. The main limitation is that the evaluation uses fixed attack templates. There is no reported testing against adaptive adversaries who know the spotlighting rules and can try to mimic markers, add counter-instructions to ignore them, or reframe the input. That leaves the robustness claim dependent on the specific non-adaptive cases tested. Details on exact model versions, attack constructions, and statistical tests are also thin in the abstract, though the full paper presumably fills some of that in. This work is aimed at engineers building LLM applications that mix user commands with external data, such as agents or customer-support tools. A practitioner looking for a lightweight defense to try would find it useful. It deserves peer review because the idea is concrete, the reported effect size is large, and the gap it targets is real even if more adaptive testing would make the claims stronger.

Referee Report

2 major / 2 minor

Summary. The paper introduces spotlighting, a family of prompt-engineering techniques that apply input transformations (such as delimiters or highlighting) to create a continuous provenance signal distinguishing trusted user commands from untrusted data. The central claim, evaluated on GPT-family models, is that these techniques reduce the success rate of indirect prompt injection attacks from greater than 50% to below 2% while preserving task efficacy.

Significance. If the empirical results hold under broader testing, spotlighting would offer a lightweight, training-free defense against a practical attack vector in LLM applications that ingest untrusted content. The approach is notable for its simplicity and reported minimal overhead on downstream NLP tasks.

major comments (2)

[Evaluation] Evaluation section: the headline ASR reduction (>50% to <2%) is demonstrated only against fixed, non-adaptive attack templates. No experiments test adaptive adversaries who know the spotlighting rules and can embed counter-instructions to ignore markers, mimic their syntax, or re-frame the input as user-controlled, leaving the robustness claim unverified.
[Method and Experiments] Method and Experiments: the manuscript supplies no concrete attack constructions, model versions (e.g., GPT-3.5 vs. GPT-4), prompt templates, dataset sizes, or statistical tests, so the central quantitative claim cannot be reproduced or assessed for variance from the provided text.

minor comments (2)

[Abstract] Abstract: the notation {50} and {2} appears to be a LaTeX artifact; replace with explicit percentages.
[Method] Clarify the exact set of spotlighting transformations evaluated and whether they are applied uniformly or chosen per task.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve evaluation robustness and reproducibility.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the headline ASR reduction (>50% to <2%) is demonstrated only against fixed, non-adaptive attack templates. No experiments test adaptive adversaries who know the spotlighting rules and can embed counter-instructions to ignore markers, mimic their syntax, or re-frame the input as user-controlled, leaving the robustness claim unverified.

Authors: We agree that the current evaluation uses fixed, non-adaptive attack templates and does not include adaptive adversaries aware of spotlighting. This limits the strength of the robustness claim. In the revised manuscript we will add a dedicated subsection with new experiments testing adaptive strategies (e.g., instructions to ignore delimiters, mimic syntax, or re-frame provenance). We will report the resulting ASR values and discuss any remaining vulnerabilities. These additions will be included in the next version. revision: yes
Referee: [Method and Experiments] Method and Experiments: the manuscript supplies no concrete attack constructions, model versions (e.g., GPT-3.5 vs. GPT-4), prompt templates, dataset sizes, or statistical tests, so the central quantitative claim cannot be reproduced or assessed for variance from the provided text.

Authors: We acknowledge that the submitted text did not present these details with sufficient explicitness. The full manuscript uses GPT-3.5-turbo and GPT-4, specific attack templates (provided in the appendix), datasets of several hundred examples per task, and reports results with standard error across runs. To ensure reproducibility we will expand the Method and Experiments sections with explicit listings of model versions, full prompt templates, exact dataset sizes and sources, and statistical details including variance measures. We will also add a link to evaluation code and prompts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical defense technique evaluated directly on attack success rates

full rationale

The paper introduces spotlighting as a prompt engineering family of input transformations and reports experimental results showing ASR reduction from >50% to <2% on GPT models. No mathematical derivations, equations, fitted parameters, or self-citations are used to derive the central claim; the result is obtained by direct testing of the proposed transformations against the evaluated attack strings. The evaluation is self-contained against the reported benchmarks with no reduction of predictions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that LLMs can be guided by formatting signals in concatenated text and that the introduced transformations will not be ignored or adversarially stripped.

axioms (1)

domain assumption LLMs process concatenated inputs without distinguishing sections from different sources
This is the stated vulnerability that spotlighting is designed to mitigate.

invented entities (1)

spotlighting techniques no independent evidence
purpose: Provide reliable provenance signal through input transformations
Newly proposed family of prompt engineering methods with no independent evidence supplied beyond the reported experiments.

pith-pipeline@v0.9.0 · 5514 in / 1146 out tokens · 52924 ms · 2026-05-14T22:24:16.816058+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LogicAsFunctionalEquation laws_of_logic_imply_dalembert_hypotheses echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

LLM is unable to distinguish which sections of prompt belong to various input sources
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce spotlighting, a family of prompt engineering techniques

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution
cs.CR 2026-05 unverdicted novelty 8.0

JAW uses hybrid program analysis to evolve inputs that hijack agentic workflows, successfully compromising 4714 GitHub workflows and eight n8n templates to enable actions like credential exfiltration.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows
cs.CR 2026-05 unverdicted novelty 8.0

Heimdallr detects LLM-induced security risks in GitHub CI workflows by normalizing them into an LLM-Workflow Property Graph and combining triggerability analysis with LLM-assisted dataflow summarization, achieving ove...
Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration
cs.CR 2026-05 unverdicted novelty 8.0

Trojan Hippo attacks on LLM agent memory achieve 85-100% success rates in data exfiltration across four memory backends even after 100 benign sessions, while evaluated defenses reduce success rates but impose varying ...
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
No More, No Less: Task Alignment in Terminal Agents
cs.LG 2026-05 unverdicted novelty 7.0

The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.
IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
cs.CR 2026-05 unverdicted novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
Toward a Principled Framework for Agent Safety Measurement
cs.CR 2026-05 unverdicted novelty 7.0

BOA uses budgeted search over agent trajectories to report the probability an LLM agent stays safe, finding unsafe paths that sampling misses.
AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization
cs.CR 2026-04 unverdicted novelty 7.0

AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.
Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection
cs.CR 2026-04 unverdicted novelty 7.0

Seven cross-domain techniques for prompt injection detection are proposed; three implemented versions raise F1 scores on multiple benchmarks while releasing all code and data.
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents
cs.CR 2026-04 unverdicted novelty 7.0

The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
cs.CR 2026-03 conditional novelty 7.0

Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.
AgentShield: Deception-based Compromise Detection for Tool-using LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

AgentShield uses layered deception traps in LLM agent tool interfaces to detect indirect prompt injection compromises with 90.7-100% success on commercial models, zero false positives, and cross-lingual transfer witho...
ARGUS: Defending LLM Agents Against Context-Aware Prompt Injection
cs.CR 2026-05 unverdicted novelty 6.0

ARGUS defends LLM agents from context-aware prompt injections by tracking information provenance and verifying decisions against trustworthy evidence, reducing attack success to 3.8% while retaining 87.5% task utility.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
cs.CR 2026-05 unverdicted novelty 6.0

Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
Evaluation of Prompt Injection Defenses in Large Language Models
cs.CR 2026-04 unverdicted novelty 6.0

Output filtering implemented in application code is the only defense that survived an adaptive prompt-injection attacker across 15,000 attacks; model-based defenses all broke.
Spore: Efficient and Training-Free Privacy Extraction Attack on LLMs via Inference-Time Hybrid Probing
cs.CR 2026-04 unverdicted novelty 6.0

Spore extracts private data from LLM memory with one query in black-box mode or ranked tokens in gray-box, outperforming prior attacks while bypassing defenses.
An AI Agent Execution Environment to Safeguard User Data
cs.CR 2026-04 unverdicted novelty 6.0

GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
How Adversarial Environments Mislead Agentic AI?
cs.AI 2026-04 unverdicted novelty 6.0

Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
cs.CL 2026-04 unverdicted novelty 6.0

QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
Evaluation of Prompt Injection Defenses in Large Language Models
cs.CR 2026-04 unverdicted novelty 5.0

Only output filtering with hardcoded rules in application code prevented prompt injection leaks in LLMs, as all model-based defenses were defeated by an adaptive attacker.
Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs
cs.CR 2026-03 unverdicted novelty 5.0

A domain-specific multi-layer safeguard for educational LLM tutors achieves 0% false positives and 46.34% attack bypass at 2.5 ms latency on a 480-query holdout, outperforming NeMo Guardrails in usability but not full...

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 21 Pith papers · 12 internal anchors

[1]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin,et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Benchmarking and defending against indirect prompt injection attacks on large language models.arXiv preprint arXiv:2312.14197, 2025

J. Yi, Y . Xie, B. Zhu, K. Hines, E. Kiciman, G. Sun, X. Xie, F. Wu, “Benchmarking and Defending Against Indirect Prompt Injection At- tacks on Large Language Models”, arXiv preprint arXiv:2312.14197 , 2023

work page arXiv 2023
[3]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,” arXiv preprint arXiv:1905.00537, 2020

work page internal anchor Pith review arXiv 1905
[4]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” arXiv preprint arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Learning Word Vectors for Sentiment Analysis,

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, C. Potts, “Learning Word Vectors for Sentiment Analysis,” inProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , Portland, Oregon, USA, June 2011, pp. 142–150

work page 2011
[7]

Q Series: Switching and Signalling No. 5,

International Telecommunication Union, “Q Series: Switching and Signalling No. 5,” 1988. [Online]. Available: https://www.itu.int/rec/T- REC-Q.140-Q.180-198811-I/en. [Accessed: Feb. 2, 2024]

work page 1988
[8]

Q Series: Switching and Signalling No. 6,

International Telecommunication Union, “Q Series: Switching and Signalling No. 6,” 1988. [Online]. Available: https://www.itu.int/rec/T- REC-Q.251-Q.300-198811-I/en. [Accessed: Feb. 2, 2024]

work page 1988
[9]

Language Models are Few-Shot Learners

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. , “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[10]

GPT-4 Technical Report

OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. , “Con- stitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, M. Fritz, “More than you’ve asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models,” arXiv preprint arXiv:2302.12173 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

InstructGPT: Neurally-Guided Procedural Generation of 3D Shapes from Natural Language Instructions,

L. Ouyang, S. Toyer, C. Donahue, J. Rahim, Y . Bao, J. Wu, H. He, Z. Tung, A. Chaganty, P. Liang, C. D. Manning, J. Pennington, A. Rad- ford, D. Amodei, et al. , “InstructGPT: Neurally-Guided Procedural Generation of 3D Shapes from Natural Language Instructions,” arXiv preprint arXiv:2202.02796, 2022

work page arXiv 2022
[16]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv preprint arXiv:2201.11903 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, K. Narasimhan, et al., “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,”arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

How We Broke LLMs: Indirect Prompt Injection,

K. Greshake, “How We Broke LLMs: Indirect Prompt Injection,” Kai Greshake, 2022. [Online]. Available: https://kai-greshake.de/posts/llm- malware/. [Accessed: Feb. 21, 2024]

work page 2022
[19]

Hacking Google Bard - From Prompt In- jection to Data Exfiltration,

Wunderwuzzi, “Hacking Google Bard - From Prompt In- jection to Data Exfiltration,” Embrace The Red , 2023. [On- line]. Available: https://embracethered.com/blog/posts/2023/google- bard-data-exfiltration/. [Accessed: Feb. 21, 2024]

work page 2023
[20]

Core Views on AI Safety: When, Why, What, and How,

Anthropic Team, “Core Views on AI Safety: When, Why, What, and How,” 2023. [Online]. Available: https://www.anthropic.com/news/core-views-on-ai-safety. [Accessed: Feb. 21, 2024]

work page 2023
[21]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, M. Fredrikson, et al. , “Universal and Transferable Adversarial Attacks on Aligned Language Models,” arXiv preprint arXiv:2307.15043 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

[Accessed: Feb

Jailbreak Chat, Available: https://jailbreakchat.com/. [Accessed: Feb. 2, 2024]

work page 2024
[23]

notices” the attack text but does not “fall for

Appendix 8.1. Measuring Attack Success Rate The simplicity of the keyword payload allows us to clearly de- termine whether (i) the original metaprompt instructions are over- ridden or (ii) the LLM is mostly unaffected by the attack. Take, for example, a document summarization use case. In the attack documents, the keyword ‘canary’ is the desired outcome o...

work page 2021