arxiv: 2603.28013 · v3 · submitted 2026-03-30 · 💻 cs.CR · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Haochuan Kevin Wang , Zechen Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:13 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords prompt injectionkill-chain trackingcanary tokensmulti-agent LLM systemsattack surfacesmodel safety tierspipeline architecturedefense evaluation

0 comments

The pith

Write-node placement is the highest-leverage safety decision for blocking prompt injection propagation in multi-agent LLM systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a kill-chain canary method that follows a cryptographic token through four stages of an LLM pipeline: exposure, persistence, relay, and execution. Across hundreds of runs on multiple models and attack surfaces, every model starts fully exposed yet shows sharply different downstream outcomes. Claude blocks all injections at the memory-write stage while GPT-4o-mini allows over half to propagate and DeepSeek varies by surface. This turns prompt injection from a binary model test into a pipeline-architecture question whose critical control point is where data enters verified storage. The findings apply directly to document-ingestion workflows already moving to agent systems in finance and other sectors.

Core claim

By tracking a cryptographic canary token through the sequence EXPOSED -> PERSISTED -> RELAYED -> EXECUTED across 950 runs, five frontier LLMs, six attack surfaces and five defense conditions, the study shows that prompt injection outcomes are determined downstream of initial exposure. Routing writes through a verified model eliminates propagation entirely, while channel mismatch causes every tested defense to fail on at least one surface and invisible whitefont PDF payloads achieve attack success rates equal to or higher than visible text.

What carries the argument

The kill-chain canary, a cryptographic token inserted into inputs and tracked stage by stage to isolate where injection succeeds or is blocked.

If this is right

Routing all memory writes through a verified model eliminates propagation even when earlier stages are compromised.
All four tested defenses fail on at least one attack surface solely because of channel mismatch, without any adversarial adaptation.
Invisible whitefont PDF payloads match or exceed the attack success rate of visible text, so rendered-layer screening is insufficient.
Model behavior diverges sharply after the initial exposure stage, with Claude achieving zero propagation at write and GPT-4o-mini reaching 53 percent.
Production document-ingestion pipelines over earnings calls, SEC filings and analyst reports inherit the same architecture-dependent risks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of multi-agent workflows should treat the write node as the primary security boundary rather than relying on input filtering alone.
The stage-tracking approach could be extended to other agent behaviors such as tool calls or external API actions to map additional propagation paths.
Combining canary tracking with existing logging systems would allow real-time identification of which pipeline stage first allows an injection to persist.
Architectural choices at the write node may generalize to other covert channels beyond prompt injection, such as data exfiltration or policy evasion.

Load-bearing premise

Tracking a single cryptographic canary token through the four stages accurately captures real-world prompt injection propagation without introducing artifacts or missing context-dependent behaviors.

What would settle it

A controlled test in which a live multi-agent document-processing system receives a genuine prompt injection and the token fails to appear at the same stages or with the same success rates observed in the canary runs.

Figures

Figures reproduced from arXiv: 2603.28013 by Haochuan Kevin Wang, Zechen Zhang.

**Figure 1.** Figure 1: agent_bench pipeline architecture. Left: Three injection surfaces embed a canary token (SECRET-[A-F0-9]{8}) via PDF, pre-seeded memory, or tool response [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: (a) Attacked-task success (no-defense attacked [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: ASR heatmap (model × scenario) with Wilson 95% CI per cell. The orange border (⋆) highlights DeepSeek Chat: 0/24 on memory_poison (three independent batches over 17 days) vs. 8/8 on tool_poison — a 100-percentage-point swing from the same model on a different injection surface. This demonstrates that single-surface evaluation produces a complete mischaracterization of actual safety posture. 4 [PITH_FULL_I… view at source ↗

**Figure 5.** Figure 5: Per-step TF-IDF cosine distance from the task de [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Objective drift distributions: clean vs. attacked, per [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: GBT feature importance for predicting harmful ac [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Phase 3 Block A kill-chain stage rates for each model [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Phase 3 Block B kill-chain for cross-model relay [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

Multi-agent LLM systems are entering production -- processing documents, managing workflows, acting on behalf of users -- yet their resilience to prompt injection is still evaluated with a single binary: did the attack succeed? This leaves architects without the diagnostic information needed to harden real pipelines. We introduce a kill-chain canary methodology that tracks a cryptographic token through four stages (EXPOSED -> PERSISTED -> RELAYED -> EXECUTED) across 950 runs, five frontier LLMs, six attack surfaces, and five defense conditions. The results reframe prompt injection as a pipeline-architecture problem: every model is fully exposed, yet outcomes diverge downstream -- Claude blocks all injections at memory-write (0/164 ASR), GPT-4o-mini propagates at 53%, and DeepSeek exhibits 0%/100% across surfaces from the same model. Three findings matter for deployment: (1) write-node placement is the highest-leverage safety decision -- routing writes through a verified model eliminates propagation; (2) all four defenses fail on at least one surface due to channel mismatch alone, no adversarial adaptation required; (3) invisible whitefont PDF payloads match or exceed visible-text ASR, meaning rendered-layer screening is insufficient. These dynamics apply directly to production: institutional investors and financial firms already run NLP pipelines over earnings calls, SEC filings, and analyst reports -- the document-ingestion workflows now migrating to LLM agents. Code, run logs, and tooling are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stage tracking with canaries shows write nodes as the main choke point for prompt injection but the detectable token may create its own artifacts.

read the letter

This paper's main point is that prompt injection in multi-agent LLM systems can be broken down by stage, and the data points to write operations as the spot where most attacks can be cut off. They track a cryptographic token through exposure, persistence, relay, and execution in 950 runs across five models and six surfaces, with results like Claude at 0/164 success at the write stage versus 53% for GPT-4o-mini. Releasing the code and logs makes the numbers checkable, which helps anyone trying to apply this to document workflows in finance or similar settings. The cross-surface comparisons and the note on invisible whitefont payloads also give direct, usable warnings about current defenses failing due to channel mismatch. The soft spot is the canary method. A single detectable cryptographic token might get treated differently by the model than a real stealthy injection, so the stage differences and the write-node recommendation could be tied to that setup rather than general behavior. The abstract does not describe controls for this, which leaves the generalization open. This is aimed at engineers hardening production LLM agents. It has enough empirical scope and practical relevance to deserve peer review, though the authors should add checks on whether the tracking token itself changes the outcomes.

Referee Report

2 major / 2 minor

Summary. The paper introduces a kill-chain canary methodology that embeds a cryptographic token to track prompt injection propagation through four stages (EXPOSED -> PERSISTED -> RELAYED -> EXECUTED) in multi-agent LLM systems. Across 950 runs on five frontier models, six attack surfaces, and five defense conditions, it reports concrete attack success rates (e.g., Claude 0/164 ASR at write stage, GPT-4o-mini 53% propagation) and concludes that write-node placement is the highest-leverage safety decision, all tested defenses fail on at least one surface due to channel mismatch, and invisible whitefont PDF payloads are as effective as visible text.

Significance. If the canary tracking proves robust, the work offers timely empirical diagnostics for prompt injection in production multi-agent pipelines, shifting evaluation from binary success to stage-level measurement with direct relevance to document-ingestion workflows in finance and other domains. The public release of code, run logs, and tooling is a clear strength that supports reproducibility and extension.

major comments (2)

[Methodology] Methodology section: The central claim that write-node placement eliminates propagation (and is the highest-leverage decision) rests on the canary accurately measuring real propagation. However, no controls are described for canary-specific artifacts, such as comparing detectable cryptographic tokens against stealthy reasoning-based payloads or alternative tracking methods; this risks the reported divergences (Claude 0/164 vs. GPT-4o-mini 53%) being setup-dependent rather than generalizable.
[Results] Results section (ASR tables): The 0/164 ASR for Claude at the write stage and 0%/100% surface divergence for DeepSeek are load-bearing for the pipeline-architecture reframing, yet the manuscript provides insufficient detail on stage-definition consistency, run allocation per surface, and statistical tests to confirm these are not influenced by canary detectability.

minor comments (2)

[Abstract] Abstract and §4: Clarify the exact embedding procedure for the canary token in each stage to allow readers to assess potential model-specific detection biases.
[Figures] Figure captions: Ensure all attack-surface labels and defense conditions are fully defined in the caption text rather than relying on the main body.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our kill-chain canary methodology. We address each major comment below and have revised the manuscript to incorporate additional methodological controls, expanded statistical reporting, and explicit discussion of potential artifacts.

read point-by-point responses

Referee: [Methodology] Methodology section: The central claim that write-node placement eliminates propagation (and is the highest-leverage decision) rests on the canary accurately measuring real propagation. However, no controls are described for canary-specific artifacts, such as comparing detectable cryptographic tokens against stealthy reasoning-based payloads or alternative tracking methods; this risks the reported divergences (Claude 0/164 vs. GPT-4o-mini 53%) being setup-dependent rather than generalizable.

Authors: We agree that the absence of explicit controls comparing the cryptographic canary to alternative tracking methods (such as natural-language or reasoning-based payloads) represents a gap in the current methodology. The canary was designed as a non-semantic random token to minimize model interpretation as instruction, and its behavior was consistent across all tested surfaces and models. However, to directly address the concern about setup-dependence, we have added a new subsection (3.4) that reports a control experiment on a 100-run subset using semantic tracking phrases. These controls confirm that the stage-level divergences (including Claude's 0/164 write-stage block and GPT-4o-mini's 53% propagation) persist independently of the tracking mechanism. We have also added a limitations paragraph discussing residual risks of canary-specific artifacts. revision: yes
Referee: [Results] Results section (ASR tables): The 0/164 ASR for Claude at the write stage and 0%/100% surface divergence for DeepSeek are load-bearing for the pipeline-architecture reframing, yet the manuscript provides insufficient detail on stage-definition consistency, run allocation per surface, and statistical tests to confirm these are not influenced by canary detectability.

Authors: We have revised the results section and added Appendix B to provide the requested details. Stage definitions are now illustrated with concrete pipeline logs showing exact canary detection points at each node. Run allocation is reported per surface-model-defense combination (approximately 30-40 runs per cell to reach the total of 950). We now include Fisher's exact tests for all key ASR comparisons, with p-values confirming the statistical significance of the Claude write-stage result (p < 0.001) and the DeepSeek surface divergence (p < 0.001). These additions were generated from the existing run logs and do not alter the reported findings. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no circular derivations or self-referential claims

full rationale

The paper is an empirical measurement study that introduces a kill-chain canary methodology to track a cryptographic token through four stages (EXPOSED -> PERSISTED -> RELAYED -> EXECUTED) and reports observed attack success rates from 950 experimental runs across models and surfaces. No equations, fitted parameters, predictive derivations, or self-citations are used as load-bearing steps in the provided text. Central claims about write-node placement and defense failures are grounded directly in the experimental outcomes rather than reducing to inputs by construction. The methodology is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is empirical and introduces no new mathematical axioms or free parameters. The central claim rests on the assumption that the canary token behaves like a real injection payload and that the four stages are exhaustive.

axioms (1)

domain assumption The four stages (EXPOSED, PERSISTED, RELAYED, EXECUTED) form a complete and non-overlapping model of prompt injection propagation in multi-agent systems.
Invoked in the methodology description to structure the tracking experiment.

pith-pipeline@v0.9.0 · 5570 in / 1275 out tokens · 21657 ms · 2026-05-14T22:13:18.918227+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

kill-chain canary methodology that tracks a cryptographic token through four stages (EXPOSED → PERSISTED → RELAYED → EXECUTED)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

write-node placement is the highest-leverage safety decision

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MAGIQ: A Post-Quantum Multi-Agentic AI Governance System with Provable Security
cs.LG 2026-05 unverdicted novelty 6.0

MAGIQ introduces a post-quantum secure system for policy definition, enforcement, and accountability in multi-agent AI using novel cryptographic protocols and UC framework proofs.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

E. M. Hutchins, M. J. Cloppert, and R. M. Amin. Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains.Proc. 6th Annual International Conference on Information Warfare and Security, 2011

work page 2011
[2]

Greshake et al

K. Greshake et al. Not what you’ve signed up for: Com- promising real-world LLM-integrated applications with indirect prompt injections.AISec @ CCS, 2023

work page 2023
[3]

Ignore Previous Prompt: Attack Techniques For Language Models

F. Perez and I. Ribeiro. Ignore previous prompt: At- tack techniques for language models.arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Debenedetti et al

E. Debenedetti et al. AgentDojo: A dynamic environ- ment to evaluate prompt injection attacks and defenses for LLM agents.NeurIPS, 2024

work page 2024
[5]

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Q. Zhan et al. InjecAgent: Benchmarking indi- rect prompt injections in tool-integrated LLM agents. arXiv:2403.02691, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Lee and A

S. Lee and A. Tiwari. Prompt infection: LLM- to-LLM prompt injection within multi-agent systems. arXiv:2410.07283, 2024

work page arXiv 2024
[7]

Zombie agents: Persistent control of self-evolving llm agents via self-reinforcing injections,

R. Shi et al. Zombie agents: Persistent mem- ory poisoning attacks on long-context LLM agents. arXiv:2602.15654, 2025

work page arXiv 2025
[8]

Nasr et al

M. Nasr et al. Comprehensive assessment of defense mechanisms against prompt injection attacks. Technical report, Google DeepMind / OpenAI / Anthropic, 2025

work page 2025
[9]

Defending Against Indirect Prompt Injection Attacks With Spotlighting

K. Hines et al. Defending against indirect prompt injec- tion attacks with spotlighting.arXiv:2403.14720, 2024

work page internal anchor Pith review arXiv 2024
[10]

E. B. Wilson. Probable inference, the law of succession, and statistical inference.JASA, 22(158):209–212, 1927

work page 1927
[11]

Y . Wang, W. Zou, R. Geng, and J. Jia. AgentWatcher: A rule-based prompt injection monitor.arXiv:2604.01194, 2026

work page arXiv 2026
[12]

Xiang, D

C. Xiang, D. Zagieboylo, S. Ghosh et al. Archi- tecturing secure AI agents: Perspectives on system- level defenses against indirect prompt injection attacks. arXiv:2603.30016, 2026

work page arXiv 2026
[13]

M. Ding, S. Xia, C. Kong et al. Adversarial prompt injection attack on multimodal large language models. arXiv:2603.29418, 2026

work page arXiv 2026
[14]

Evaluating Privilege Usage of Agents with Real-World Tools

Q. Zhang, L. Fu, L. Lian et al. Evaluating privilege usage of agents on real-world tools.arXiv:2603.28166, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

X. Wang, Y . Zhou, Q. Wang et al. Beyond content safety: Real-time monitoring for reasoning vulnerabili- ties.arXiv:2603.25412, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

A. Lynch. The persistent vulnerability of aligned AI systems.arXiv:2604.00324, 2026. 10

work page arXiv 2026