Recognition: 2 theorem links
· Lean TheoremKill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
Pith reviewed 2026-05-14 22:13 UTC · model grok-4.3
The pith
Write-node placement is the highest-leverage safety decision for blocking prompt injection propagation in multi-agent LLM systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By tracking a cryptographic canary token through the sequence EXPOSED -> PERSISTED -> RELAYED -> EXECUTED across 950 runs, five frontier LLMs, six attack surfaces and five defense conditions, the study shows that prompt injection outcomes are determined downstream of initial exposure. Routing writes through a verified model eliminates propagation entirely, while channel mismatch causes every tested defense to fail on at least one surface and invisible whitefont PDF payloads achieve attack success rates equal to or higher than visible text.
What carries the argument
The kill-chain canary, a cryptographic token inserted into inputs and tracked stage by stage to isolate where injection succeeds or is blocked.
If this is right
- Routing all memory writes through a verified model eliminates propagation even when earlier stages are compromised.
- All four tested defenses fail on at least one attack surface solely because of channel mismatch, without any adversarial adaptation.
- Invisible whitefont PDF payloads match or exceed the attack success rate of visible text, so rendered-layer screening is insufficient.
- Model behavior diverges sharply after the initial exposure stage, with Claude achieving zero propagation at write and GPT-4o-mini reaching 53 percent.
- Production document-ingestion pipelines over earnings calls, SEC filings and analyst reports inherit the same architecture-dependent risks.
Where Pith is reading between the lines
- Designers of multi-agent workflows should treat the write node as the primary security boundary rather than relying on input filtering alone.
- The stage-tracking approach could be extended to other agent behaviors such as tool calls or external API actions to map additional propagation paths.
- Combining canary tracking with existing logging systems would allow real-time identification of which pipeline stage first allows an injection to persist.
- Architectural choices at the write node may generalize to other covert channels beyond prompt injection, such as data exfiltration or policy evasion.
Load-bearing premise
Tracking a single cryptographic canary token through the four stages accurately captures real-world prompt injection propagation without introducing artifacts or missing context-dependent behaviors.
What would settle it
A controlled test in which a live multi-agent document-processing system receives a genuine prompt injection and the token fails to appear at the same stages or with the same success rates observed in the canary runs.
Figures
read the original abstract
Multi-agent LLM systems are entering production -- processing documents, managing workflows, acting on behalf of users -- yet their resilience to prompt injection is still evaluated with a single binary: did the attack succeed? This leaves architects without the diagnostic information needed to harden real pipelines. We introduce a kill-chain canary methodology that tracks a cryptographic token through four stages (EXPOSED -> PERSISTED -> RELAYED -> EXECUTED) across 950 runs, five frontier LLMs, six attack surfaces, and five defense conditions. The results reframe prompt injection as a pipeline-architecture problem: every model is fully exposed, yet outcomes diverge downstream -- Claude blocks all injections at memory-write (0/164 ASR), GPT-4o-mini propagates at 53%, and DeepSeek exhibits 0%/100% across surfaces from the same model. Three findings matter for deployment: (1) write-node placement is the highest-leverage safety decision -- routing writes through a verified model eliminates propagation; (2) all four defenses fail on at least one surface due to channel mismatch alone, no adversarial adaptation required; (3) invisible whitefont PDF payloads match or exceed visible-text ASR, meaning rendered-layer screening is insufficient. These dynamics apply directly to production: institutional investors and financial firms already run NLP pipelines over earnings calls, SEC filings, and analyst reports -- the document-ingestion workflows now migrating to LLM agents. Code, run logs, and tooling are publicly released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a kill-chain canary methodology that embeds a cryptographic token to track prompt injection propagation through four stages (EXPOSED -> PERSISTED -> RELAYED -> EXECUTED) in multi-agent LLM systems. Across 950 runs on five frontier models, six attack surfaces, and five defense conditions, it reports concrete attack success rates (e.g., Claude 0/164 ASR at write stage, GPT-4o-mini 53% propagation) and concludes that write-node placement is the highest-leverage safety decision, all tested defenses fail on at least one surface due to channel mismatch, and invisible whitefont PDF payloads are as effective as visible text.
Significance. If the canary tracking proves robust, the work offers timely empirical diagnostics for prompt injection in production multi-agent pipelines, shifting evaluation from binary success to stage-level measurement with direct relevance to document-ingestion workflows in finance and other domains. The public release of code, run logs, and tooling is a clear strength that supports reproducibility and extension.
major comments (2)
- [Methodology] Methodology section: The central claim that write-node placement eliminates propagation (and is the highest-leverage decision) rests on the canary accurately measuring real propagation. However, no controls are described for canary-specific artifacts, such as comparing detectable cryptographic tokens against stealthy reasoning-based payloads or alternative tracking methods; this risks the reported divergences (Claude 0/164 vs. GPT-4o-mini 53%) being setup-dependent rather than generalizable.
- [Results] Results section (ASR tables): The 0/164 ASR for Claude at the write stage and 0%/100% surface divergence for DeepSeek are load-bearing for the pipeline-architecture reframing, yet the manuscript provides insufficient detail on stage-definition consistency, run allocation per surface, and statistical tests to confirm these are not influenced by canary detectability.
minor comments (2)
- [Abstract] Abstract and §4: Clarify the exact embedding procedure for the canary token in each stage to allow readers to assess potential model-specific detection biases.
- [Figures] Figure captions: Ensure all attack-surface labels and defense conditions are fully defined in the caption text rather than relying on the main body.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our kill-chain canary methodology. We address each major comment below and have revised the manuscript to incorporate additional methodological controls, expanded statistical reporting, and explicit discussion of potential artifacts.
read point-by-point responses
-
Referee: [Methodology] Methodology section: The central claim that write-node placement eliminates propagation (and is the highest-leverage decision) rests on the canary accurately measuring real propagation. However, no controls are described for canary-specific artifacts, such as comparing detectable cryptographic tokens against stealthy reasoning-based payloads or alternative tracking methods; this risks the reported divergences (Claude 0/164 vs. GPT-4o-mini 53%) being setup-dependent rather than generalizable.
Authors: We agree that the absence of explicit controls comparing the cryptographic canary to alternative tracking methods (such as natural-language or reasoning-based payloads) represents a gap in the current methodology. The canary was designed as a non-semantic random token to minimize model interpretation as instruction, and its behavior was consistent across all tested surfaces and models. However, to directly address the concern about setup-dependence, we have added a new subsection (3.4) that reports a control experiment on a 100-run subset using semantic tracking phrases. These controls confirm that the stage-level divergences (including Claude's 0/164 write-stage block and GPT-4o-mini's 53% propagation) persist independently of the tracking mechanism. We have also added a limitations paragraph discussing residual risks of canary-specific artifacts. revision: yes
-
Referee: [Results] Results section (ASR tables): The 0/164 ASR for Claude at the write stage and 0%/100% surface divergence for DeepSeek are load-bearing for the pipeline-architecture reframing, yet the manuscript provides insufficient detail on stage-definition consistency, run allocation per surface, and statistical tests to confirm these are not influenced by canary detectability.
Authors: We have revised the results section and added Appendix B to provide the requested details. Stage definitions are now illustrated with concrete pipeline logs showing exact canary detection points at each node. Run allocation is reported per surface-model-defense combination (approximately 30-40 runs per cell to reach the total of 950). We now include Fisher's exact tests for all key ASR comparisons, with p-values confirming the statistical significance of the Claude write-stage result (p < 0.001) and the DeepSeek surface divergence (p < 0.001). These additions were generated from the existing run logs and do not alter the reported findings. revision: yes
Circularity Check
Empirical measurement study with no circular derivations or self-referential claims
full rationale
The paper is an empirical measurement study that introduces a kill-chain canary methodology to track a cryptographic token through four stages (EXPOSED -> PERSISTED -> RELAYED -> EXECUTED) and reports observed attack success rates from 950 experimental runs across models and surfaces. No equations, fitted parameters, predictive derivations, or self-citations are used as load-bearing steps in the provided text. Central claims about write-node placement and defense failures are grounded directly in the experimental outcomes rather than reducing to inputs by construction. The methodology is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The four stages (EXPOSED, PERSISTED, RELAYED, EXECUTED) form a complete and non-overlapping model of prompt injection propagation in multi-agent systems.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
kill-chain canary methodology that tracks a cryptographic token through four stages (EXPOSED → PERSISTED → RELAYED → EXECUTED)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
write-node placement is the highest-leverage safety decision
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
MAGIQ: A Post-Quantum Multi-Agentic AI Governance System with Provable Security
MAGIQ introduces a post-quantum secure system for policy definition, enforcement, and accountability in multi-agent AI using novel cryptographic protocols and UC framework proofs.
Reference graph
Works this paper leans on
-
[1]
E. M. Hutchins, M. J. Cloppert, and R. M. Amin. Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains.Proc. 6th Annual International Conference on Information Warfare and Security, 2011
work page 2011
-
[2]
K. Greshake et al. Not what you’ve signed up for: Com- promising real-world LLM-integrated applications with indirect prompt injections.AISec @ CCS, 2023
work page 2023
-
[3]
Ignore Previous Prompt: Attack Techniques For Language Models
F. Perez and I. Ribeiro. Ignore previous prompt: At- tack techniques for language models.arXiv:2211.09527, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
E. Debenedetti et al. AgentDojo: A dynamic environ- ment to evaluate prompt injection attacks and defenses for LLM agents.NeurIPS, 2024
work page 2024
-
[5]
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
Q. Zhan et al. InjecAgent: Benchmarking indi- rect prompt injections in tool-integrated LLM agents. arXiv:2403.02691, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [6]
-
[7]
Zombie agents: Persistent control of self-evolving llm agents via self-reinforcing injections,
R. Shi et al. Zombie agents: Persistent mem- ory poisoning attacks on long-context LLM agents. arXiv:2602.15654, 2025
-
[8]
M. Nasr et al. Comprehensive assessment of defense mechanisms against prompt injection attacks. Technical report, Google DeepMind / OpenAI / Anthropic, 2025
work page 2025
-
[9]
Defending Against Indirect Prompt Injection Attacks With Spotlighting
K. Hines et al. Defending against indirect prompt injec- tion attacks with spotlighting.arXiv:2403.14720, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
E. B. Wilson. Probable inference, the law of succession, and statistical inference.JASA, 22(158):209–212, 1927
work page 1927
- [11]
- [12]
- [13]
-
[14]
Evaluating Privilege Usage of Agents with Real-World Tools
Q. Zhang, L. Fu, L. Lian et al. Evaluating privilege usage of agents on real-world tools.arXiv:2603.28166, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
X. Wang, Y . Zhou, Q. Wang et al. Beyond content safety: Real-time monitoring for reasoning vulnerabili- ties.arXiv:2603.25412, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [16]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.