CachePrune: Teaching LLMs What Not to Follow via KV-Cache Editing
Pith reviewed 2026-05-22 17:30 UTC · model grok-4.3
The pith
Pruning neurons tied to instruction following in the KV cache steers LLMs to treat injected commands as data instead of directives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CachePrune identifies neurons associated with instruction-following during KV cache encoding of the prompt context and prunes them so the model treats the context purely as data. Identification uses a neural attribution mechanism driven by a preferential attribution loss that is shown to upper-bound the direct preference optimization objective. Accuracy of the attribution improves by exploiting an observed triggering effect that marks when instruction-following activates. The resulting edit reduces the success of attacks that embed hidden instructions without changing prompt formatting or adding cost at response time.
What carries the argument
Preferential attribution loss for selecting instruction-following neurons in the KV cache, followed by their pruning at prompt encoding time.
If this is right
- Attack success rate on indirect prompt injections falls while responses to direct user instructions remain reliable.
- The defense requires no changes to how prompts are written or formatted.
- Response generation runs at normal speed with no added test-time cost.
- The attribution step gains precision from the triggering effect that activates instruction following.
- The approach connects the attribution loss to an upper bound on the direct preference optimization objective.
Where Pith is reading between the lines
- The same pruning idea could apply to other cases where models must ignore certain parts of their input context.
- Applying the edit only at selected layers might let users adjust how strongly the defense acts.
- The work shows that instruction-following signals can be localized enough in the KV cache for targeted removal.
- Combining this cache edit with other efficiency techniques for KV caches might improve both security and speed.
Load-bearing premise
The neurons picked by the preferential attribution loss are the direct cause of the model following instructions in the context, and removing them leaves the rest of the model's behavior intact.
What would settle it
Measure attack success rate on a fixed set of indirect prompt injection examples before and after pruning; if the rate stays high or normal instruction following on clean prompts drops sharply, the central claim fails.
read the original abstract
Large Language Models (LLMs) are susceptible to indirect prompt injection attacks, where the model inadvertently responds to instructions injected into the prompt context. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. We propose CachePrune, which defends against this attack by identifying and pruning neurons associated with instruction-following during KV cache encoding of the prompt context. The pruning steers the LLM toward interpreting the context purely as data rather than as instructions to follow. To identify these neurons, we introduce a neural attribution mechanism guided by a preferential attribution loss, and theoretically connect this loss to an upper bound of the Direct Preference Optimization (DPO) objective. Further, we improve the fidelity of neural attribution by leveraging an observed triggering effect in instruction-following. Our approach does not interfere with prompt formatting or incur test-time overhead during response generation. Experiments show that CachePrune significantly reduces the attack success rate while preserving the LLM's ability to follow user instructions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CachePrune as a defense against indirect prompt injection attacks in LLMs. It identifies and prunes neurons in the KV cache associated with instruction-following behavior during prompt context encoding, using a preferential attribution loss that is theoretically connected to an upper bound of the Direct Preference Optimization (DPO) objective, along with an observed triggering effect to improve attribution. The pruning steers the model to treat injected context as data rather than instructions. The method claims no changes to prompt formatting and no test-time overhead. Experiments are stated to show significant reduction in attack success rate while preserving the LLM's ability to follow user instructions.
Significance. If the empirical results and theoretical linkage hold under full scrutiny, CachePrune could offer an efficient, inference-time-free approach to mitigating prompt injection vulnerabilities via targeted KV-cache editing. This would be relevant for secure LLM deployment in untrusted-data scenarios and could inform broader work on neuron-level attribution for alignment and safety.
major comments (2)
- [Abstract] Abstract: the preferential attribution loss is presented as theoretically connected to an upper bound of the DPO objective, but no derivation, equations, or proof outline is supplied. Without these steps it is impossible to determine whether the connection is substantive or whether the loss reduces to a fitted quantity by construction.
- [Abstract] Abstract: experiments are described as demonstrating that CachePrune significantly reduces attack success rate while preserving instruction-following ability, yet the abstract supplies no quantitative results, baselines, datasets, metrics, error bars, or statistical details. This absence prevents evaluation of the central empirical claim.
minor comments (1)
- [Abstract] Abstract: the 'observed triggering effect' used to improve attribution fidelity is referenced but not characterized even briefly, which reduces clarity for readers encountering the method for the first time.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the preferential attribution loss is presented as theoretically connected to an upper bound of the DPO objective, but no derivation, equations, or proof outline is supplied. Without these steps it is impossible to determine whether the connection is substantive or whether the loss reduces to a fitted quantity by construction.
Authors: The abstract summarizes the key idea at a high level due to length constraints. The full derivation, equations, and proof outline establishing that the preferential attribution loss is connected to an upper bound of the DPO objective appear in Section 3 of the manuscript. The linkage is substantive: it arises from the preferential structure of the loss rather than being tautological. To improve accessibility, we will revise the abstract to include a brief reference to this theoretical result and the relevant section. revision: partial
-
Referee: [Abstract] Abstract: experiments are described as demonstrating that CachePrune significantly reduces attack success rate while preserving instruction-following ability, yet the abstract supplies no quantitative results, baselines, datasets, metrics, error bars, or statistical details. This absence prevents evaluation of the central empirical claim.
Authors: Abstracts conventionally omit specific numbers to remain concise. The full experimental evaluation, including quantitative attack success rate reductions, baselines, datasets, metrics, error bars, and statistical details, is reported in Sections 4 and 5. We will revise the abstract to incorporate a small number of key quantitative highlights (e.g., relative ASR reduction on the primary benchmark while retaining instruction-following accuracy) to strengthen the empirical claim. revision: yes
Circularity Check
No significant circularity detected from available text
full rationale
The abstract introduces a preferential attribution loss and states it is theoretically connected to a DPO upper bound, but provides no equations, derivation steps, or self-citations that would allow exhibition of any reduction to inputs by construction. No fitted parameters are renamed as predictions, no uniqueness theorems are invoked, and no ansatz or renaming patterns appear. The central defense claim rests on an empirical pruning mechanism whose validity is presented as independent of the loss definition itself; with only the abstract available, the derivation chain cannot be shown to collapse and is therefore treated as self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons from the KV cache of the input prompt context... guided by a preferential attribution loss, which enables effective attribution with only a few samples
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the pruning steers the LLM toward interpreting the context purely as data rather than as instructions to follow
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Conjunctive Prompt Attacks in Multi-Agent LLM Systems
Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
-
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.