CachePrune: Teaching LLMs What Not to Follow via KV-Cache Editing

Julian McAuley; Junda Wu; Lina Yao; Rui Wang; Ruiyi Zhang; Ryan Rossi; Subrata Mitra; Tong Yu; Yu Xia

arxiv: 2504.21228 · v3 · submitted 2025-04-29 · 💻 cs.CR · cs.AI

CachePrune: Teaching LLMs What Not to Follow via KV-Cache Editing

Rui Wang , Junda Wu , Yu Xia , Tong Yu , Ruiyi Zhang , Ryan Rossi , Subrata Mitra , Lina Yao

show 1 more author

Julian McAuley

This is my paper

Pith reviewed 2026-05-22 17:30 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords indirect prompt injectionKV cache editingneuron pruningLLM defensepreferential attribution lossinstruction followingDPO objective

0 comments

The pith

Pruning neurons tied to instruction following in the KV cache steers LLMs to treat injected commands as data instead of directives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

CachePrune counters indirect prompt injection attacks on large language models by locating and removing neurons that drive instruction-following behavior during the creation of the prompt's key-value cache. A preferential attribution loss guides the identification of these neurons and connects theoretically to an upper bound on the direct preference optimization objective. The pruning occurs only at the encoding stage of the context, causing the model to interpret extra instructions as ordinary data rather than orders it must obey. This leaves the model's responses to legitimate user instructions unchanged and adds no extra work when the model later generates text.

Core claim

CachePrune identifies neurons associated with instruction-following during KV cache encoding of the prompt context and prunes them so the model treats the context purely as data. Identification uses a neural attribution mechanism driven by a preferential attribution loss that is shown to upper-bound the direct preference optimization objective. Accuracy of the attribution improves by exploiting an observed triggering effect that marks when instruction-following activates. The resulting edit reduces the success of attacks that embed hidden instructions without changing prompt formatting or adding cost at response time.

What carries the argument

Preferential attribution loss for selecting instruction-following neurons in the KV cache, followed by their pruning at prompt encoding time.

If this is right

Attack success rate on indirect prompt injections falls while responses to direct user instructions remain reliable.
The defense requires no changes to how prompts are written or formatted.
Response generation runs at normal speed with no added test-time cost.
The attribution step gains precision from the triggering effect that activates instruction following.
The approach connects the attribution loss to an upper bound on the direct preference optimization objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pruning idea could apply to other cases where models must ignore certain parts of their input context.
Applying the edit only at selected layers might let users adjust how strongly the defense acts.
The work shows that instruction-following signals can be localized enough in the KV cache for targeted removal.
Combining this cache edit with other efficiency techniques for KV caches might improve both security and speed.

Load-bearing premise

The neurons picked by the preferential attribution loss are the direct cause of the model following instructions in the context, and removing them leaves the rest of the model's behavior intact.

What would settle it

Measure attack success rate on a fixed set of indirect prompt injection examples before and after pruning; if the rate stays high or normal instruction following on clean prompts drops sharply, the central claim fails.

read the original abstract

Large Language Models (LLMs) are susceptible to indirect prompt injection attacks, where the model inadvertently responds to instructions injected into the prompt context. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. We propose CachePrune, which defends against this attack by identifying and pruning neurons associated with instruction-following during KV cache encoding of the prompt context. The pruning steers the LLM toward interpreting the context purely as data rather than as instructions to follow. To identify these neurons, we introduce a neural attribution mechanism guided by a preferential attribution loss, and theoretically connect this loss to an upper bound of the Direct Preference Optimization (DPO) objective. Further, we improve the fidelity of neural attribution by leveraging an observed triggering effect in instruction-following. Our approach does not interfere with prompt formatting or incur test-time overhead during response generation. Experiments show that CachePrune significantly reduces the attack success rate while preserving the LLM's ability to follow user instructions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CachePrune as a defense against indirect prompt injection attacks in LLMs. It identifies and prunes neurons in the KV cache associated with instruction-following behavior during prompt context encoding, using a preferential attribution loss that is theoretically connected to an upper bound of the Direct Preference Optimization (DPO) objective, along with an observed triggering effect to improve attribution. The pruning steers the model to treat injected context as data rather than instructions. The method claims no changes to prompt formatting and no test-time overhead. Experiments are stated to show significant reduction in attack success rate while preserving the LLM's ability to follow user instructions.

Significance. If the empirical results and theoretical linkage hold under full scrutiny, CachePrune could offer an efficient, inference-time-free approach to mitigating prompt injection vulnerabilities via targeted KV-cache editing. This would be relevant for secure LLM deployment in untrusted-data scenarios and could inform broader work on neuron-level attribution for alignment and safety.

major comments (2)

[Abstract] Abstract: the preferential attribution loss is presented as theoretically connected to an upper bound of the DPO objective, but no derivation, equations, or proof outline is supplied. Without these steps it is impossible to determine whether the connection is substantive or whether the loss reduces to a fitted quantity by construction.
[Abstract] Abstract: experiments are described as demonstrating that CachePrune significantly reduces attack success rate while preserving instruction-following ability, yet the abstract supplies no quantitative results, baselines, datasets, metrics, error bars, or statistical details. This absence prevents evaluation of the central empirical claim.

minor comments (1)

[Abstract] Abstract: the 'observed triggering effect' used to improve attribution fidelity is referenced but not characterized even briefly, which reduces clarity for readers encountering the method for the first time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the preferential attribution loss is presented as theoretically connected to an upper bound of the DPO objective, but no derivation, equations, or proof outline is supplied. Without these steps it is impossible to determine whether the connection is substantive or whether the loss reduces to a fitted quantity by construction.

Authors: The abstract summarizes the key idea at a high level due to length constraints. The full derivation, equations, and proof outline establishing that the preferential attribution loss is connected to an upper bound of the DPO objective appear in Section 3 of the manuscript. The linkage is substantive: it arises from the preferential structure of the loss rather than being tautological. To improve accessibility, we will revise the abstract to include a brief reference to this theoretical result and the relevant section. revision: partial
Referee: [Abstract] Abstract: experiments are described as demonstrating that CachePrune significantly reduces attack success rate while preserving instruction-following ability, yet the abstract supplies no quantitative results, baselines, datasets, metrics, error bars, or statistical details. This absence prevents evaluation of the central empirical claim.

Authors: Abstracts conventionally omit specific numbers to remain concise. The full experimental evaluation, including quantitative attack success rate reductions, baselines, datasets, metrics, error bars, and statistical details, is reported in Sections 4 and 5. We will revise the abstract to incorporate a small number of key quantitative highlights (e.g., relative ASR reduction on the primary benchmark while retaining instruction-following accuracy) to strengthen the empirical claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected from available text

full rationale

The abstract introduces a preferential attribution loss and states it is theoretically connected to a DPO upper bound, but provides no equations, derivation steps, or self-citations that would allow exhibition of any reduction to inputs by construction. No fitted parameters are renamed as predictions, no uniqueness theorems are invoked, and no ansatz or renaming patterns appear. The central defense claim rests on an empirical pruning mechanism whose validity is presented as independent of the loss definition itself; with only the abstract available, the derivation chain cannot be shown to collapse and is therefore treated as self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields limited visibility into parameters or assumptions; the preferential attribution loss and triggering effect are treated as observed or derived elements whose independence cannot be verified without the full text.

pith-pipeline@v0.9.0 · 5691 in / 1157 out tokens · 28154 ms · 2026-05-22T17:30:23.391675+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose CachePrune that defends against this attack by identifying and pruning task-triggering neurons from the KV cache of the input prompt context... guided by a preferential attribution loss, which enables effective attribution with only a few samples
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the pruning steers the LLM toward interpreting the context purely as data rather than as instructions to follow

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Conjunctive Prompt Attacks in Multi-Agent LLM Systems
cs.MA 2026-04 unverdicted novelty 7.0

Conjunctive prompt attacks split adversarial elements across agents and routing paths in multi-agent LLM systems, evading isolated defenses and succeeding through topology-aware optimization.
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
cs.CL 2026-05 unverdicted novelty 4.0

MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.