pith. machine review for the scientific record. sign in

arxiv: 2604.03270 · v1 · submitted 2026-03-22 · 💻 cs.CL

Recognition: no theorem link

Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords knowledge packsKV cache injectionzero-token RAGcausal transformersvalue vector steeringtoken efficiencybehavioral control
0
0 comments X

The pith

Pre-computed KV caches from knowledge text match full-prompt results exactly for causal transformers, delivering facts at zero token cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in causal language models the key-value cache generated by a standalone forward pass over knowledge text F is identical to the cache that would arise in a single joint pass over F concatenated with a query. This identity holds directly from the causal attention mask, which blocks any backward influence from the query onto the knowledge positions. When the chat template is applied consistently, the approach produces zero performance difference from standard RAG across hundreds of questions while cutting token usage by up to 95 percent on tested 8B models. The same KV interface additionally supports behavioral steering by adding scaled contrastive deltas to the cached value vectors in mid-layers, an operation that can run simultaneously with knowledge delivery without interference.

Core claim

For causal transformers the KV cache from a forward pass solely on text F matches exactly the cache that a joint forward pass on F concatenated with query q would produce. This holds because the causal mask prevents any attention from positions in q back to future tokens. The equivalence is exact when formatting is correct, enabling zero-token knowledge delivery and also behavioral steering through value-vector arithmetic that leaves keys untouched.

What carries the argument

KV cache injection: pre-computed key-value states from a knowledge-only forward pass are inserted directly into the model's running cache for subsequent query processing.

If this is right

  • Up to 95 percent reduction in prompt tokens for knowledge-intensive queries.
  • Zero performance divergence from standard RAG across 700 questions on Qwen3-8B and Llama-3.1-8B.
  • Behavioral steering via contrastive deltas on mid-layer value vectors is possible without retraining.
  • Knowledge delivery and steering can be combined at scaling factors up to 0.7 with no mutual interference.
  • Independent steering directions remain nearly orthogonal and compose additively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dynamic swapping of pre-computed knowledge modules becomes feasible in production without altering the user prompt.
  • The value-delta steering channel could be applied to other transformer variants that preserve the causal mask structure.
  • Pre-computing packs for recurring knowledge domains may reduce both latency and context-window pressure in real-time systems.

Load-bearing premise

The input must be formatted with exactly the same chat template that was used when the knowledge cache was pre-computed.

What would settle it

Compute the KV cache once from a standalone pass on F and once from a joint pass on F plus q; any difference larger than floating-point error falsifies the claimed exact equivalence.

Figures

Figures reproduced from arXiv: 2604.03270 by Andrey Pustovit.

Figure 1
Figure 1. Figure 1: Token cost scaling with retrieval steps. RAG prompt cost grows linearly ( [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-selective value steering on two models. Mid layers (33–66%, green highlight) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dual-channel trade-off (Qwen3-8B, N=200). At α≤0.7 (green zone), factual accuracy is fully preserved while behavioral steering is measurable. Higher α trades accuracy for stronger steering. 6 Related Work RAG. Lewis et al. [2020], Guu et al. [2020], with multi-hop extensions [Trivedi et al., 2023, Press et al., 2023, Asai et al., 2024]. KV injection replaces the prompt-insertion step. KV cache compression … view at source ↗
read the original abstract

RAG wastes tokens. We propose Knowledge Packs: pre-computed KV caches that deliver the same knowledge at zero token cost. For causal transformers, the KV cache from a forward pass on text F is identical to what a joint pass on F+q would produce - this follows directly from the causal mask. The equivalence is exact but fragile: wrong chat template formatting causes 6-7pp degradation, which we believe explains prior claims of KV outperforming RAG. With correct formatting: zero divergences across 700 questions on Qwen3-8B and Llama-3.1-8B, up to 95% token savings. The KV interface also enables behavioral steering that RAG cannot do. Because RoPE rotates keys but leaves values untouched, contrastive deltas on cached values can nudge model behavior while key arithmetic destroys coherence. The effect sits in mid-layer values (33-66%), independent directions are nearly orthogonal (cos~0) and compose, and both channels - knowledge and steering - run simultaneously at alpha<=0.7 without interference. No training, no weight modification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Knowledge Packs as pre-computed KV caches that deliver knowledge from a prefix text F at zero token cost during inference on a query q. The central claim is that, for causal transformers, the KV states produced by a standalone forward pass on F are identical to those arising in a joint forward pass on F+q; this follows directly from the causal attention mask. The paper reports exact numerical agreement (zero divergences) across 700 questions on Qwen3-8B and Llama-3.1-8B when correct chat templates are used, yielding up to 95% token savings. It further shows that contrastive deltas on cached value vectors enable behavioral steering in mid-layers (33-66%), with independent directions nearly orthogonal and composable, and that knowledge injection and steering can run simultaneously at alpha ≤ 0.7 without interference. No training or weight modification is required.

Significance. If the equivalence and empirical results hold, the work provides a practical, parameter-free route to token-efficient knowledge delivery that could substantially reduce context costs in retrieval-augmented generation and similar settings. The zero-divergence verification across two models and the analysis of value-based steering (including orthogonality and simultaneous operation) constitute clear strengths. The approach also opens a new interface for steering that is unavailable to standard RAG.

minor comments (3)
  1. The abstract states 'up to 95% token savings' and '6-7pp degradation' under incorrect formatting but provides no dataset statistics, average savings, or variance; these details should be added to the results section for reproducibility.
  2. The description of steering deltas (mid-layer values 33-66%, cos~0 orthogonality) would benefit from an explicit equation or pseudocode showing how the contrastive deltas are computed from the cached values.
  3. A short related-work paragraph comparing Knowledge Packs to prior KV-cache reuse and prefix-tuning methods would help readers situate the contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the zero-divergence results, steering analysis, and the minor-revision recommendation. The report accurately captures the core claims regarding causal-mask equivalence, token savings, and value-based steering. No specific major comments were listed in the report, so we provide no point-by-point rebuttals below. We are prepared to incorporate any minor clarifications or formatting adjustments requested by the editor.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core claim—that the KV cache computed on prefix F in isolation is identical to the KV states arising in a joint forward pass on F+q—follows directly from the standard causal attention mask property, which ensures later tokens cannot influence earlier KV computations. This equivalence is presented as a mathematical consequence of the mask and is supported by exact numerical verification (zero divergences across 700 questions on two models). No parameters are fitted to produce the target behavior, no self-referential definitions equate inputs to outputs, and no load-bearing self-citations or uniqueness theorems are invoked. The steering deltas are derived from observed value differences in cached states rather than constructed to match desired outcomes. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the standard causal attention mask property and the observation that RoPE affects keys but not values. No free parameters are introduced. The only invented entity is the named technique itself.

axioms (1)
  • standard math Causal attention mask ensures that KV states for prefix F are identical whether or not a suffix query follows.
    Direct consequence of the standard causal mask in transformer attention; invoked in the opening claim.
invented entities (1)
  • Knowledge Pack no independent evidence
    purpose: Named container for pre-computed KV cache used for zero-token injection.
    Terminology for the technique; no new physical or mathematical entity.

pith-pipeline@v0.9.0 · 5482 in / 1256 out tokens · 31971 ms · 2026-05-15T07:07:13.420354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 6 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    Abhimanyu Dubey et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  2. [2]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024b. Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anasta- sios Kyrillidis, and Anshumali Shrivast...

  3. [3]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  4. [4]

    Agent memory below the prompt: Persistent q4 kv cache.arXiv preprint arXiv:2603.04428,

    Mikhail Shkolnikov. Agent memory below the prompt: Persistent q4 kv cache.arXiv preprint arXiv:2603.04428,

  5. [5]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Pelrine. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

  6. [6]

    C-Pack: Packed Resources For General Chinese Embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding.arXiv preprint arXiv:2309.07597,

  7. [7]

    SGLang: Efficient Execution of Structured Language Model Programs

    12 Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Shuo Cheng, Jeff Huang, Siyuan Zhuang, Yinmin Shi, and Ion Stoica. Sglang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104,