pith. sign in

arxiv: 2606.08635 · v1 · pith:D43WT376new · submitted 2026-06-07 · 💻 cs.LG · cs.DC

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

Pith reviewed 2026-06-27 18:49 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords KV cachemixed precisionLLM servingprefill decode disaggregationquantizationtoken importancenetwork transferperplexity
0
0 comments X

The pith

SpectrumKV assigns three precision levels to each KV token during transfer instead of binary keep-or-drop, preserving quality at the same network budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that KV cache transfer in prefill-decode disaggregated serving can be treated as a per-token precision allocation task rather than a pure selection task. High-importance tokens stay at FP16, medium-importance at INT8, and low-importance at INT4 only when a quick probe confirms the model tolerates it. This mixed policy is shown to limit perplexity degradation to under 2 percent on WikiText-2 at 50 percent budget, where binary baselines exceed 20 percent, and to raise needle-in-a-haystack retrieval from 26 percent to 52 percent on one model at the tightest budget. The result matters because disaggregation turns the KV cache into a network payload whose size directly sets time-to-first-token.

Core claim

SpectrumKV assigns FP16 to attention sinks and other high-importance tokens, INT8 to medium-importance tokens, and INT4 to low-importance tokens when the model passes a three-trial NIAH probe; models that fail the probe fall back to FP16 plus INT8. Across Qwen2.5-7B, Mistral-7B, and Gemma-2-9B this three-tier (or two-tier) policy produces far smaller quality loss than binary token pruning at identical normalized KV budgets.

What carries the argument

The three-tier per-token precision policy (FP16/INT8/INT4) gated by a lightweight deployment-time NIAH probe that decides whether INT4 is safe for a given model.

If this is right

  • At 50 percent normalized KV budget, perplexity change stays near zero or single-digit percent on WikiText-2 while binary methods exceed 20 percent.
  • Needle-in-a-haystack retrieval at b=0.3 reaches 52.6 percent on Qwen versus 26.3 percent for binary selection.
  • Models that pass the probe achieve the full three-tier benefit; others safely fall back to two tiers without manual intervention.
  • Measured transfer-path timing shows 50-62 percent TTFT reduction at b=0.5.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The probe approach could be extended to other quantization bit-widths or to dynamic importance scoring beyond the paper's attention-sink heuristic.
  • Production serving systems could run the probe once per model version and cache the resulting tier policy for all subsequent requests.
  • The same per-token precision idea might apply to activations or weights transferred between prefill and decode nodes.

Load-bearing premise

A short three-trial NIAH probe run at deployment time can reliably detect whether a model will tolerate INT4 KV quantization without later catastrophic failure.

What would settle it

A model that passes the three-trial probe yet exhibits a large quality drop (for example, perplexity rise exceeding 10 percent) when the three-tier policy is applied on WikiText-2 or NIAH at the same budget used in the paper.

Figures

Figures reproduced from arXiv: 2606.08635 by Yang Pengju.

Figure 1
Figure 1. Figure 1: KV access Gini coefficients. All models exhibit strong skew (Gini [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SpectrumKV pipeline overview. 4.2 System architecture in PD disaggregation [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: System architecture in PD disaggregation. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mixed-precision vs. binary selection at b = 0.5. The Balanced row is intentionally not hidden. It assigns substantial int4 mass even at b = 0.5 and causes Qwen PPL to explode. This is not a plotting artifact or a malformed cache write; it is reproduced by direct int4 controls and motivates the probe. 6.2 Budget sweep [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: PPL change vs. KV budget for Mistral-7B. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PPL change vs. KV budget for Gemma-2-9B. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: PPL change vs. KV budget for Qwen2.5-7B. Shaded region: INT4 unsafe zone. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: NIAH accuracy at b = 0.3. 6.5 NIAH depth–budget heatmap The aggregate NIAH accuracy in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: NIAH depth–budget heatmap for Qwen2.5-7B. Left: SpectrumKV Adaptive (2-tier). [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: TTFT reduction at b = 0.5. These timings should be read as transfer-path evidence rather than a full production serving study. They do not include a complete PD scheduler, continuous batching, network contention, or cold-tier fetch policies. 6.7 Context length [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-layer INT4 cosine similarity for Qwen2.5-7B. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-layer INT4 cosine similarity for Mistral-7B. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-layer INT4 cosine similarity for Gemma-2-9B. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
read the original abstract

Prefill-decode (PD) disaggregation decouples prompt processing from token generation, but it also turns the key-value (KV) cache into a network payload. Existing PD-side KV reduction methods are mostly binary: selected tokens are transmitted at full precision and the rest are not transmitted. This paper argues that binary selection leaves a useful design space unused. SpectrumKV assigns a precision level to each token instead: attention sinks and other high-importance tokens are protected at FP16, medium-importance tokens are sent at INT8, and low-importance tokens are sent at INT4 when the model can tolerate it. The main practical complication is that INT4 tolerance is model-dependent. Qwen2.5-7B catastrophically fails under INT4 KV quantization, while Mistral-7B and Gemma-2-9B remain stable. SpectrumKV therefore runs a lightweight deployment-time probe: three aggressive NIAH trials under a 3-tier policy. Models that pass use FP16+INT8+INT4; models that fail fall back to FP16+INT8. Across Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, and Gemma-2-9B-it, SpectrumKV improves quality at the same transfer budget. At a 50% normalized KV budget on WikiText-2, SpectrumKV changes perplexity by +1.97%,-0.06%, and-0.44%, respectively, compared with PDTrim's +25.85%, +22.07%, and +35.63%. On NIAH retrieval at 4096 tokens, the adaptive policy reaches 52.6% on Qwen at the aggressive b=0.3 budget versus 26.3% for PDTrim, and reaches 100% by b=0.5; Mistral and Gemma preserve retrieval under the 3-tier policy. End-to-end GPU timing of the transfer path shows 50-62% TTFT reductions at b=0.5. These results suggest that PD KV transfer should be treated as a precision-allocation problem, not only as token pruning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes SpectrumKV for prefill-decode disaggregated LLM serving, replacing binary KV token selection with per-token mixed-precision transfer: high-importance tokens (attention sinks) at FP16, medium at INT8, and low at INT4 when the model tolerates it. A lightweight deployment-time probe of three aggressive NIAH trials decides whether to enable the 3-tier (FP16+INT8+INT4) policy or fall back to FP16+INT8. On Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, and Gemma-2-9B-it, SpectrumKV reports better perplexity on WikiText-2 at 50% normalized KV budget (+1.97%, -0.06%, -0.44% vs. PDTrim's +25.85%, +22.07%, +35.63%) and higher NIAH accuracy at b=0.3 (52.6% vs. 26.3% on Qwen), with 50-62% TTFT reductions at b=0.5.

Significance. If the central empirical claims hold, the work reframes PD KV transfer as a precision-allocation problem rather than pure token pruning, enabling higher quality at fixed transfer budgets. The model-dependent INT4 tolerance observation and the probe-based adaptive policy are novel contributions. The end-to-end GPU timing results provide practical evidence of latency benefits. No machine-checked proofs or parameter-free derivations are present; the strength lies in the multi-model empirical evaluation.

major comments (2)
  1. [Abstract (deployment-time probe description)] The deployment-time probe (three aggressive NIAH trials) is load-bearing for the adaptive policy, yet the manuscript provides no cross-task validation that passing NIAH guarantees stability on non-retrieval workloads such as long-context reasoning, code generation, or math where KV quantization noise may accumulate differently. The abstract states Qwen2.5-7B fails catastrophically under INT4 while Mistral/Gemma do not, but only NIAH and WikiText-2 results are reported.
  2. [Abstract (quantitative results)] No details are supplied on the method for computing per-token importance scores that determine the FP16/INT8/INT4 tier assignment, the exact experimental controls (e.g., how the 50% normalized KV budget is enforced), or statistical significance of the reported perplexity deltas. These omissions prevent verification of the central claim that mixed-precision outperforms binary selection at the same budget.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below.

read point-by-point responses
  1. Referee: [Abstract (deployment-time probe description)] The deployment-time probe (three aggressive NIAH trials) is load-bearing for the adaptive policy, yet the manuscript provides no cross-task validation that passing NIAH guarantees stability on non-retrieval workloads such as long-context reasoning, code generation, or math where KV quantization noise may accumulate differently. The abstract states Qwen2.5-7B fails catastrophically under INT4 while Mistral/Gemma do not, but only NIAH and WikiText-2 results are reported.

    Authors: The probe uses aggressive NIAH trials specifically to detect catastrophic INT4 failure modes, as NIAH is a standard and sensitive test for KV cache fidelity in retrieval. WikiText-2 serves as a general perplexity check. We agree that the probe's behavior on code generation, math, or other reasoning tasks is not directly validated in the current experiments and will add an explicit limitations paragraph discussing this scope in the revision. revision: partial

  2. Referee: [Abstract (quantitative results)] No details are supplied on the method for computing per-token importance scores that determine the FP16/INT8/INT4 tier assignment, the exact experimental controls (e.g., how the 50% normalized KV budget is enforced), or statistical significance of the reported perplexity deltas. These omissions prevent verification of the central claim that mixed-precision outperforms binary selection at the same budget.

    Authors: We will expand the abstract and methods section to include these details. Per-token importance is computed from aggregated attention scores (detailed in Section 3); the normalized budget is met by sorting tokens by importance and greedily assigning the highest feasible precision tier until the exact budget is reached. We will also report standard deviations across multiple runs to support the perplexity deltas. revision: yes

standing simulated objections not resolved
  • Empirical cross-task validation of the NIAH-based probe on non-retrieval workloads (code generation, math, long-context reasoning) beyond the reported NIAH and WikiText-2 results.

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation with no derivations or self-referential reductions

full rationale

The paper presents SpectrumKV as an empirical system: a deployment-time NIAH probe decides between 3-tier (FP16/INT8/INT4) and 2-tier (FP16/INT8) policies, with quality measured directly on WikiText-2 perplexity and NIAH accuracy at given budgets. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Claims reduce to observed benchmark deltas (e.g., +1.97% vs +25.85% perplexity change) rather than any definitional or constructional equivalence. This is the expected non-finding for a systems/empirical paper without mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5939 in / 1209 out tokens · 18920 ms · 2026-06-27T18:49:04.900898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Can I Buy Your KV Cache?

    cs.AI 2026-06 unverdicted novelty 6.0

    Proposes an agent-native prefill CDN where precomputed KV caches are hosted and sold to agents, delivering 9-50x compute savings with exact token and logit matching on Qwen3-4B.

Reference graph

Works this paper leans on

22 extracted references · 4 linked inside Pith · cited by 1 Pith paper

  1. [1]

    Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

    Zefan Cai et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  2. [2]

    QuIP: 2-bit quantization of large language models with guarantees

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InAdvances in Neural Information Processing Systems, 2023

  3. [3]

    Peter J. Denning. The working set model for program behavior.Communications of the ACM, 11(5):323–333, 1968

  4. [4]

    GPTQ: Accurate post- training quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InInternational Conference on Learning Representations, 2023

  5. [5]

    SplitZip: Ultra fast lossless KV compression for disaggre- gated LLM serving.arXiv preprint arXiv:2605.01708, 2025

    Yipin Guo and Siddharth Joshi. SplitZip: Ultra fast lossless KV compression for disaggre- gated LLM serving.arXiv preprint arXiv:2605.01708, 2025

  6. [6]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. InAdvances in Neural Information Processing Systems, 2024

  7. [7]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention.arXiv preprint arXiv:2309.06180, 2023

  8. [8]

    Flowkv: A disaggregated inference framework with low-latency kv cache transfer and load-aware scheduling.arXiv preprint arXiv:2504.03775, 2025

    Yichao Li et al. Flowkv: A disaggregated inference framework with low-latency kv cache transfer and load-aware scheduling.arXiv preprint arXiv:2504.03775, 2025

  9. [9]

    Snapkv: LLM knows what you are looking for before generation

    Yuhong Li et al. Snapkv: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems, 2024

  10. [10]

    KVServe: Service-aware kv cache compression for communication-efficient disaggregated LLM serving.arXiv preprint arXiv:2605.13734, 2026

    Huan Liu et al. KVServe: Service-aware kv cache compression for communication-efficient disaggregated LLM serving.arXiv preprint arXiv:2605.13734, 2026

  11. [11]

    Kivi: A tuning-free asymmetric 2bit quantization for kv cache

    Zirui Liu, Jiayi Zhao, Wenxuan Yao, Shijie Cheng, Yucheng Li, Zhe Liang, Jiarong Luo, Xiaodong Wang, Quanquan Gu, Xuehai Li, Yuxuan Liu, Yuandong Wang, and Beidi Chen. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InInternational Conference on Machine Learning, 2024

  12. [12]

    LMCache: KV cache offloading for vLLM.arXiv preprint arXiv:2504.02257, 2025

    LMCache Team. LMCache: KV cache offloading for vLLM.arXiv preprint arXiv:2504.02257, 2025

  13. [13]

    OrbitFlow: SLO-aware long-context LLM serving with fine-grained KV cache reconfiguration.Proceedings of the VLDB Endowment, 19:1046–1060, 2026

    Xinyu Ma et al. OrbitFlow: SLO-aware long-context LLM serving with fine-grained KV cache reconfiguration.Proceedings of the VLDB Endowment, 19:1046–1060, 2026

  14. [14]

    NVIDIA Dynamo: Multi-level KV cache management for LLM serving.NVIDIA Technical Blog, 2025.https://developer.nvidia.com/blog/tag/dynamo/

    NVIDIA. NVIDIA Dynamo: Multi-level KV cache management for LLM serving.NVIDIA Technical Blog, 2025.https://developer.nvidia.com/blog/tag/dynamo/. 24

  15. [15]

    Splitwise: Efficient generative llm inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. InProceedings of the International Symposium on Computer Architecture, 2024

  16. [16]

    Mooncake: A KVCache-centric disaggregated architecture for LLM serving.arXiv preprint arXiv:2407.00079, 2024

    Rui Qin, Zhaozhuo Li, Weiran He, Ming Zhang, Yongwei Wu, Weimin Zheng, and Xiaowen Xu. Mooncake: A KVCache-centric disaggregated architecture for LLM serving.arXiv preprint arXiv:2407.00079, 2024

  17. [17]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024

  18. [18]

    Cacheblend: Fast large language model serving for rag with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024

    Yihong Yao et al. Cacheblend: Fast large language model serving for rag with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024

  19. [19]

    Enhancing llm efficiency: Targeted pruning for prefill-decode disaggregation in inference.arXiv preprint arXiv:2509.04467, 2025

    Hao Zhang, Mengsi Lyu, Yulong Ao, and Yonghua Lin. Enhancing llm efficiency: Targeted pruning for prefill-decode disaggregation in inference.arXiv preprint arXiv:2509.04467, 2025

  20. [20]

    H2O: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Re, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. In Advances in Neural Information Processing Systems, 2023

  21. [21]

    Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jian Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of the USENIX Symposium on Operating Systems Design and Implementation, 2024

  22. [22]

    Dynamickv: Task-aware adaptive kv cache compression for long context llms.arXiv preprint arXiv:2412.14838, 2024

    Yuxuan Zhou et al. Dynamickv: Task-aware adaptive kv cache compression for long context llms.arXiv preprint arXiv:2412.14838, 2024. A Decay-Rate Sweep The decay-rate sweep in Table 15 comes from the simulation-v3 scan rather than the main GPU PPL suite. It is included as an auxiliary sensitivity result. Larger decay rates help Mistral and Gemma in this sw...