Recognition: no theorem link
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?
Pith reviewed 2026-05-12 00:55 UTC · model grok-4.3
The pith
Reusing the target model's KV cache improves long-range acceptance in speculative decoding drafters
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The target model's KV cache serves as an explicit, token-wise context store that the draft model can reuse to obtain richer signals for long-horizon drafting, in contrast to the query-optimized and information-suppressing hidden state. The KV-Reuse Hypothesis is tested with the KVShot framework on Qwen3-8B, which shows improved acceptance rates at larger speculative depths for KV and hybrid variants. The work identifies two structural limits: shallow drafters cannot accurately estimate the target's future queries, and draft-side KV projections receive weak gradient signals under current pipelines, pointing toward block-wise training as a necessary next step.
What carries the argument
The KV-Reuse Hypothesis, which states that allowing the draft model to access the target's complete key-value cache supplies richer context signals than hidden-state reuse alone, diagnosed through the KVShot framework that directly compares hidden-only, KV-only, and hybrid reuse paradigms.
If this is right
- KV reuse raises acceptance rates for speculative steps at greater distances from the current position.
- Test-time training is insufficient to overcome long-range decay once the context-compression limitation is addressed.
- Shallow drafters struggle to predict the target model's future queries, limiting how well they can use KV information.
- Block-wise training is required to give draft-side KV projections adequate gradient signals and realize end-to-end speedups.
Where Pith is reading between the lines
- The same reuse distinction could be tested in tree-based or multi-token speculative methods to see whether KV signals help branching accuracy.
- Deeper or wider draft models might close more of the performance gap once they receive full KV context.
- The diagnostic approach could be reused to evaluate other forms of context reuse across different attention mechanisms.
Load-bearing premise
The observed gains in long-range acceptance arise from the richer, uncompressed context signals in the KV cache rather than from incidental differences in how the three reuse paradigms are implemented or trained.
What would settle it
Train hidden-only, KV-only, and hybrid drafters under identical architecture, data, and optimization settings, then measure whether the KV-only and hybrid versions still produce higher acceptance rates at speculative depths beyond 4 steps.
read the original abstract
Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model's KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper attributes long-range decay in speculative decoding to information loss in hidden-state reuse by the target model, which acts as a query-biased compression. It posits the KV-Reuse Hypothesis that reusing the full target KV cache supplies richer long-horizon signals than hidden states. To test this, the authors introduce the KVShot diagnostic framework comparing hidden-only, KV-only, and hybrid reuse paradigms on Qwen3-8B, reporting improved long-range acceptance rates for KV variants while noting marginal end-to-end speedups. They identify two bottlenecks—shallow drafters' difficulty estimating target queries and sparse gradients on draft-side KV projections—and recommend shifting from test-time training to block-wise paradigms.
Significance. If the attribution to KV signal richness holds after proper controls, the diagnostic framework and hypothesis could usefully steer speculative decoding research toward KV-aware drafters for long contexts. The work is empirical rather than theoretical and supplies a reusable testbed plus concrete bottlenecks, which are strengths even if current speedups remain marginal.
major comments (3)
- [KVShot framework description and experimental setup] The central claim that acceptance gains arise from richer KV signals rather than implementation differences requires explicit confirmation that the three paradigms use identical draft architectures, parameter counts, training data, loss functions, and optimization. The abstract notes that KV paradigms introduce distinct projection layers and receive sparse gradients; without a methods section detailing matched training dynamics and capacity, the KV-Reuse Hypothesis attribution remains vulnerable to confounding.
- [Evaluation results] Results reporting lacks error bars, per-step acceptance tables, and statistical tests for the long-range improvements on Qwen3-8B. The abstract states 'improved long-range acceptance' and 'marginal' speedups but provides no quantitative values or variance estimates, making it impossible to assess whether the gains are robust or practically meaningful.
- [Analysis of structural bottlenecks] The identification of bottlenecks (shallow drafters struggling with target query estimation and sparse KV gradients) is load-bearing for the recommendation to pursue block-wise training. These claims need supporting measurements—e.g., query estimation error rates or gradient norm statistics across layers—to show they are the primary limiters rather than secondary observations.
minor comments (2)
- Clarify notation for the three reuse paradigms (hidden-only, KV-only, hybrid) early in the paper to avoid ambiguity when comparing them.
- The abstract mentions 'extensive evaluations' but the provided text does not reference specific figures or tables; ensure all quantitative claims are tied to visible results.
Simulated Author's Rebuttal
Thank you for your thorough review and constructive suggestions. We address each of the major comments below and will revise the manuscript accordingly to strengthen the presentation of our KV-Reuse Hypothesis and experimental results.
read point-by-point responses
-
Referee: The central claim that acceptance gains arise from richer KV signals rather than implementation differences requires explicit confirmation that the three paradigms use identical draft architectures, parameter counts, training data, loss functions, and optimization. The abstract notes that KV paradigms introduce distinct projection layers and receive sparse gradients; without a methods section detailing matched training dynamics and capacity, the KV-Reuse Hypothesis attribution remains vulnerable to confounding.
Authors: We thank the referee for highlighting this important point. The three paradigms were designed with identical draft architectures, parameter counts, training data, loss functions, and optimization procedures to isolate the effect of the reuse mechanism. The distinct projection layers for KV inputs are necessary to handle the different input format but do not alter the core model capacity or training. In the revised manuscript, we will add a dedicated subsection in Methods detailing these matched conditions and the training dynamics to eliminate any potential confounding. revision: yes
-
Referee: Results reporting lacks error bars, per-step acceptance tables, and statistical tests for the long-range improvements on Qwen3-8B. The abstract states 'improved long-range acceptance' and 'marginal' speedups but provides no quantitative values or variance estimates, making it impossible to assess whether the gains are robust or practically meaningful.
Authors: We agree that more detailed quantitative reporting is necessary. In the revision, we will include error bars (standard deviations from multiple seeds), per-step acceptance rate tables for long-range speculative steps, and statistical tests (e.g., paired t-tests) to validate the improvements. We will also update the abstract with specific quantitative values for acceptance rates and speedups where appropriate. revision: yes
-
Referee: The identification of bottlenecks (shallow drafters struggling with target query estimation and sparse KV gradients) is load-bearing for the recommendation to pursue block-wise training. These claims need supporting measurements—e.g., query estimation error rates or gradient norm statistics across layers—to show they are the primary limiters rather than secondary observations.
Authors: We acknowledge that the bottleneck analysis requires more empirical support. We will augment the analysis section with quantitative measurements, including query estimation error rates (computed as the discrepancy between the drafter's estimated queries and the target's actual queries) and gradient norm statistics for the KV projection layers across different depths. These additions will substantiate that these factors are indeed the primary constraints limiting the end-to-end speedups. revision: yes
Circularity Check
No circularity: empirical hypothesis test with no definitional reduction
full rationale
The paper advances the KV-Reuse Hypothesis via qualitative reasoning on context preservation (hidden-state compression vs. explicit KV retention) and tests it through the KVShot framework's controlled comparison of three reuse paradigms on Qwen3-8B. No equations, parameter fits, or predictions are presented; results are observational acceptance rates and identified bottlenecks. No self-citations, ansatzes, or renamings appear in the provided text. The chain is self-contained empirical evaluation rather than any derivation that reduces to its inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Accelerating Large Language Model Decoding with Speculative Sampling
URLhttps://arxiv.org/ abs/2302.01318. Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Dflash: Block diffusion for flash speculative decoding
URLhttps://arxiv.org/abs/2602.06036. Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding,
-
[3]
Sequoia: Scalable, robust, and hardware-aware specu- lative decoding,
URLhttps://arxiv.org/ abs/2402.12374. DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
-
[4]
URLhttps://arxiv. org/abs/2412.19437. Cunxiao Du, Jing Jiang, Yuanchen Xu, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, and Yang You. Glide with a cape: a low-hassle method to accelerate speculative decoding. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
URLhttps://arxiv.org/abs/2404.19737. Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pp. 19274–19286. PMLR, 23–29 Jul
-
[6]
EAGLE-2: Faster inference of language models with dynamic draft trees
Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main.422. URLhttps://aclanthology.org/2024.emnlp-main.422/. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty, 2025b. URLhttps://arxiv.org/abs/2401.15077. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eag...
-
[7]
Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun, and Xiaoyan Sun
URLhttps://arxiv.org/abs/2408.11850. Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun, and Xiaoyan Sun. Logitspec: Accelerating retrieval-based speculative decoding via next next token speculation, 2026a. URL https://arxiv.org/abs/2507. 01449. Tianyu Liu, Qitan Lv, Yuhao Shen, Xiao Sun, and Xiaoyan Sun. Talon: Confidence-aware speculative decoding with ad...
-
[8]
URLhttps: //arxiv.org/abs/2601.08273. Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verific...
-
[9]
Association for Computing Machinery. ISBN 9798400703867. doi: 10.1145/3620666.3651335. URL https://doi.org/10.1145/3620666.3651335. Qwen Team, An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong...
-
[10]
URLhttps://arxiv.org/abs/2505.09388. ShareGPT. ShareGPT. https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism
Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, and Cong Wang. Double: Breaking the acceleration limit via double retrieval speculative parallelism, 2026a. URLhttps://arxiv.org/abs/ 2601.05524. Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, and Cong Wang. Specbranch: Speculative decoding via hybrid drafting and rollback-aware b...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Association for Computational Linguistics. ISBN 979-8-89176-386-9. doi: 10.18653/v1/2026.findings-eacl.31. URLhttps://aclanthology.org/2026. findings-eacl.31/. Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculat...
-
[13]
URLhttps://arxiv.org/abs/2401.07851. Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. Swift: On-the-fly self-speculative decoding for llm inference acceleration,
-
[14]
URLhttps://arxiv.org/abs/2410.06916. Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, and Bo An. Longspec: Long-context lossless speculative decoding with efficient drafting and verification,
-
[15]
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
URLhttps: //arxiv.org/abs/2502.17421. Accepted to ACL
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
doi: 10.18653/v1/2024.acl-long.607. URLhttp://dx.doi.org/10. 18653/v1/2024.acl-long.607. Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling. InThe Thirteenth International Conference on Learning Representations,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.