BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
Pith reviewed 2026-06-28 23:17 UTC · model grok-4.3
The pith
BudgetDraft trains one drafter on multiple KV budgets with acceptance-aware losses to keep acceptance high in sparse speculative decoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BudgetDraft exposes the drafter to multiple sampled KV budgets during training and aligns each sparse view to one shared full-cache teacher target by combining an acceptance-aware loss on the full-cache branch with a multi-view loss on the sparse-cache branch, yielding a single budget-robust drafter that recovers acceptance across sparsity levels.
What carries the argument
Multi-view sparse training that pairs acceptance-aware loss on a full-cache branch with multi-view loss on a sparse-cache branch to align different KV budgets to one teacher target.
If this is right
- End-to-end speedups reach 6.55x versus plain autoregressive decoding at 4K context length.
- Speedups reach 4.46x at 8K context length and 2.10x at 16K context length.
- The inference pipeline stays memory-friendly because the drafter uses a sparse KV cache.
- A single drafter model suffices across sparsity levels instead of requiring separate models or runtime adjustments.
Where Pith is reading between the lines
- The same multi-view alignment technique could apply to other mismatches between draft and verify caches in speculative decoding.
- Hardware deployments with varying memory budgets might reuse one trained drafter instead of maintaining several specialized versions.
- If acceptance remains stable under the method, it could extend speculative decoding to longer contexts than current sparse approaches allow.
Load-bearing premise
Training the drafter on several sampled KV budgets plus an acceptance-aware loss against a shared full-cache target will keep acceptance rates high at inference time across different sparsity levels without needing per-budget retraining or extra runtime parts.
What would settle it
An experiment that applies the trained BudgetDraft model at 16K context length and measures whether its acceptance rate falls to the same low levels seen in naive sparse/full speculative decoding.
Figures
read the original abstract
Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly. We propose BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full-cache teacher target. BudgetDraft combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch, producing a single budget-robust drafter that recovers acceptance across sparsity levels without extra inference-time components. Experimental results on PG-19, LongBench, and LWM show that BudgetDraft achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory-friendly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes BudgetDraft, a multi-view sparse training method for speculative decoding drafters that use sparse KV caches under a fixed budget while the verifier uses a full cache. The drafter is trained on multiple sampled KV budgets with an acceptance-aware loss aligned to a shared full-cache teacher target, yielding a single budget-robust model that maintains acceptance rates across sparsity levels without extra inference-time components. Experiments on PG-19, LongBench, and LWM report end-to-end speedups versus autoregressive decoding of up to 6.55x at 4K, 4.46x at 8K, and 2.10x at 16K context lengths while remaining memory-friendly.
Significance. If the empirical claims hold under rigorous controls, the method addresses a practical deployment barrier for speculative decoding in mid-to-long contexts by eliminating per-budget retraining and runtime overheads, potentially enabling wider use of sparse-KV drafters in memory-constrained settings.
major comments (2)
- [§4 (Experiments)] §4 (Experiments): the abstract and method description report speedups on three benchmarks but provide no details on experimental controls, variance across runs, baseline implementations, or the precise protocol for measuring acceptance rate; these omissions are load-bearing for validating the central claim that the multi-view training produces stable acceptance across sparsity levels and context lengths.
- [§3 (Method)] §3 (Method): the description of the multi-view loss and acceptance-aware loss on the full-cache branch does not specify how the sampled KV budgets are drawn during training or whether the sampling distribution matches the inference-time sparsity levels; without this, it is unclear whether the reported robustness is by construction or requires additional assumptions.
minor comments (2)
- [Abstract] The abstract and introduction use the term "parameter-free" in passing when describing the resulting drafter; this should be clarified or removed since the training procedure introduces multiple hyperparameters for budget sampling and loss weighting.
- [Figures/Tables] Figure captions and table headers should explicitly state the number of runs and error bars (or lack thereof) for all reported speedups and acceptance rates.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve reproducibility and clarity.
read point-by-point responses
-
Referee: [§4 (Experiments)] the abstract and method description report speedups on three benchmarks but provide no details on experimental controls, variance across runs, baseline implementations, or the precise protocol for measuring acceptance rate; these omissions are load-bearing for validating the central claim that the multi-view training produces stable acceptance across sparsity levels and context lengths.
Authors: We agree these details are necessary for validating the claims. In the revised manuscript we will expand §4 with: fixed random seeds and hardware configuration; mean and standard deviation of results over 3 runs; exact baseline implementations including sparse-KV handling; and the acceptance-rate protocol (token-by-token exact match against verifier outputs). These additions will directly support the reported robustness. revision: yes
-
Referee: [§3 (Method)] the description of the multi-view loss and acceptance-aware loss on the full-cache branch does not specify how the sampled KV budgets are drawn during training or whether the sampling distribution matches the inference-time sparsity levels; without this, it is unclear whether the reported robustness is by construction or requires additional assumptions.
Authors: We acknowledge the sampling procedure requires explicit description. The revised §3 will state that sparsity ratios are drawn uniformly at random from the discrete set {0.25, 0.5, 0.75} of full-cache size during each training step, chosen to match the inference-time budgets used in evaluation. The full-cache branch remains the fixed teacher target. Pseudocode will be added to make the construction of robustness explicit. revision: yes
Circularity Check
No significant circularity; purely empirical method with measured results
full rationale
The paper presents an empirical training procedure (multi-view sparse KV exposure plus acceptance-aware loss on a shared full-cache target) and reports measured end-to-end speedups on PG-19, LongBench, and LWM at fixed context lengths. No equations, derivations, or parameter-fitting steps are described that would allow any claimed quantity to reduce to its own inputs by construction. The central claim is supported by direct experimental measurement rather than by any self-referential definition or self-citation chain, satisfying the self-contained benchmark criterion.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
International Conference on Learning Representations , year=
Compressive Transformers for Long-Range Sequence Modelling , author=. International Conference on Learning Representations , year=
-
[2]
CoRR , volume=
Tomás Kociský and Jonathan Schwarz and Phil Blunsom and Chris Dyer and Karl Moritz Hermann and Gábor Melis and Edward Grefenstette , title=. CoRR , volume=. 2017 , cdate=
2017
-
[3]
CoRR , volume=
Yushi Bai and Xin Lv and Jiajie Zhang and Hongchang Lyu and Jiankai Tang and Zhidian Huang and Zhengxiao Du and Xiao Liu and Aohan Zeng and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li , title=. CoRR , volume=. 2023 , cdate=
2023
-
[4]
Hugging Face Hub , year=
llama-68m , author=. Hugging Face Hub , year=
-
[5]
CoRR , volume=
Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton-Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller ...
2023
-
[6]
International Conference on Machine Learning , pages=
Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[7]
Accelerating Large Language Model Decoding with Speculative Sampling
Accelerating large language model decoding with speculative sampling , author=. arXiv preprint arXiv:2302.01318 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
CoRR , volume=
Hanshi Sun and Zhuoming Chen and Xinyu Yang and Yuandong Tian and Beidi Chen , title=. CoRR , volume=. 2024 , cdate=
2024
-
[9]
2024 , url=
Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang , booktitle=. 2024 , url=
2024
-
[10]
Proceedings of the 2024 conference on empirical methods in natural language processing , pages=
Eagle-2: Faster inference of language models with dynamic draft trees , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=
2024
-
[11]
2026 , url=
Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang , booktitle=. 2026 , url=
2026
-
[12]
International Conference on Learning Representations , volume=
Yarn: Efficient context window extension of large language models , author=. International Conference on Learning Representations , volume=
-
[13]
International Conference on Learning Representations , volume=
Efficient streaming language models with attention sinks , author=. International Conference on Learning Representations , volume=
-
[14]
Advances in Neural Information Processing Systems , volume=
Eagle-3: Scaling up inference acceleration of large language models via training-time test , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
Advances in Neural Information Processing Systems , volume=
H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
Distilling the Knowledge in a Neural Network
Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
CoRR , volume=
Xiaoxuan Liu and Lanxiang Hu and Peter Bailis and Ion Stoica and Zhijie Deng and Alvin Cheung and Hao Zhang , title=. CoRR , volume=. 2023 , cdate=
2023
-
[18]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Draft& verify: Lossless large language model acceleration via self-speculative decoding , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[19]
CoRR , volume=
Jiaheng Liu and Dawei Zhu and Zhiqi Bai and Yancheng He and Huanxuan Liao and Haoran Que and Zekun Wang and Chenchen Zhang and Ge Zhang and Jiebin Zhang and Yuanxing Zhang and Zhuo Chen and Hangyu Guo and Shilong Li and Ziqiang Liu and Yong Shan and Yifan Song and Jiayi Tian and Wenhao Wu and Zhejian Zhou and Ruijie Zhu and Junlan Feng and Yang Gao and Sh...
2025
-
[20]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
E2llm: Encoder elongated large language models for long-context understanding and reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[21]
International Conference on Learning Representations , volume=
Duoattention: Efficient long-context llm inference with retrieval and streaming heads , author=. International Conference on Learning Representations , volume=
-
[22]
ACM Computing Surveys , year=
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities , author=. ACM Computing Surveys , year=
-
[23]
International Conference on Learning Representations , volume=
Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding , author=. International Conference on Learning Representations , volume=
-
[24]
CoRR , volume=
Haoyang Li and Yiming Li and Anxin Tian and Tianhao Tang and Zhanchao Xu and Xuejia Chen and Nicole Hu and Wei Dong and Qing Li and Lei Chen , title=. CoRR , volume=. 2024 , cdate=
2024
-
[25]
CoRR , volume=
Zixuan Zhou and Xuefei Ning and Ke Hong and Tianyu Fu and Jiaming Xu and Shiyao Li and Yuming Lou and Luning Wang and Zhihang Yuan and Xiuhong Li and Shengen Yan and Guohao Dai and Xiao-Ping Zhang and Yuhan Dong and Yu Wang , title=. CoRR , volume=. 2024 , cdate=
2024
-
[26]
ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification , author=. ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models , year=
-
[27]
Tsinghua Science and Technology , volume=
Efficient Inference for Edge Large Language Models: A Survey , author=. Tsinghua Science and Technology , volume=. 2026 , publisher=
2026
-
[28]
Advances in Neural Information Processing Systems , volume=
Snapkv: Llm knows what you are looking for before generation , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
Advances in Neural Information Processing Systems , volume=
Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
Advances in Neural Information Processing Systems , volume=
Adaspec: Selective knowledge distillation for efficient speculative decoders , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
International Conference on Learning Representations , volume=
Efficient inference for large language model-based generative recommendation , author=. International Conference on Learning Representations , volume=
-
[32]
Proceedings of the ACM on Management of Data , volume=
Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving , author=. Proceedings of the ACM on Management of Data , volume=. 2025 , publisher=
2025
-
[34]
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2023. https://doi.org/10.48550/arXiv.2308.14508 Longbench: A bilingual, multitask benchmark for long context understanding . CoRR, abs/2308.14508
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.14508 2023
-
[35]
Guanyu Cai, Ruiming Tian, Lang Yang, Yunzhe Jia, Lingkun Li, and Jiliang Wang. 2026. Efficient inference for edge large language models: A survey. Tsinghua Science and Technology, 31(3):1365--1380
2026
-
[36]
Zhixiong Chen, Bingjie Zhu, Jiangzhou Wang, Hyundong Shin, Arumugam Nallanathan, and Dusit Niyato. 2026. Network edge inference for large language models: Principles, techniques, and opportunities. ACM Computing Surveys
2026
-
[37]
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2026. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. Advances in Neural Information Processing Systems, 38:113152--113188
2026
-
[38]
Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-serve: Adaptive request scheduling on hybrid cache for scalable llm inference serving. Proceedings of the ACM on Management of Data, 3(3):1--28
2025
-
[39]
Yuezhou Hu, Jiaxin Guo, Xinyu Feng, and Tuo Zhao. 2026. Adaspec: Selective knowledge distillation for efficient speculative decoders. Advances in Neural Information Processing Systems, 38:88736--88758
2026
-
[40]
Tomás Kociský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. 2017. http://arxiv.org/abs/1712.07040 The narrativeqa reading comprehension challenge . CoRR, abs/1712.07040
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274--19286. PMLR
2023
-
[42]
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. 2024 a . https://doi.org/10.48550/arXiv.2412.19442 A survey on large language model acceleration based on kv cache management . CoRR, abs/2412.19442
-
[43]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024 b . Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37:22947--22970
2024
-
[44]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024 c . https://openreview.net/forum?id=1NdN7eXyb4 EAGLE : Speculative sampling requires rethinking feature uncertainty . In Forty-first International Conference on Machine Learning
2024
-
[45]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2026 a . Eagle-3: Scaling up inference acceleration of large language models via training-time test. Advances in Neural Information Processing Systems, 38:136737--136756
2026
-
[46]
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2026 b . https://openreview.net/forum?id=4exx1hUffq EAGLE -3: Scaling up inference acceleration of large language models via training-time test . In The Thirty-ninth Annual Conference on Neural Information Processing Systems
2026
-
[47]
Zihan Liao, Jun Wang, Hang Yu, Lingxiao Wei, Jianguo Li, and Wei Zhang. 2025. E2llm: Encoder elongated large language models for long-context understanding and reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19212--19241
2025
-
[48]
Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2025. Efficient inference for large language model-based generative recommendation. In International Conference on Learning Representations, volume 2025, pages 91672--91697
2025
-
[49]
Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, and 18 others. 2025 a . https://doi.org/10.48550/arXiv.2503.17407 A comprehensive survey on long context language modeli...
-
[50]
Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, and Hao Zhang. 2023. https://doi.org/10.48550/arXiv.2310.07177 Online speculative decoding . CoRR, abs/2310.07177
- [51]
-
[52]
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2024. Yarn: Efficient context window extension of large language models. In International Conference on Learning Representations, volume 2024, pages 31932--31951
2024
-
[53]
Rae, Anna Potapenko, Siddhant M
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. https://openreview.net/forum?id=SylKikSYDH Compressive transformers for long-range sequence modelling . In International Conference on Learning Representations
2020
-
[54]
Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen, Vashisth Tiwari, Ruihang Lai, Jinyuan Shi, Ian Yen, Avner May, Tianqi Chen, and Beidi Chen. 2025. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding. In International Conference on Learning Representations, volume 2025, pages 6835--6850
2025
-
[55]
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. 2024. https://doi.org/10.48550/arXiv.2404.11912 Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding . CoRR, abs/2404.11912
-
[56]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, and 49 others. 2023. https://doi.org/10.48550/arXiv.2307.09288 Llama 2...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
-
[57]
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2025. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. In International Conference on Learning Representations, volume 2025, pages 37228--37253
2025
-
[58]
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming language models with attention sinks. In International Conference on Learning Representations, volume 2024, pages 21875--21895
2024
-
[59]
Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, and Bo An. 2025. https://openreview.net/forum?id=GFN9PWbfHs Longspec: Long-context lossless speculative decoding with efficient drafting and verification . In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
2025
-
[60]
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, and 1 others. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661--34710
2023
-
[61]
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, and Yu Wang. 2024. https://doi.org/10.48550/arXiv.2404.14294 A survey on efficient inference for large language models . CoRR, abs/2404.14294
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14294 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.