arxiv: 2605.02888 · v2 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· cs.CL· cs.DC· cs.SY· eess.SY

Recognition: 3 theorem links

· Lean Theorem

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

Shikhar Shukla

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.DCcs.SYeess.SY

keywords speculative decodingadaptive gammadraft modelmodel compressionacceptance rateMLP controllerLLM inference

0 comments

The pith

SpecKV shows that choosing gamma adaptively with a small MLP trained on draft confidence and entropy improves speculative decoding throughput by 56 percent over a fixed baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the best speculation length gamma changes with the target model's compression level and input task type. Profiling across thousands of steps reveals that the draft model's own confidence and entropy values predict acceptance rates with a correlation of about 0.56. Training a lightweight MLP on these signals lets the system pick a fresh gamma each step to raise the expected number of accepted tokens while adding less than half a millisecond of decision time.

Core claim

SpecKV is an adaptive controller that selects the speculation length gamma per step by feeding draft-model confidence and entropy into a small MLP trained to maximize expected accepted tokens. The work profiles speculative decoding across four task categories, four gamma values, and three compression levels to show that optimal gamma shifts with compression and that the draft signals correlate with acceptance. This controller delivers a 56 percent gain in tokens per step over the fixed-gamma=4 baseline at 0.34 ms overhead per decision.

What carries the argument

SpecKV, the small MLP that maps draft confidence and entropy to the gamma value maximizing expected accepted tokens per speculation step.

Load-bearing premise

That the patterns linking draft signals, compression level, and acceptance rate in the profiled tasks will continue to hold for arbitrary new models, tasks, and deployment settings.

What would settle it

Running SpecKV against a fixed-gamma baseline on a new task category or model combination never seen during profiling and checking whether the measured token-rate improvement drops below statistical significance.

Figures

Figures reproduced from arXiv: 2605.02888 by Shikhar Shukla.

**Figure 1.** Figure 1: Phase 1 results: Throughput variation across tasks and speculation lengths (FP16, no compression). view at source ↗

**Figure 2.** Figure 2: Phase 2 results: Compression changes acceptance rates and shifts the optimal view at source ↗

**Figure 3.** Figure 3: Draft entropy (left) and confidence (right) vs. step acceptance rate. Colors indicate compression view at source ↗

**Figure 6.** Figure 6: Overhead analysis and detailed improvement breakdown. view at source ↗

**Figure 4.** Figure 4: Feature importance for acceptance rate prediction (Random Forest, Gini importance). Draft model view at source ↗

**Figure 5.** Figure 5: Main results. SpecKV-fast consistently outperforms all fixed- view at source ↗

read the original abstract

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length $\gamma$, which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed $\gamma$ (typically 4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model. In this paper, we present SpecKV, a lightweight adaptive controller that selects $\gamma$ per speculation step using signals extracted from the draft model itself. We profile speculative decoding across 4 task categories, 4 speculation lengths, and 3 compression levels (FP16, INT8, NF4), collecting 5,112 step-level records with per-step acceptance rates, draft entropy, and draft confidence. We demonstrate that the optimal $\gamma$ shifts across compression regimes and that draft model confidence and entropy are strong predictors of acceptance rate (correlation $\approx$ 0.56). SpecKV uses a small MLP trained on these signals to maximize expected tokens per speculation step, achieving a 56.0% improvement over the fixed-$\gamma=4$ baseline with only 0.34 ms overhead per decision (<0.5% of step time). The improvement is statistically significant (p < 0.001, paired bootstrap test). We release all profiling data, trained models, and notebooks as open-source artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpecKV's MLP-based gamma adapter beats fixed gamma=4 by 56% in their profiled setup with low overhead, but the narrow 4-task/3-compression data leaves generalization unproven.

read the letter

SpecKV shows that a small MLP can adapt gamma using draft signals and beat the fixed baseline in their tests, but the narrow data collection raises questions about how well it works more generally. The new part is the compression-aware adaptive selection. Most speculative decoding papers stick with a fixed gamma like 4, but here they profile how acceptance rates change with FP16, INT8, and NF4, and train the controller on draft entropy and confidence to pick gamma per step. That matches the claim that optimal gamma shifts with compression. Releasing the 5112 records, the trained models, and notebooks is a plus for reproducibility. What they do well is keep the overhead tiny at 0.34 ms per decision, which is under half a percent of step time. The statistical test with paired bootstrap gives p < 0.001, so within their data the improvement looks real. It's a practical tweak for deployment. The soft spots are around generalization and details. They collected data from 4 task categories, but without splits or hold-out results described, it's unclear if the MLP was properly validated or if it just fits the training distribution. The correlation of 0.56 is moderate, so the signals are predictive but not strongly so. If the method doesn't transfer to new tasks or models, the 56% gain won't hold up in real use. This paper is for researchers and engineers focused on efficient LLM serving, especially with compressed models. It deserves a serious referee because the core idea is simple, the artifacts are open, and the empirical result is concrete enough to check. I'd send it to review with a note to add more validation experiments.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpecKV, a lightweight MLP-based adaptive controller for selecting the speculation length γ per step in speculative decoding. It profiles 5,112 step-level records across 4 task categories, 4 γ values, and 3 compression levels (FP16/INT8/NF4), demonstrates that optimal γ shifts with compression, finds draft confidence and entropy correlate with acceptance rate at ≈0.56, and trains an MLP to maximize expected tokens per step. This yields a claimed 56% improvement over fixed γ=4 with 0.34 ms overhead (<0.5% of step time) and statistical significance (p<0.001 via paired bootstrap). All data, models, and notebooks are released.

Significance. If the adaptive γ selection generalizes, SpecKV could meaningfully improve token throughput in compressed LLM deployments where fixed γ is suboptimal. The compression-aware profiling and open-sourcing of artifacts are strengths that aid reproducibility and follow-up work. However, the moderate correlation and narrow training distribution (only 4 tasks) limit the assessed significance until broader validation is shown.

major comments (2)

[Abstract and Evaluation] The 56% improvement claim rests on the MLP generalizing beyond the profiled set, yet the manuscript provides no cross-task hold-out, cross-model, or out-of-compression evaluation despite showing that optimal γ shifts across the 3 compression regimes and 4 task categories. This is load-bearing for the applicability of the adaptive controller to arbitrary inputs and deployment scenarios.
[Abstract and Results] The MLP is trained on the 5,112 step-level records to maximize expected tokens, but the evaluation protocol (including any train/validation/test split, whether the 56% gain is on held-out data, or reduction to fitted parameters) is not detailed. This creates a circularity risk where performance is measured empirically on the same distribution used for training.

minor comments (2)

[Abstract] The abstract describes draft confidence and entropy as 'strong predictors' with correlation ≈0.56; this value is conventionally moderate rather than strong, and the text should qualify the predictive strength accordingly.
[Methods] Provide the exact MLP architecture (layers, hidden dimensions, activation) and training hyperparameters in the methods section for reproducibility, as only 'small MLP' is mentioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the strengths of our open-sourced artifacts and the potential impact of adaptive gamma selection in compressed LLM deployments. We address the major comments below, providing clarifications on our evaluation protocol and committing to revisions that enhance the manuscript's rigor.

read point-by-point responses

Referee: [Abstract and Evaluation] The 56% improvement claim rests on the MLP generalizing beyond the profiled set, yet the manuscript provides no cross-task hold-out, cross-model, or out-of-compression evaluation despite showing that optimal γ shifts across the 3 compression regimes and 4 task categories. This is load-bearing for the applicability of the adaptive controller to arbitrary inputs and deployment scenarios.

Authors: We agree that demonstrating generalization is crucial for the broader applicability of SpecKV. While our profiling covers 4 task categories and 3 compression levels, and the MLP relies on general draft-model signals (confidence and entropy) that should transfer, the current results are indeed within the profiled distribution. In the revised version, we will add a 'Limitations and Future Work' section explicitly discussing the narrow training distribution and the absence of cross-model or out-of-distribution evaluations. We will also report results from a cross-task hold-out experiment (e.g., training on 3 tasks and testing on the 4th) to better quantify generalization within the current scope. This addresses the load-bearing concern without overstating current claims. revision: yes
Referee: [Abstract and Results] The MLP is trained on the 5,112 step-level records to maximize expected tokens, but the evaluation protocol (including any train/validation/test split, whether the 56% gain is on held-out data, or reduction to fitted parameters) is not detailed. This creates a circularity risk where performance is measured empirically on the same distribution used for training.

Authors: We apologize for the omission of these details in the original manuscript. The 5,112 records were split into 70% for training, 15% for validation, and 15% for testing. The MLP was trained on the training portion to predict the gamma that maximizes expected tokens per step, using the profiled acceptance rates as ground truth. The 56% improvement and statistical significance (p<0.001) are computed on the held-out test set. We will expand the evaluation section to fully describe the data split, training objective, hyperparameter selection via validation, and confirmation that all reported metrics use unseen steps. This eliminates any circularity risk. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical profiling, MLP training, and measured improvement are independent of inputs by construction.

full rationale

The paper profiles 5,112 step-level records across 4 tasks, 4 γ values, and 3 compressions, extracts draft confidence/entropy signals, trains a small MLP to select γ that maximizes expected tokens per step, and reports a 56% empirical gain versus fixed-γ=4 on the collected records (with p<0.001 bootstrap). No equations reduce a claimed prediction to a fitted parameter by definition, no self-citations are load-bearing for the central result, and the evaluation uses step-level acceptance rates that are not forced by the training objective. The derivation chain is a standard data-driven controller whose performance claim rests on external measurement rather than tautological renaming or self-referential fitting.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach relies on empirical data collection and supervised training of a predictor rather than theoretical derivation. No new physical or mathematical axioms beyond standard assumptions in machine learning and LLM inference.

free parameters (1)

MLP model parameters
Weights of the small multilayer perceptron trained on the profiling dataset to predict optimal gamma.

axioms (1)

domain assumption Draft model confidence and entropy correlate with token acceptance rate in speculative decoding
Basis for using these signals as inputs to the controller; supported by collected data with correlation ~0.56

invented entities (1)

SpecKV adaptive controller no independent evidence
purpose: To dynamically select gamma per step using draft signals
The proposed system that uses the MLP for dynamic selection

pith-pipeline@v0.9.0 · 5583 in / 1409 out tokens · 83384 ms · 2026-05-08T19:13:04.868816+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/DimensionForcing (8-tick period from 2^D=8) n/a — γ here is a tunable hyperparameter, not an RS-forced 8-tick clock unclear
We testγ∈ {2,4,6,8} for each compression level.

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Cottier, R

B. Cottier, R. Besiroglu, and D. Owen. The rapid drop in AI inference pricing.Epoch AI, 2025

2025
[2]

Leviathan, M

Y. Leviathan, M. Kalman, and Y. Matias. Fast inference from transformers via speculative decoding. InProceedings of ICML, 2023

2023
[3]

C. Chen, S. Borgeaud, G. Irving, J. Lespiau, L. Sifre, and J. Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review arXiv 2023
[4]

J. Chen, J. Liang, and B. Wang. Smurfs: Leveraging multiple proficiency agents with context-efficiency for tool planning.arXiv preprint arXiv:2405.05955, 2024

work page arXiv 2024
[5]

Nawrot, A

P. Nawrot, A. Łancucki, M. Chochowski, D. Tarjan, and E. Ponti. Dynamic memory compression: Retrofitting LLMs for accelerated inference.Proceedings of the 41st International Conference on Ma- chine Learning (ICML), PMLR 235:37396–37412, 2024

2024
[6]

Optimizing inference for long context and large batch sizes with NVFP4 KV cache.NVIDIA Developer Blog, 2025

NVIDIA. Optimizing inference for long context and large batch sizes with NVFP4 KV cache.NVIDIA Developer Blog, 2025

2025
[7]

Dettmers, M

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. InProceedings of NeurIPS, 2022

2022
[8]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized language models. InProceedings of NeurIPS, 2023

2023
[9]

Tiwari, H

R. Tiwari, H. Xi, A. Tomar, C. Hooper, S. Kim, M. Horton, M. Najibi, M. W. Mahoney, K. Keutzer, and A. Gholami. QuantSpec: Self-speculative decoding with hierarchical quantized KV cache. InProceedings of the International Conference on Machine Learning (ICML), 2025

2025
[10]

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, R. Wong, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia. SpecInfer: Accelerating generative large language model serving with tree-based speculative inference and verification. InProceedings of ASPLOS, 2024

2024
[11]

Y. Li, F. Wei, C. Zhang, and H. Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InProceedings of the International Conference on Machine Learning (ICML), 2024

2024
[12]

Y. Li, F. Wei, C. Zhang, and H. Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7421–7432, 2024

2024
[13]

Pereg, O

O. Pereg, O. Eyal, M. Gad-Elrab, Y. Katz, A. Mendelson, and T. Maoz. Accelerating LLM inference with lossless speculative decoding algorithms for heterogeneous vocabularies. InProceedings of the International Conference on Machine Learning (ICML), 2025. 10

2025
[14]

J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han. AWQ: Activation-aware weight quantization for LLM compression and acceleration. InProceedings of MLSys, 2024

2024
[15]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers. InProceedings of ICLR, 2023

2023
[16]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of SOSP, 2023

2023
[17]

Llama 3 model card.https://github.com/meta-llama/llama3, 2024

Meta AI. Llama 3 model card.https://github.com/meta-llama/llama3, 2024

2024
[18]

Wolf et al

T. Wolf et al. Transformers: State-of-the-art natural language processing. InProceedings of EMNLP: System Demonstrations, 2020

2020
[19]

Zhang, Y

Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tay, B. Huai, and D. Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[20]

G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis. Efficient streaming language models with attention sinks. InProceedings of ICLR, 2024

2024
[21]

Teerapittayanon, B

S. Teerapittayanon, B. McDanel, and H. Kung. BranchyNet: Fast inference via early exiting from deep neural networks. InProceedings of ICPR, 2016

2016
[22]

Shazeer, A

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InProceedings of ICLR, 2017

2017
[23]

L. Chen, M. Zaharia, and J. Zou. FrugalGPT: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review arXiv 2023
[24]

Slivkins

A. Slivkins. Introduction to multi-armed bandits.Foundations and Trends in Machine Learning, 12(1– 2):1–286, 2019. 11

2019