Fast Inference from Transformers via Speculative Decoding
Pith reviewed 2026-05-17 22:47 UTC · model grok-4.3
The pith
Speculative decoding accelerates large autoregressive models by verifying multiple draft tokens in one parallel run of the target model while preserving the exact output distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By running a fast approximation model to generate a short speculative sequence and then evaluating that sequence under the target model in parallel, exact samples from the target distribution can be produced while often accepting more than one token per invocation of the large model.
What carries the argument
Speculative decoding algorithm, which uses a draft model to propose candidate tokens and verifies them against the target model's output distribution in a single batched step.
If this is right
- Existing off-the-shelf models can be accelerated without retraining or architecture modifications.
- Generated outputs remain identical to those from standard autoregressive decoding.
- The method exploits the fact that many language-modeling steps are easier subtasks that smaller models approximate well.
- Speedups of 2X-3X are achieved on T5-XXL relative to the standard T5X implementation.
Where Pith is reading between the lines
- The same draft-and-verify pattern could apply to other autoregressive generation tasks such as music or protein sequences.
- Pairing the technique with model compression methods might compound the observed speedups.
- Training draft models specifically to maximize acceptance rate rather than standalone accuracy could raise the average tokens advanced per step.
Load-bearing premise
A sufficiently accurate and faster draft model exists that can produce enough accepted tokens to offset the cost of the verification step.
What would settle it
Measure the average number of accepted tokens per speculative step on representative prompts; if this average falls below approximately 1.5, the method produces no net speedup.
read the original abstract
Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces speculative decoding, an algorithm to accelerate autoregressive sampling from large Transformer models without altering the output distribution. A smaller draft model generates candidate tokens in parallel; these are verified by the target model in a single forward pass, with a corrected sampling step that accepts or rejects tokens to ensure exact equivalence to standard autoregressive decoding from the target. The authors report 2-3x speedups on T5-XXL relative to the standard T5X implementation, with identical outputs and no retraining or architectural changes required.
Significance. If the central construction holds, the work provides a practical, general-purpose technique for reducing the serial bottleneck in Transformer inference by exploiting easier subtasks approximable by faster models. The explicit algorithmic guarantee of distribution preservation (via the acceptance probability derived from the target conditional) combined with direct empirical measurement on T5-XXL constitutes a reproducible and falsifiable contribution to efficient large-model deployment.
minor comments (2)
- [Abstract] Abstract: the reported 2X-3X range would be more precise if the paper stated the exact hardware, batch size, and baseline T5X implementation details used for the timing measurements.
- [§3] The description of the draft-model selection process could include a short discussion of how alignment between draft and target distributions affects the expected number of accepted tokens.
Simulated Author's Rebuttal
We thank the referee for the positive review and the recommendation to accept. The provided summary accurately captures the core ideas and empirical results of our work on speculative decoding.
Circularity Check
No significant circularity identified
full rationale
The paper's central contribution is an explicit algorithmic construction (draft-model token generation followed by parallel target-model verification with a distribution-preserving sampling correction) whose correctness follows directly from the definition of the target model's conditional distribution. The claimed speedup is obtained by empirical measurement on T5-XXL rather than by any fitted parameter, self-referential equation, or load-bearing self-citation. The requirement for a faster draft model is stated as an explicit precondition and is satisfied in the reported experiments; no step reduces the result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Autoregressive models define a conditional distribution over the next token given previous tokens.
- domain assumption A smaller model can approximate easier subtasks within the overall language-modeling distribution.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 16 Pith papers
-
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
-
Speculative Decoding for Autoregressive Video Generation
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
-
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
-
Micro Language Models Enable Instant Responses
Ultra-compact 8-30M parameter models start contextually grounded responses on-device while cloud models seamlessly continue them, enabling responsive AI on power-constrained hardware.
-
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
-
DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models
DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.
-
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
-
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
-
Complexity Horizons of Compressed Models in Analog Circuit Analysis
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
-
EdgeFM: Efficient Edge Inference for Vision-Language Models
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
-
SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling
SOLARIS speculatively precomputes user-item latent representations to decouple large-model inference from real-time serving, delivering 0.67% revenue gain when deployed in Meta's ad system.
-
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
Reference graph
Works this paper leans on
-
[1]
Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...
work page 2020
- [2]
-
[3]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. ArXiv , year=
- [4]
-
[5]
Lossless Speedup of Autoregressive Translation with Generalized Aggressive Decoding , author=. ArXiv , year=
-
[6]
Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding , author=. ArXiv , year=
- [7]
- [8]
-
[9]
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , author=. ArXiv , year=
-
[10]
Neural Information Processing Systems , year=
Sparse is Enough in Scaling Transformers , author=. Neural Information Processing Systems , year=
-
[11]
Primer: Searching for Efficient Transformers for Language Modeling , author=. ArXiv , year=
-
[12]
Annual Meeting of the Association for Computational Linguistics , year=
The Right Tool for the Job: Matching Model and Instance Complexities , author=. Annual Meeting of the Association for Computational Linguistics , year=
-
[13]
Conference on Empirical Methods in Natural Language Processing , year=
Consistent Accelerated Inference via Confident Adaptive Transformers , author=. Conference on Empirical Methods in Natural Language Processing , year=
-
[14]
Controlling Computation versus Quality for Neural Sequence Models , author=. ArXiv , year=
-
[15]
Cognitive Computation , volume=
Why should we add early exits to neural networks? , author=. Cognitive Computation , volume=. 2020 , publisher=
work page 2020
-
[16]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Dynamic Neural Networks: A Survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[17]
International Conference on Learning Representations , year=
Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models , author=. International Conference on Learning Representations , year=
- [18]
-
[19]
Annual Meeting of the Association for Computational Linguistics , year=
Adaptive Attention Span in Transformers , author=. Annual Meeting of the Association for Computational Linguistics , year=
-
[20]
One billion word benchmark for measuring progress in statistical language modeling , author=. Interspeech , year=
-
[21]
Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , author=. ArXiv , year=
-
[22]
Fast Transformer Decoding: One Write-Head is All You Need , author=. ArXiv , year=
- [23]
-
[24]
The Journal of Machine Learning Research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=
work page 2020
-
[25]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. ArXiv , year=
-
[26]
Burton, F. Warren , journal=. Speculative computation, parallelism, and functional programming , year=
-
[27]
Hennessy, John L. and Patterson, David A. , biburl =. Computer Architecture: A Quantitative Approach , username =
-
[28]
Adaptive Computation Time for Recurrent Neural Networks , author=. ArXiv , year=
-
[29]
Advances in Neural Information Processing Systems , volume=
Blockwise parallel decoding for deep autoregressive models , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[31]
Accelerating Large Language Model Decoding with Speculative Sampling , author=. ArXiv , year=
-
[32]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[33]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[34]
M. J. Kearns , title =
-
[35]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[36]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[37]
Suppressed for Anonymity , author=
-
[38]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[39]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[40]
Controlling computation versus quality for neural sequence models
Bapna, A., Arivazhagan, N., and Firat, O. Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, 2020
-
[41]
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...
work page 2020
-
[42]
Burton, F. W. Speculative computation, parallelism, and functional programming. IEEE Transactions on Computers, C-34 0 (12): 0 1190--1193, 1985. doi:10.1109/TC.1985.6312218
-
[43]
Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P. T., and Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013
work page 2013
-
[44]
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. M. Accelerating large language model decoding with speculative sampling. ArXiv, abs/2302.01318, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N. M., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B. C., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
The efficiency mis- nomer.arXiv preprint arXiv:2110.12894,
Dehghani, M., Arnab, A., Beyer, L., Vaswani, A., and Tay, Y. The efficiency misnomer. ArXiv, abs/2110.12894, 2021
-
[47]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[48]
Elbayad, M., Gu, J., Grave, E., and Auli, M. Depth-adaptive transformer. ArXiv, abs/1910.10073, 2019
-
[49]
Dynamic neural networks: A survey
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44: 0 7436--7456, 2021
work page 2021
-
[50]
Gaussian Error Linear Units (GELUs)
Hendrycks, D. and Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. ArXiv, abs/1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[51]
Hennessy, J. L. and Patterson, D. A. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Amsterdam, 5 edition, 2012. ISBN 978-0-12-383872-8
work page 2012
-
[52]
Distilling the Knowledge in a Neural Network
Hinton, G. E., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[53]
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. ArXiv, abs/1609.07061, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[54]
Sparse is enough in scaling transformers
Jaszczur, S., Chowdhery, A., Mohiuddin, A., Kaiser, L., Gajewski, W., Michalewski, H., and Kanerva, J. Sparse is enough in scaling transformers. In Neural Information Processing Systems, 2021
work page 2021
-
[55]
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020
work page 2020
-
[56]
Roberts, A., Chung, H. W., Levskaya, A., Mishra, G., Bradbury, J., Andor, D., Narang, S., Lester, B., Gaffney, C., Mohiuddin, A., Hawthorne, C., Lewkowycz, A., Salcianu, A., van Zee, M., Austin, J., Goodman, S., Soares, L. B., Hu, H., Tsvyashchenko, S., Chowdhery, A., Bastings, J., Bulian, J., Garc \'i a, X., Ni, J., Chen, A., Kenealy, K., Clark, J., Lee,...
-
[57]
Why should we add early exits to neural networks? Cognitive Computation, 12 0 (5): 0 954--966, 2020
Scardapane, S., Scarpiniti, M., Baccarelli, E., and Uncini, A. Why should we add early exits to neural networks? Cognitive Computation, 12 0 (5): 0 954--966, 2020
work page 2020
-
[58]
Consistent accelerated inference via confident adaptive transformers
Schuster, T., Fisch, A., Jaakkola, T., and Barzilay, R. Consistent accelerated inference via confident adaptive transformers. In Conference on Empirical Methods in Natural Language Processing, 2021
work page 2021
-
[59]
Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A. The right tool for the job: Matching model and instance complexities. In Annual Meeting of the Association for Computational Linguistics, 2020
work page 2020
-
[60]
Shazeer, N. M. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[61]
So and Wojciech Mańke and Hanxiao Liu and Zihang Dai and Noam Shazeer and Quoc V
So, D. R., Ma'nke, W., Liu, H., Dai, Z., Shazeer, N. M., and Le, Q. V. Primer: Searching for efficient transformers for language modeling. ArXiv, abs/2109.08668, 2021
-
[62]
Blockwise parallel decoding for deep autoregressive models
Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[63]
Adaptive attention span in transformers
Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A. Adaptive attention span in transformers. In Annual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[64]
Instantaneous grammatical error correction with shallow aggressive decoding
Sun, X., Ge, T., Wei, F., and Wang, H. Instantaneous grammatical error correction with shallow aggressive decoding. ArXiv, abs/2106.04970, 2021
-
[65]
LaMDA: Language Models for Dialog Applications
Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N. M., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhou, Y., Chang, C.-C., Krivokon, I. A., Rusch, W. J., Pickett, M., Meier-Hel...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[66]
N., Kaiser, ., and Polosukhin, I
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[67]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson, B. C., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J., and Wu, Y. Scaling autoregressive models for content-rich text-to-image generation. ArXiv, abs/2206.10789, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.