When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

Aaryam Sharma

arxiv: 2606.30265 · v1 · pith:KXDHIMH6new · submitted 2026-06-29 · 💻 cs.LG · cs.CL· stat.ML

When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

Aaryam Sharma This is my paper

Pith reviewed 2026-06-30 07:37 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords speculative decodingacceptance criteriaKL divergencegreedy decodingrelaxed acceptancetree decodingcertificateslevel sets

0 comments

The pith

Many common acceptance criteria in speculative decoding have rejection regions that are lower level sets of the target distribution, allowing exact KL certificates and margin bounds for greedy, relaxed, and tree rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theory for acceptance events in speculative decoding outside the exact-distribution setting. It shows that criteria used in practice, such as strict greedy decoding and various relaxed thresholds, define rejection regions that coincide with lower level sets of the target model's probability distribution. For any such criterion the minimal KL divergence that forces rejection can be stated exactly, which in turn produces sharp certificates and margin-based bounds on when a draft token is guaranteed to be accepted. The same machinery extends to tree-structured candidate sets, giving coverage guarantees for the target's greedy token under top-m drafter proposals. Evaluation on Qwen3 models illustrates that relaxed and tree criteria certify acceptance over substantially larger regions than strict rules, particularly when the target distribution has low margin between its top tokens.

Core claim

For acceptance criteria whose rejection regions are lower level sets of the target distribution, the exact KL divergence needed to produce rejection can be derived in closed form, supplying exact certificates together with sharp margin-based bounds for strict greedy decoding, additive and multiplicative relaxed acceptance, top-m relaxed criteria, entropy-thresholded acceptance, and greedy tree decoding.

What carries the argument

Lower level sets of the target distribution that characterize the rejection regions of common acceptance criteria.

If this is right

Exact KL certificates become available for strict greedy decoding.
Sharp margin bounds apply to additive, multiplicative, top-m, and entropy-thresholded relaxed acceptance.
Exact and margin-only certificates extend to greedy tree decoding for coverage of the target greedy token.
Relaxed and tree criteria enlarge the certified acceptance region compared with strict rules, especially at low target margins.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Drafter training objectives could be adjusted to maximize the derived margins rather than matching the full target distribution.
The level-set view may transfer to other local-ranking verification schemes outside speculative decoding.
Certificate computation could be integrated into runtime monitoring to decide when to fall back to the target model.

Load-bearing premise

The acceptance criteria actually used in systems have rejection regions that are exactly lower level sets of the target distribution.

What would settle it

A direct computation or sampling experiment on a concrete acceptance rule showing that its rejection set is not a lower level set of the target probabilities, or a measured rejection frequency that deviates from the KL-derived prediction.

Figures

Figures reproduced from arXiv: 2606.30265 by Aaryam Sharma.

**Figure 1.** Figure 1: We first observe that target models are typically very confident. Indeed, the mean top token probability is ≈ 0.85 for Qwen3-1.7B and ≈ 0.87 for Qwen3-4B, and the median is even higher at ≈ 0.97 and ≈ 0.99 respectively. The gap between the mean and median indeed suggests a long tail of low-margin steps which is where the relaxed criteria and tree-based acceptance can provide the most benefit. We also obser… view at source ↗

**Figure 1.** Figure 1: Distribution of the probability of the argmax token of both models. More than 40% of steps have [PITH_FULL_IMAGE:figures/full_fig_p020_1.png] view at source ↗

**Figure 2.** Figure 2: Both models have similar output distributions and hence have similar greedy certificate distribu [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

**Figure 3.** Figure 3: Additive relaxed certificate comparison across different values of [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Multiplicative relaxed certificate comparison across different values of [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Tree certificate comparison [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

read the original abstract

Speculative decoding accelerates language model inference by using a fast drafter to propose candidate tokens that are then verified by a larger target model. Existing theory largely studies the stochastic, distribution-preserving setting, where the goal is to exactly sample from the target distribution. In contrast, many practical systems use greedy decoding, relaxed acceptance rules, or tree-based candidate sets, where success is governed by local ranking and threshold events rather than exact distributional equality. We develop a theory for these regimes. We identify that many common acceptance criteria have rejection regions that can be characterized as lower level sets of the target distribution. For these, we characterize the exact KL divergence required for rejection yielding exact certificates and sharp margin-based bounds for strict greedy decoding, additive and multiplicative relaxed acceptance, top-(m) relaxed criteria, and entropy-thresholded acceptance. We then extend the framework to greedy tree decoding, deriving exact and margin-only certificates for when the target greedy token remains covered by the drafter's top-(m) candidates. Finally, we evaluate the resulting certificates on Qwen3 models, showing that relaxed and tree-based criteria substantially enlarge the region of certified acceptance, especially on decoding steps with low target model distribution margin. These results complement existing distribution-preserving analyses of speculative decoding by characterizing the deterministic local acceptance events common in practical inference systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives exact KL certificates for deterministic acceptance rules in speculative decoding by mapping common criteria to lower level sets of the target distribution.

read the letter

The core contribution here is a shift from distribution-preserving analysis to local ranking and threshold events. The authors claim that greedy decoding, additive and multiplicative relaxed rules, top-m selection, entropy thresholds, and their tree variants all have rejection regions that are exactly lower level sets of the target model. From that identification they derive exact KL characterizations for when rejection occurs and sharp margin bounds for the greedy case.

This is useful because deployed speculative decoding often runs in deterministic mode rather than trying to match the full distribution. The level-set move lets them produce certificates that practitioners could in principle use to set acceptance thresholds with guarantees. The extension to tree-based greedy coverage is a natural next step and the Qwen3 evaluation suggests the relaxed and tree criteria enlarge the certified region, especially on low-margin steps.

The main soft spot is the identification step itself. The abstract presents it as holding for the listed criteria, but without the explicit mappings or derivations visible it is hard to judge whether the lower-level-set property is exact or requires additional assumptions about how the drafter and target interact. If those mappings have gaps, the certificates rest on them. The evaluation is described only at a high level, so it is unclear how the certificates were computed or whether they were tested against actual acceptance rates.

This is for people working on inference optimization who already know the speculative decoding literature and want to move beyond stochastic guarantees. A reader looking for new empirical speedups or fully worked proofs will find less here.

The work is grounded enough in a real gap to deserve referee time. I would send it out.

Referee Report

2 major / 2 minor

Summary. The paper develops a theory of acceptance for speculative decoding in deterministic regimes (greedy, relaxed, tree-based) rather than exact distribution-preserving sampling. It claims that common acceptance criteria have rejection regions exactly equal to lower level sets of the target distribution p; from this identification it derives exact KL-divergence certificates for rejection and sharp margin-based bounds for strict greedy decoding, additive/multiplicative relaxed acceptance, top-(m) criteria, entropy-thresholded acceptance, and an extension to greedy tree decoding that certifies when the target greedy token remains covered by the drafter's top-(m) candidates. The certificates are evaluated on Qwen3 models and shown to enlarge the certified-acceptance region, especially at low-margin decoding steps.

Significance. If the level-set identification holds without gaps, the work supplies exact, parameter-free certificates and sharp bounds that directly characterize the local ranking and threshold events used in practical inference systems. This complements existing stochastic analyses by giving deterministic guarantees and could inform the design of acceptance rules that maximize certified throughput. The Qwen3 evaluation provides concrete evidence that relaxed and tree criteria materially expand the certified region relative to strict greedy.

major comments (2)

[Abstract / identification section] Abstract and § on identification of acceptance criteria: the central premise that rejection regions for greedy, additive/multiplicative relaxed, top-(m), entropy-thresholded, and tree variants are exactly lower level sets of p must be shown explicitly for each criterion (with the precise definition of the level set and the acceptance rule) to confirm there are no post-hoc adjustments or edge cases that break the equality.
[tree decoding section] § on tree decoding: the exact and margin-only certificates for coverage of the target greedy token by the drafter's top-(m) candidates rely on the level-set property carrying over to the joint tree structure; any deviation in how the tree is constructed (e.g., shared prefixes or non-independent proposals) would require an additional argument that the rejection region remains a level set.

minor comments (2)

[Abstract] Notation for the target distribution p and drafter q should be introduced once and used consistently; the abstract uses both "target distribution" and "p" without an explicit definition paragraph.
[evaluation section] The evaluation section would benefit from a table listing, for each criterion, the fraction of steps where the certificate is non-vacuous and the average margin improvement, to make the "substantially enlarge" claim quantitative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comments on explicitness. We address each major comment below.

read point-by-point responses

Referee: [Abstract / identification section] Abstract and § on identification of acceptance criteria: the central premise that rejection regions for greedy, additive/multiplicative relaxed, top-(m), entropy-thresholded, and tree variants are exactly lower level sets of p must be shown explicitly for each criterion (with the precise definition of the level set and the acceptance rule) to confirm there are no post-hoc adjustments or edge cases that break the equality.

Authors: Section 3 already supplies direct proofs for each criterion, defining the lower level set L_τ(p) = {x : p(x) ≤ τ} and showing equality to the rejection region via the acceptance rule. For strict greedy the rejection region is exactly the open lower level set below the mode; for additive relaxed it is the level set below p(target) − α; for multiplicative it is the level set scaled by β; for top-(m) it is the level set below the m-th order statistic; and for entropy-thresholded it is the level set below the entropy-derived τ. The proofs contain no post-hoc adjustments and handle edge cases (ties, zero probabilities) explicitly. To improve readability we will insert a summary table in the revised manuscript that lists, for each criterion, the precise acceptance rule, the corresponding level-set definition, and the theorem reference. revision: yes
Referee: [tree decoding section] § on tree decoding: the exact and margin-only certificates for coverage of the target greedy token by the drafter's top-(m) candidates rely on the level-set property carrying over to the joint tree structure; any deviation in how the tree is constructed (e.g., shared prefixes or non-independent proposals) would require an additional argument that the rejection region remains a level set.

Authors: The level-set identification is applied token-wise: each candidate in the tree is accepted or rejected according to the same per-token rule used in the non-tree case. Because the acceptance predicate depends only on the probability of the token being verified (not on path dependence or prefix sharing), the rejection region for every verification step remains exactly a lower level set of p. The tree certificates are obtained by requiring that every competing branch is rejected under this per-token rule; the joint structure therefore inherits the level-set property without modification. The manuscript already states this token-wise invariance in the tree section, so no further argument is required. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's derivation begins by identifying that listed acceptance criteria (greedy, additive/multiplicative relaxed, top-m, entropy-thresholded, tree variants) have rejection regions exactly equal to lower level sets of the target distribution p; once this property holds, standard level-set arguments on KL divergence produce the exact certificates and margin bounds. This identification is a direct claim about the structure of the criteria themselves rather than a reduction to fitted parameters, self-citations, or ansatzes imported from prior work. No equations or load-bearing steps in the provided abstract reduce the claimed results to the inputs by construction, and the framework is presented as self-contained once the level-set characterization is granted. The evaluation on Qwen3 models is empirical validation, not part of the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only, so ledger is minimal. No free parameters, axioms, or invented entities are explicitly named; the central move is the lower-level-set characterization of rejection regions.

pith-pipeline@v0.9.1-grok · 5760 in / 1063 out tokens · 20342 ms · 2026-06-30T07:37:22.577556+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Su, Qidong and Giannoula, Christina and Pekhimenko, Gennady , year =. The. doi:10.48550/ARXIV.2310.18813 , abstract =

work page doi:10.48550/arxiv.2310.18813
[2]

Yang, Sen and Huang, Shujian and Dai, Xinyu and Chen, Jiajun , year =. Multi-. doi:10.48550/ARXIV.2401.06706 , abstract =

work page doi:10.48550/arxiv.2401.06706
[3]

DFlash: Block Diffusion for Flash Speculative Decoding

Chen, Jian and Liang, Yesheng and Liu, Zhijian , year =. doi:10.48550/ARXIV.2602.06036 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.06036
[4]

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

An, Zihao and Liu, Taichi and Liu, Ziqiong and Li, Dong and Liu, Ruofeng and Barsoum, Emad , year =. doi:10.48550/ARXIV.2605.08632 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.08632
[5]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, Charlie and Borgeaud, Sebastian and Irving, Geoffrey and Lespiau, Jean-Baptiste and Sifre, Laurent and Jumper, John , year =. Accelerating. doi:10.48550/ARXIV.2302.01318 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01318
[6]

Proceedings of the 29th

Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Zhang, Zhengxin and Wong, Rae Ying Yee and Zhu, Alan and Yang, Lijie and Shi, Xiaoxiang and Shi, Chunan and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , month = apr, year =. Proceedings of the 29th. doi:10.1145/3620666.3651335 , language =

work page doi:10.1145/3620666.3651335
[7]

Proceedings of the 2024

Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang , year =. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.422 , language =

work page doi:10.18653/v1/2024.emnlp-main.422 2024
[8]

Accelerating

Wertheimer, Davis and Rosenkranz, Joshua and Parnell, Thomas and Suneja, Sahil and Ranganathan, Pavithra and Ganti, Raghu and Srivatsa, Mudhakar , year =. Accelerating. doi:10.48550/ARXIV.2404.19124 , abstract =

work page doi:10.48550/arxiv.2404.19124
[11]

Speeding up

Zhong, Meiyu and Teku, Noel and Tandon, Ravi , year =. Speeding up. doi:10.48550/ARXIV.2502.04557 , abstract =

work page doi:10.48550/arxiv.2502.04557
[12]

MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification

Song, Jingwei and Wang, Xinyu and Wang, Hanbin and Lei, Xiaoxuan and Shi, Bill and Han, Shixin and Yang, Eric and Chang, Xiao-Wen and Ai, Lynn , year =. doi:10.48550/ARXIV.2601.15498 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.15498
[13]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Leviathan, Yaniv and Kalman, Matan and Matias, Yossi , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

2023
[14]

Workshop on Efficient Systems for Foundation Models II @ ICML2024 , year=

Block Verification Accelerates Speculative Decoding , author=. Workshop on Efficient Systems for Foundation Models II @ ICML2024 , year=
[15]

and Chen, Deming and Dao, Tri , title =

Cai, Tianle and Li, Yuhong and Geng, Zhengyang and Peng, Hongwu and Lee, Jason D. and Chen, Deming and Dao, Tri , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[16]

Recursive Speculative Decoding: Accelerating

Wonseok Jeon and Mukul Gagrani and Raghavv Goel and Junyoung Park and Mingu Lee and Christopher Lott , booktitle=. Recursive Speculative Decoding: Accelerating. 2024 , url=

2024
[17]

The Thirteenth International Conference on Learning Representations , year=

Learning Harmonized Representations for Speculative Sampling , author=. The Thirteenth International Conference on Learning Representations , year=
[18]

Draft& verify: Lossless large language model acceleration via self-speculative decoding

Zhang, Jun and Wang, Jue and Li, Huan and Shou, Lidan and Chen, Ke and Chen, Gang and Mehrotra, Sharad. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.607

work page doi:10.18653/v1/2024.acl-long.607 2024
[19]

Kangaroo: Lossless Self-Speculative Decoding for Accelerating

Fangcheng Liu and Yehui Tang and Zhenhua Liu and Yunsheng Ni and Duyu Tang and Kai Han and Yunhe Wang , booktitle=. Kangaroo: Lossless Self-Speculative Decoding for Accelerating. 2024 , url=

2024
[20]

2026 , url=

Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang , booktitle=. 2026 , url=

2026
[21]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Sequoia: Scalable and Robust Speculative Decoding , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[22]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[23]

First Conference on Language Modeling , year=

Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding , author=. First Conference on Language Modeling , year=
[24]

Thirty-seventh Conference on Neural Information Processing Systems , year=

SpecTr: Fast Speculative Decoding via Optimal Transport , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[25]

NeurIPS 2023 Workshop Optimal Transport and Machine Learning , year=

SpecTr++: Improved transport plans for speculative decoding of large language models , author=. NeurIPS 2023 Workshop Optimal Transport and Machine Learning , year=

2023
[26]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

A Theoretical Perspective for Speculative Decoding Algorithm , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[27]

Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages =

Stern, Mitchell and Shazeer, Noam and Uszkoreit, Jakob , title =. Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages =. 2018 , publisher =

2018
[28]

2026 , url=

Zihao An and Huajun Bai and Ziqiong Liu and Dong Li and Emad Barsoum , booktitle=. 2026 , url=

2026
[29]

Gonzalez and Clark Barrett and Ying Sheng , booktitle=

Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Chuyue Sun and Jeff Huang and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng , booktitle=. 2024 , url=

2024
[30]

Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff

Holsman, Maximilian and Huang, Yukun and Dhingra, Bhuwan. Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1346

work page doi:10.18653/v1/2025.findings-acl.1346 2025
[31]

Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

Wang, Jikai and Tian, Zhenxu and Li, Juntao and Xia, Qingrong and Duan, Xinyu and Wang, Zhefeng and Huai, Baoxing and Zhang, Min. Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.343

work page doi:10.18653/v1/2025.emnlp-main.343 2025
[32]

2025 , eprint=

Speeding up Speculative Decoding via Sequential Approximate Verification , author=. 2025 , eprint=

2025
[33]

2026 , eprint=

MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification , author=. 2026 , eprint=

2026
[34]

2026 , eprint=

Attention Drift: What Autoregressive Speculative Decoding Models Learn , author=. 2026 , eprint=

2026
[35]

2021 , eprint=

The Lipschitz Constant of Self-Attention , author=. 2021 , eprint=

2021
[36]

and Waerden, B

Federer, Herbert and Eckmann, B. and Waerden, B. L. , year =. Geometric. doi:10.1007/978-3-642-62010-2 , abstract =

work page doi:10.1007/978-3-642-62010-2
[37]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[38]

2023 , eprint=

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=

2023

[1] [1]

Su, Qidong and Giannoula, Christina and Pekhimenko, Gennady , year =. The. doi:10.48550/ARXIV.2310.18813 , abstract =

work page doi:10.48550/arxiv.2310.18813

[2] [2]

Yang, Sen and Huang, Shujian and Dai, Xinyu and Chen, Jiajun , year =. Multi-. doi:10.48550/ARXIV.2401.06706 , abstract =

work page doi:10.48550/arxiv.2401.06706

[3] [3]

DFlash: Block Diffusion for Flash Speculative Decoding

Chen, Jian and Liang, Yesheng and Liu, Zhijian , year =. doi:10.48550/ARXIV.2602.06036 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.06036

[4] [4]

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

An, Zihao and Liu, Taichi and Liu, Ziqiong and Li, Dong and Liu, Ruofeng and Barsoum, Emad , year =. doi:10.48550/ARXIV.2605.08632 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.08632

[5] [5]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, Charlie and Borgeaud, Sebastian and Irving, Geoffrey and Lespiau, Jean-Baptiste and Sifre, Laurent and Jumper, John , year =. Accelerating. doi:10.48550/ARXIV.2302.01318 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.01318

[6] [6]

Proceedings of the 29th

Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Zhang, Zhengxin and Wong, Rae Ying Yee and Zhu, Alan and Yang, Lijie and Shi, Xiaoxiang and Shi, Chunan and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , month = apr, year =. Proceedings of the 29th. doi:10.1145/3620666.3651335 , language =

work page doi:10.1145/3620666.3651335

[7] [7]

Proceedings of the 2024

Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang , year =. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.422 , language =

work page doi:10.18653/v1/2024.emnlp-main.422 2024

[8] [8]

Accelerating

Wertheimer, Davis and Rosenkranz, Joshua and Parnell, Thomas and Suneja, Sahil and Ranganathan, Pavithra and Ganti, Raghu and Srivatsa, Mudhakar , year =. Accelerating. doi:10.48550/ARXIV.2404.19124 , abstract =

work page doi:10.48550/arxiv.2404.19124

[9] [11]

Speeding up

Zhong, Meiyu and Teku, Noel and Tandon, Ravi , year =. Speeding up. doi:10.48550/ARXIV.2502.04557 , abstract =

work page doi:10.48550/arxiv.2502.04557

[10] [12]

MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification

Song, Jingwei and Wang, Xinyu and Wang, Hanbin and Lei, Xiaoxuan and Shi, Bill and Han, Shixin and Yang, Eric and Chang, Xiao-Wen and Ai, Lynn , year =. doi:10.48550/ARXIV.2601.15498 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.15498

[11] [13]

Proceedings of the 40th International Conference on Machine Learning , articleno =

Leviathan, Yaniv and Kalman, Matan and Matias, Yossi , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

2023

[12] [14]

Workshop on Efficient Systems for Foundation Models II @ ICML2024 , year=

Block Verification Accelerates Speculative Decoding , author=. Workshop on Efficient Systems for Foundation Models II @ ICML2024 , year=

[13] [15]

and Chen, Deming and Dao, Tri , title =

Cai, Tianle and Li, Yuhong and Geng, Zhengyang and Peng, Hongwu and Lee, Jason D. and Chen, Deming and Dao, Tri , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[14] [16]

Recursive Speculative Decoding: Accelerating

Wonseok Jeon and Mukul Gagrani and Raghavv Goel and Junyoung Park and Mingu Lee and Christopher Lott , booktitle=. Recursive Speculative Decoding: Accelerating. 2024 , url=

2024

[15] [17]

The Thirteenth International Conference on Learning Representations , year=

Learning Harmonized Representations for Speculative Sampling , author=. The Thirteenth International Conference on Learning Representations , year=

[16] [18]

Draft& verify: Lossless large language model acceleration via self-speculative decoding

Zhang, Jun and Wang, Jue and Li, Huan and Shou, Lidan and Chen, Ke and Chen, Gang and Mehrotra, Sharad. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.607

work page doi:10.18653/v1/2024.acl-long.607 2024

[17] [19]

Kangaroo: Lossless Self-Speculative Decoding for Accelerating

Fangcheng Liu and Yehui Tang and Zhenhua Liu and Yunsheng Ni and Duyu Tang and Kai Han and Yunhe Wang , booktitle=. Kangaroo: Lossless Self-Speculative Decoding for Accelerating. 2024 , url=

2024

[18] [20]

2026 , url=

Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang , booktitle=. 2026 , url=

2026

[19] [21]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Sequoia: Scalable and Robust Speculative Decoding , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[20] [22]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[21] [23]

First Conference on Language Modeling , year=

Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding , author=. First Conference on Language Modeling , year=

[22] [24]

Thirty-seventh Conference on Neural Information Processing Systems , year=

SpecTr: Fast Speculative Decoding via Optimal Transport , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[23] [25]

NeurIPS 2023 Workshop Optimal Transport and Machine Learning , year=

SpecTr++: Improved transport plans for speculative decoding of large language models , author=. NeurIPS 2023 Workshop Optimal Transport and Machine Learning , year=

2023

[24] [26]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

A Theoretical Perspective for Speculative Decoding Algorithm , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[25] [27]

Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages =

Stern, Mitchell and Shazeer, Noam and Uszkoreit, Jakob , title =. Proceedings of the 32nd International Conference on Neural Information Processing Systems , pages =. 2018 , publisher =

2018

[26] [28]

2026 , url=

Zihao An and Huajun Bai and Ziqiong Liu and Dong Li and Emad Barsoum , booktitle=. 2026 , url=

2026

[27] [29]

Gonzalez and Clark Barrett and Ying Sheng , booktitle=

Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Chuyue Sun and Jeff Huang and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng , booktitle=. 2024 , url=

2024

[28] [30]

Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff

Holsman, Maximilian and Huang, Yukun and Dhingra, Bhuwan. Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1346

work page doi:10.18653/v1/2025.findings-acl.1346 2025

[29] [31]

Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

Wang, Jikai and Tian, Zhenxu and Li, Juntao and Xia, Qingrong and Duan, Xinyu and Wang, Zhefeng and Huai, Baoxing and Zhang, Min. Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.343

work page doi:10.18653/v1/2025.emnlp-main.343 2025

[30] [32]

2025 , eprint=

Speeding up Speculative Decoding via Sequential Approximate Verification , author=. 2025 , eprint=

2025

[31] [33]

2026 , eprint=

MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification , author=. 2026 , eprint=

2026

[32] [34]

2026 , eprint=

Attention Drift: What Autoregressive Speculative Decoding Models Learn , author=. 2026 , eprint=

2026

[33] [35]

2021 , eprint=

The Lipschitz Constant of Self-Attention , author=. 2021 , eprint=

2021

[34] [36]

and Waerden, B

Federer, Herbert and Eckmann, B. and Waerden, B. L. , year =. Geometric. doi:10.1007/978-3-642-62010-2 , abstract =

work page doi:10.1007/978-3-642-62010-2

[35] [37]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[36] [38]

2023 , eprint=

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=

2023