An Interpretable Latency Model for Speculative Decoding in LLM Serving

Alexandre Marques; Linghao Kong; Mark Kurtz; Megan Flynn; Michael Peng; Nir Shavit

arxiv: 2605.15051 · v1 · pith:YMQXB3EEnew · submitted 2026-05-14 · 💻 cs.LG · cs.PF

An Interpretable Latency Model for Speculative Decoding in LLM Serving

Linghao Kong , Megan Flynn , Michael Peng , Nir Shavit , Mark Kurtz , Alexandre Marques This is my paper

Pith reviewed 2026-06-30 21:21 UTC · model grok-4.3

classification 💻 cs.LG cs.PF

keywords speculative decodinglatency modelLLM servingLittle's Laweffective batch sizedraft modelacceptance ratemixture of experts

0 comments

The pith

A latency model decomposes speculative decoding costs into load-independent and load-dependent components to predict performance in varying server loads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a latency model for speculative decoding in LLM serving systems. It infers effective batch size from request rate with Little's Law and splits per-request demand into parts that stay constant with load and parts that change with it, covering prefill, drafting, and verification. The model matches extensive measurements across different conditions and accounts for why speedups from using draft models decrease as load rises. This matters because it helps understand how to set draft lengths and acceptance rates in real deployed systems where load fluctuates. It also applies the same approach to mixture of experts models.

Core claim

The central discovery is a simple interpretable latency model for speculative decoding in LLM serving that infers effective batch size from request rate using Little's Law and decomposes per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. This model accurately describes observed latency from vLLM measurements, explains the reduction in speedups as server load increases, and shows how draft length, acceptance rate, and verifier-drafter size affect latency, with extension to mixture of experts models where sparse activation changes costs across load regimes.

What carries the argument

The decomposition of per-request demand into load-independent and load-dependent components, combined with effective batch size inferred via Little's Law from request rate.

If this is right

The model shows that speedups diminish as server load increases because of the growing influence of load-dependent components.
Latency depends on draft length, acceptance rate, and the size difference between verifier and drafter models.
The framework can guide configuration of speculative decoding in production serving systems.
It extends to mixture of experts models by accounting for changes in effective service costs due to sparse expert activation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Operators could use the model to dynamically tune draft model parameters based on measured request rates.
The decomposition approach might help analyze other acceleration techniques like continuous batching under variable loads.
Validation across more diverse hardware setups could strengthen the model's applicability.

Load-bearing premise

That the load-independent and load-dependent components of per-request demand stay stable enough to be useful across ranges of prefill and decode lengths, model sizes, and acceptance probabilities.

What would settle it

A set of latency measurements at high request rates where the predicted latency using the decomposed components deviates substantially from actual observed values for varying draft lengths.

Figures

Figures reproduced from arXiv: 2605.15051 by Alexandre Marques, Linghao Kong, Mark Kurtz, Megan Flynn, Michael Peng, Nir Shavit.

**Figure 1.** Figure 1: Universal latency scaling across models and configurations. Measured end-to-end request latency L, normalized by C1, plotted against normalized load RP S × C2, for multiple models and multiple combinations of prefill/decode configurations. Despite differing absolute scales, all measurements collapse onto the expected curve y = 1 1−x , validating the latency model (eq. (1), section 4.1). For example, ser… view at source ↗

**Figure 2.** Figure 2: Configuration-dependent latency and speedup behavior under SD. (a) Per-configuration [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The minimum possible fixed and load-variable cost ratios across prefill and decode [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Speculative decoding induces configuration-dependent latency costs which a single [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling of SD latency costs with verifier and drafter sizes. A speculation-aware latency [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Scaling of SD latency costs with prefill [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: MoE-aware SD latency model improves fit. Compared to the original speculation-aware [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Scaling of speculative decoding latency costs with verifier and drafter sizes. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: The minimum possible fixed and load-variable cost ratios across prefill and decode [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Speculative decoding induces configuration-dependent latency costs. (a) Forcing all [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Speculation-aware latency model using p95 latency rather than mean latency. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Speculation-aware latency model using p99 latency rather than mean latency. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Speculation-aware latency model scaling behavior for Qwen3 family on an H100. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

read the original abstract

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed. In this work, we develop a simple and interpretable latency model for SD in LLM serving. We infer effective batch size from request rate using Little's Law and decompose per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. We validate our model using extensive measurements from vLLM across verifier and drafter model sizes, prefill and decode lengths, request rates, draft lengths, and acceptance probabilities. The model accurately describes observed latency, explains why speedups often diminish as server load increases, and characterizes how draft length, acceptance rate, and verifier-drafter size shape latency across serving conditions, with implications for configuring SD in deployed systems. We further show how the framework extends to mixture of experts models, where sparse expert activation changes the effective service costs across load regimes. Together, our results provide a structured framework for understanding SD in real LLM serving systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a workable latency model for speculative decoding under variable load via Little's Law and demand split, backed by broad vLLM sweeps.

read the letter

This paper's core contribution is a latency model for speculative decoding that accounts for changing request load in serving systems. They use Little's Law to infer effective batch size from request rate, then split per-request demand into load-independent and load-dependent pieces for prefill, drafting, and verification steps.

The work stands out for its validation. They ran extensive measurements in vLLM across verifier and drafter sizes, prefill and decode lengths, request rates, draft lengths, and acceptance probabilities. The model tracks observed latencies and accounts for why speedups drop as load increases. They also show a quick extension to mixture-of-experts models where expert activation changes costs across load levels. That level of sweep coverage makes the claims concrete rather than hand-wavy.

The soft spot is the fitting step. The load-dependent demand components are derived from the same measurements used to check the model, so the fit is partly post-hoc. It reproduces the data inside the tested ranges, but the letter does not show how stable those components stay if hardware, batching policy, or load regimes shift outside the sweeps. Minor for descriptive use in similar setups, more limiting if someone wants to extrapolate far.

This is for engineers who tune speculative decoding in production LLM servers and need to predict behavior under real load variation. A reader focused on inference systems will get direct value from the framework and the configuration implications. It deserves a serious referee because the experiments are thorough and the modeling approach is simple enough to check or extend.

Referee Report

1 major / 2 minor

Summary. The paper develops an interpretable latency model for speculative decoding (SD) in LLM serving. It applies Little's Law to infer effective batch size from request rate and decomposes per-request demand into load-independent and load-dependent components for prefill, drafting, and verification phases. Extensive vLLM measurements across verifier/drafter sizes, prefill/decode lengths, request rates, draft lengths, and acceptance probabilities are used to show that the model accurately reproduces observed latencies, explains the erosion of SD speedups at higher loads, and characterizes the influence of draft length, acceptance rate, and model sizes, with an extension to mixture-of-experts models.

Significance. If the decomposition and predictions hold, the work supplies a practical, interpretable framework for configuring SD under variable production load, an area left open by prior isolated or fixed-batch studies. The broad empirical sweep and use of Little's Law as an external anchor are strengths; the MoE extension further increases relevance for modern serving stacks.

major comments (1)

[Model derivation and validation] Model section (inferred from abstract and validation description): the load-dependent demand components are fitted from the same measurement sweeps used for validation. This makes the claim that the model 'accurately describes observed latency' partly tautological; a clearer statement of how the two-parameter fit is performed, whether any data are held out, and whether the components remain stable when load, hardware, or acceptance probability move outside the fitted range would be required to support the predictive use case.

minor comments (2)

[Abstract and validation] Abstract and § on validation: the ranges of prefill/decode lengths, model sizes, and acceptance probabilities are described at a high level; explicit tables or figures listing the exact parameter grids would improve reproducibility.
[Model] Notation: the distinction between 'demand' and 'latency' components should be defined once with symbols before the decomposition is applied.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address the single major comment below and will incorporate the requested clarifications.

read point-by-point responses

Referee: [Model derivation and validation] Model section (inferred from abstract and validation description): the load-dependent demand components are fitted from the same measurement sweeps used for validation. This makes the claim that the model 'accurately describes observed latency' partly tautological; a clearer statement of how the two-parameter fit is performed, whether any data are held out, and whether the components remain stable when load, hardware, or acceptance probability move outside the fitted range would be required to support the predictive use case.

Authors: We agree that the manuscript would benefit from greater transparency on this point. The load-dependent demand parameters are estimated from the same vLLM measurement sweeps. In the revision we will expand the model section to state explicitly that the two parameters per phase are obtained by least-squares minimization of the difference between predicted and measured per-request latency versus effective batch size (inferred via Little's Law) for each fixed configuration. No hold-out data were used; the fitting is performed independently per configuration to isolate the load-dependent component while the load-independent component is taken from low-load measurements. We will also add a paragraph noting that the fitted parameters remain consistent in magnitude across the wide range of verifier/drafter sizes, prefill/decode lengths, draft lengths, and acceptance probabilities tested, which provides evidence of stability within the measured regimes. For hardware or acceptance rates outside this range we will acknowledge that re-estimation may be required and that the current experiments do not constitute out-of-distribution validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper applies Little's Law (an external, standard result) to infer effective batch size from request rate, then introduces an empirical decomposition of per-request demand into load-independent and load-dependent terms for different phases. This decomposition is fitted to measurements and validated across wide experimental sweeps in vLLM. No quoted equations or steps reduce the model's predictions to the fitted inputs by construction, nor does any self-citation chain bear the central claim. The work is an empirical modeling effort whose validation is independent of the derivation itself.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The model rests on fitted load-independent and load-dependent demand parameters for each phase (prefill, drafting, verification) and the applicability of Little's Law to infer effective batch size from request rate. No new entities are postulated.

free parameters (2)

load-independent demand components
Fitted constants for prefill, drafting, and verification phases that do not vary with batch size.
load-dependent demand components
Fitted slopes or multipliers that scale with effective batch size for each phase.

axioms (1)

standard math Little's Law relates average number of items in a system to arrival rate and average time in system.
Invoked to infer effective batch size from request rate.

pith-pipeline@v0.9.1-grok · 5775 in / 1417 out tokens · 24188 ms · 2026-06-30T21:21:09.044412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 6 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583,

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583,

work page arXiv
[3]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057,

work page arXiv
[6]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference.arXiv preprint arXiv:2303.06182,

Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, and Benjamin Lee. Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference.arXiv preprint arXiv:2303.06182,

work page arXiv
[8]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024a. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024b. Yuhui Li, Fangyun Wei, Chao Z...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Turbospec: Closed-loop speculation control system for optimizing llm serving goodput.arXiv preprint arXiv:2406.14066,

Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, et al. Turbospec: Closed-loop speculation control system for optimizing llm serving goodput.arXiv preprint arXiv:2406.14066,

work page arXiv
[10]

Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, and Roy Schwartz. Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

work page arXiv
[11]

doi:10.1038/s41592-019-0686-2 , eprint =

doi: 10.1038/s41592-019-0686-2. Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76,

work page doi:10.1038/s41592-019-0686-2
[12]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583,

Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583,

work page arXiv

[3] [3]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057,

work page arXiv

[6] [6]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference.arXiv preprint arXiv:2303.06182,

Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, and Benjamin Lee. Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference.arXiv preprint arXiv:2303.06182,

work page arXiv

[8] [8]

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024a. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024b. Yuhui Li, Fangyun Wei, Chao Z...

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Turbospec: Closed-loop speculation control system for optimizing llm serving goodput.arXiv preprint arXiv:2406.14066,

Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, et al. Turbospec: Closed-loop speculation control system for optimizing llm serving goodput.arXiv preprint arXiv:2406.14066,

work page arXiv

[10] [10]

Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, and Roy Schwartz. Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

work page arXiv

[11] [11]

doi:10.1038/s41592-019-0686-2 , eprint =

doi: 10.1038/s41592-019-0686-2. Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76,

work page doi:10.1038/s41592-019-0686-2

[12] [12]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv