pith. sign in

arxiv: 2605.15051 · v1 · pith:YMQXB3EEnew · submitted 2026-05-14 · 💻 cs.LG · cs.PF

An Interpretable Latency Model for Speculative Decoding in LLM Serving

Pith reviewed 2026-06-30 21:21 UTC · model grok-4.3

classification 💻 cs.LG cs.PF
keywords speculative decodinglatency modelLLM servingLittle's Laweffective batch sizedraft modelacceptance ratemixture of experts
0
0 comments X

The pith

A latency model decomposes speculative decoding costs into load-independent and load-dependent components to predict performance in varying server loads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a latency model for speculative decoding in LLM serving systems. It infers effective batch size from request rate with Little's Law and splits per-request demand into parts that stay constant with load and parts that change with it, covering prefill, drafting, and verification. The model matches extensive measurements across different conditions and accounts for why speedups from using draft models decrease as load rises. This matters because it helps understand how to set draft lengths and acceptance rates in real deployed systems where load fluctuates. It also applies the same approach to mixture of experts models.

Core claim

The central discovery is a simple interpretable latency model for speculative decoding in LLM serving that infers effective batch size from request rate using Little's Law and decomposes per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. This model accurately describes observed latency from vLLM measurements, explains the reduction in speedups as server load increases, and shows how draft length, acceptance rate, and verifier-drafter size affect latency, with extension to mixture of experts models where sparse activation changes costs across load regimes.

What carries the argument

The decomposition of per-request demand into load-independent and load-dependent components, combined with effective batch size inferred via Little's Law from request rate.

If this is right

  • The model shows that speedups diminish as server load increases because of the growing influence of load-dependent components.
  • Latency depends on draft length, acceptance rate, and the size difference between verifier and drafter models.
  • The framework can guide configuration of speculative decoding in production serving systems.
  • It extends to mixture of experts models by accounting for changes in effective service costs due to sparse expert activation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Operators could use the model to dynamically tune draft model parameters based on measured request rates.
  • The decomposition approach might help analyze other acceleration techniques like continuous batching under variable loads.
  • Validation across more diverse hardware setups could strengthen the model's applicability.

Load-bearing premise

That the load-independent and load-dependent components of per-request demand stay stable enough to be useful across ranges of prefill and decode lengths, model sizes, and acceptance probabilities.

What would settle it

A set of latency measurements at high request rates where the predicted latency using the decomposed components deviates substantially from actual observed values for varying draft lengths.

Figures

Figures reproduced from arXiv: 2605.15051 by Alexandre Marques, Linghao Kong, Mark Kurtz, Megan Flynn, Michael Peng, Nir Shavit.

Figure 1
Figure 1. Figure 1: Universal latency scaling across mod￾els and configurations. Measured end-to-end re￾quest latency L, normalized by C1, plotted against normalized load RP S × C2, for multiple models and multiple combinations of prefill/decode con￾figurations. Despite differing absolute scales, all measurements collapse onto the expected curve y = 1 1−x , validating the latency model (eq. (1), section 4.1). For example, ser… view at source ↗
Figure 2
Figure 2. Figure 2: Configuration-dependent latency and speedup behavior under SD. (a) Per-configuration [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The minimum possible fixed and load-variable cost ratios across prefill and decode [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Speculative decoding induces configuration-dependent latency costs which a single [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scaling of SD latency costs with verifier and drafter sizes. A speculation-aware latency [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scaling of SD latency costs with prefill [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: MoE-aware SD latency model improves fit. Compared to the original speculation-aware [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Scaling of speculative decoding latency costs with verifier and drafter sizes. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The minimum possible fixed and load-variable cost ratios across prefill and decode [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Speculative decoding induces configuration-dependent latency costs. (a) Forcing all [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Speculation-aware latency model using p95 latency rather than mean latency. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Speculation-aware latency model using p99 latency rather than mean latency. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Speculation-aware latency model scaling behavior for Qwen3 family on an H100. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed. In this work, we develop a simple and interpretable latency model for SD in LLM serving. We infer effective batch size from request rate using Little's Law and decompose per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. We validate our model using extensive measurements from vLLM across verifier and drafter model sizes, prefill and decode lengths, request rates, draft lengths, and acceptance probabilities. The model accurately describes observed latency, explains why speedups often diminish as server load increases, and characterizes how draft length, acceptance rate, and verifier-drafter size shape latency across serving conditions, with implications for configuring SD in deployed systems. We further show how the framework extends to mixture of experts models, where sparse expert activation changes the effective service costs across load regimes. Together, our results provide a structured framework for understanding SD in real LLM serving systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper develops an interpretable latency model for speculative decoding (SD) in LLM serving. It applies Little's Law to infer effective batch size from request rate and decomposes per-request demand into load-independent and load-dependent components for prefill, drafting, and verification phases. Extensive vLLM measurements across verifier/drafter sizes, prefill/decode lengths, request rates, draft lengths, and acceptance probabilities are used to show that the model accurately reproduces observed latencies, explains the erosion of SD speedups at higher loads, and characterizes the influence of draft length, acceptance rate, and model sizes, with an extension to mixture-of-experts models.

Significance. If the decomposition and predictions hold, the work supplies a practical, interpretable framework for configuring SD under variable production load, an area left open by prior isolated or fixed-batch studies. The broad empirical sweep and use of Little's Law as an external anchor are strengths; the MoE extension further increases relevance for modern serving stacks.

major comments (1)
  1. [Model derivation and validation] Model section (inferred from abstract and validation description): the load-dependent demand components are fitted from the same measurement sweeps used for validation. This makes the claim that the model 'accurately describes observed latency' partly tautological; a clearer statement of how the two-parameter fit is performed, whether any data are held out, and whether the components remain stable when load, hardware, or acceptance probability move outside the fitted range would be required to support the predictive use case.
minor comments (2)
  1. [Abstract and validation] Abstract and § on validation: the ranges of prefill/decode lengths, model sizes, and acceptance probabilities are described at a high level; explicit tables or figures listing the exact parameter grids would improve reproducibility.
  2. [Model] Notation: the distinction between 'demand' and 'latency' components should be defined once with symbols before the decomposition is applied.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and recommendation for minor revision. We address the single major comment below and will incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [Model derivation and validation] Model section (inferred from abstract and validation description): the load-dependent demand components are fitted from the same measurement sweeps used for validation. This makes the claim that the model 'accurately describes observed latency' partly tautological; a clearer statement of how the two-parameter fit is performed, whether any data are held out, and whether the components remain stable when load, hardware, or acceptance probability move outside the fitted range would be required to support the predictive use case.

    Authors: We agree that the manuscript would benefit from greater transparency on this point. The load-dependent demand parameters are estimated from the same vLLM measurement sweeps. In the revision we will expand the model section to state explicitly that the two parameters per phase are obtained by least-squares minimization of the difference between predicted and measured per-request latency versus effective batch size (inferred via Little's Law) for each fixed configuration. No hold-out data were used; the fitting is performed independently per configuration to isolate the load-dependent component while the load-independent component is taken from low-load measurements. We will also add a paragraph noting that the fitted parameters remain consistent in magnitude across the wide range of verifier/drafter sizes, prefill/decode lengths, draft lengths, and acceptance probabilities tested, which provides evidence of stability within the measured regimes. For hardware or acceptance rates outside this range we will acknowledge that re-estimation may be required and that the current experiments do not constitute out-of-distribution validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper applies Little's Law (an external, standard result) to infer effective batch size from request rate, then introduces an empirical decomposition of per-request demand into load-independent and load-dependent terms for different phases. This decomposition is fitted to measurements and validated across wide experimental sweeps in vLLM. No quoted equations or steps reduce the model's predictions to the fitted inputs by construction, nor does any self-citation chain bear the central claim. The work is an empirical modeling effort whose validation is independent of the derivation itself.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The model rests on fitted load-independent and load-dependent demand parameters for each phase (prefill, drafting, verification) and the applicability of Little's Law to infer effective batch size from request rate. No new entities are postulated.

free parameters (2)
  • load-independent demand components
    Fitted constants for prefill, drafting, and verification phases that do not vary with batch size.
  • load-dependent demand components
    Fitted slopes or multipliers that scale with effective batch size for each phase.
axioms (1)
  • standard math Little's Law relates average number of items in a system to arrival rate and average time in system.
    Invoked to infer effective batch size from request rate.

pith-pipeline@v0.9.1-grok · 5775 in / 1417 out tokens · 24188 ms · 2026-06-30T21:21:09.044412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

  2. [2]

    Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583,

    Zihao An, Huajun Bai, Ziqiong Liu, Dong Li, and Emad Barsoum. Pard: Accelerating llm inference with low-cost parallel draft model adaptation.arXiv preprint arXiv:2504.18583,

  3. [3]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads.arXiv preprint arXiv:2401.10774,

  4. [4]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318,

  5. [5]

    Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057,

  6. [6]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  7. [7]

    Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference.arXiv preprint arXiv:2303.06182,

    Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, and Benjamin Lee. Towards moe deployment: Mitigating inefficiencies in mixture-of-expert (moe) inference.arXiv preprint arXiv:2303.06182,

  8. [8]

    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.arXiv preprint arXiv:2401.15077, 2024a. Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024b. Yuhui Li, Fangyun Wei, Chao Z...

  9. [9]

    Turbospec: Closed-loop speculation control system for optimizing llm serving goodput.arXiv preprint arXiv:2406.14066,

    Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, et al. Turbospec: Closed-loop speculation control system for optimizing llm serving goodput.arXiv preprint arXiv:2406.14066,

  10. [10]

    Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

    Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, and Roy Schwartz. Dynamic speculation lookahead accelerates speculative decoding of large language models.arXiv preprint arXiv:2405.04304,

  11. [11]

    doi:10.1038/s41592-019-0686-2 , eprint =

    doi: 10.1038/s41592-019-0686-2. Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures.Communications of the ACM, 52(4):65–76,

  12. [12]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,