super hub Canonical reference

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Geoffrey Irving, Jean-Baptiste Lespiau, John Jumper, Laurent Sifre, Sebastian Borgeaud · 2023 · cs.CL · arXiv 2302.01318

Canonical reference. 75% of citing Pith papers cite this work as background.

156 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 156 citing papers more from Charlie Chen arXiv PDF

abstract

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 method 5 other 1

citation-polarity summary

background 18 use method 5 unclear 1

claims ledger

abstract We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion p

authors

Charlie Chen Geoffrey Irving Jean-Baptiste Lespiau John Jumper Laurent Sifre Sebastian Borgeaud

co-cited works

representative citing papers

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.

Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference

cs.DC · 2026-07-02 · unverdicted · novelty 7.0

Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.

Certified Speculative Execution for Untrusted AI Agents

cs.CR · 2026-06-30 · unverdicted · novelty 7.0

CGPA enables certified speculative execution of untrusted AI proposals in constrained sequential decisions via verifier rejection, conformal boundary gating, and solver deferral, yielding zero violations and regret within noise of the oracle.

When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Develops theory for acceptance in speculative decoding under greedy/relaxed/tree criteria, with exact KL certificates and margin bounds, evaluated on Qwen3 models.

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

cs.CL · 2026-06-07 · accept · novelty 7.0

Naive samplers beat published diffusion and flow models on gen-PPL with incoherent output, proving the metric unsound and motivating distributional evaluation suites.

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

cs.CV · 2026-06-04 · conditional · novelty 7.0

Parallel Jacobi Decoding accelerates autoregressive image models 4.8x-6.4x by using 2D spatial draft expansion and adjusted attention masks while keeping generation quality competitive.

D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

cs.DC · 2026-06-03 · unverdicted · novelty 7.0

D^2SD uses two diffusion drafters in a prefix tree structure with confidence scores to select and recover alternative draft sequences, achieving higher acceptance rates in speculative decoding.

Speculative Sampling For Faster Molecular Dynamics

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

LSD extends speculative sampling to second-order Langevin dynamics, achieving 3-9x speedup in MD while exactly sampling from the target distribution without relative error.

Cost-Aware Diffusion Draft Trees for Speculative Decoding

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

cs.AI · 2026-05-30 · unverdicted · novelty 7.0

TAPS converts diffusion marginal probabilities into path-conditioned acceptance estimates to select prefix-closed subtrees under a fixed verification budget, achieving up to 7.9x end-to-end speedup over autoregressive decoding.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

Skim: Speculative Execution for Fast and Efficient Web Agents

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy decoding.

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

Future Validity is the Missing Statistic: From Impossibility to $\Phi$-Estimation for Grammar-Faithful Speculative Decoding

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Speculative decoding under local grammar masking samples from the projected distribution μ^proj instead of the grammar-conditional μ*, and the future-validity function Φ corrects it via a Doob transform to achieve exact sampling from μ*.

citing papers explorer

Showing 19 of 19 citing papers after filters.

Parallel Jacobi Decoding for Fast Autoregressive Image Generation cs.CV · 2026-06-04 · conditional · none · ref 3 · internal anchor
Parallel Jacobi Decoding accelerates autoregressive image models 4.8x-6.4x by using 2D spatial draft expansion and adjusted attention masks while keeping generation quality competitive.
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL cs.LG · 2026-05-07 · conditional · none · ref 30 · internal anchor
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
Speculative Decoding for Autoregressive Video Generation cs.CV · 2026-04-19 · conditional · none · ref 2 · internal anchor
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE cs.LG · 2026-03-06 · conditional · none · ref 3 · internal anchor
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System cs.LG · 2026-02-06 · conditional · none · ref 7 · internal anchor
Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices cs.DC · 2025-12-06 · conditional · none · ref 5 · internal anchor
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping cs.CV · 2025-11-17 · conditional · none · ref 4 · internal anchor
VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.
Efficient Autoregressive Inference for Transformer Probabilistic Models stat.ML · 2025-10-10 · conditional · none · ref 1 · internal anchor
A causal autoregressive buffer enables efficient batched autoregressive sampling and joint density evaluation in set-based transformer models by caching context and attending to prior predictions.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 7 · internal anchor
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 5 · internal anchor
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Fast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM Inference cs.CL · 2026-06-01 · conditional · none · ref 69 · internal anchor
Fast-dLLM++ generalizes Fast-dLLM decoding to heterogeneous confidence profiles via Fréchet profile selection, delivering up to 37% throughput gains on GSM8K, MATH, HumanEval, and MBPP with LLaDA-8B.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts cs.CL · 2026-05-08 · conditional · none · ref 9 · internal anchor
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection cs.LG · 2026-05-04 · conditional · none · ref 3 · internal anchor
SpecKV uses a small MLP trained on draft model confidence and entropy to dynamically choose the optimal speculation length gamma, achieving 56% better performance than fixed gamma=4 across various tasks and compression levels.
When Less is Enough: Efficient Inference via Collaborative Reasoning cs.LG · 2026-05-01 · conditional · none · ref 3 · internal anchor
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
When LLMs get significantly worse: A statistical approach to detect model degradations stat.ML · 2026-02-09 · conditional · none · ref 2 · internal anchor
A McNemar-based statistical test detects real degradations in optimized LLMs with controlled false positives, even for accuracy changes as small as 0.3%.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference cs.LG · 2025-05-05 · conditional · none · ref 11 · internal anchor
RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.
Scaling Diffusion Language Models via Adaptation from Autoregressive Models cs.CL · 2024-10-23 · conditional · none · ref 112 · internal anchor
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
SnapKV: LLM Knows What You are Looking for Before Generation cs.CL · 2024-04-22 · conditional · none · ref 30 · internal anchor
SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable performance across 16 datasets.
Tandem: Riding Together with Large and Small Language Models for Efficient Reasoning cs.AI · 2026-04-26 · conditional · none · ref 1 · internal anchor
Tandem lets a large model supply compact strategic guidance to a small model for reasoning tasks, achieving similar or better performance at about 40 percent lower cost through adaptive early stopping.

Accelerating Large Language Model Decoding with Speculative Sampling

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer