Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
hub
Accelerating Large Language Model Decoding with Speculative Sampling
58 Pith papers cite this work. Polarity classification is still indexing.
abstract
We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion p
co-cited works
roles
method 1polarities
use method 1representative citing papers
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
Speculative decoding under local grammar masking samples from the projected distribution μ^proj instead of the grammar-conditional μ*, and the future-validity function Φ corrects it via a Doob transform to achieve exact sampling from μ*.
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
UniVer frames tree-based speculative decoding as conditional optimal transport, proving it is lossless with optimal acceptance rates and delivering 4.2-8.5% longer accepted sequences than standard rejection sampling.
Component-aware self-speculative decoding achieves high acceptance rates in parallel hybrid models like Falcon-H1 but fails in sequential ones like Qwen3.5, with the gap tied to how components are integrated.
Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
Copy-as-Decode recasts LLM editing as grammar-constrained decoding over copy and generate primitives, delivering closed-form upper-bound speedups of 13x pooled on editing benchmarks via parallel prefill without any training.
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
SpecGuard adds step-level verification to speculative decoding via attention grounding and log-probability scores, yielding 3.6% higher accuracy and 11% lower latency on reasoning benchmarks.
MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
CATS achieves up to 5.08x wall-clock speedup for LLM generation on edge devices via memory-matched cascaded tree speculation, outperforming prior methods by 1.45x with no quality loss.
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.
Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improves acceptance length up to 2x on perturbed templates and 1.18x on long-context data
ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
citing papers explorer
-
Latency and Cost of Multi-Agent Intelligent Tutoring at Scale
Priority PayGo keeps multi-agent tutoring responses under 4 seconds even at 50 concurrent users, while costs stay below textbook prices per student.