super hub Canonical reference

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Geoffrey Irving, Jean-Baptiste Lespiau, John Jumper, Laurent Sifre, Sebastian Borgeaud · 2023 · cs.CL · arXiv 2302.01318

Canonical reference. 74% of citing Pith papers cite this work as background.

123 Pith papers citing it

Background 74% of classified citations

open full Pith review browse 123 citing papers more from Charlie Chen arXiv PDF

abstract

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 17 method 5 other 1

citation-polarity summary

background 17 use method 5 unclear 1

claims ledger

abstract We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion p

authors

Charlie Chen Geoffrey Irving Jean-Baptiste Lespiau John Jumper Laurent Sifre Sebastian Borgeaud

co-cited works

representative citing papers

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.

Certified Speculative Execution for Untrusted AI Agents

cs.CR · 2026-06-30 · unverdicted · novelty 7.0

CGPA enables certified speculative execution of untrusted AI proposals in constrained sequential decisions via verifier rejection, conformal boundary gating, and solver deferral, yielding zero violations and regret within noise of the oracle.

When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Develops theory for acceptance in speculative decoding under greedy/relaxed/tree criteria, with exact KL certificates and margin bounds, evaluated on Qwen3 models.

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.

Speculative Sampling For Faster Molecular Dynamics

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

LSD extends speculative sampling to second-order Langevin dynamics, achieving 3-9x speedup in MD while exactly sampling from the target distribution without relative error.

Cost-Aware Diffusion Draft Trees for Speculative Decoding

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

cs.AI · 2026-05-30 · unverdicted · novelty 7.0

TAPS converts diffusion marginal probabilities into path-conditioned acceptance estimates to select prefix-closed subtrees under a fixed verification budget, achieving up to 7.9x end-to-end speedup over autoregressive decoding.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

Skim: Speculative Execution for Fast and Efficient Web Agents

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy decoding.

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

Future Validity is the Missing Statistic: From Impossibility to $\Phi$-Estimation for Grammar-Faithful Speculative Decoding

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Speculative decoding under local grammar masking samples from the projected distribution μ^proj instead of the grammar-conditional μ*, and the future-validity function Φ corrects it via a Doob transform to achieve exact sampling from μ*.

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

cs.CL · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.

Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

cs.LG · 2026-05-07 · conditional · novelty 7.0

A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.

UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

UniVer frames tree-based speculative decoding as conditional optimal transport, proving it is lossless with optimal acceptance rates and delivering 4.2-8.5% longer accepted sequences than standard rejection sampling.

Component-Aware Self-Speculative Decoding in Hybrid Language Models

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

Component-aware self-speculative decoding achieves high acceptance rates in parallel hybrid models like Falcon-H1 but fails in sequential ones like Qwen3.5, with the gap tied to how components are integrated.

An Empirical Study of Speculative Decoding on Software Engineering Tasks

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

cs.DC · 2026-04-22 · unverdicted · novelty 7.0

FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.

citing papers explorer

Showing 50 of 123 citing papers.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding cs.CL · 2026-05-13 · unverdicted · none · ref 3 · 2 links · internal anchor
Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits cs.LG · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
Certified Speculative Execution for Untrusted AI Agents cs.CR · 2026-06-30 · unverdicted · none · ref 2 · internal anchor
CGPA enables certified speculative execution of untrusted AI proposals in constrained sequential decisions via verifier rejection, conformal boundary gating, and solver deferral, yielding zero violations and regret within noise of the oracle.
When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding cs.LG · 2026-06-29 · unverdicted · none · ref 5 · internal anchor
Develops theory for acceptance in speculative decoding under greedy/relaxed/tree criteria, with exact KL certificates and margin bounds, evaluated on Qwen3 models.
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing cs.CL · 2026-06-24 · unverdicted · none · ref 5 · internal anchor
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
Speculative Sampling For Faster Molecular Dynamics cs.LG · 2026-06-01 · unverdicted · none · ref 45 · internal anchor
LSD extends speculative sampling to second-order Langevin dynamics, achieving 3-9x speedup in MD while exactly sampling from the target distribution without relative error.
Cost-Aware Diffusion Draft Trees for Speculative Decoding cs.CL · 2026-06-01 · unverdicted · none · ref 2 · internal anchor
CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.
Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning cs.CL · 2026-06-01 · unverdicted · none · ref 10 · internal anchor
Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification cs.LG · 2026-05-31 · unverdicted · none · ref 3 · internal anchor
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding cs.AI · 2026-05-30 · unverdicted · none · ref 9 · internal anchor
TAPS converts diffusion marginal probabilities into path-conditioned acceptance estimates to select prefix-closed subtrees under a fixed verification budget, achieving up to 7.9x end-to-end speedup over autoregressive decoding.
EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing cs.LG · 2026-05-30 · unverdicted · none · ref 85 · internal anchor
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting cs.LG · 2026-05-28 · unverdicted · none · ref 9 · internal anchor
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding cs.LG · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
Skim: Speculative Execution for Fast and Efficient Web Agents cs.AI · 2026-05-15 · unverdicted · none · ref 4 · internal anchor
Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding cs.CL · 2026-05-15 · unverdicted · none · ref 11 · internal anchor
PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy decoding.
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding cs.CL · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding cs.LG · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
Future Validity is the Missing Statistic: From Impossibility to $\Phi$-Estimation for Grammar-Faithful Speculative Decoding cs.LG · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
Speculative decoding under local grammar masking samples from the projected distribution μ^proj instead of the grammar-conditional μ*, and the future-validity function Φ corrects it via a Doob transform to achieve exact sampling from μ*.
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting cs.CL · 2026-05-08 · unverdicted · none · ref 2 · 2 links · internal anchor
SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL cs.LG · 2026-05-07 · conditional · none · ref 30 · internal anchor
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding cs.CL · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
UniVer frames tree-based speculative decoding as conditional optimal transport, proving it is lossless with optimal acceptance rates and delivering 4.2-8.5% longer accepted sequences than standard rejection sampling.
Component-Aware Self-Speculative Decoding in Hybrid Language Models cs.CL · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
Component-aware self-speculative decoding achieves high acceptance rates in parallel hybrid models like Falcon-H1 but fails in sequential ones like Qwen3.5, with the gap tied to how components are integrated.
An Empirical Study of Speculative Decoding on Software Engineering Tasks cs.SE · 2026-04-29 · unverdicted · none · ref 6 · internal anchor
Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving cs.DC · 2026-04-22 · unverdicted · none · ref 8 · internal anchor
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference cs.IT · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.
Speculative Decoding for Autoregressive Video Generation cs.CV · 2026-04-19 · conditional · none · ref 2 · internal anchor
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning cs.CL · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
SpecGuard adds step-level verification to speculative decoding via attention grounding and log-probability scores, yielding 3.6% higher accuracy and 11% lower latency on reasoning benchmarks.
MARS: Enabling Autoregressive Models Multi-Token Generation cs.CL · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling cs.LG · 2026-04-05 · unverdicted · none · ref 1 · internal anchor
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE cs.LG · 2026-03-06 · conditional · none · ref 3 · internal anchor
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System cs.LG · 2026-02-06 · conditional · none · ref 7 · internal anchor
Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices cs.DC · 2025-12-06 · conditional · none · ref 5 · internal anchor
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping cs.CV · 2025-11-17 · conditional · none · ref 4 · internal anchor
VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.
Efficient Autoregressive Inference for Transformer Probabilistic Models stat.ML · 2025-10-10 · conditional · none · ref 1 · internal anchor
A causal autoregressive buffer enables efficient batched autoregressive sampling and joint density evaluation in set-based transformer models by caching context and attending to prior predictions.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs cs.CL · 2024-12-30 · unverdicted · none · ref 232 · internal anchor
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 7 · internal anchor
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 5 · internal anchor
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Fast Inference from Transformers via Speculative Decoding cs.LG · 2022-11-30 · accept · none · ref 44 · internal anchor
Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.
Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning cs.CL · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.
Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path cs.LG · 2026-06-28 · unverdicted · none · ref 6 · internal anchor
Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.
Depth Exploration for LLM Decoding cs.LG · 2026-06-28 · unverdicted · none · ref 27 · internal anchor
DEX replaces single-depth selection with parallel exploration over multiple candidate depths, committing the final-depth token while collapsing reusable states to reduce per-token computation.
Fast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM Inference cs.CL · 2026-06-01 · conditional · none · ref 69 · internal anchor
Fast-dLLM++ generalizes Fast-dLLM decoding to heterogeneous confidence profiles via Fréchet profile selection, delivering up to 37% throughput gains on GSM8K, MATH, HumanEval, and MBPP with LLaDA-8B.
SimSD: Simple Speculative Decoding in Diffusion Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 8 · internal anchor
SimSD adds a masking strategy to enable speculative decoding in diffusion LLMs, delivering up to 7.46x throughput gains on SDAR models while preserving generation quality.
DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding cs.CL · 2026-06-01 · unverdicted · none · ref 2 · internal anchor
DFlare replaces DFlash's shared fused representation with per-draft-layer attention to distinct target-layer combinations, enabling deeper drafts and 2.4M training samples for 5-11% higher speedups than DFlash on Qwen3 and GPT-OSS models.
DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation cs.LG · 2026-05-30 · unverdicted · none · ref 90 · internal anchor
DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.
Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding cs.CL · 2026-05-28 · unverdicted · none · ref 5 · internal anchor
Domino decouples causal dependency modeling from autoregressive draft execution via a parallel backbone plus lightweight causal head and a base-anchored training curriculum, reporting up to 5.49x speedup.
Draft-OPD: On-Policy Distillation for Speculative Draft Models cs.CL · 2026-05-28 · unverdicted · none · ref 7 · internal anchor
Draft-OPD applies on-policy distillation via target-assisted generation and error replay to train speculative draft models, yielding over 5x lossless acceleration and gains over EAGLE-3 and DFlash.
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving cs.CL · 2026-05-22 · unverdicted · none · ref 4 · 2 links · internal anchor
Fast-dDrive is a block-diffusion VLA that reports SOTA accuracy on WOD-E2E and nuScenes driving benchmarks together with 12x throughput over autoregressive baselines via section scaffolds and test-time averaging.
FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration cs.CL · 2026-05-19 · unverdicted · none · ref 5 · internal anchor
FlexDraft is a lossless speculative decoding framework that adapts to batch sizes via attention tuning on final layers, MLP-based bonus calibration, and dynamic parallel/sequential decoding.
SNLP: Layer-Parallel Inference via Structured Newton Corrections cs.LG · 2026-05-18 · unverdicted · none · ref 6 · 2 links · internal anchor
SNLP achieves up to 2.58x wall-clock speedup on 0.5B Transformers via architecture-specific Newton corrections (IDN/HCN) that enable layer-parallel inference while preserving perplexity in milder settings.

Accelerating Large Language Model Decoding with Speculative Sampling

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer