Accelerating Speculative Decoding with Block Diffusion Draft Trees

· 2026 · cs.CL · arXiv 2604.12989

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

representative citing papers

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.

D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

cs.DC · 2026-06-03 · unverdicted · novelty 7.0

D^2SD uses two diffusion drafters in a prefix tree structure with confidence scores to select and recover alternative draft sequences, achieving higher acceptance rates in speculative decoding.

Cost-Aware Diffusion Draft Trees for Speculative Decoding

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

cs.AI · 2026-05-30 · unverdicted · novelty 7.0

TAPS converts diffusion marginal probabilities into path-conditioned acceptance estimates to select prefix-closed subtrees under a fixed verification budget, achieving up to 7.9x end-to-end speedup over autoregressive decoding.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

JetSpec trains a causal draft head to produce branch-consistent trees aligned with target autoregressive scores, achieving up to 9.64x speedup on MATH-500 and outperforming prior SD baselines on Qwen3 models.

Teaching Diffusion to Speculate Left-to-Right

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Three training interventions for diffusion drafters raise accepted draft length 21-76% over uniform baseline on reasoning, code, and dialogue tasks.

TreeFlash: Parallel AR-Approximation for Faster Speculative Decoding

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

TreeFlash adds an MLP conditioned on hidden state and prior token to approximate autoregressive distributions in parallel one-shot tree drafters for speculative decoding, claiming 12% higher block efficiency and 9% higher speedup over marginal tree drafting.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Cost-Aware Diffusion Draft Trees for Speculative Decoding cs.CL · 2026-06-01 · unverdicted · none · ref 5 · internal anchor
CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.
JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting cs.CL · 2026-06-16 · unverdicted · none · ref 36 · internal anchor
JetSpec trains a causal draft head to produce branch-consistent trees aligned with target autoregressive scores, achieving up to 9.64x speedup on MATH-500 and outperforming prior SD baselines on Qwen3 models.
Teaching Diffusion to Speculate Left-to-Right cs.CL · 2026-06-10 · unverdicted · none · ref 49 · internal anchor
Three training interventions for diffusion drafters raise accepted draft length 21-76% over uniform baseline on reasoning, code, and dialogue tasks.

Accelerating Speculative Decoding with Block Diffusion Draft Trees

fields

years

verdicts

representative citing papers

citing papers explorer