pith. sign in

arxiv: 2509.18085 · v4 · pith:KBUF7ZEOnew · submitted 2025-09-22 · 💻 cs.LG · cs.AI· cs.CL

Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

classification 💻 cs.LG cs.AIcs.CL
keywords draftdecodingmodelspeculativespiffyacceleratear-llmscalibrated
0
0 comments X
read the original abstract

Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spiffy, a speculative decoding algorithm to accelerate dLLM inference while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to dLLMs. Spiffy performs auto-speculation to eliminate the overheads of an independent draft model, structuring draft states in the form of a novel directed draft graph to take advantage of the bidirectional, blockwise nature of dLLM generation. These draft graphs are calibrated offline to maximize acceptance rates and are dynamically pruned during inference for improved computational efficiency. We present a detailed formulation of Spiffy and demonstrate its ability to accelerate LLaDA, Dream, and SDAR models in combination with KV caching and threshold-based dynamic unmasking leading to up to $8.6\times$ reduction in model inferences and $6.3\times$ acceleration in token rate.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 7.0

    PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy ...

  2. Multi-Token Residual Prediction

    cs.LG 2026-05 unverdicted novelty 7.0

    MRP predicts logit residuals from hidden states to support dependency-aware multi-token denoising in a single forward pass for diffusion language models, yielding up to 1.42× lossless speedup on SDAR models.

  3. TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

    cs.CL 2026-02 unverdicted novelty 7.0

    TEAM accelerates MoE dLLMs up to 2.2x by exploiting temporal-spatial consistency in expert routing to accept more tokens with fewer activations.

  4. Accelerating Speculative Diffusions via Block Verification

    cs.LG 2026-06 unverdicted novelty 6.0

    A new residual-sampling scheme for diffusion models permits block verification and yields up to 6.3% speedup via a heuristic self-speculative drafter that needs no training.

  5. Diffusion Language Models Know the Answer Before Decoding

    cs.CL 2025-08 conditional novelty 6.0

    DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.