Recognition: unknown
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit
read the original abstract
Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step confirms only high-confidence positions while remasking the others. By analyzing dLLM denoising traces, we uncover a key inefficiency: models often predict the correct target token several steps before its confidence becomes high enough to be decoded. This gap between early prediction and late decoding forces repeated remasking of already-correct tokens, causing redundant iterations and limiting acceleration. To exploit this temporal redundancy, we introduce Trace Credit to quantify a token's decoding potential by accumulating historical evidence. Building on this, we propose CreditDecoding, a training-free parallel decoding method that fuses Trace Credit with current logits to boost the confidence of correct but underconfident tokens, thereby accelerating denoising and improving robustness. On eight benchmarks, CreditDecoding achieves up to 5.48 times speedup with +0.48 accuracy on LLaDA-8B and consistently improves performance across diverse dLLM architectures and parameter scales. It further scales to long contexts and remains orthogonal to mainstream inference optimizations, making it a practical and widely applicable solution.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
-
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.