Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

Aditya Akella, Gaurav Sarkar, Jucheng Shen, Sharath Nittur Sridhar, Souvik Kundu, Yeonju Ro, Zhangyang Wang

Authors on Pith no claims yet

classification 💻 cs.LG

keywords cadllmthroughputconfidencediffusion-baseddllmsmethodsizetraining-free

read the original abstract

We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 1.1-2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
cs.CL 2026-04 unverdicted novelty 7.0

R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.