ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs

Coleman Hooper; Hyung Il Koo; Kangwook Lee; Kevin Galim; Minjae Lee; Nam Ik Cho; Seunghyuk Oh; Shuibai Zhang; Wonjun Kang; Yuchen Zeng

arxiv: 2510.04767 · v2 · pith:7B4UNLJCnew · submitted 2025-10-06 · 💻 cs.LG

ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs

Wonjun Kang , Kevin Galim , Seunghyuk Oh , Minjae Lee , Yuchen Zeng , Shuibai Zhang , Coleman Hooper , Yuezhou Hu

show 3 more authors

Hyung Il Koo Nam Ik Cho Kangwook Lee

This is my paper

classification 💻 cs.LG

keywords decodingparalleldllmsllmsqualityautoregressiveparallelbenchaccelerate

0 comments

read the original abstract

While most autoregressive LLMs are constrained to one-by-one decoding, diffusion LLMs (dLLMs) have attracted growing interest for their potential to dramatically accelerate inference through parallel decoding. Despite this promise, the conditional independence assumption in dLLMs causes parallel decoding to ignore token dependencies, inevitably degrading generation quality when these dependencies are strong. However, existing works largely overlook these inherent challenges, and evaluations on standard benchmarks (e.g., math and coding) are not sufficient to capture the quality degradation caused by parallel decoding. To address this gap, we first provide an information-theoretic analysis of parallel decoding. We then conduct case studies on analytically tractable synthetic list operations from both data distribution and decoding strategy perspectives, offering quantitative insights that highlight the fundamental limitations of parallel decoding. Building on these insights, we propose ParallelBench, the first benchmark specifically designed for dLLMs, featuring realistic tasks that are trivial for humans and autoregressive LLMs yet exceptionally challenging for dLLMs under parallel decoding. Using ParallelBench, we systematically analyze both dLLMs and autoregressive LLMs, revealing that: (i) dLLMs under parallel decoding can suffer dramatic quality degradation in real-world scenarios, and (ii) current parallel decoding strategies struggle to adapt their degree of parallelism based on task difficulty, thus failing to achieve meaningful speedup without compromising quality. Our findings underscore the pressing need for innovative decoding methods that can overcome the current speed-quality trade-off. We release our benchmark to help accelerate the development of truly efficient dLLMs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kernel-Gradient Drifting Models
cs.LG 2026-05 unverdicted novelty 7.0

Kernel-gradient drifting reformulates drifting models via kernel gradients to yield identifiable one-step generation with smoothed score matching and KL descent on Euclidean, Riemannian, and discrete spaces.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 unverdicted novelty 7.0

Continuous flow language models match discrete diffusion baselines and their distilled one-step flow map versions exceed 8-step discrete diffusion quality on LM1B and OWT.
Posterior Refinement: Fast Language Generation via Any-Order Flow Maps
cs.CL 2026-06 unverdicted novelty 6.0

FMLM+ with Posterior Refinement bridges masked diffusion and flow map models to match discrete baseline quality in language generation using 32x fewer neural function evaluations via posterior scoring and refinement.
Self-Generated Error Training for Token Editing in Diffusion Language Models
cs.CL 2026-06 unverdicted novelty 6.0

Self-generated T2T training on LLaDA2.1-mini improves benchmark accuracy and lowers edit intensity by supervising recovery from model-generated corruptions instead of random ones.
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
cs.CL 2026-02 conditional novelty 6.0

Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.