Introspective Diffusion Language Models

· 2026 · cs.AI · arXiv 2604.11035

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.

representative citing papers

Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

Presents D3IM sampler and SCOPE post-training that enable visible-token revision in masked diffusion LMs, reporting double-digit gains on GSM8K and HumanEval for LLaDA-8B.

DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

cs.CL · 2026-05-31 · unverdicted · novelty 6.0

Adapting LLaDA-8B-Instruct via Discrete Stochastic Localization with continuous per-token Gaussian noise yields continuous denoising that achieves top ROUGE-1 on zero-shot summarization at low step budgets and adds selective noisy-state robustness.

citing papers explorer

Showing 2 of 2 citing papers.

Revise, Don't Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models cs.CL · 2026-05-31 · unverdicted · none · ref 30 · internal anchor
Presents D3IM sampler and SCOPE post-training that enable visible-token revision in masked diffusion LMs, reporting double-digit gains on GSM8K and HumanEval for LLaDA-8B.
DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs cs.CL · 2026-05-31 · unverdicted · none · ref 39 · internal anchor
Adapting LLaDA-8B-Instruct via Discrete Stochastic Localization with continuous per-token Gaussian noise yields continuous denoising that achieves top ROUGE-1 on zero-shot summarization at low step budgets and adds selective noisy-state robustness.

Introspective Diffusion Language Models

fields

years

verdicts

representative citing papers

citing papers explorer