Understanding and Accelerating the Training of Masked Diffusion Language Models

· 2026 · cs.LG · arXiv 2605.13026

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

representative citing papers

Subliminal Clocks: Latent Time Modelling in Diffusion Language Models

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

DLMs encode a decodable latent timestep signal in residual activations that can be steered to predictably change model confidence and entropy.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Subliminal Clocks: Latent Time Modelling in Diffusion Language Models cs.AI · 2026-07-02 · unverdicted · none · ref 50 · internal anchor
DLMs encode a decodable latent timestep signal in residual activations that can be steered to predictably change model confidence and entropy.

Understanding and Accelerating the Training of Masked Diffusion Language Models

fields

years

verdicts

representative citing papers

citing papers explorer