DLMs encode a decodable latent timestep signal in residual activations that can be steered to predictably change model confidence and entropy.
Understanding and Accelerating the Training of Masked Diffusion Language Models
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Subliminal Clocks: Latent Time Modelling in Diffusion Language Models
DLMs encode a decodable latent timestep signal in residual activations that can be steered to predictably change model confidence and entropy.