Masked diffusion LMs can use continuous x-prediction flow with token-wise asynchronous updates and an RL policy network to reach 97% performance on HumanEval using only 25% of the usual decoding budget.
hub Canonical reference
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
Canonical reference. 73% of citing Pith papers cite this work as background.
abstract
This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.
hub tools
citation-role summary
citation-polarity summary
years
2026 41representative citing papers
TUBE is a new upper bound on evidence for discrete diffusion models that shows block MDMs and AO-ARMs have strictly lower likelihood than exact ARMs.
Infilling extraction on diffusion language models extracts up to three times more verbatim sequences than prefix methods and achieves higher recall on redacted emails than autoregressive models.
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.
CoDiLA adds a compact auxiliary AR model on diffusion latents to enforce local sequential validity during parallel token sampling in discrete diffusion language models.
TEAM accelerates MoE dLLMs up to 2.2x by exploiting temporal-spatial consistency in expert routing to accept more tokens with fewer activations.
Prompt template choice creates an evaluation illusion in dLLMs where parallel decoding appears competitive, but experiments show these methods consistently underperform single-token decoding across varied settings.
MBD-LMs post-train BD-LMs using MultiTF on bounded noise-groups with randomized schedulers and Block Buffer decoding to increase average TPF from 3.47 to 6.19 with accuracy rising to 81.03%.
dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.
GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.
LoopMDM loops early-middle layers in masked diffusion models to match same-size MDM performance with up to 3.3x fewer training FLOPs and outperform on reasoning tasks by up to 8.5 points on GSM8K.
citing papers explorer
-
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.