hub Canonical reference

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong · 2025 · cs.LG · arXiv 2512.15745

Canonical reference. 73% of citing Pith papers cite this work as background.

32 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 32 citing papers arXiv PDF

abstract

This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1 method 1 other 1

citation-polarity summary

background 8 baseline 1 unclear 1 use method 1

representative citing papers

From Table to Cell: Attention for Better Reasoning with TABALIGN

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

Infinite Mask Diffusion for Few-Step Distillation

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.

Relative Score Policy Optimization for Diffusion Language Models

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.

BadDLM: Backdooring Diffusion Language Models with Diverse Targets

cs.CR · 2026-05-10 · unverdicted · novelty 7.0

BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.

LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.

Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.

DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 2 refs

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

cs.CL · 2026-04-02 · unverdicted · novelty 7.0

DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.

A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs

cs.CL · 2026-03-08 · unverdicted · novelty 7.0

Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.

Locally Coherent Parallel Decoding in Diffusion Language Models

cs.CL · 2026-03-03 · unverdicted · novelty 7.0

CoDiLA adds a compact auxiliary AR model on diffusion latents to enforce local sequential validity during parallel token sampling in discrete diffusion language models.

TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

cs.CL · 2026-02-09 · unverdicted · novelty 7.0

TEAM accelerates MoE dLLMs up to 2.2x by exploiting temporal-spatial consistency in expert routing to accept more tokens with fewer activations.

dMoE: dLLMs with Learnable Block Experts

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.

Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.

Understanding and Accelerating the Training of Masked Diffusion Language Models

cs.LG · 2026-05-13 · conditional · novelty 6.0

Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.

TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

TrajDLM applies block diffusion language models to discrete road-segment sequences with topology constraints to generate realistic trajectories up to 2.8 times faster than prior methods while supporting zero-shot transfer.

dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

cs.RO · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.

citing papers explorer

Showing 32 of 32 citing papers.

From Table to Cell: Attention for Better Reasoning with TABALIGN cs.AI · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding execution 44.64%.
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding cs.CL · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.
BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning cs.RO · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models cs.LG · 2026-05-12 · unverdicted · none · ref 8 · 2 links · internal anchor
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Infinite Mask Diffusion for Few-Step Distillation cs.CL · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
Relative Score Policy Optimization for Diffusion Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 62 · internal anchor
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM cs.CL · 2026-05-10 · unverdicted · none · ref 34 · internal anchor
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets cs.CR · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection cs.LG · 2026-05-09 · unverdicted · none · ref 4 · internal anchor
LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast cs.CL · 2026-05-02 · unverdicted · none · ref 2 · internal anchor
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference cs.LG · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 10 · 2 links · internal anchor
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models cs.CL · 2026-04-02 · unverdicted · none · ref 3 · internal anchor
DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs cs.CL · 2026-03-08 · unverdicted · none · ref 1 · internal anchor
Diffusion language models form more global representations with early-layer redundancy compared to autoregressive models, allowing layer skipping for up to 18.75% FLOP savings while maintaining over 90% performance.
Locally Coherent Parallel Decoding in Diffusion Language Models cs.CL · 2026-03-03 · unverdicted · none · ref 2 · internal anchor
CoDiLA adds a compact auxiliary AR model on diffusion latents to enforce local sequential validity during parallel token sampling in discrete diffusion language models.
TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration cs.CL · 2026-02-09 · unverdicted · none · ref 4 · internal anchor
TEAM accelerates MoE dLLMs up to 2.2x by exploiting temporal-spatial consistency in expert routing to accept more tokens with fewer activations.
dMoE: dLLMs with Learnable Block Experts cs.CL · 2026-05-29 · unverdicted · none · ref 11 · internal anchor
dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.
PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 3 · internal anchor
PulseCol introduces periodically refreshed column-sparse attention to achieve up to 1.95x speedup over FlashAttention in diffusion LLMs with maintained model quality.
Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs cs.LG · 2026-05-18 · unverdicted · none · ref 3 · internal anchor
Position-preserving MASK token compression reduces redundancy in diffusion LLMs to accelerate parallel decoding and enable context folding for longer sequences.
Understanding and Accelerating the Training of Masked Diffusion Language Models cs.LG · 2026-05-13 · conditional · none · ref 3 · internal anchor
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion cs.CL · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation cs.LG · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
TrajDLM applies block diffusion language models to discrete road-segment sequences with topology constraints to generate realistic trajectories up to 2.8 times faster than prior methods while supporting zero-shot transfer.
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models cs.LG · 2026-05-10 · unverdicted · none · ref 146 · internal anchor
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving cs.RO · 2026-05-06 · unverdicted · none · ref 90 · 2 links · internal anchor
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model cs.CV · 2026-04-22 · unverdicted · none · ref 3 · internal anchor
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
Stability-Weighted Decoding for Diffusion Language Models cs.CL · 2026-04-18 · unverdicted · none · ref 3 · internal anchor
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion cs.AI · 2026-04-08 · unverdicted · none · ref 7 · 2 links · internal anchor
VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and MCTS methods.
Differences in Text Generated by Diffusion and Autoregressive Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 4 · internal anchor
DLMs exhibit lower n-gram entropy, higher semantic coherence, and higher semantic diversity than ARMs, primarily due to bidirectional context and remasking decoding strategies.
Training-Trajectory-Aware Token Selection cs.CL · 2026-01-15 · unverdicted · none · ref 2 · internal anchor
Training-Trajectory-Aware Token Selection (T3S) reconstructs the token-level training objective to overcome a performance bottleneck in continual distillation of reasoning capabilities from large to small language models.
TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload cs.CL · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
TIDE schedules I/O-aware expert offloading for MoE diffusion LLMs by solving for an optimal refresh interval that exploits temporal stability of activations, yielding up to 1.5x throughput gain losslessly.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion cs.LG · 2026-04-10 · unverdicted · none · ref 4 · 2 links · internal anchor
ECHO introduces one-step block diffusion via Direct Conditional Distillation and Response-Asymmetric Diffusion to generate chest X-ray reports faster than autoregressive models while improving clinical metrics.
Simple Self-Conditioning Adaptation for Masked Diffusion Models cs.LG · 2026-04-28 · unreviewed · ref 5 · internal anchor

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer