LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
hub Canonical reference
Simplified and generalized masked diffusion for discrete data
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
A policy network learns to choose unmasking order in masked diffusion by reweighting the loss, outperforming random and heuristic baselines on ordering-sensitive tasks.
FLDD learns non-Markovian marginal and posterior distributions for the forward process so a factorized reverse process can match the target better and produce higher-quality samples in fewer steps.
TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effective capabilities.
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.
First dedicated survey organizing diffusion and flow matching models for tabular data synthesis, imputation, anomaly detection, and related tasks, covering literature from 2015 to 2026 and highlighting open problems.
A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.
FP-MGMs with consistency loss and three-state reuse (CoFRe) reduce parameters by up to 38.8% and improve low-budget perplexity and FID versus standard masked generative models on text and images.
GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.
VQ-SAD combines a pretrained VQ-VAE with diffusion models by using its codebooks as discrete tokenizers for atoms and bonds, yielding slight improvements over prior diffusion methods on QM9 and ZINC250k.
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
citing papers explorer
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Adaptive Order Policies for Masked Diffusion
A policy network learns to choose unmasking order in masked diffusion by reweighting the loss, outperforming random and heuristic baselines on ordering-sensitive tasks.
-
Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster
FLDD learns non-Markovian marginal and posterior distributions for the forward process so a factorized reverse process can match the target better and produce higher-quality samples in fewer steps.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM is a trajectory-aligned Boltzmann modeling framework that turns self-distilled inference paths into a pairwise ranking loss to close the training-inference gap in diffusion language models and expand their effective capabilities.
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
-
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
-
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Fast-dLLM adds reusable KV cache blocks and selective parallel decoding to diffusion LLMs, closing most of the speed gap with autoregressive models without retraining.
-
Diffusion and Flow Matching Models for Tabular Data: A Survey
First dedicated survey organizing diffusion and flow matching models for tabular data synthesis, imputation, anomaly detection, and related tasks, covering literature from 2015 to 2026 and highlighting open problems.
-
Adversarial Diffusion Across Modalities: A Fusion Survey of Attacks, Defenses, and Evaluation for Text, Vision, and Vision-Language Models
A narrative survey that catalogs fifty papers on diffusion-based adversarial techniques across text, vision, and vision-language models, proposes a six-class taxonomy of diffusion roles plus a unified five-dimension evaluation framework, and releases a companion catalog.
-
Fixed-Point Masked Generative Modeling
FP-MGMs with consistency loss and three-state reuse (CoFRe) reduce parameters by up to 38.8% and improve low-budget perplexity and FID versus standard masked generative models on text and images.
-
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.
-
VQ-SAD: Vector Quantized Structure Aware Diffusion For Molecule Generation
VQ-SAD combines a pretrained VQ-VAE with diffusion models by using its codebooks as discrete tokenizers for atoms and bonds, yielding slight improvements over prior diffusion methods on QM9 and ZINC250k.
-
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
-
Diffusion Language Models Know the Answer Before Decoding
DLMs show early answer convergence allowing Prophet to cut decoding steps by up to 3.4x on LLaDA-8B and Dream-7B while keeping output quality.
-
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
- Learned Relay Representations for Forward-Thinking Discrete Diffusion Models