LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
hub Canonical reference
Self- conditioned embedding diffusion for text generation
Canonical reference. 83% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Infilling extraction on diffusion language models extracts up to three times more verbatim sequences than prefix methods and achieves higher recall on redacted emails than autoregressive models.
First dedicated survey organizing diffusion and flow matching models for tabular data synthesis, imputation, anomaly detection, and related tasks, covering literature from 2015 to 2026 and highlighting open problems.
Absorbing discrete diffusion models the conditional distributions of clean data; reparameterizing yields a time-independent RADD that unifies with AO-ARMs and reaches SOTA perplexity among diffusion models on zero-shot language benchmarks.
RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
Manta-LM approximates the HJB equation via flow matching in latent control space to realize closed-loop optimal control for language generation.
DSL provides a continuous embedding framework where one denoiser supports a family of SNR paths for discrete sequences, improving MAUVE scores on OpenWebText and allowing random-order and hybrid sampling from a fine-tuned MDLM checkpoint.
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
ELF applies continuous-time flow matching in embedding space for language generation and reports outperforming prior discrete and continuous diffusion language models with fewer steps.
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
BA-Att introduces pre-downsampled block selection with norm-sorting and diagonal covariance correction to approximate sparse attention, yielding up to 6.95x speedup at 50% sparsity across language, multimodal, and video models.
A tutorial that unifies diffusion probabilistic models, score-based generative modeling, and SDE methods by deriving forward and reverse dynamics from a shared Gaussian noising process.
citing papers explorer
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
Extracting Training Data from Diffusion Language Models via Infilling
Infilling extraction on diffusion language models extracts up to three times more verbatim sequences than prefix methods and achieves higher recall on redacted emails than autoregressive models.
-
Diffusion and Flow Matching Models for Tabular Data: A Survey
First dedicated survey organizing diffusion and flow matching models for tabular data synthesis, imputation, anomaly detection, and related tasks, covering literature from 2015 to 2026 and highlighting open problems.
-
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Absorbing discrete diffusion models the conditional distributions of clean data; reparameterizing yields a time-independent RADD that unifies with AO-ARMs and reaches SOTA perplexity among diffusion models on zero-shot language benchmarks.
-
Continuous Diffusion Scales Competitively with Discrete Diffusion for Language
RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.
-
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
Manta-LM approximates the HJB equation via flow matching in latent control space to realize closed-loop optimal control for language generation.
-
Discrete Stochastic Localization for Non-autoregressive Generation
DSL provides a continuous embedding framework where one denoiser supports a family of SNR paths for discrete sequences, improving MAUVE scores on OpenWebText and allowing random-order and hybrid sampling from a fine-tuned MDLM checkpoint.
-
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
-
ELF: Embedded Language Flows
ELF applies continuous-time flow matching in embedding space for language generation and reports outperforming prior discrete and continuous diffusion language models with fewer steps.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing latent prior modeling as an alternative to token-level autoregressive language model
-
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
BA-Att introduces pre-downsampled block selection with norm-sorting and diagonal covariance correction to approximate sparse attention, yielding up to 6.95x speedup at 50% sparsity across language, multimodal, and video models.
-
A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models
A tutorial that unifies diffusion probabilistic models, score-based generative modeling, and SDE methods by deriving forward and reverse dynamics from a shared Gaussian noising process.
- Simple Self-Conditioning Adaptation for Masked Diffusion Models