arxiv: 2502.09992 · v3 · submitted 2025-02-14 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Large Language Diffusion Models

Chongxuan Li, Fengqi Zhu, Jingyang Ou, Ji-Rong Wen, Jun Hu, Jun Zhou, Shen Nie, Xiaolu Zhang, Yankai Lin, Zebin You

Pith reviewed 2026-05-11 01:37 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords Large Language ModelsDiffusion ModelsNon-Autoregressive GenerationReversal CurseIn-Context LearningSupervised Fine-TuningLikelihood Lower BoundTransformer Parameterization

0 comments

The pith

A diffusion model trained from scratch can match autoregressive LLMs like LLaMA3 on in-context learning and instruction following while better handling reversal tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLaDA is a large language diffusion model that replaces autoregressive next-token prediction with a forward masking process to corrupt text and a reverse process to recover it, both handled by a Transformer. The model optimizes a lower bound on data likelihood rather than sequential probabilities. At 8 billion parameters, it performs on par with self-trained autoregressive baselines across general, math, and code benchmarks. After supervised fine-tuning it produces coherent multi-turn dialogues, and it outperforms GPT-4o on a reversal poem-completion task that exposes the reversal curse in autoregressive models. These results indicate that core language-model behaviors do not require autoregressive structure.

Core claim

LLaDA demonstrates that a diffusion model for language, using a forward data masking process and a reverse generation process parameterized by a Transformer to predict masked tokens, can be trained from scratch under the standard pre-training and SFT paradigm and achieve performance comparable to autoregressive models. It scales to 8B parameters, matches strong LLMs such as LLaMA3 8B on in-context learning, shows strong instruction-following after SFT, and surpasses GPT-4o on reversal tasks, thereby challenging the assumption that core LLM capabilities inherently depend on autoregressive architectures.

What carries the argument

The forward masking process combined with a reverse denoising process parameterized by a Transformer that predicts masked tokens, enabling likelihood-bound optimization without sequential token dependencies.

If this is right

Language models can achieve competitive benchmark performance without enforcing left-to-right generation order during training or inference.
Bidirectional context in the reverse process can reduce the reversal curse that affects autoregressive models on tasks requiring backward reasoning.
Supervised fine-tuning on diffusion models yields instruction-following behavior comparable to that observed in autoregressive models.
Scaling laws for diffusion-based language models appear similar to those of autoregressive models on general, math, and code tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Diffusion language models may enable parallel or non-sequential sampling strategies that autoregressive models cannot use.
The masking-based training could be combined with other non-autoregressive techniques to explore hybrid generation methods.
If reversal performance generalizes, diffusion models might become preferable for tasks that require symmetric or bidirectional reasoning over sequences.

Load-bearing premise

That optimizing a likelihood lower bound through repeated masking and unmasking steps produces coherent high-quality text at scale without needing autoregressive dependencies.

What would settle it

A controlled scaling experiment in which LLaDA 8B or larger variants fall substantially behind matched autoregressive baselines on a broad suite of in-context learning and instruction-following benchmarks.

read the original abstract

The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that core LLM capabilities discussed above inherently depend on ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLaDA scales a diffusion language model to 8B with competitive benchmark results and a reversal-curse edge, but the evidence that it escapes autoregressive biases remains thin.

read the letter

The main point is that they trained an 8B diffusion model from scratch using a masking forward process and iterative denoising reverse process, then ran the standard pre-train plus SFT pipeline. It matches their own autoregressive baselines on general, math, and code tasks, shows in-context learning, and produces usable multi-turn dialogue after fine-tuning. The reversal poem completion result, where it beats GPT-4o, is the clearest concrete difference reported.

Referee Report

3 major / 2 minor

Summary. The paper introduces LLaDA, a diffusion language model trained from scratch via a forward masking process and reverse denoising process parameterized by a Transformer. It claims LLaDA 8B achieves performance comparable to autoregressive baselines like LLaMA3 8B across general, math, and code benchmarks, exhibits competitive in-context learning, strong instruction-following after SFT (e.g., multi-turn dialogue), and surpasses GPT-4o on a reversal poem completion task, thereby challenging the view that core LLM capabilities inherently require autoregressive structure.

Significance. If the empirical claims hold under rigorous verification, the work would be significant for providing evidence that diffusion models can scale to 8B parameters and match ARM performance on in-context learning and reversal-curse resistance without left-to-right inductive bias. The public project page and code release is a clear strength supporting reproducibility.

major comments (3)

[Abstract and Results] Abstract and Results section: the claim of comparability to LLaMA3 8B and superiority to GPT-4o on reversal tasks is asserted without any numerical metrics, error bars, or specific benchmark scores in the abstract and is only vaguely referenced in the provided summary; this directly undermines verification of the central scalability and reversal-curse claims.
[§3] §3 (model description): the reverse generation process uses iterative masked-token prediction with full-attention Transformer; no analysis is given of whether the masking schedule or unmasking order encodes implicit positional or sequential preferences, which is load-bearing for the claim that capabilities do not depend on AR structure.
[Reversal-curse experiment] Reversal-curse experiment: the specific task (reversal poem completion) and evaluation protocol are not detailed enough to confirm that the improvement over GPT-4o is not an artifact of the diffusion training objective or data construction.

minor comments (2)

[Methods] Notation for the likelihood lower bound and masking process could be clarified with an explicit equation reference in the methods.
[Figures/Tables] Figure legends for benchmark tables should include exact model sizes and training tokens for fair comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps us strengthen the presentation of our work. We address each of the major comments below and commit to making the suggested revisions to enhance clarity and verifiability.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: the claim of comparability to LLaMA3 8B and superiority to GPT-4o on reversal tasks is asserted without any numerical metrics, error bars, or specific benchmark scores in the abstract and is only vaguely referenced in the provided summary; this directly undermines verification of the central scalability and reversal-curse claims.

Authors: We acknowledge that the abstract currently summarizes the results at a high level without specific numbers, which can make immediate verification challenging. In the revised manuscript, we will incorporate key numerical results into the abstract, such as the average benchmark performance scores for LLaDA 8B versus LLaMA3 8B on general tasks, math, and code, as well as the specific accuracy on the reversal poem completion task where LLaDA surpasses GPT-4o. Regarding error bars, we will clarify that results are from single training runs due to computational constraints but will report any available variance from smaller-scale experiments or multiple evaluations. This will allow readers to better assess the claims without needing to refer only to the main body. revision: yes
Referee: [§3] §3 (model description): the reverse generation process uses iterative masked-token prediction with full-attention Transformer; no analysis is given of whether the masking schedule or unmasking order encodes implicit positional or sequential preferences, which is load-bearing for the claim that capabilities do not depend on AR structure.

Authors: This is a valid point regarding the potential for implicit biases. The forward process in LLaDA applies random masking to tokens without regard to their positions, and the reverse process uses the Transformer with full attention to predict masked tokens iteratively, with unmasking typically proceeding based on prediction confidence rather than a predetermined sequential order. To strengthen this, we will revise §3 to include a dedicated discussion and analysis of the masking schedule and unmasking strategy, demonstrating through description and possibly additional figures or ablations that no left-to-right or positional preference is encoded. This supports our claim that the model's capabilities arise independently of autoregressive inductive biases. revision: yes
Referee: [Reversal-curse experiment] Reversal-curse experiment: the specific task (reversal poem completion) and evaluation protocol are not detailed enough to confirm that the improvement over GPT-4o is not an artifact of the diffusion training objective or data construction.

Authors: We agree that more details are necessary for full reproducibility and to rule out artifacts. In the revised version, we will expand the description of the reversal poem completion experiment, providing specifics on how the poems are generated and reversed, the exact prompts and input formats used for evaluation, the evaluation protocol (including metrics such as completion accuracy or semantic coherence), and details on the data sources to show that the task construction is independent of the diffusion objective. We will also include the precise numerical comparison to GPT-4o. This will allow independent verification that the results are not due to training artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on independent empirical training and evaluation

full rationale

The paper proposes LLaDA as a new diffusion-based language model trained from scratch using forward masking and reverse denoising with a Transformer backbone. All performance claims (scalability, in-context learning parity with LLaMA3 8B, instruction following, reversal-curse resistance) are supported by direct training runs, benchmark results, and comparisons to separately constructed autoregressive baselines. No derivation step reduces a prediction to a fitted parameter by construction, invokes a self-citation as the sole justification for a uniqueness theorem, or renames an input as an output. The central argument is falsifiable via the reported experiments rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; assessment limited to high-level description.

pith-pipeline@v0.9.0 · 5534 in / 1049 out tokens · 42241 ms · 2026-05-11T01:37:38.809359+00:00 · methodology

discussion (0)

Forward citations

Cited by 59 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
Dynamic Execution Commitment of Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
UniRank: Unified List-wise Reranking via Confidence-Ordered Denoising
cs.IR 2026-05 unverdicted novelty 7.0

UniRank unifies autoregressive and non-autoregressive list-wise reranking via bidirectional modeling in a confidence-ordered iterative denoising process, outperforming baselines on datasets and online tests.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
cs.CR 2026-05 unverdicted novelty 7.0

BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
Discrete Langevin-Inspired Posterior Sampling
cs.LG 2026-05 unverdicted novelty 7.0

ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
cs.LG 2026-05 unverdicted novelty 7.0

LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
cs.IR 2026-05 unverdicted novelty 7.0

DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
cs.CL 2026-05 unverdicted novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
cs.LG 2026-05 unverdicted novelty 7.0

DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
cs.CV 2026-04 unverdicted novelty 7.0

Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
cs.CL 2026-04 unverdicted novelty 7.0

R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction
cs.AI 2026-04 unverdicted novelty 7.0

DiffTSP applies discrete diffusion to knowledge graph triple set prediction, recovering all missing triples simultaneously via edge-masking noise reversal and a structure-aware transformer, achieving SOTA on three datasets.
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
cs.LG 2026-04 unverdicted novelty 7.0

DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
cs.CV 2026-04 unverdicted novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
Unlocking Prompt Infilling Capability for Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning
eess.IV 2026-04 unverdicted novelty 7.0

NeuralLVC achieves better lossless compression than H.264 and H.265 on video sequences by combining masked diffusion with temporal conditioning on frame differences.
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation
cs.LG 2026-05 unverdicted novelty 6.0

TrajDLM applies block diffusion language models to discrete road-segment sequences with topology constraints to generate realistic trajectories up to 2.8 times faster than prior methods while supporting zero-shot transfer.
Primal-Dual Guided Decoding for Constrained Discrete Diffusion
cs.AI 2026-05 unverdicted novelty 6.0

Primal-dual guided decoding casts constrained discrete diffusion as a KL-regularized optimization solved online with adaptive Lagrangian multipliers to satisfy constraints while staying close to the unconstrained mode...
Edit-Based Refinement for Parallel Masked Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
DVD: Discrete Voxel Diffusion for 3D Generation and Editing
cs.CV 2026-05 unverdicted novelty 6.0

DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.
TextLDM: Language Modeling with Continuous Latent Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
Coupling Models for One-Step Discrete Generation
cs.LG 2026-05 unverdicted novelty 6.0

Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
Continuous Latent Diffusion Language Model
cs.CL 2026-05 unverdicted novelty 6.0

Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
cs.LG 2026-05 unverdicted novelty 6.0

Predict-then-Diffuse predicts response lengths for diffusion LLMs via an auxiliary model and safety buffer to reduce FLOP waste while preserving output quality.
Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

b1 trains dLLMs to dynamically select reasoning block sizes via monotonic entropy descent with RL, improving coherence over fixed-size baselines on reasoning benchmarks.
Towards A Generative Protein Evolution Machine with DPLM-Evo
cs.LG 2026-04 unverdicted novelty 6.0

DPLM-Evo is an evolutionary discrete diffusion framework that models protein sequences via explicit substitution, insertion, and deletion operations, achieving state-of-the-art single-sequence mutation effect predicti...
Consistent Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
Simple Self-Conditioning Adaptation for Masked Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule...
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
A Universal Avoidance Method for Diverse Multi-branch Generation
cs.CL 2026-04 unverdicted novelty 6.0

UAG is a universal avoidance generation method that increases multi-branch diversity in diffusion and transformer models by penalizing output similarity, delivering up to 1.9x higher diversity with 4.4x speed and 1/64...
Stability-Weighted Decoding for Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
cs.AR 2026-04 unverdicted novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
cs.AI 2026-04 unverdicted novelty 6.0

Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
cs.CV 2026-04 unverdicted novelty 6.0

Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose...
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
MolDA: Molecular Understanding and Generation via Large Language Diffusion Model
cs.AI 2026-04 unverdicted novelty 6.0

MolDA is a multimodal molecular model that uses a discrete large language diffusion backbone plus a hybrid graph encoder to achieve better global coherence and validity than autoregressive approaches.
Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks
cs.CL 2026-05 unverdicted novelty 5.0

Chained rewrites by open-weight LLMs reduce watermark detection on diffusion LM outputs from 87.9% to 4.86% after five steps across multiple styles and models.
Scaling Properties of Continuous Diffusion Spoken Language Models
cs.CL 2026-04 unverdicted novelty 5.0

Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
cs.CL 2026-04 unverdicted novelty 5.0

DALM is a proposed language model architecture that enforces algebraic constraints via a three-phase process over domain lattices to prevent cross-domain knowledge contamination during generation.
Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
eess.IV 2026-04 unverdicted novelty 5.0

A commutator-zero condition enables training-free generation of perceptually consistent low-resolution previews for high-resolution diffusion model outputs, achieving up to 33% computation reduction.
Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

AHD uses real-time stability monitoring with dynamic anchors to allow early cross-block decoding of converged tokens, cutting steps by up to 80% and raising performance on benchmarks like BBH.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation--Full Version
cs.CL 2026-04 unverdicted novelty 5.0

A training framework perturbs self-conditioning signals in diffusion language models to match few-step inference noise, enabling up to 400x faster sampling while surpassing standard continuous diffusion performance on...
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
cs.LG 2026-04 unverdicted novelty 4.0

Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
Low-Rank Adaptation Redux for Large Models
cs.LG 2026-04 unverdicted novelty 3.0

An overview revisits LoRA variants by categorizing advances in architectural design, efficient optimization, and applications while linking them to classical signal processing tools for principled fine-tuning.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 57 Pith papers · 32 internal anchors

[1]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Improving language understanding by generative pre-training, 2018

Alec Radford. Improving language understanding by generative pre-training, 2018

work page 2018
[3]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[4]

Language Models are Few-Shot Learners

Tom B Brown. Language models are few-shot learners.arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[5]

ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November

OpenAI. ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November

work page
[6]

URLhttps://openai.com/blog/chatgpt/. 10

work page
[7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Attention Is All You Need

Ashish Vaswani. Attention is all you need.arXiv preprint arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

On the mathematical foundations of theoretical statistics.Philosophical transactions of the Royal Society of London

Ronald A Fisher. On the mathematical foundations of theoretical statistics.Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604):309–368, 1922

work page 1922
[10]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023

work page 2023
[11]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[12]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

work page 2024
[13]

Language modeling is compression

Gregoire Deletang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language modeling is compression. InThe Twelfth International Conference on Learning Representations

work page
[14]

arXiv preprint arXiv:2404.09937 , year=

Yuzhen Huang, Jinghan Zhang, Zifei Shan, and Junxian He. Compression represents intelli- gence linearly.arXiv preprint arXiv:2404.09937, 2024

work page arXiv 2024
[15]

A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

work page 1948
[16]

A is B” fail to learn “B is A

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023

work page arXiv 2023
[17]

Structured denoising diffusion models in discrete state-spaces.Advances in Neural Information Processing Systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in Neural Information Processing Systems, 34:17981–17993, 2021

work page 2021
[18]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review arXiv 2023
[19]

Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

work page arXiv 2024
[20]

Chiu, Alexander Rush, and Volodymyr Kuleshov

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024

work page arXiv 2024
[21]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024

work page arXiv 2024
[22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022. 11

work page 2022
[25]

Gqa: Training generalized multi-query transformer models from multi- head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi- head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

work page 2023
[26]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024

work page arXiv 2024
[29]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review arXiv 2024
[30]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review arXiv 1904
[32]

Block diffusion: Interpolating between autoregressive and diffusion language models

Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025

work page arXiv 2025
[33]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review arXiv 2024
[34]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Physics of language models: Part 2.1, grade-school math and the hidden reasoning process

Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process.ArXiv e-prints, abs/2407.20311, July 2024. Full version available athttp://arxiv.org/abs/2407.20311

work page arXiv 2024
[36]

arXiv preprint arXiv:2309.14402 , year =

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of Language Models: Part 3.2, Knowledge Manipulation.ArXiv e-prints, abs/2309.14402, September 2023. Full version available at http://arxiv.org/abs/2309.14402

work page arXiv 2023
[37]

The factorization curse: Which tokens you predict underlie the reversal curse and more.arXiv preprint arXiv:2406.05183, 2024

Ouail Kitouni, Niklas Nolte, Diane Bouchacourt, Adina Williams, Mike Rabbat, and Mark Ibrahim. The factorization curse: Which tokens you predict underlie the reversal curse and more.arXiv preprint arXiv:2406.05183, 2024

work page arXiv 2024
[38]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 12

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015
[40]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[41]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[42]

Diffusion-lm improves controllable text generation.Advances in Neural Information Process- ing Systems, 35:4328–4343, 2022

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in Neural Information Process- ing Systems, 35:4328–4343, 2022

work page 2022
[43]

Diffuseq: Se- quence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933,

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022

work page arXiv 2022
[44]

Ssd-lm: Semi-autoregressive simplex- based diffusion language model for text generation and modular control.arXiv preprint arXiv:2210.17432, 2022

Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex- based diffusion language model for text generation and modular control.arXiv preprint arXiv:2210.17432, 2022

work page arXiv 2022
[45]

Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236,

Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022

work page arXiv 2022
[46]

org/CorpusID:253384277

Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202, 2022

work page arXiv 2022
[47]

Continuous diffusion for categor- ical data.arXiv preprint arXiv:2211.15089, 2022

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Contin- uous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022

work page arXiv 2022
[48]

Richemond, Sander Dieleman, and Arnaud Doucet

Pierre H. Richemond, Sander Dieleman, and Arnaud Doucet. Categorical sdes with simplex diffusion, 2022

work page 2022
[49]

Ar-diffusion: Auto-regressive diffusion model for text generation, 2023

Tong Wu, Zhihao Fan, Xiao Liu, Yeyun Gong, Yelong Shen, Jian Jiao, Hai-Tao Zheng, Juntao Li, Zhongyu Wei, Jian Guo, Nan Duan, and Weizhu Chen. Ar-diffusion: Auto-regressive diffusion model for text generation, 2023

work page 2023
[50]

Peters, and Arman Cohan

Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E. Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffu- sion, 2024

work page 2024
[51]

Dinoiser: Diffused conditional sequence learning by manipulating noises.arXiv preprint arXiv:2302.10025, 2023

Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Mingxuan Wang. Dinoiser: Diffused conditional sequence learning by manipulating noises.arXiv preprint arXiv:2302.10025, 2023

work page arXiv 2023
[52]

Planner: Generating diversified paragraph via latent language diffusion model.Advances in Neural Information Processing Systems, 36:80178–80190, 2023

Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Joshua Susskind, and Navdeep Jaitly. Planner: Generating diversified paragraph via latent language diffusion model.Advances in Neural Information Processing Systems, 36:80178–80190, 2023

work page 2023
[53]

Reflected diffusion models, 2023

Aaron Lou and Stefano Ermon. Reflected diffusion models, 2023

work page 2023
[54]

arXiv preprint arXiv:2308.07037 , year =

Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, and Faustino Gomez. Bayesian flow networks.arXiv preprint arXiv:2308.07037, 2023

work page arXiv 2023
[55]

Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise

Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. InInternational Conference on Machine Learning, pages 21051–21064. PMLR, 2023

work page 2023
[56]

arXiv preprint arXiv:2404.15766 , year =

Kaiwen Xue, Yuhao Zhou, Shen Nie, Xu Min, Xiaolu Zhang, Jun Zhou, and Chongxuan Li. Unifying bayesian flow networks and diffusion models through stochastic differential equations.arXiv preprint arXiv:2404.15766, 2024. 13

work page arXiv 2024
[57]

Target concrete score matching: A holistic framework for discrete diffusion

Ruixiang Zhang, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua Susskind, and Navdeep Jaitly. Target concrete score matching: A holistic framework for discrete diffusion. arXiv preprint arXiv:2504.16431, 2025

work page arXiv 2025
[58]

Likelihood-based diffusion language models

Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[59]

Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34:12454–12465, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34:12454–12465, 2021

work page 2021
[60]

Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021

work page arXiv 2021
[61]

Hemmat, A., Torr, P., Chen, Y ., and Yu, J

Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusion- bert: Improving generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029, 2022

work page arXiv 2022
[62]

A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022

work page 2022
[63]

Concrete score matching: Generalized score matching for discrete data.Advances in Neural Information Processing Systems, 35:34532–34545, 2022

Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data.Advances in Neural Information Processing Systems, 35:34532–34545, 2022

work page 2022
[64]

Hellendoorn, and Graham Neubig

Machel Reid, Vincent J. Hellendoorn, and Graham Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022

work page 2022
[65]

Score-based continuous-time discrete diffusion models

Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous- time discrete diffusion models.arXiv preprint arXiv:2211.16750, 2022

work page arXiv 2022
[66]

Disk: A diffusion model for structured knowledge.arXiv preprint arXiv:2312.05253, 2023

Ouail Kitouni, Niklas Nolte, James Hensman, and Bhaskar Mitra. Disk: A diffusion model for structured knowledge.arXiv preprint arXiv:2312.05253, 2023

work page arXiv 2023
[67]

A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737,

Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation.ArXiv, abs/2302.05737, 2023

work page arXiv 2023
[68]

Fast sampling via de-randomization for discrete diffusion models.arXiv preprint arXiv:2312.09193, 2023

Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via de-randomization for discrete diffusion models.arXiv preprint arXiv:2312.09193, 2023

work page arXiv 2023
[69]

arXiv preprint arXiv:2308.12219 , year=

Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Quanquan Gu. Diffusion language models can perform many tasks with scaling and instruction-finetuning.arXiv preprint arXiv:2308.12219, 2023

work page arXiv 2023
[70]

Discrete flow matching.arXiv preprint arXiv:2407.15595, 2024

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.arXiv preprint arXiv:2407.15595, 2024

work page arXiv 2024
[72]

Diffusion on syntax trees for program synthesis

Shreyas Kapur, Erik Jenner, and Stuart Russell. Diffusion on syntax trees for program synthesis. arXiv preprint arXiv:2405.20519, 2024

work page arXiv 2024
[73]

Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024

work page arXiv 2024
[74]

Mercury: Ultra-fast language models based on diffusion, 2025

Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025. 14

work page arXiv 2025
[75]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023

work page arXiv 2023
[76]

Effective and efficient masked image generation models.arXiv preprint arXiv:2503.07197, 2025

Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, and Chongxuan Li. Effective and efficient masked image generation models.arXiv preprint arXiv:2503.07197, 2025

work page arXiv 2025
[77]

Diffusion language models are versatile protein learners.arXiv preprint arXiv:2402.18567, 2024

Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Diffusion language models are versatile protein learners.arXiv preprint arXiv:2402.18567, 2024

work page arXiv 2024
[78]

Dplm-2: A multimodal diffusion protein language model.arXiv preprint arXiv:2410.13782, 2024

Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Dplm-2: A multimodal diffusion protein language model.arXiv preprint arXiv:2410.13782, 2024

work page arXiv 2024
[79]

Cllms: Consistency large language models.arXiv preprint arXiv:2403.00835, 2024

Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. Cllms: Consistency large language models.arXiv preprint arXiv:2403.00835, 2024

work page arXiv 2024
[80]

Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415,

Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, and Zhijie Deng. Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415, 2025

work page arXiv 2025
[81]

Think while you generate: Discrete diffusion with planned denoising

Sulin Liu, Juno Nam, Andrew Campbell, Hannes Stärk, Yilun Xu, Tommi Jaakkola, and Rafael Gómez-Bombarelli. Think while you generate: Discrete diffusion with planned denoising. arXiv preprint arXiv:2410.06264, 2024

work page arXiv 2024

Showing first 80 references.