Recognition: 2 theorem links
· Lean TheoremLarge Language Diffusion Models
Pith reviewed 2026-05-11 01:37 UTC · model grok-4.3
The pith
A diffusion model trained from scratch can match autoregressive LLMs like LLaMA3 on in-context learning and instruction following while better handling reversal tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaDA demonstrates that a diffusion model for language, using a forward data masking process and a reverse generation process parameterized by a Transformer to predict masked tokens, can be trained from scratch under the standard pre-training and SFT paradigm and achieve performance comparable to autoregressive models. It scales to 8B parameters, matches strong LLMs such as LLaMA3 8B on in-context learning, shows strong instruction-following after SFT, and surpasses GPT-4o on reversal tasks, thereby challenging the assumption that core LLM capabilities inherently depend on autoregressive architectures.
What carries the argument
The forward masking process combined with a reverse denoising process parameterized by a Transformer that predicts masked tokens, enabling likelihood-bound optimization without sequential token dependencies.
If this is right
- Language models can achieve competitive benchmark performance without enforcing left-to-right generation order during training or inference.
- Bidirectional context in the reverse process can reduce the reversal curse that affects autoregressive models on tasks requiring backward reasoning.
- Supervised fine-tuning on diffusion models yields instruction-following behavior comparable to that observed in autoregressive models.
- Scaling laws for diffusion-based language models appear similar to those of autoregressive models on general, math, and code tasks.
Where Pith is reading between the lines
- Diffusion language models may enable parallel or non-sequential sampling strategies that autoregressive models cannot use.
- The masking-based training could be combined with other non-autoregressive techniques to explore hybrid generation methods.
- If reversal performance generalizes, diffusion models might become preferable for tasks that require symmetric or bidirectional reasoning over sequences.
Load-bearing premise
That optimizing a likelihood lower bound through repeated masking and unmasking steps produces coherent high-quality text at scale without needing autoregressive dependencies.
What would settle it
A controlled scaling experiment in which LLaDA 8B or larger variants fall substantially behind matched autoregressive baselines on a broad suite of in-context learning and instruction-following benchmarks.
read the original abstract
The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that core LLM capabilities discussed above inherently depend on ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLaDA, a diffusion language model trained from scratch via a forward masking process and reverse denoising process parameterized by a Transformer. It claims LLaDA 8B achieves performance comparable to autoregressive baselines like LLaMA3 8B across general, math, and code benchmarks, exhibits competitive in-context learning, strong instruction-following after SFT (e.g., multi-turn dialogue), and surpasses GPT-4o on a reversal poem completion task, thereby challenging the view that core LLM capabilities inherently require autoregressive structure.
Significance. If the empirical claims hold under rigorous verification, the work would be significant for providing evidence that diffusion models can scale to 8B parameters and match ARM performance on in-context learning and reversal-curse resistance without left-to-right inductive bias. The public project page and code release is a clear strength supporting reproducibility.
major comments (3)
- [Abstract and Results] Abstract and Results section: the claim of comparability to LLaMA3 8B and superiority to GPT-4o on reversal tasks is asserted without any numerical metrics, error bars, or specific benchmark scores in the abstract and is only vaguely referenced in the provided summary; this directly undermines verification of the central scalability and reversal-curse claims.
- [§3] §3 (model description): the reverse generation process uses iterative masked-token prediction with full-attention Transformer; no analysis is given of whether the masking schedule or unmasking order encodes implicit positional or sequential preferences, which is load-bearing for the claim that capabilities do not depend on AR structure.
- [Reversal-curse experiment] Reversal-curse experiment: the specific task (reversal poem completion) and evaluation protocol are not detailed enough to confirm that the improvement over GPT-4o is not an artifact of the diffusion training objective or data construction.
minor comments (2)
- [Methods] Notation for the likelihood lower bound and masking process could be clarified with an explicit equation reference in the methods.
- [Figures/Tables] Figure legends for benchmark tables should include exact model sizes and training tokens for fair comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps us strengthen the presentation of our work. We address each of the major comments below and commit to making the suggested revisions to enhance clarity and verifiability.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: the claim of comparability to LLaMA3 8B and superiority to GPT-4o on reversal tasks is asserted without any numerical metrics, error bars, or specific benchmark scores in the abstract and is only vaguely referenced in the provided summary; this directly undermines verification of the central scalability and reversal-curse claims.
Authors: We acknowledge that the abstract currently summarizes the results at a high level without specific numbers, which can make immediate verification challenging. In the revised manuscript, we will incorporate key numerical results into the abstract, such as the average benchmark performance scores for LLaDA 8B versus LLaMA3 8B on general tasks, math, and code, as well as the specific accuracy on the reversal poem completion task where LLaDA surpasses GPT-4o. Regarding error bars, we will clarify that results are from single training runs due to computational constraints but will report any available variance from smaller-scale experiments or multiple evaluations. This will allow readers to better assess the claims without needing to refer only to the main body. revision: yes
-
Referee: [§3] §3 (model description): the reverse generation process uses iterative masked-token prediction with full-attention Transformer; no analysis is given of whether the masking schedule or unmasking order encodes implicit positional or sequential preferences, which is load-bearing for the claim that capabilities do not depend on AR structure.
Authors: This is a valid point regarding the potential for implicit biases. The forward process in LLaDA applies random masking to tokens without regard to their positions, and the reverse process uses the Transformer with full attention to predict masked tokens iteratively, with unmasking typically proceeding based on prediction confidence rather than a predetermined sequential order. To strengthen this, we will revise §3 to include a dedicated discussion and analysis of the masking schedule and unmasking strategy, demonstrating through description and possibly additional figures or ablations that no left-to-right or positional preference is encoded. This supports our claim that the model's capabilities arise independently of autoregressive inductive biases. revision: yes
-
Referee: [Reversal-curse experiment] Reversal-curse experiment: the specific task (reversal poem completion) and evaluation protocol are not detailed enough to confirm that the improvement over GPT-4o is not an artifact of the diffusion training objective or data construction.
Authors: We agree that more details are necessary for full reproducibility and to rule out artifacts. In the revised version, we will expand the description of the reversal poem completion experiment, providing specifics on how the poems are generated and reversed, the exact prompts and input formats used for evaluation, the evaluation protocol (including metrics such as completion accuracy or semantic coherence), and details on the data sources to show that the task construction is independent of the diffusion objective. We will also include the precise numerical comparison to GPT-4o. This will allow independent verification that the results are not due to training artifacts. revision: yes
Circularity Check
No circularity; claims rest on independent empirical training and evaluation
full rationale
The paper proposes LLaDA as a new diffusion-based language model trained from scratch using forward masking and reverse denoising with a Transformer backbone. All performance claims (scalability, in-context learning parity with LLaMA3 8B, instruction following, reversal-curse resistance) are supported by direct training runs, benchmark results, and comparisons to separately constructed autoregressive baselines. No derivation step reduces a prediction to a fitted parameter by construction, invokes a self-citation as the sole justification for a uniqueness theorem, or renames an input as an output. The central argument is falsifiable via the reported experiments rather than tautological.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 59 Pith papers
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
-
UniRank: Unified List-wise Reranking via Confidence-Ordered Denoising
UniRank unifies autoregressive and non-autoregressive list-wise reranking via bidirectional modeling in a confidence-ordered iterative denoising process, outperforming baselines on datasets and online tests.
-
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
-
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
-
Discrete Langevin-Inspired Posterior Sampling
ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.
-
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
-
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models
Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
-
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
-
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
-
Dream-Cubed: Controllable Generative Modeling in Minecraft by Training on Billions of Cubes
Dream-Cubed releases a billion-scale voxel dataset and 3D diffusion models that generate controllable Minecraft worlds by operating directly on blocks.
-
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction
DiffTSP applies discrete diffusion to knowledge graph triple set prediction, recovering all missing triples simultaneously via edge-masking noise reversal and a structure-aware transformer, achieving SOTA on three datasets.
-
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
-
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
-
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning
NeuralLVC achieves better lossless compression than H.264 and H.265 on video sequences by combining masked diffusion with temporal conditioning on frame differences.
-
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models
DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
-
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
-
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
-
TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation
TrajDLM applies block diffusion language models to discrete road-segment sequences with topology constraints to generate realistic trajectories up to 2.8 times faster than prior methods while supporting zero-shot transfer.
-
Primal-Dual Guided Decoding for Constrained Discrete Diffusion
Primal-dual guided decoding casts constrained discrete diffusion as a KL-regularized optimization solved online with adaptive Lagrangian multipliers to satisfy constraints while staying close to the unconstrained mode...
-
Edit-Based Refinement for Parallel Masked Diffusion Language Models
ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
-
DVD: Discrete Voxel Diffusion for 3D Generation and Editing
DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.
-
TextLDM: Language Modeling with Continuous Latent Diffusion
TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
-
Coupling Models for One-Step Discrete Generation
Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
-
Continuous Latent Diffusion Language Model
Cola DLM proposes a hierarchical latent diffusion model that learns a text-to-latent mapping, fits a global semantic prior in continuous space with a block-causal DiT, and performs conditional decoding, establishing l...
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
Predict-then-Diffuse predicts response lengths for diffusion LLMs via an auxiliary model and safety buffer to reduce FLOP waste while preserving output quality.
-
Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
b1 trains dLLMs to dynamically select reasoning block sizes via monotonic entropy descent with RL, improving coherence over fixed-size baselines on reasoning benchmarks.
-
Towards A Generative Protein Evolution Machine with DPLM-Evo
DPLM-Evo is an evolutionary discrete diffusion framework that models protein sequences via explicit substitution, insertion, and deletion operations, achieving state-of-the-art single-sequence mutation effect predicti...
-
Consistent Diffusion Language Models
CDLM trains denoisers to be path-invariant across stochastic posterior bridges in discrete diffusion, unifying prior methods and achieving new SOTA few-step text generation performance.
-
Simple Self-Conditioning Adaptation for Masked Diffusion Models
SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule...
-
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
-
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
-
A Universal Avoidance Method for Diverse Multi-branch Generation
UAG is a universal avoidance generation method that increases multi-branch diversity in diffusion and transformer models by penalizing output similarity, delivering up to 1.9x higher diversity with 4.4x speed and 1/64...
-
Stability-Weighted Decoding for Diffusion Language Models
Stability-Weighted Decoding improves diffusion LLM accuracy by modulating token scores with temporal stability from KL divergence between prediction steps.
-
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
-
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
-
Dataset-Level Metrics Attenuate Non-Determinism: A Fine-Grained Non-Determinism Evaluation in Diffusion Language Models
Dataset-level metrics in diffusion language models mask substantial sample-level non-determinism that varies with model and system factors, which a new Factor Variance Attribution metric can decompose.
-
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose...
-
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
-
MolDA: Molecular Understanding and Generation via Large Language Diffusion Model
MolDA is a multimodal molecular model that uses a discrete large language diffusion backbone plus a hybrid graph encoder to achieve better global coherence and validity than autoregressive approaches.
-
Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks
Chained rewrites by open-weight LLMs reduce watermark detection on diffusion LM outputs from 87.9% to 4.86% after five steps across multiple styles and models.
-
Scaling Properties of Continuous Diffusion Spoken Language Models
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
-
DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation
DALM is a proposed language model architecture that enforces algebraic constraints via a three-phase process over domain lattices to prevent cross-domain knowledge contamination during generation.
-
Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
A commutator-zero condition enables training-free generation of perceptually consistent low-resolution previews for high-resolution diffusion model outputs, achieving up to 33% computation reduction.
-
Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models
AHD uses real-time stability monitoring with dynamic anchors to allow early cross-block decoding of converged tokens, cutting steps by up to 80% and raising performance on benchmarks like BBH.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
-
FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation--Full Version
A training framework perturbs self-conditioning signals in diffusion language models to match few-step inference noise, enabling up to 400x faster sampling while surpassing standard continuous diffusion performance on...
-
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
-
Low-Rank Adaptation Redux for Large Models
An overview revisits LoRA variants by categorizing advances in architectural design, efficient optimization, and applications while linking them to classical signal processing tools for principled fine-tuning.
Reference graph
Works this paper leans on
-
[1]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Improving language understanding by generative pre-training, 2018
Alec Radford. Improving language understanding by generative pre-training, 2018
work page 2018
-
[3]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
work page 2019
-
[4]
Language Models are Few-Shot Learners
Tom B Brown. Language models are few-shot learners.arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[5]
ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November
OpenAI. ChatGPT: Optimizing Language Models for Dialogue.OpenAI blog, November
-
[6]
URLhttps://openai.com/blog/chatgpt/. 10
-
[7]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Ashish Vaswani. Attention is all you need.arXiv preprint arXiv:1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
Ronald A Fisher. On the mathematical foundations of theoretical statistics.Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604):309–368, 1922
work page 1922
-
[10]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023
work page 2023
-
[11]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
-
[12]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators
work page 2024
-
[13]
Language modeling is compression
Gregoire Deletang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, et al. Language modeling is compression. InThe Twelfth International Conference on Learning Representations
-
[14]
arXiv preprint arXiv:2404.09937 , year=
Yuzhen Huang, Jinghan Zhang, Zifei Shan, and Junxian He. Compression represents intelli- gence linearly.arXiv preprint arXiv:2404.09937, 2024
-
[15]
A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948
Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948
work page 1948
-
[16]
Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023
-
[17]
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in Neural Information Processing Systems, 34:17981–17993, 2021
work page 2021
-
[18]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review arXiv 2023
-
[19]
Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024
-
[20]
Chiu, Alexander Rush, and Volodymyr Kuleshov
Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.arXiv preprint arXiv:2406.07524, 2024
-
[21]
Your absorbing discrete diffusion secretly models the conditional distributions of clean data
Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.arXiv preprint arXiv:2406.03736, 2024
-
[22]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022. 11
work page 2022
-
[25]
Gqa: Training generalized multi-query transformer models from multi- head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi- head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023
work page 2023
-
[26]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514, 2024
-
[29]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
Decoupled Weight Decay Regularization
I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019
work page internal anchor Pith review arXiv 1904
-
[32]
Block diffusion: Interpolating between autoregressive and diffusion language models
Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025
-
[33]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024
work page internal anchor Pith review arXiv 2024
-
[34]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Physics of language models: Part 2.1, grade-school math and the hidden reasoning process
Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process.ArXiv e-prints, abs/2407.20311, July 2024. Full version available athttp://arxiv.org/abs/2407.20311
-
[36]
arXiv preprint arXiv:2309.14402 , year =
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of Language Models: Part 3.2, Knowledge Manipulation.ArXiv e-prints, abs/2309.14402, September 2023. Full version available at http://arxiv.org/abs/2309.14402
-
[37]
Ouail Kitouni, Niklas Nolte, Diane Bouchacourt, Adina Williams, Mike Rabbat, and Mark Ibrahim. The factorization curse: Which tokens you predict underlie the reversal curse and more.arXiv preprint arXiv:2406.05183, 2024
-
[38]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 12
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015
work page 2015
-
[40]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[41]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[42]
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in Neural Information Process- ing Systems, 35:4328–4343, 2022
work page 2022
-
[43]
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and LingPeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models.arXiv preprint arXiv:2210.08933, 2022
-
[44]
Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex- based diffusion language model for text generation and modular control.arXiv preprint arXiv:2210.17432, 2022
-
[45]
Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236,
Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation.arXiv preprint arXiv:2211.04236, 2022
-
[46]
Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning.arXiv preprint arXiv:2208.04202, 2022
-
[47]
Continuous diffusion for categor- ical data.arXiv preprint arXiv:2211.15089, 2022
Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Contin- uous diffusion for categorical data.arXiv preprint arXiv:2211.15089, 2022
-
[48]
Richemond, Sander Dieleman, and Arnaud Doucet
Pierre H. Richemond, Sander Dieleman, and Arnaud Doucet. Categorical sdes with simplex diffusion, 2022
work page 2022
-
[49]
Ar-diffusion: Auto-regressive diffusion model for text generation, 2023
Tong Wu, Zhihao Fan, Xiao Liu, Yeyun Gong, Yelong Shen, Jian Jiao, Hai-Tao Zheng, Juntao Li, Zhongyu Wei, Jian Guo, Nan Duan, and Weizhu Chen. Ar-diffusion: Auto-regressive diffusion model for text generation, 2023
work page 2023
-
[50]
Rabeeh Karimi Mahabadi, Hamish Ivison, Jaesung Tae, James Henderson, Iz Beltagy, Matthew E. Peters, and Arman Cohan. Tess: Text-to-text self-conditioned simplex diffu- sion, 2024
work page 2024
-
[51]
Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Mingxuan Wang. Dinoiser: Diffused conditional sequence learning by manipulating noises.arXiv preprint arXiv:2302.10025, 2023
-
[52]
Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Joshua Susskind, and Navdeep Jaitly. Planner: Generating diversified paragraph via latent language diffusion model.Advances in Neural Information Processing Systems, 36:80178–80190, 2023
work page 2023
-
[53]
Reflected diffusion models, 2023
Aaron Lou and Stefano Ermon. Reflected diffusion models, 2023
work page 2023
-
[54]
arXiv preprint arXiv:2308.07037 , year =
Alex Graves, Rupesh Kumar Srivastava, Timothy Atkinson, and Faustino Gomez. Bayesian flow networks.arXiv preprint arXiv:2308.07037, 2023
-
[55]
Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, and Weizhu Chen. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. InInternational Conference on Machine Learning, pages 21051–21064. PMLR, 2023
work page 2023
-
[56]
arXiv preprint arXiv:2404.15766 , year =
Kaiwen Xue, Yuhao Zhou, Shen Nie, Xu Min, Xiaolu Zhang, Jun Zhou, and Chongxuan Li. Unifying bayesian flow networks and diffusion models through stochastic differential equations.arXiv preprint arXiv:2404.15766, 2024. 13
-
[57]
Target concrete score matching: A holistic framework for discrete diffusion
Ruixiang Zhang, Shuangfei Zhai, Yizhe Zhang, James Thornton, Zijing Ou, Joshua Susskind, and Navdeep Jaitly. Target concrete score matching: A holistic framework for discrete diffusion. arXiv preprint arXiv:2504.16431, 2025
-
[58]
Likelihood-based diffusion language models
Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[59]
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34:12454–12465, 2021
work page 2021
-
[60]
Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021
Emiel Hoogeboom, Alexey A Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. Autoregressive diffusion models.arXiv preprint arXiv:2110.02037, 2021
-
[61]
Hemmat, A., Torr, P., Chen, Y ., and Yu, J
Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, and Xipeng Qiu. Diffusion- bert: Improving generative masked language models with diffusion models.arXiv preprint arXiv:2211.15029, 2022
-
[62]
Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models.Advances in Neural Information Processing Systems, 35:28266–28279, 2022
work page 2022
-
[63]
Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete score matching: Generalized score matching for discrete data.Advances in Neural Information Processing Systems, 35:34532–34545, 2022
work page 2022
-
[64]
Hellendoorn, and Graham Neubig
Machel Reid, Vincent J. Hellendoorn, and Graham Neubig. Diffuser: Discrete diffusion via edit-based reconstruction, 2022
work page 2022
-
[65]
Score-based continuous-time discrete diffusion models
Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous- time discrete diffusion models.arXiv preprint arXiv:2211.16750, 2022
-
[66]
Disk: A diffusion model for structured knowledge.arXiv preprint arXiv:2312.05253, 2023
Ouail Kitouni, Niklas Nolte, James Hensman, and Bhaskar Mitra. Disk: A diffusion model for structured knowledge.arXiv preprint arXiv:2312.05253, 2023
-
[67]
A reparameterized discrete diffusion model for text generation.arXiv preprint arXiv:2302.05737,
Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for text generation.ArXiv, abs/2302.05737, 2023
-
[68]
Zixiang Chen, Huizhuo Yuan, Yongqian Li, Yiwen Kou, Junkai Zhang, and Quanquan Gu. Fast sampling via de-randomization for discrete diffusion models.arXiv preprint arXiv:2312.09193, 2023
-
[69]
arXiv preprint arXiv:2308.12219 , year=
Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Quanquan Gu. Diffusion language models can perform many tasks with scaling and instruction-finetuning.arXiv preprint arXiv:2308.12219, 2023
-
[70]
Discrete flow matching.arXiv preprint arXiv:2407.15595, 2024
Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.arXiv preprint arXiv:2407.15595, 2024
-
[72]
Diffusion on syntax trees for program synthesis
Shreyas Kapur, Erik Jenner, and Stuart Russell. Diffusion on syntax trees for program synthesis. arXiv preprint arXiv:2405.20519, 2024
-
[73]
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models.arXiv preprint arXiv:2410.17891, 2024
-
[74]
Mercury: Ultra-fast language models based on diffusion, 2025
Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, et al. Mercury: Ultra-fast language models based on diffusion.arXiv preprint arXiv:2506.17298, 2025. 14
-
[75]
Muse: Text-to-image generation via masked generative transformers
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers.arXiv preprint arXiv:2301.00704, 2023
-
[76]
Effective and efficient masked image generation models.arXiv preprint arXiv:2503.07197, 2025
Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, and Chongxuan Li. Effective and efficient masked image generation models.arXiv preprint arXiv:2503.07197, 2025
-
[77]
Diffusion language models are versatile protein learners.arXiv preprint arXiv:2402.18567, 2024
Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Diffusion language models are versatile protein learners.arXiv preprint arXiv:2402.18567, 2024
-
[78]
Dplm-2: A multimodal diffusion protein language model.arXiv preprint arXiv:2410.13782, 2024
Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. Dplm-2: A multimodal diffusion protein language model.arXiv preprint arXiv:2410.13782, 2024
-
[79]
Cllms: Consistency large language models.arXiv preprint arXiv:2403.00835, 2024
Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. Cllms: Consistency large language models.arXiv preprint arXiv:2403.00835, 2024
-
[80]
Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, and Zhijie Deng. Show-o turbo: Towards accelerated unified multimodal understanding and generation.arXiv preprint arXiv:2502.05415, 2025
-
[81]
Think while you generate: Discrete diffusion with planned denoising
Sulin Liu, Juno Nam, Andrew Campbell, Hannes Stärk, Yilun Xu, Tommi Jaakkola, and Rafael Gómez-Bombarelli. Think while you generate: Discrete diffusion with planned denoising. arXiv preprint arXiv:2410.06264, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.