arxiv: 2508.15487 · v1 · submitted 2025-08-21 · 💻 cs.CL

Recognition: 1 theorem link

Dream 7B: Diffusion Large Language Models

Jiacheng Ye , Zhihui Xie , Lin Zheng , Jiahui Gao , Zirui Wu , Xin Jiang , Zhenguo Li , Lingpeng Kong

Authors on Pith no claims yet

Pith reviewed 2026-05-11 16:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelsdiscrete diffusionlarge language modelsparallel generationarbitrary order generationtext infillingmodel initialization techniquesiterative denoising

0 comments

The pith

Dream 7B shows a 7B diffusion language model can outperform prior diffusion models on language, math, and coding tasks while supporting parallel iterative generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Dream 7B as a large language model trained with discrete diffusion rather than sequential token prediction. It generates text by starting with noise and iteratively refining all tokens at once across multiple steps. The authors claim that initializing from an autoregressive model and adapting the noise level for each token based on its context allows the model to reach higher accuracy on general, mathematical, and coding benchmarks than earlier diffusion approaches. These choices also produce practical advantages such as generating tokens in any order, filling in blanks within existing text, and trading off speed for quality by changing the number of refinement steps. Readers would care because this parallel refinement path opens generation behaviors that sequential models handle awkwardly or not at all.

Core claim

Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Unlike autoregressive models that generate tokens sequentially, Dream 7B consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling.

What carries the argument

Discrete diffusion modeling that iteratively denoises an entire token sequence in parallel, supported by initialization from an autoregressive LLM and per-token adaptive noise rescheduling during training.

If this is right

Diffusion-based language models can reach competitive accuracy on math and coding problems without relying on left-to-right token prediction.
A single trained model can produce valid output when tokens are generated in any chosen order or when sections of text are missing.
Users can control the speed versus quality trade-off at inference time by selecting how many denoising steps to run.
Releasing both a base model and an instruction-tuned version makes these flexible generation modes available for further experimentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The parallel refinement process may eventually allow diffusion models to handle long-range planning tasks with less accumulation of early errors than sequential models.
If the adaptive noise technique generalizes, similar rescheduling could improve training stability for diffusion models in other domains such as images or audio.
The ability to infill and reorder tokens suggests diffusion language models could serve as a natural fit for interactive editing interfaces where users revise parts of a draft.
Further scaling of this approach might reveal whether diffusion models can close the remaining gap with autoregressive models on broad knowledge benchmarks.

Load-bearing premise

The combination of autoregressive model initialization and context-adaptive token-level noise rescheduling is sufficient to produce the reported performance gains and new generation capabilities at 7B scale.

What would settle it

Train a 7B-scale discrete diffusion language model using the same data and architecture but without autoregressive initialization or context-adaptive noise rescheduling, then compare its scores on the same general, math, and coding benchmarks to those reported for Dream 7B.

read the original abstract

We introduce Dream 7B, the most powerful open diffusion large language model to date. Unlike autoregressive (AR) models that generate tokens sequentially, Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Our model consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling. We release both Dream-Base and Dream-Instruct to facilitate further research in diffusion-based language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Dream 7B, a 7B-parameter discrete diffusion language model that generates via iterative parallel denoising rather than sequential autoregressive prediction. It claims consistent outperformance over prior diffusion LLMs on general, mathematical, and coding benchmarks, plus new inference capabilities (arbitrary-order generation, infilling, tunable quality-speed trade-offs) obtained through AR-based LLM initialization and context-adaptive token-level noise rescheduling. Dream-Base and Dream-Instruct variants are released.

Significance. If the performance and capability claims are substantiated with rigorous controls, the work would represent a meaningful step toward practical non-autoregressive LLMs at scale, demonstrating that diffusion models can achieve competitive results on reasoning-heavy tasks while offering inference flexibility unavailable to standard AR models. The release of the models would further enable community exploration of diffusion-based language modeling.

major comments (2)

[Experiments] Experiments section: the manuscript presents overall benchmark results for Dream 7B but provides no controlled ablations that isolate the contribution of AR-based LLM initialization or context-adaptive token-level noise rescheduling at the 7B scale (e.g., training otherwise identical 7B diffusion models with these components disabled). Without such ablations, the central attribution of the reported gains and new capabilities to these specific techniques remains unsupported.
[Results] Results tables (general/math/coding benchmarks): while aggregate outperformance is asserted, the paper does not report per-task breakdowns, statistical significance tests, or comparisons against strong AR baselines of comparable size and training compute, making it difficult to assess whether the diffusion approach truly closes the gap or merely matches prior diffusion models.

minor comments (2)

[Abstract] Abstract: quantitative results, benchmark names, and exact metrics are omitted, forcing readers to consult the full text for any concrete evidence of the claimed outperformance.
[Method] Method description: the precise formulation of the context-adaptive token-level noise rescheduling (e.g., the functional form of the schedule and how context length modulates it) should be given explicitly, ideally with pseudocode or an equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions or experimental scope.

read point-by-point responses

Referee: [Experiments] Experiments section: the manuscript presents overall benchmark results for Dream 7B but provides no controlled ablations that isolate the contribution of AR-based LLM initialization or context-adaptive token-level noise rescheduling at the 7B scale (e.g., training otherwise identical 7B diffusion models with these components disabled). Without such ablations, the central attribution of the reported gains and new capabilities to these specific techniques remains unsupported.

Authors: We agree that controlled ablations at the full 7B scale would offer the strongest isolation of each technique's contribution. Training multiple independent 7B diffusion models from scratch exceeds our available compute budget. However, we conducted systematic ablations at the 1B scale (reported in the appendix) that isolate the effects of AR initialization and context-adaptive noise scheduling, showing consistent gains that align with the 7B results. In the revised manuscript we will (i) move these 1B ablations into the main text, (ii) add a dedicated limitations paragraph discussing the computational constraints on 7B-scale ablations, and (iii) reference prior smaller-scale studies that motivated the design choices. We believe the combination of smaller-scale evidence, scaling behavior, and public model release still supports the attribution while transparently noting the limitation. revision: partial
Referee: [Results] Results tables (general/math/coding benchmarks): while aggregate outperformance is asserted, the paper does not report per-task breakdowns, statistical significance tests, or comparisons against strong AR baselines of comparable size and training compute, making it difficult to assess whether the diffusion approach truly closes the gap or merely matches prior diffusion models.

Authors: We accept that the current presentation can be improved. In the revision we will add (a) per-task score tables in an expanded appendix, (b) bootstrap-based statistical significance tests with 95% confidence intervals for all reported averages, and (c) a new subsection comparing Dream 7B against publicly documented 7B-scale AR models (e.g., Llama-2-7B, Mistral-7B) on the identical benchmark suites, while explicitly stating differences in training data and objective. Our primary claim remains outperformance over prior diffusion LLMs; we do not assert superiority over state-of-the-art AR models. These additions will allow readers to evaluate the gap-closing question directly. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical model introduction with no derivations or self-referential reductions

full rationale

The manuscript presents Dream 7B as an empirical contribution, with performance claims resting on benchmark evaluations after applying AR-based LLM initialization and context-adaptive token-level noise rescheduling. No equations, first-principles derivations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The central attribution of gains to the listed training techniques is not shown to reduce by construction to prior fitted quantities or self-citations; it remains an empirical assertion open to external verification via ablations or reproduction. No steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new theoretical constructs are described; the work relies on standard discrete diffusion modeling assumptions and conventional LLM pretraining practices.

pith-pipeline@v0.9.0 · 5427 in / 1095 out tokens · 57229 ms · 2026-05-11T16:20:03.561303+00:00 · methodology

discussion (0)

Forward citations

Cited by 56 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
From Table to Cell: Attention for Better Reasoning with TABALIGN
cs.AI 2026-05 unverdicted novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement
cs.CL 2026-05 unverdicted novelty 7.0

DiHAL uses geometry proxies to pick where to replace the lower layers of a pretrained transformer with a diffusion bridge for hidden-state reconstruction, improving over token-level diffusion baselines on 8B models.
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster ...
Support Before Frequency in Discrete Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

Discrete diffusion models learn data support before frequencies because the exact reverse process decomposes edits into a dominant validity scale and a finer probability coefficient.
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
cs.LG 2026-05 conditional novelty 7.0

TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.
DiffScore: Text Evaluation Beyond Autoregressive Likelihood
cs.CL 2026-05 unverdicted novelty 7.0

DiffScore is a bidirectional masked-diffusion evaluation framework that measures text recoverability across masking rates and outperforms autoregressive baselines on ten benchmarks.
Relative Score Policy Optimization for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
TAD: Temporal-Aware Trajectory Self-Distillation for Fast and Accurate Diffusion LLM
cs.CL 2026-05 unverdicted novelty 7.0

TAD improves the accuracy-parallelism trade-off in diffusion LLMs via temporal-aware self-distillation that applies hard labels to soon-to-be-decoded tokens and soft supervision to future tokens.
BadDLM: Backdooring Diffusion Language Models with Diverse Targets
cs.CR 2026-05 unverdicted novelty 7.0

BadDLM implants effective backdoors in diffusion language models across concept, attribute, alignment, and payload targets by exploiting denoising dynamics while preserving clean performance.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
cs.LG 2026-05 unverdicted novelty 7.0

LEAP detects early-converging tokens in dLLMs via future context filtering and multi-sequence superposition, reducing average denoising steps by about 30% while maintaining accuracy.
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Energy-navigated trajectory shaping during training produces 8-step discrete flow matching students that achieve 32% lower perplexity than 1024-step teachers on 170M language models with unchanged inference cost.
Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Adaptive guidance trajectories learned via PPO outperform fixed-scale CFG on controllability-quality balance in three controlled NLP generation tasks with discrete diffusion models.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
cs.IR 2026-05 unverdicted novelty 7.0

DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 7.0

ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast
cs.CL 2026-05 unverdicted novelty 7.0

FoCore uses self-contrast on early-converging high-density tokens to boost diffusion LLM quality on reasoning benchmarks while cutting decoding steps by over 2x.
DARE: Diffusion Language Model Activation Reuse for Efficient Inference
cs.LG 2026-05 unverdicted novelty 7.0

DARE reuses up to 87% of attention activations in diffusion LLMs through KV caching and output reuse, delivering 1.2x per-layer latency gains with average performance drops of 1.2-2.0%.
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
cs.CL 2026-04 unverdicted novelty 7.0

R²-dLLM reduces dLLM decoding steps by up to 75% via spatio-temporal redundancy reduction while keeping generation quality competitive.
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference
cs.LG 2026-04 unverdicted novelty 7.0

DepCap accelerates diffusion LM inference up to 5.63x by using last-block influence for adaptive block boundaries and conflict-free token selection for parallel decoding, with negligible quality loss.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
cs.CV 2026-04 unverdicted novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
Lost in Diffusion: Uncovering Hallucination Patterns and Failure Modes in Diffusion Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Diffusion LLMs hallucinate more than autoregressive models and display distinct failure modes including premature termination, incomplete denoising, and context intrusion.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 7.0

ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
MARS: Enabling Autoregressive Models Multi-Token Generation
cs.CL 2026-04 unverdicted novelty 7.0

MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
Unlocking Prompt Infilling Capability for Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.
MemDLM: Memory-Enhanced DLM Training
cs.CL 2026-03 unverdicted novelty 7.0

MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
cs.CL 2026-05 unverdicted novelty 6.0

Language generation is recast as optimal control and solved approximately with flow matching in rectified latent control space to enable high-fidelity parallel text generation.
Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models
cs.CL 2026-05 conditional novelty 6.0

Step-wise detection via a contrastive safety direction followed by remasking and adaptive steering reduces jailbreak success rates in diffusion language models to 0.64% while preserving output quality.
Understanding and Accelerating the Training of Masked Diffusion Language Models
cs.LG 2026-05 conditional novelty 6.0

Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
cs.LG 2026-05 unverdicted novelty 6.0

Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...
ELF: Embedded Language Flows
cs.CL 2026-05 unverdicted novelty 6.0

ELF is a continuous embedding-space flow matching model for language that stays continuous until the last step and outperforms prior discrete and continuous diffusion language models with fewer sampling steps.
TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation
cs.LG 2026-05 unverdicted novelty 6.0

TrajDLM applies block diffusion language models to discrete road-segment sequences with topology constraints to generate realistic trajectories up to 2.8 times faster than prior methods while supporting zero-shot transfer.
Edit-Based Refinement for Parallel Masked Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 6.0

ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
cs.RO 2026-05 unverdicted novelty 6.0

ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
cs.LG 2026-05 unverdicted novelty 6.0

Predict-then-Diffuse predicts response lengths for diffusion LLMs via an auxiliary model and safety buffer to reduce FLOP waste while preserving output quality.
Towards A Generative Protein Evolution Machine with DPLM-Evo
cs.LG 2026-04 unverdicted novelty 6.0

DPLM-Evo adds explicit edit operations and a latent alignment space to discrete diffusion protein models, achieving SOTA single-sequence mutation effect prediction on ProteinGym while supporting variable-length generation.
Towards A Generative Protein Evolution Machine with DPLM-Evo
cs.LG 2026-04 unverdicted novelty 6.0

DPLM-Evo is an evolutionary discrete diffusion framework that models protein sequences via explicit substitution, insertion, and deletion operations, achieving state-of-the-art single-sequence mutation effect predicti...
Simple Self-Conditioning Adaptation for Masked Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule...
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
cs.AI 2026-04 unverdicted novelty 6.0

Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
DiffuMask: Diffusion Language Model for Token-level Prompt Pruning
cs.CL 2026-04 unverdicted novelty 6.0

DiffuMask uses a diffusion language model for parallel token-level prompt pruning, achieving up to 80% length reduction with maintained or improved accuracy in reasoning tasks.
STDec: Spatio-Temporal Stability Guided Decoding for dLLMs
cs.CL 2026-04 unverdicted novelty 6.0

STDec raises dLLM decoding speed by up to 14x on benchmarks like MBPP by using observed spatio-temporal stability to create dynamic, token-specific confidence thresholds while preserving task performance.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 6.0

OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
cs.LG 2025-12 conditional novelty 6.0

LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs
cs.LG 2026-05 unverdicted novelty 5.0

Predict-then-Diffuse predicts response length for diffusion LLMs before inference, cutting FLOPs with a data-driven safety buffer while preserving output quality.
Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

AHD uses real-time stability monitoring with dynamic anchors to allow early cross-block decoding of converged tokens, cutting steps by up to 80% and raising performance on benchmarks like BBH.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
FastDiSS: Few-step Match Many-step Diffusion Language Model on Sequence-to-Sequence Generation--Full Version
cs.CL 2026-04 unverdicted novelty 5.0

A training framework perturbs self-conditioning signals in diffusion language models to match few-step inference noise, enabling up to 400x faster sampling while surpassing standard continuous diffusion performance on...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 52 Pith papers · 13 internal anchors

[1]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

URL https://arxiv.org/abs/2502.02737. Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models,

work page internal anchor Pith review arXiv
[2]

arXiv preprint arXiv:2503.09573 , year=

URLhttps://arxiv.org/abs/2503.09573. Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021a. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen J...

work page arXiv
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

H., Doucet, A., Strudel, R., Dyer, C., Durkan, C., et al

Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, et al. Continuous diffusion for categorical data.arXiv preprint arXiv:2211.15089,

work page arXiv
[7]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171,

work page internal anchor Pith review arXiv
[8]

DiffuSeq-v2: Bridging discrete and continuous text spaces for accelerated Seq2Seq diffusion models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. DiffuSeq-v2: Bridging discrete and continuous text spaces for accelerated Seq2Seq diffusion models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9868–9875. Association for Computational Linguistics, 20...

work page 2023
[9]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Ishaan Gulrajani and Tatsunori Hashimoto. Likelihood-based diffusion language models. InThirty- seventh Conference on Neural Information Processing Systems, 2023a. Ishaan Gulrajani and Tatsunori B Hashimoto. Likelihood-based diffusion language models.Advances in Neural Information Processing Systems, 36:16693–16715, 20...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Measuring Massive Multitask Language Understanding

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.248. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.248 2023
[11]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual,

work page 2020
[12]

Liu, et al

URLhttps://arxiv.org/abs/2411.04905. Inception Labs. Mercury: Ultra-fast language models based on diffusion. https://www.inceptionlabs.ai/introducing-mercury,

work page arXiv
[13]

OpenAI o1 System Card

Accessed: 2025-06-16. Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Interpolated estimation of markov source parameters from sparse data

14 Frederick Jelinek. Interpolated estimation of markov source parameters from sparse data. InProc. Workshop on Pattern Recognition in Practice, 1980,

work page 1980
[15]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations.arXiv preprint arXiv:1704.04683,

work page Pith review arXiv
[16]

URLhttps://arxiv.org/abs/2411.15124. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Che...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393,

work page Pith review arXiv
[18]

Scaling up masked diffusion models on text.arXiv preprint arXiv:2410.18514,

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li. Scaling up masked diffusion models on text, 2025a. URLhttps://arxiv.org/abs/2410.18514. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv...

work page arXiv
[19]

GPT-4 Technical Report

URLhttps://arxiv.org/abs/2303.08774. Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data.International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Qwen2.5 Technical Report

URL https://arxiv.org/abs/2412.15115. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Diffusion language models can perform many tasks with scaling and instruction-finetuning.arXiv preprint arXiv:2308.12219, 2023

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autoregression: Discrete diffusion for complex reasoning and planning.International Conference on Learning Representations, 2025a. Jiacheng Ye, Zhenyu Wu, Jiahui Gao, Zhiyong Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Implicit search via discrete diffusio...

work page arXiv 1905
[24]

Improving and unifying discrete&continuous-time discrete denoising diffusion.arXiv preprint arXiv:2402.03701,

Lingxiao Zhao, Xueying Ding, Lijun Yu, and Leman Akoglu. Improving and unifying discrete&continuous-time discrete denoising diffusion.arXiv preprint arXiv:2402.03701,

work page arXiv
[25]

arXiv preprint arXiv:2406.04520 , year=

Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, and Denny Zhou. Natural plan: Benchmarking llms on natural language planning, 2024a. URLhttps://arxiv.org/abs/2406.04520. Lin Zheng, Jianbo Yuan, Lei Yu, and Lingpeng Kong. A reparameterized discrete diffusion model for te...

work page arXiv 2024