BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

Caglar Gulcehre; Justin Deschenaux

arxiv: 2606.02241 · v1 · pith:SQ6XAV35new · submitted 2026-06-01 · 💻 cs.LG

BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

Justin Deschenaux , Caglar Gulcehre This is my paper

Pith reviewed 2026-06-28 15:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords blockwise sequence modelinguniform diffusionmasked diffusionpredictor-corrector samplingAR-informed correctorsdiscrete diffusionGSM8Kgenerative perplexity

0 comments

The pith

BlockGen shows uniform diffusion outperforms masked diffusion in block-by-block generation under ancestral sampling, especially with few steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops BlockGen to compare uniform-state and masked diffusion when tokens are produced block by block instead of over full sequences. It trains models on mixtures of block sizes so that the effective behavior can interpolate between autoregressive and fully diffusive regimes, and it adds AR-informed predictor-corrector sampling that uses autoregressive predictions to target which tokens to re-generate. Experiments demonstrate that under plain ancestral sampling uniform diffusion produces higher-quality blocks than masked diffusion, with the largest margin appearing when the number of steps is small. When the informed corrector is used the advantage shrinks and can reverse once the step count becomes large. On GSM8K with blocks of size 16 masked models edge out uniform ones in accuracy, and a parallel pattern appears in generative perplexity on OpenWebText.

Core claim

BlockGen is a blockwise sequence model that is instantiated once with masked diffusion and once with uniform diffusion; it trains on a mixture of block sizes whose likelihoods interpolate between autoregressive and pure diffusion models, and it supports AR-informed predictor-corrector sampling that combines autoregressive and diffusion predictions to re-generate unlikely tokens. Under ancestral sampling the uniform version outperforms the masked version in the block-by-block regime, most clearly in the few-step regime; under ARPC the performance gap closes and reverses at high numbers of function evaluations. With block size 16 on GSM8K the masked models reach slightly higher accuracy than t

What carries the argument

BlockGen, a blockwise sequence model trained on a mixture of block sizes that also supports AR-informed predictor-corrector sampling to compare uniform and masked diffusion states.

If this is right

Uniform diffusion becomes the preferred choice for blockwise generation when only ancestral sampling and low step counts are available.
AR-informed correctors reduce the practical importance of choosing between uniform and masked states once the sampling budget grows large.
A single model trained on mixed block sizes can be deployed at different effective block lengths without retraining.
Accuracy trends on math reasoning tasks and perplexity trends on language data move in the same direction, suggesting the ordering is not task-specific.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reversal at high NFE implies that sampler choice can matter more than diffusion-state choice when high-quality output is required.
If the mixed-block training is the source of the advantage, then variable block sizes at inference time could further improve speed-quality trade-offs.
The slight masked-model edge on GSM8K at block size 16 suggests that downstream task performance may still favor masked diffusion in some regimes even when ancestral sampling favors uniform.

Load-bearing premise

That training both diffusion types on the same mixture of block sizes and evaluating them with the same AR-informed correctors produces an unbiased comparison in which performance differences reflect the choice of diffusion state rather than interactions with the blockwise schedule or the hybrid sampler.

What would settle it

Retraining separate masked and uniform models with a single fixed block size and with random rather than AR-informed correctors, then observing that the performance ordering under ancestral sampling reverses, would indicate that the reported advantage depends on the BlockGen training or sampler rather than on the diffusion state itself.

Figures

Figures reproduced from arXiv: 2606.02241 by Caglar Gulcehre, Justin Deschenaux.

**Figure 1.** Figure 1: GSM8K accuracy with block size 16 as a function of NFE (number of function evaluations). Models are trained on TinyGSM and evaluated on the GSM8K test set. Each curve shows the best performance for a given NFE; the full sweep is in Suppl. D.1. AR-Informed Predictor-Corrector (ARPC; ours) uses checkpoints trained with the mixture in (10), with γ1 = 0.05, γ16 = 0.95. Ancestral uses a single-block-size model.… view at source ↗

**Figure 2.** Figure 2: ARPC vs EIPC GSM8K accuracy with block size 16 as a function of NFE. Models are trained on TinyGSM and evaluated on the GSM8K test set. Each curve shows the best performance for a given NFE. EntropyInformed Predictor-Corrector (EIPC) uses a single-block-16 model, and ARPC uses the multi-block mixture from (10) with γ1 = 0.05, γ16 = 0.95. ARPC outperforms EIPC across most NFE budgets (left and right). prob… view at source ↗

**Figure 3.** Figure 3: Sample quality on OpenWebText. Each curve represents a sweep over temperatures with fixed NFEs. Lower Gen. PPL at matched entropy is better. Left: masked vs uniform diffusion under ARPC at per-block NFE = 8. Uniform-ARPC reaches better Gen. PPL than masked-ARPC. Middle: same comparison at per-block NFE = 64. Masked-ARPC has the better frontier. Right: ARPC vs AR at per-block NFE = 64 with a late-correction… view at source ↗

**Figure 4.** Figure 4: Distribution of tokenized sequence lengths on TinyGSM. The GPT-2 tokenizer was trained on web text and produces longer tokenized sequences on code than tokenizers trained on code. We compare GPT-2 against SmolLM-135M (top) and Qwen2.5 (bottom). Both tokenizers are trained on code, and produce shorter sequences. Algorithm 1 Stratified Block Size Selection Require: Weights γ ∈ ∆M, number of GPUs D 1: c0 ← 0,… view at source ↗

**Figure 5.** Figure 5: Attention patterns for block diffusion (training) with L ′ = 2. Noisy blocks attend to all tokens within the block and to all clean tokens in previous blocks. (a) shows the attention from BD3-LM (Arriola et al., 2025). (b) uses causal attention over the clean tokens. This is important for BlockGen since we train over multiple block sizes but want to share a single KV cache across all block sizes during inf… view at source ↗

**Figure 6.** Figure 6: Effect of mixture weights on ARPC with block size 16 as a function of NFE, at T=1. Each curve shows the best ARPC performance for a given NFE, one curve per value of γ1 in (10), with γ16 = 1 − γ1. Masked diffusion (left) and uniform diffusion (right). 128 256 512 1024 2048 NFE vs Accuracy (Masked, w=0.05 / 0.95) 10% 20% 30% 40% 50% 60% T=1 T=0.9 T=0.5 T=0.3 T=0.1 128 256 512 1024 2048 NFE vs Accuracy (Unif… view at source ↗

**Figure 7.** Figure 7: Effect of sampling temperature on ARPC with block size 16 as a function of NFE. Each curve shows the best ARPC performance for a given NFE, one curve per sampling temperature (T ∈ {1.0, 0.9, 0.5, 0.3, 0.1}). ARPC uses the multi-block mixture from (10) with γ1 = 0.05, γ16 = 0.95. Masked diffusion (left) and uniform diffusion (right). 64 128 256 512 1024 2048 NFE vs GSM8K Accuracy (T=1, block 32) 0% 10% 20% … view at source ↗

**Figure 8.** Figure 8: GSM8K accuracy with block size 32 as a function of NFE. Models are trained on TinyGSM and evaluated on the GSM8K test set. Each curve shows the best performance for a given NFE. ARPC uses checkpoints trained with the mixture in (10), with γ1 = 0.05, γ32 = 0.95, and ancestral uses a single-block-size model. Dashed lines show the AR baseline (53.9% sampled, 63.3% greedy). ARPC outperforms ancestral across th… view at source ↗

**Figure 9.** Figure 9: ARPC vs EIPC GSM8K accuracy with block size 32 as a function of NFE. Models are trained on TinyGSM and evaluated on the GSM8K test set. Each curve shows the best performance for a given NFE. EIPC uses a single-block-32 model, and ARPC uses the multi-block mixture from (10) with γ1 = 0.05, γ32 = 0.95. ARPC outperforms EIPC across most NFE budgets (left and right). 64 128 256 512 1024 2048 NFE vs Accuracy (M… view at source ↗

**Figure 10.** Figure 10: Effect of mixture weights on ARPC with block size 32 as a function of NFE, at T=1. Each curve shows the best ARPC performance for a given NFE, one curve per (γ1, γ32) ∈ {(0.05, 0.95),(0.01, 0.99)} in (10). Masked diffusion (left) and uniform diffusion (right). 64 128 256 512 1024 2048 NFE vs Accuracy (Masked, w=0.05 / 0.95, block 32) 0% 10% 20% 30% 40% 50% T=1 T=0.1 64 128 256 512 1024 2048 NFE vs Accurac… view at source ↗

**Figure 11.** Figure 11: Effect of sampling temperature on ARPC with block size 32 as a function of NFE. Each curve shows the best ARPC performance for a given NFE, one curve per sampling temperature (T ∈ {1, 0.1}). ARPC uses the multi-block mixture from (10) with γ1 = 0.05, γ32 = 0.95. Masked diffusion (left) and uniform diffusion (right). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: ARPC GSM8K accuracy at block sizes 16 vs 32 as a function of NFE per block. Models are trained on TinyGSM and evaluated on the GSM8K test set. Each curve shows the best ARPC performance for a given NFE. Both curves use the mixture from (10) with γ1 = 0.05 and γL = 0.95 for L ∈ {16, 32}. Block size 16 reaches higher accuracy than block size 32 at matched NFE per block (left and right). 64 128 256 512 1024 … view at source ↗

**Figure 13.** Figure 13: Ancestral sampling GSM8K accuracy at block sizes 16 vs 32 as a function of NFE per block. Models are trained on TinyGSM and evaluated on the GSM8K test set. Each curve shows the accuracy for a given NFE on the single-block-16 and single-block-32 models. Block size 16 reaches higher accuracy than block size 32 at matched NFE per block (left and right) [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Single-block ancestral sampling on OpenWebText: masked vs uniform diffusion at per-block NFE ∈ {8, 16, 32, 64}. Uniform reaches a lower Gen. PPL frontier than masked at every NFE, with the gap narrowing as NFE grows [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Block-level samplers on OpenWebText, masked prior: single-block ancestral, EIPC, and ARPC at per-block NFE ∈ {8, 16, 32, 64}. ARPC reaches the lowest Gen. PPL frontier at every NFE [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Block-level samplers on OpenWebText, uniform prior: single-block ancestral, EIPC, and ARPC at per-block NFE ∈ {8, 16, 32, 64}. ARPC and EIPC trade places across temperature and NFE, and both remain close to single-block ancestral. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: ARPC vs AR on OpenWebText at per-block NFE = 64, additional late-correction schedules. From left to right: a wider corrector spacing (GE = 8, with a 40-step warmup), and two schedules with very few correctors at the end (2 correctors after a 58-step warmup; 1 corrector after a 56-step warmup with GE = 4). AR retains lower Gen. PPL across the practical temperature range in all three settings, and masked-AR… view at source ↗

**Figure 18.** Figure 18: Matched total-NFE frontier on OpenWebText, total NFE = 1024. AR runs at one forward pass per token; MDLM and Duo are full-sequence diffusion samplers at 1024 ancestral steps; ARPC (masked) and ARPC (uniform) run block-by-block at 16 NFE per block across 64 blocks. At matched total compute, single-block samplers reach lower Gen. PPL than block-by-block ARPC, consistent with the trend reported for masked di… view at source ↗

read the original abstract

Is the uniform-state diffusion framework a more powerful paradigm for discrete diffusion? Recent studies indicate that this may be the case. In combination with predictor-corrector samplers, uniform-state diffusion models (USDMs) produce samples of higher-quality than masked diffusion models (MDMs), and USDMs equal or outperform MDMs in downstream tasks, even though they exhibit greater perplexity. Two issues remain unresolved. First, existing work compares uniform and masked diffusion with un-informed correctors that re-inject noise at random positions, rather than targeting tokens most likely to be wrong. Second, prior work compares full-sequence diffusion models, so we do not know whether the same conclusion holds when tokens are generated block by block. To address these issues, we introduce BlockGen, a blockwise sequence model that we instantiate with both masked and uniform diffusion. BlockGen trains on a mixture of block sizes and its likelihood interpolates between AR and pure diffusion more finely than models with a fixed block size. BlockGen enables AR-informed predictor-corrector sampling (ARPC), which combines AR and diffusion predictions to re-generate unlikely tokens without an auxiliary verifier. Under ancestral sampling, uniform outperforms masked in the block-by-block setting, especially in the few-step regime. Under ARPC, the gap closes and reverses at high NFE. With block size $16$ on GSM8K, MDMs reach slightly higher accuracy than USDMs, and we observe a similar trend in Generative Perplexity on OpenWebText. Find our code at https://github.com/jdeschena/blockgen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BlockGen's blockwise mixture training and ARPC sampler are the real additions, but they likely confound the uniform-vs-masked claims rather than isolating diffusion state.

read the letter

The main takeaway is that BlockGen trains on a mixture of block sizes and pairs it with an AR-informed predictor-corrector (ARPC) that re-generates tokens using AR logits. This produces block-by-block generation that sits between pure AR and full-sequence diffusion. Under ancestral sampling the paper reports uniform diffusion beating masked, especially at low NFE; ARPC closes or reverses the gap at higher NFE, and masked edges out on GSM8K accuracy and OpenWebText generative perplexity at block size 16.

What is actually new is the combination of mixed block-size training with both diffusion types plus the ARPC sampler itself. Prior comparisons stayed at full-sequence level or used uninformed correctors; this moves the setting closer to practical sequence work and releases code.

The soft spot is the one flagged in the stress test. Both models share the identical mixture schedule and the same ARPC without an auxiliary verifier. If the AR component aligns better with masked tokens than with uniform states, or if the mixture distribution interacts unevenly, the observed ordering could be an artifact rather than a property of the diffusion formulation. The abstract gives no ablation that holds ARPC and the mixture fixed while varying only the state, and no error bars or mixture-distribution details appear. That keeps the central empirical claims at moderate support.

The work is aimed at people already working on discrete diffusion for language and reasoning tasks who want blockwise extensions. A reader looking for new sampling tricks or code to build on will find something usable. The paper is coherent, cites the relevant prior work, and ships reproducible artifacts, so it deserves a serious referee even though the comparisons will need tighter controls on the shared components.

Referee Report

2 major / 2 minor

Summary. The paper introduces BlockGen, a blockwise sequence modeling framework trained on a mixture of block sizes that can be instantiated with either masked diffusion models (MDMs) or uniform-state diffusion models (USDMs). It proposes AR-informed predictor-corrector (ARPC) sampling that combines AR and diffusion predictions to regenerate unlikely tokens. The central empirical claims are that, under ancestral sampling, USDMs outperform MDMs (especially in the few-step regime); under ARPC the performance gap closes and can reverse at high NFE; and that with block size 16 on GSM8K, MDMs achieve slightly higher accuracy than USDMs, with a similar trend in generative perplexity on OpenWebText.

Significance. If the ordering results are robust to the shared training and sampling components, the work would provide evidence that diffusion state choice interacts with sampling method in blockwise regimes and that hybrid AR-diffusion correctors can be effective without auxiliary verifiers. The mixture-of-block-sizes training and code release are positive features that allow finer interpolation between AR and diffusion.

major comments (2)

[Abstract / Experiments] Abstract and experimental claims: the headline ordering (uniform > masked under ancestral sampling; gap closes/reverses under ARPC; MDM slightly better on GSM8K bs=16) rests on the premise that observed differences are attributable to the diffusion state rather than the shared mixture-of-block-sizes training schedule or the ARPC component. No ablation isolating diffusion state from these factors is described, which is load-bearing for the central claim that the results reflect an intrinsic property of uniform vs. masked diffusion.
[Abstract / Results] Results on GSM8K and OpenWebText: the statements that MDMs reach 'slightly higher accuracy' and 'a similar trend' in generative perplexity are presented without error bars, standard deviations across runs, or statistical significance tests. This makes it impossible to determine whether the reported small differences are reliable or could be explained by sampling variance.

minor comments (2)

[Abstract] The abstract states that BlockGen 'trains on a mixture of block sizes' but does not specify the exact mixture distribution or sampling procedure over block sizes; this detail is needed for reproducibility.
[Abstract] The link to code is provided, but the manuscript does not indicate whether the released code includes the exact training configurations, random seeds, and evaluation scripts used for the reported GSM8K and OpenWebText numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. Below we address the major comments point by point.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental claims: the headline ordering (uniform > masked under ancestral sampling; gap closes/reverses under ARPC; MDM slightly better on GSM8K bs=16) rests on the premise that observed differences are attributable to the diffusion state rather than the shared mixture-of-block-sizes training schedule or the ARPC component. No ablation isolating diffusion state from these factors is described, which is load-bearing for the central claim that the results reflect an intrinsic property of uniform vs. masked diffusion.

Authors: Both the MDM and USDM models in BlockGen are trained with the exact same mixture-of-block-sizes schedule and share the same architecture and hyperparameters aside from the state definition. The ARPC sampling procedure is also identical for both. This design allows us to attribute performance differences directly to the choice of diffusion state (masked versus uniform) and its interaction with the sampling method. The central claims concern the blockwise setting with this training regime, so we believe the current experiments already provide a fair comparison without confounding factors from differing training schedules. revision: no
Referee: [Abstract / Results] Results on GSM8K and OpenWebText: the statements that MDMs reach 'slightly higher accuracy' and 'a similar trend' in generative perplexity are presented without error bars, standard deviations across runs, or statistical significance tests. This makes it impossible to determine whether the reported small differences are reliable or could be explained by sampling variance.

Authors: We agree that reporting variability is important for interpreting the small differences. In the revised version of the manuscript, we will include error bars and standard deviations computed over multiple independent training and sampling runs for the GSM8K accuracy and OpenWebText generative perplexity results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons on external benchmarks

full rationale

The paper's central claims consist of empirical performance comparisons (ancestral sampling, ARPC, GSM8K accuracy, OpenWebText generative perplexity) between USDM and MDM instantiations of BlockGen. These rest on training and evaluation against standard external benchmarks rather than any derivation, equation, or fitted parameter that reduces to the paper's own inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the reported results or methodology.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claims rest on the empirical performance of the BlockGen architecture and ARPC sampler on standard benchmarks; the mixture of block sizes is a key design choice whose distribution is not detailed as fitted to data.

free parameters (1)

block size mixture distribution
The model is trained on a mixture of block sizes whose specific weights or sampling strategy constitute a modeling choice that affects the interpolation between AR and diffusion regimes.

pith-pipeline@v0.9.1-grok · 5818 in / 1301 out tokens · 28797 ms · 2026-06-28T15:26:55.824207+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

150 extracted references · 13 canonical work pages · 3 internal anchors

[1]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[2]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025
[3]

2024 , howpublished=

GPT-OSS: open-weight language models by OpenAI , author=. 2024 , howpublished=

2024
[4]

2023 , eprint=

SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control , author=. 2023 , eprint=

2023
[5]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

2015
[6]

2017 , eprint=

Attention Is All You Need , author=. 2017 , eprint=

2017
[7]

A Neural Probabilistic Language Model , url =

Bengio, Yoshua and Ducharme, R\'. A Neural Probabilistic Language Model , url =. Advances in Neural Information Processing Systems , editor =
[8]

2025 , eprint=

Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds , author=. 2025 , eprint=

2025
[9]

Handbook of Monte Carlo methods

Kroese, Dirk P and Taimre, Thomas and Botev, Zdravko I. Handbook of Monte Carlo methods
[10]

2025 , eprint=

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing , author=. 2025 , eprint=

2025
[11]

2025 , eprint=

CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation , author=. 2025 , eprint=

2025
[12]

2025 , eprint=

Fast-dLLM v2: Efficient Block-Diffusion LLM , author=. 2025 , eprint=

2025
[13]

1986 , publisher=

Non-Uniform Random Variate Generation , author=. 1986 , publisher=

1986
[14]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023
[15]

1999 , publisher=

Programming Pearls , author=. 1999 , publisher=

1999
[16]

2025 , eprint=

The Diffusion Duality , author=. 2025 , eprint=

2025
[17]

2024 , eprint=

TinyLlama: An Open-Source Small Language Model , author=. 2024 , eprint=

2024
[18]

Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =
[19]

2020 , eprint=

Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

2020
[20]

2025 , eprint=

Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. 2025 , eprint=

2025
[21]

2025 , eprint=

Simple Guidance Mechanisms for Discrete Diffusion Models , author=. 2025 , eprint=

2025
[22]

2026 , url=

Luca Eyring and Vincent Pauline and Stefan Bauer and Alexey Dosovitskiy and Zeynep Akata , booktitle=. 2026 , url=

2026
[23]

2017 , eprint=

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , author=. 2017 , eprint=

2017
[24]

2021 , eprint=

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers , author=. 2021 , eprint=

2021
[25]

2023 , eprint=

Consistency Models , author=. 2023 , eprint=

2023
[26]

2017 , eprint=

Categorical Reparameterization with Gumbel-Softmax , author=. 2017 , eprint=

2017
[27]

2017 , eprint=

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , author=. 2017 , eprint=

2017
[28]

2025 , eprint=

Beyond Autoregression: Fast LLMs via Self-Distillation Through Time , author=. 2025 , eprint=

2025
[29]

Yuanzhi Zhu and Xi Wang and Stéphane Lathuilière and Vicky Kalogeiton , year=. Di. 2503.15457 , archivePrefix=

work page arXiv
[30]

2025 , eprint=

Distillation of Discrete Diffusion through Dimensional Correlations , author=. 2025 , eprint=

2025
[31]

2026 , eprint=

IDLM: Inverse-distilled Diffusion Language Models , author=. 2026 , eprint=

2026
[32]

2026 , eprint=

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD , author=. 2026 , eprint=

2026
[33]

2015 , eprint=

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. 2015 , eprint=

2015
[34]

2022 , eprint=

Denoising Diffusion Implicit Models , author=. 2022 , eprint=

2022
[35]

2023 , eprint=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2023 , eprint=

2023
[36]

2025 , eprint=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2025 , eprint=

2025
[37]

2025 , eprint=

Generalized Interpolating Discrete Diffusion , author=. 2025 , eprint=

2025
[38]

2024 , eprint=

Flex Attention: A Programming Model for Generating Optimized Attention Kernels , author=. 2024 , eprint=

2024
[39]

2025 , eprint=

Generator Matching: Generative modeling with arbitrary Markov processes , author=. 2025 , eprint=

2025
[40]

2023 , eprint=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

2023
[41]

2024 , eprint=

Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective , author=. 2024 , eprint=

2024
[42]

2023 , eprint=

Fast Inference from Transformers via Speculative Decoding , author=. 2023 , eprint=

2023
[43]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

2019
[44]

2023 , eprint=

Variational Diffusion Models , author=. 2023 , eprint=

2023
[45]

2025 , eprint=

Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling , author=. 2025 , eprint=

2025
[46]

2024 , eprint=

Unified Discrete Diffusion for Categorical Data , author=. 2024 , eprint=

2024
[47]

International Conference on Machine Learning , year=

Curriculum learning , author=. International Conference on Machine Learning , year=
[48]

2014 , eprint=

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , author=. 2014 , eprint=

2014
[49]

OpenWebText Corpus , author=
[50]

2024 , eprint=

Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

2024
[51]

2024 , eprint=

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2024 , eprint=

2024
[52]

2019 , eprint=

Decoupled Weight Decay Regularization , author=. 2019 , eprint=

2019
[53]

2021 , eprint=

Improved Denoising Diffusion Probabilistic Models , author=. 2021 , eprint=

2021
[54]

2015 , eprint=

U-Net: Convolutional Networks for Biomedical Image Segmentation , author=. 2015 , eprint=

2015
[55]

2022 , eprint=

DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models , author=. 2022 , eprint=

2022
[56]

2025 , eprint=

Path Planning for Masked Diffusion Model Sampling , author=. 2025 , eprint=

2025
[57]

2025 , eprint=

Planner Aware Path Learning in Diffusion Language Models Training , author=. 2025 , eprint=

2025
[58]

2025 , eprint=

Think While You Generate: Discrete Diffusion with Planned Denoising , author=. 2025 , eprint=

2025
[59]

2025 , eprint=

TiDAR: Think in Diffusion, Talk in Autoregression , author=. 2025 , eprint=

2025
[60]

The Eleventh International Conference on Learning Representations , year=

Discrete Predictor-Corrector Diffusion Models for Image Synthesis , author=. The Eleventh International Conference on Learning Representations , year=
[61]

2025 , eprint=

Fine-Tuning Masked Diffusion for Provable Self-Correction , author=. 2025 , eprint=

2025
[62]

2025 , eprint=

Corrective Diffusion Language Models , author=. 2025 , eprint=

2025
[63]

2025 , eprint=

Informed Correctors for Discrete Diffusion Models , author=. 2025 , eprint=

2025
[64]

2410.07761 , archivePrefix=

Yong-Hyun Park and Chieh-Hsin Lai and Satoshi Hayakawa and Yuhta Takida and Yuki Mitsufuji , year=. 2410.07761 , archivePrefix=

work page arXiv
[65]

2025 , eprint=

Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms , author=. 2025 , eprint=

2025
[66]

2025 , eprint=

Energy-Based Diffusion Language Models for Text Generation , author=. 2025 , eprint=

2025
[67]

2025 , eprint=

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation , author=. 2025 , eprint=

2025
[68]

Language Models are Unsupervised Multitask Learners , author=
[69]

2023 , eprint=

Scalable Diffusion Models with Transformers , author=. 2023 , eprint=

2023
[70]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

2023
[71]

2022 , eprint=

Efficiently Scaling Transformer Inference , author=. 2022 , eprint=

2022
[72]

2020 , eprint=

The Curious Case of Neural Text Degeneration , author=. 2020 , eprint=

2020
[73]

2022 , eprint=

Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=

2022
[74]

ImageNet: A large-scale hierarchical image database , year=

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei , booktitle=. ImageNet: A large-scale hierarchical image database , year=
[75]

2025 , eprint=

Halton Scheduler For Masked Generative Image Transformer , author=. 2025 , eprint=

2025
[76]

2025 , eprint=

Movie Gen: A Cast of Media Foundation Models , author=. 2025 , eprint=

2025
[77]

2022 , eprint=

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , author=. 2022 , eprint=

2022
[78]

2023 , eprint=

Noise2Music: Text-conditioned Music Generation with Diffusion Models , author=. 2023 , eprint=

2023
[79]

2025 , eprint=

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. 2025 , eprint=

2025
[80]

2025 , eprint=

Encoder-Decoder Diffusion Language Models for Efficient Training and Inference , author=. 2025 , eprint=

2025

Showing first 80 references.

[1] [1]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[2] [2]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025

[3] [3]

2024 , howpublished=

GPT-OSS: open-weight language models by OpenAI , author=. 2024 , howpublished=

2024

[4] [4]

2023 , eprint=

SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control , author=. 2023 , eprint=

2023

[5] [5]

2015 , eprint=

Distilling the Knowledge in a Neural Network , author=. 2015 , eprint=

2015

[6] [6]

2017 , eprint=

Attention Is All You Need , author=. 2017 , eprint=

2017

[7] [7]

A Neural Probabilistic Language Model , url =

Bengio, Yoshua and Ducharme, R\'. A Neural Probabilistic Language Model , url =. Advances in Neural Information Processing Systems , editor =

[8] [8]

2025 , eprint=

Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds , author=. 2025 , eprint=

2025

[9] [9]

Handbook of Monte Carlo methods

Kroese, Dirk P and Taimre, Thomas and Botev, Zdravko I. Handbook of Monte Carlo methods

[10] [10]

2025 , eprint=

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing , author=. 2025 , eprint=

2025

[11] [11]

2025 , eprint=

CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation , author=. 2025 , eprint=

2025

[12] [12]

2025 , eprint=

Fast-dLLM v2: Efficient Block-Diffusion LLM , author=. 2025 , eprint=

2025

[13] [13]

1986 , publisher=

Non-Uniform Random Variate Generation , author=. 1986 , publisher=

1986

[14] [14]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023

[15] [15]

1999 , publisher=

Programming Pearls , author=. 1999 , publisher=

1999

[16] [16]

2025 , eprint=

The Diffusion Duality , author=. 2025 , eprint=

2025

[17] [17]

2024 , eprint=

TinyLlama: An Open-Source Small Language Model , author=. 2024 , eprint=

2024

[18] [18]

Soboleva, Daria and Al-Khateeb, Faisal and Myers, Robert and Steeves, Jacob R and Hestness, Joel and Dey, Nolan , title =

[19] [19]

2020 , eprint=

Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

2020

[20] [20]

2025 , eprint=

Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. 2025 , eprint=

2025

[21] [21]

2025 , eprint=

Simple Guidance Mechanisms for Discrete Diffusion Models , author=. 2025 , eprint=

2025

[22] [22]

2026 , url=

Luca Eyring and Vincent Pauline and Stefan Bauer and Alexey Dosovitskiy and Zeynep Akata , booktitle=. 2026 , url=

2026

[23] [23]

2017 , eprint=

PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , author=. 2017 , eprint=

2017

[24] [24]

2021 , eprint=

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers , author=. 2021 , eprint=

2021

[25] [25]

2023 , eprint=

Consistency Models , author=. 2023 , eprint=

2023

[26] [26]

2017 , eprint=

Categorical Reparameterization with Gumbel-Softmax , author=. 2017 , eprint=

2017

[27] [27]

2017 , eprint=

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , author=. 2017 , eprint=

2017

[28] [28]

2025 , eprint=

Beyond Autoregression: Fast LLMs via Self-Distillation Through Time , author=. 2025 , eprint=

2025

[29] [29]

Yuanzhi Zhu and Xi Wang and Stéphane Lathuilière and Vicky Kalogeiton , year=. Di. 2503.15457 , archivePrefix=

work page arXiv

[30] [30]

2025 , eprint=

Distillation of Discrete Diffusion through Dimensional Correlations , author=. 2025 , eprint=

2025

[31] [31]

2026 , eprint=

IDLM: Inverse-distilled Diffusion Language Models , author=. 2026 , eprint=

2026

[32] [32]

2026 , eprint=

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD , author=. 2026 , eprint=

2026

[33] [33]

2015 , eprint=

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. 2015 , eprint=

2015

[34] [34]

2022 , eprint=

Denoising Diffusion Implicit Models , author=. 2022 , eprint=

2022

[35] [35]

2023 , eprint=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2023 , eprint=

2023

[36] [36]

2025 , eprint=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2025 , eprint=

2025

[37] [37]

2025 , eprint=

Generalized Interpolating Discrete Diffusion , author=. 2025 , eprint=

2025

[38] [38]

2024 , eprint=

Flex Attention: A Programming Model for Generating Optimized Attention Kernels , author=. 2024 , eprint=

2024

[39] [39]

2025 , eprint=

Generator Matching: Generative modeling with arbitrary Markov processes , author=. 2025 , eprint=

2025

[40] [40]

2023 , eprint=

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , author=. 2023 , eprint=

2023

[41] [41]

2024 , eprint=

Flow Matching with General Discrete Paths: A Kinetic-Optimal Perspective , author=. 2024 , eprint=

2024

[42] [42]

2023 , eprint=

Fast Inference from Transformers via Speculative Decoding , author=. 2023 , eprint=

2023

[43] [43]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

2019

[44] [44]

2023 , eprint=

Variational Diffusion Models , author=. 2023 , eprint=

2023

[45] [45]

2025 , eprint=

Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling , author=. 2025 , eprint=

2025

[46] [46]

2024 , eprint=

Unified Discrete Diffusion for Categorical Data , author=. 2024 , eprint=

2024

[47] [47]

International Conference on Machine Learning , year=

Curriculum learning , author=. International Conference on Machine Learning , year=

[48] [48]

2014 , eprint=

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , author=. 2014 , eprint=

2014

[49] [49]

OpenWebText Corpus , author=

[50] [50]

2024 , eprint=

Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

2024

[51] [51]

2024 , eprint=

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2024 , eprint=

2024

[52] [52]

2019 , eprint=

Decoupled Weight Decay Regularization , author=. 2019 , eprint=

2019

[53] [53]

2021 , eprint=

Improved Denoising Diffusion Probabilistic Models , author=. 2021 , eprint=

2021

[54] [54]

2015 , eprint=

U-Net: Convolutional Networks for Biomedical Image Segmentation , author=. 2015 , eprint=

2015

[55] [55]

2022 , eprint=

DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models , author=. 2022 , eprint=

2022

[56] [56]

2025 , eprint=

Path Planning for Masked Diffusion Model Sampling , author=. 2025 , eprint=

2025

[57] [57]

2025 , eprint=

Planner Aware Path Learning in Diffusion Language Models Training , author=. 2025 , eprint=

2025

[58] [58]

2025 , eprint=

Think While You Generate: Discrete Diffusion with Planned Denoising , author=. 2025 , eprint=

2025

[59] [59]

2025 , eprint=

TiDAR: Think in Diffusion, Talk in Autoregression , author=. 2025 , eprint=

2025

[60] [60]

The Eleventh International Conference on Learning Representations , year=

Discrete Predictor-Corrector Diffusion Models for Image Synthesis , author=. The Eleventh International Conference on Learning Representations , year=

[61] [61]

2025 , eprint=

Fine-Tuning Masked Diffusion for Provable Self-Correction , author=. 2025 , eprint=

2025

[62] [62]

2025 , eprint=

Corrective Diffusion Language Models , author=. 2025 , eprint=

2025

[63] [63]

2025 , eprint=

Informed Correctors for Discrete Diffusion Models , author=. 2025 , eprint=

2025

[64] [64]

2410.07761 , archivePrefix=

Yong-Hyun Park and Chieh-Hsin Lai and Satoshi Hayakawa and Yuhta Takida and Yuki Mitsufuji , year=. 2410.07761 , archivePrefix=

work page arXiv

[65] [65]

2025 , eprint=

Fast Solvers for Discrete Diffusion Models: Theory and Applications of High-Order Algorithms , author=. 2025 , eprint=

2025

[66] [66]

2025 , eprint=

Energy-Based Diffusion Language Models for Text Generation , author=. 2025 , eprint=

2025

[67] [67]

2025 , eprint=

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation , author=. 2025 , eprint=

2025

[68] [68]

Language Models are Unsupervised Multitask Learners , author=

[69] [69]

2023 , eprint=

Scalable Diffusion Models with Transformers , author=. 2023 , eprint=

2023

[70] [70]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

2023

[71] [71]

2022 , eprint=

Efficiently Scaling Transformer Inference , author=. 2022 , eprint=

2022

[72] [72]

2020 , eprint=

The Curious Case of Neural Text Degeneration , author=. 2020 , eprint=

2020

[73] [73]

2022 , eprint=

Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=

2022

[74] [74]

ImageNet: A large-scale hierarchical image database , year=

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei , booktitle=. ImageNet: A large-scale hierarchical image database , year=

[75] [75]

2025 , eprint=

Halton Scheduler For Masked Generative Image Transformer , author=. 2025 , eprint=

2025

[76] [76]

2025 , eprint=

Movie Gen: A Cast of Media Foundation Models , author=. 2025 , eprint=

2025

[77] [77]

2022 , eprint=

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , author=. 2022 , eprint=

2022

[78] [78]

2023 , eprint=

Noise2Music: Text-conditioned Music Generation with Diffusion Models , author=. 2023 , eprint=

2023

[79] [79]

2025 , eprint=

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. 2025 , eprint=

2025

[80] [80]

2025 , eprint=

Encoder-Decoder Diffusion Language Models for Efficient Training and Inference , author=. 2025 , eprint=

2025