VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

Alex Lamb; Chunyu Liu; Kaisen Yang; Zhengyang Fan

arxiv: 2606.17999 · v2 · pith:I7UXGMYMnew · submitted 2026-06-16 · 💻 cs.CL

VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination

Chunyu Liu , Zhengyang Fan , Kaisen Yang , Alex Lamb This is my paper

Pith reviewed 2026-06-27 00:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords masked diffusion language modelspadding tokensEOS tokeninstruction tuningearly stoppingresponse length modelingdenoising

0 comments

The pith

VoidPadding introduces a dedicated [VOID] token for padding so [EOS] signals only semantic termination in masked diffusion language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Masked diffusion language models build responses by denoising a preallocated masked canvas whose length must be chosen in advance. Existing training reuses the [EOS] token for both true termination and padding, giving it conflicting signals. This overlap produces [EOS] overflow when the model decodes long blocks. VoidPadding replaces padding with a new [VOID] token while reserving [EOS] for termination. The change yields higher average scores on math-reasoning and code-generation tasks and cuts the number of denoising steps required.

Core claim

By training with [VOID] as the padding token instead of repeated [EOS], the model learns separate representations so that [EOS] can be used for reliable early stopping and [VOID] can guide adaptive expansion of the response canvas during inference.

What carries the argument

The [VOID] token, introduced for padding during instruction tuning, whose learned signal later controls adaptive canvas expansion while [EOS] controls early stopping.

If this is right

Block-size-averaged performance across four math and code tasks rises by 17.84 points over the baseline model.
The same tasks improve by 6.95 points over the prior RainbowPadding method.
Average decoding cost measured in number of function evaluations drops by 55.7 percent.
Early stopping becomes feasible without sacrificing response quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same role-separation idea could be tested on other non-autoregressive generative architectures that rely on length or termination tokens.
If [VOID] truly decouples signals, similar dedicated tokens might simplify length control in any canvas-based diffusion model.
The method reduces the engineering burden of choosing a single fixed block size at inference time.

Load-bearing premise

The dual use of [EOS] as both terminator and padding token is the main cause of overflow, and a separate [VOID] token plus ordinary training is enough to produce cleanly separated signals.

What would settle it

Train an identical model with VoidPadding and observe whether [EOS] overflow still occurs under the same large-block decoding regime on the reported benchmarks.

Figures

Figures reproduced from arXiv: 2606.17999 by Alex Lamb, Chunyu Liu, Kaisen Yang, Zhengyang Fan.

**Figure 2.** Figure 2: [EOS] as a learned length signal. Initial [EOS] confidence aligns with the [EOS] label ratio of padded data and the raw response-length CDF. Next, we examine whether [EOS] padding provides an additional signal in MDLM training beyond semantic termination. Specifically, we train an [EOS]-padding diagnostic model on LLaDA-8BBase, with training details provided in Appendix A. Following Section 2.2, for eac… view at source ↗

**Figure 3.** Figure 3: Attention intervention on the B = L = 512 [EOS] overflow trajectory. Both cases use the same prompt. Masking attention to committed [EOS] tokens avoids [EOS] overflow and recovers the correct answer. [EOS] tokens as denoising targets, so [EOS] is trained as both a semantic terminator and a padding token. During inference, however, [EOS] is expected to serve only as a semantic terminator. In ARLM IT, causa… view at source ↗

**Figure 4.** Figure 4: Large-block stress test with L0 = 256 and B = 256. Lower scores indicate [EOS] overflow. and Dream-7B-Base in Appendices F and G. 6 Ablation Studies Setup. Our ablation experiments use the same benchmarks and metrics as Section 5 and compare the original LLaDA-8B-Instruct model with the RainbowPadding and VoidPadding checkpoints. VoidPadding makes [EOS]Termination effective [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 5.** Figure 5: LLaDA-8B-Instruct VoidPadding NFE/example for B ∈ {64, 128, 512}. Avg. is arithmetic over four benchmarks. 0 120 240 360 480 NFE/example GSM8K 150.2 149.8 172.0 HumanEval 253.6 254.8 393.8 MATH500 363.7 366.4 382.9 MBPP 67.8 69.199.1 Avg. 208.8 210.0262.0 Block size B: dark 64 → light 512 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Dream-7B-Instruct VoidPadding NFE/example for B ∈ {64, 128, 512}. Avg. is arithmetic over four benchmarks. pass@1. D.2 Dream-7B-Instruct Fixed-512 Block-Length Sweep [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 8.** Figure 8: Fixed generation length sensitivity for LLaDA [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Window-size accuracy statistics for VoidPadding-finetuned LLaDA-8B-Instruct + VoidExpansion. Bars are grouped by benchmark; within each group, the four bars vary the tail window w ∈ {8, 16, 24, 32}. Darker blue denotes smaller w. Labels give the exact score. GSM8K and MATH500 report accuracy, HumanEval and MBPP report pass@1, and Mean is the arithmetic average over the four benchmarks. 0 100 200 300 400 … view at source ↗

**Figure 10.** Figure 10: Window-size NFE statistics for VoidPadding [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Dream-7B-Instruct fixed-512 block-length sweep comparing the original model with RainbowPadding [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Dream-7B-Base fixed-512 block-length sweep comparing EOS Padding with RainbowPadding and [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Same-step training-budget comparison on LLaDA-8B-Instruct. Bars compare Early RainbowPadding [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Same-step training-budget comparison on Dream-7B-Instruct. Bars compare Early RainbowPadding [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Initial-length ablation for VoidPadding-finetuned LLaDA-8B-Base + VoidExpansion. Bars compare [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Initial-length ablation for VoidPadding-finetuned LLaDA-8B-Instruct + VoidExpansion. Bars compare [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Fixed-length LLaDA-8B-Instruct NFE with Lmax = 512, averaged over B ∈ {64, 128, 512}. Bars compare vanilla stopping with [EOS]-termination variants; Avg. is the arithmetic average over benchmarks. 0 20 40 60 80 Score GSM8K 16 32 64 56.5 78.6 52.8 78.3 31.4 78.3 MATH500 16 32 64 26.239.0 26.239.6 23.2 40.6 HumanEval 16 32 64 36.042.7 37.242.7 29.939.6 MBPP 16 32 64 39.645.4 39.746.5 40.8 44.7 Avg. 16 32 64… view at source ↗

**Figure 18.** Figure 18: Block-size accuracy statistics for VoidPadding-finetuned LLaDA-8B-Instruct + VoidExpansion. Bars [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Initial-length NFE ablation results for VoidPadding-finetuned LLaDA-8B-Instruct + VoidExpansion over L0 ∈ {64, 96, 128, 160}. Avg. is the benchmark mean. 0 100 200 300 400 NFE/example GSM8K 132.4 133.8 132.9 MATH500 415.4 418.1 418.4 HumanEval 71.7 73.1 74.2 MBPP 63.5 63.5 63.9 Avg. 170.7 172.1 172.4 Block size B: dark B = 16 → light B = 64 [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗

**Figure 21.** Figure 21: VoidPadding-finetuned LLaDA-8B-Instruct + VoidExpansion threshold sweep. The sweep uses [PITH_FULL_IMAGE:figures/full_fig_p020_21.png] view at source ↗

**Figure 22.** Figure 22: VoidPadding-finetuned LLaDA-8B-Instruct + VoidExpansion threshold-sweep average NFE. The [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗

**Figure 23.** Figure 23: VoidPadding-finetuned LLaDA-8B-Base + VoidExpansion threshold sweep. The sweep uses [PITH_FULL_IMAGE:figures/full_fig_p020_23.png] view at source ↗

**Figure 24.** Figure 24: VoidPadding-finetuned Dream-7B-Instruct + VoidExpansion threshold sweep. The sweep uses [PITH_FULL_IMAGE:figures/full_fig_p020_24.png] view at source ↗

**Figure 25.** Figure 25: VoidPadding-finetuned Dream-7B-Base + VoidExpansion threshold sweep. The sweep uses [PITH_FULL_IMAGE:figures/full_fig_p020_25.png] view at source ↗

read the original abstract

MDLMs generate text by denoising a preallocated masked response canvas, making response-length modeling central to instruction tuning. Existing MDLMs often inherit the autoregressive convention of using repeated \texttt{[EOS]} tokens for padding during instruction tuning, giving \texttt{[EOS]} a dual role as both a semantic terminator and a padding token. We show that this dual role is a root cause of \texttt{[EOS]} overflow under large-block decoding. To decouple these roles, we propose VoidPadding, which introduces \texttt{[VOID]} for padding and reserves \texttt{[EOS]} for termination. During inference, the learned \texttt{[EOS]} signal enables early stopping, while the learned \texttt{[VOID]} signal guides adaptive response canvas expansion. On Dream-7B-Instruct, VoidPadding improves the block-size-averaged four-task mean across mathematical reasoning and code generation benchmarks by \(+17.84\) points over the original model and \(+6.95\) points over RainbowPadding, while reducing decoding NFE by 55.7\% on average. Code is available at https://github.com/Haru-LCY/VoidPadding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VoidPadding separates [EOS] and padding roles in MDLMs with clear reported gains on Dream-7B, but the causal link to dual-role misuse is not isolated by the evidence.

read the letter

The main thing to know is that this paper adds a [VOID] token for padding during training of masked diffusion language models, reserves [EOS] for termination only, and uses the learned signals at inference for early stopping plus adaptive canvas expansion. On Dream-7B-Instruct it reports a +17.84 point lift on the block-averaged mean of four math and code tasks over the base model, +6.95 over RainbowPadding, and a 55.7% drop in NFE.

The work does a clean job stating the problem with inherited autoregressive padding conventions and then shipping a minimal change plus code. The quantitative results are specific enough to be useful inside the subfield, and releasing the repo lets others reproduce the setup.

The soft spot is the missing link between the hypothesized mechanism and the gains. The paper treats the dual role of [EOS] as the root cause of overflow, yet supplies no token-level diagnostics from the baseline showing excess [EOS] probability mass on post-termination positions. The improvements are consistent with the story but could equally trace to the new early-stopping rule, the extra vocabulary item, or differences in how termination is now supervised. No ablations separate those factors, and the abstract-level description gives no statistical significance or split details.

This is for people already working on non-autoregressive or diffusion-based generators who need better length control. A reader in that niche gets a practical trick and verifiable numbers. The empirical claims are sharp enough and the thinking is straightforward, so the paper deserves a serious referee even if stronger mechanistic checks would be needed in revision.

I would send it to peer review.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes VoidPadding for masked diffusion language models (MDLMs), claiming that the dual role of [EOS] as both semantic terminator and padding token causes [EOS] overflow under large-block decoding. By introducing a dedicated [VOID] token for padding and reserving [EOS] for termination, the approach enables early stopping via the learned [EOS] signal and adaptive canvas expansion via [VOID] during inference. On Dream-7B-Instruct, it reports a +17.84 point improvement in block-size-averaged four-task mean (math reasoning and code generation) over the original model and +6.95 over RainbowPadding, alongside a 55.7% average reduction in decoding NFE.

Significance. If the results hold, this offers a simple, practical fix to response-length modeling in MDLMs, a key issue for instruction tuning in non-autoregressive generation. The reported gains on reasoning benchmarks and efficiency improvements could make MDLMs more competitive, and the public code release aids reproducibility.

major comments (2)

[Abstract] Abstract: the claim that the dual semantic/padding role of [EOS] is the root cause of overflow is not supported by any token-level statistics (e.g., [EOS] probability mass on post-termination positions in the baseline) or ablations that isolate this mechanism from the new adaptive expansion rule or vocabulary changes.
[Abstract] Abstract: the quantitative claims (+17.84 / +6.95 points, -55.7% NFE) are presented without statistical significance tests, exact baseline definitions, data splits, or ablation controls, preventing verification of the central performance result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments below and will revise the paper to strengthen the supporting evidence and experimental details as outlined.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the dual semantic/padding role of [EOS] is the root cause of overflow is not supported by any token-level statistics (e.g., [EOS] probability mass on post-termination positions in the baseline) or ablations that isolate this mechanism from the new adaptive expansion rule or vocabulary changes.

Authors: We acknowledge that the abstract and current presentation do not include explicit token-level statistics or isolating ablations. The manuscript supports the claim through the observed [EOS] overflow behavior under large-block decoding and the performance gains from decoupling via [VOID], but we agree these are indirect. In the revision we will add (i) visualizations of [EOS] probability mass on post-termination positions for the baseline and (ii) ablations that separately control for the adaptive expansion rule and vocabulary changes. revision: yes
Referee: [Abstract] Abstract: the quantitative claims (+17.84 / +6.95 points, -55.7% NFE) are presented without statistical significance tests, exact baseline definitions, data splits, or ablation controls, preventing verification of the central performance result.

Authors: The reported numbers are block-size-averaged means over the four tasks with the original model and RainbowPadding as baselines, using the standard splits for the math and code benchmarks. However, we did not include p-values, variance estimates, or exhaustive ablation tables in the abstract. We will expand the experimental section with statistical significance tests, precise baseline configurations, data-split details, and additional ablation controls to improve verifiability. revision: yes

Circularity Check

0 steps flagged

Empirical intervention with no derivation chain or self-referential definitions

full rationale

The paper introduces VoidPadding by adding a dedicated [VOID] token for padding and reserving [EOS] for termination, then trains the model under standard procedures and reports measured benchmark improvements (+17.84 points, -55.7% NFE). No equations, fitted parameters, uniqueness theorems, or self-citations appear in the provided text. The central claim is an empirical outcome of the token change and early-stopping logic rather than a quantity defined in terms of itself or reduced by construction to prior fitted values. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central addition is the new [VOID] token together with the assumption that standard training will produce distinct learned behaviors for it and for [EOS].

axioms (1)

domain assumption Standard training of MDLMs on instruction data will cause the model to learn distinct representations and usage patterns for a newly introduced [VOID] token versus the existing [EOS] token.
The method depends on the model acquiring separate signals for padding and termination without further architectural changes.

invented entities (1)

[VOID] token no independent evidence
purpose: Dedicated padding token that decouples length control from semantic termination
New special token introduced by the paper to solve the dual-role problem.

pith-pipeline@v0.9.1-grok · 5746 in / 1326 out tokens · 55580 ms · 2026-06-27T00:40:05.051792+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 19 linked inside Pith

[1]

arXiv preprint arXiv:2510.24605 , year=

Diffusion llm with native variable generation lengths: Let [eos] lead the way , author=. arXiv preprint arXiv:2510.24605 , year=

arXiv
[2]

Journal of Machine Learning Research , volume=

Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=
[3]

arXiv preprint arXiv:2510.03680 , year=

Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs , author=. arXiv preprint arXiv:2510.03680 , year=

arXiv
[4]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[5]

Advances in Neural Information Processing Systems , volume=

Large language diffusion models , author=. Advances in Neural Information Processing Systems , volume=
[6]

arXiv preprint arXiv:2508.00819 , year=

Beyond fixed: Training-free variable-length denoising for diffusion large language models , author=. arXiv preprint arXiv:2508.00819 , year=

arXiv
[7]

Yang, Jingyi and Jiang, Yuxian and Shao, Jing , journal=
[8]

arXiv preprint arXiv:2603.06123 , year=

Diffusion Language Models Are Natively Length-Aware , author=. arXiv preprint arXiv:2603.06123 , year=

arXiv
[9]

Advances in neural information processing systems , volume=

Diffusion-lm improves controllable text generation , author=. Advances in neural information processing systems , volume=
[10]

Advances in neural information processing systems , volume=

Structured denoising diffusion models in discrete state-spaces , author=. Advances in neural information processing systems , volume=
[11]

Advances in Neural Information Processing Systems , volume=

Simple and effective masked diffusion language models , author=. Advances in Neural Information Processing Systems , volume=
[12]

arXiv preprint arXiv:2508.15487 , year=

Dream 7b: Diffusion large language models , author=. arXiv preprint arXiv:2508.15487 , year=

Pith/arXiv arXiv
[13]

1: Speeding up text diffusion via token editing , author=

Llada2. 1: Speeding up text diffusion via token editing , author=. arXiv preprint arXiv:2602.08676 , year=

arXiv
[14]

2018 , publisher=

Improving language understanding by generative pre-training , author=. 2018 , publisher=

2018
[15]

arXiv preprint arXiv:2505.22618 , year=

Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding , author=. arXiv preprint arXiv:2505.22618 , year=

Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2602.18176 , year=

Improving Sampling for Masked Diffusion Models via Information Gain , author=. arXiv preprint arXiv:2602.18176 , year=

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2602.07546 , year=

Improving Variable-Length Generation in Diffusion Language Models via Length Regularization , author=. arXiv preprint arXiv:2602.07546 , year=

arXiv
[18]

Advances in Neural Information Processing Systems , volume=

Accelerated sampling from masked diffusion models via entropy bounded unmasking , author=. Advances in Neural Information Processing Systems , volume=
[19]

arXiv preprint arXiv:2505.21467 , year=

FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion , author=. arXiv preprint arXiv:2505.21467 , year=

arXiv
[20]

arXiv preprint arXiv:2602.01326 , year=

DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas , author=. arXiv preprint arXiv:2602.01326 , year=

arXiv
[21]

arXiv preprint arXiv:2509.24007 , year=

Sequential diffusion language models , author=. arXiv preprint arXiv:2509.24007 , year=

arXiv
[22]

Advances in Neural Information Processing Systems , volume=

Klass: Kl-guided fast inference in masked diffusion models , author=. Advances in Neural Information Processing Systems , volume=
[23]

arXiv preprint arXiv:2508.13021 , year=

Pc-sampler: Position-aware calibration of decoding bias in masked diffusion models , author=. arXiv preprint arXiv:2508.13021 , year=

arXiv
[24]

arXiv preprint arXiv:2108.07732 , year=

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2411.15124 , year=

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

Pith/arXiv arXiv
[29]

Hugging Face Blog , volume=

SmolLM-blazingly fast and remarkably powerful , author=. Hugging Face Blog , volume=
[30]

arXiv preprint arXiv:2310.16834 , year=

Discrete diffusion modeling by estimating the ratios of the data distribution , author=. arXiv preprint arXiv:2310.16834 , year=

Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2502.06768 , year=

Train for the worst, plan for the best: Understanding token ordering in masked diffusions , author=. arXiv preprint arXiv:2502.06768 , year=

arXiv
[32]

International Conference on Learning Representations , volume=

Block diffusion: Interpolating between autoregressive and diffusion language models , author=. International Conference on Learning Representations , volume=
[33]

arXiv preprint arXiv:2510.06303 , year=

Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation , author=. arXiv preprint arXiv:2510.06303 , year=

arXiv
[34]

arXiv preprint arXiv:2509.06949 , year=

Revolutionizing reinforcement learning framework for diffusion large language models , author=. arXiv preprint arXiv:2509.06949 , year=

arXiv
[35]

arXiv preprint arXiv:2603.22248 , year=

Confidence-Based Decoding is Provably Efficient for Diffusion Language Models , author=. arXiv preprint arXiv:2603.22248 , year=

arXiv
[36]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
[37]

arXiv preprint arXiv:2510.17206 , year=

Soft-masked diffusion language models , author=. arXiv preprint arXiv:2510.17206 , year=

arXiv
[38]

arXiv preprint arXiv:2509.24389 , year=

Llada-moe: A sparse moe diffusion language model , author=. arXiv preprint arXiv:2509.24389 , year=

arXiv
[39]

arXiv preprint arXiv:2505.19223 , year=

Llada 1.5: Variance-reduced preference optimization for large language diffusion models , author=. arXiv preprint arXiv:2505.19223 , year=

Pith/arXiv arXiv
[40]

0: Scaling up diffusion language models to 100b , author=

Llada2. 0: Scaling up diffusion language models to 100b , author=. arXiv preprint arXiv:2512.15745 , year=

Pith/arXiv arXiv
[41]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[42]

The Fourteenth International Conference on Learning Representations , year=

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM , author=. The Fourteenth International Conference on Learning Representations , year=
[43]

arXiv preprint arXiv:2602.05992 , year=

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs , author=. arXiv preprint arXiv:2602.05992 , year=

Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2605.10938 , year=

ELF: Embedded Language Flows , author=. arXiv preprint arXiv:2605.10938 , year=

Pith/arXiv arXiv
[45]

arXiv preprint arXiv:2512.22737 , year=

Wedlm: Reconciling diffusion language models with standard causal attention for fast inference , author=. arXiv preprint arXiv:2512.22737 , year=

arXiv
[46]

arXiv preprint arXiv:2605.06548 , year=

Continuous Latent Diffusion Language Model , author=. arXiv preprint arXiv:2605.06548 , year=

Pith/arXiv arXiv
[47]

Advances in Neural Information Processing Systems , volume=

Continuous diffusion model for language modeling , author=. Advances in Neural Information Processing Systems , volume=
[48]

arXiv preprint arXiv:2603.02547 , year=

Codar: Continuous diffusion language models are more powerful than you think , author=. arXiv preprint arXiv:2603.02547 , year=

arXiv
[49]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
[50]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Discrete diffusion language model for efficient text summarization , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

2025
[51]

arXiv preprint arXiv:2506.10892 , year=

The diffusion duality , author=. arXiv preprint arXiv:2506.10892 , year=

arXiv
[52]

Advances in neural information processing systems , volume=

Argmax flows and multinomial diffusion: Learning categorical distributions , author=. Advances in neural information processing systems , volume=
[53]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Diffusionbert: Improving generative masked language models with diffusion models , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=
[54]

arXiv preprint arXiv:2508.19982 , year=

Diffusion language models know the answer before decoding , author=. arXiv preprint arXiv:2508.19982 , year=

Pith/arXiv arXiv
[55]

Advances in neural information processing systems , volume=

Accelerating diffusion llms via adaptive parallel decoding , author=. Advances in neural information processing systems , volume=
[56]

arXiv preprint arXiv:2506.10848 , year=

Accelerating diffusion large language models with slowfast sampling: The three golden principles , author=. arXiv preprint arXiv:2506.10848 , year=

arXiv
[57]

arXiv preprint arXiv:2509.20624 , year=

Fs-dfm: Fast and accurate long text generation with few-step diffusion language models , author=. arXiv preprint arXiv:2509.20624 , year=

Pith/arXiv arXiv
[58]

arXiv preprint arXiv:2605.00161 , year=

Consistent Diffusion Language Models , author=. arXiv preprint arXiv:2605.00161 , year=

Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2602.12262 , year=

T3d: Few-step diffusion language models via trajectory self-distillation with direct discriminative optimization , author=. arXiv preprint arXiv:2602.12262 , year=

arXiv
[60]

arXiv preprint arXiv:2506.00290 , year=

Dlm-one: Diffusion language models for one-step sequence generation , author=. arXiv preprint arXiv:2506.00290 , year=

arXiv
[61]

arXiv preprint arXiv:2509.01025 , year=

Any-order flexible length masked diffusion , author=. arXiv preprint arXiv:2509.01025 , year=

arXiv
[62]

arXiv preprint arXiv:2604.23994 , year=

When to Commit? Towards Variable-Size Self-Contained Blocks for Discrete Diffusion Language Models , author=. arXiv preprint arXiv:2604.23994 , year=

Pith/arXiv arXiv
[63]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[64]

arXiv preprint arXiv:1412.6980 , year=

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv
[65]

arXiv preprint arXiv:2511.21759 , year=

Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models , author=. arXiv preprint arXiv:2511.21759 , year=

arXiv
[66]

Advances in Neural Information Processing Systems , year=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. Advances in Neural Information Processing Systems , year=

[1] [1]

arXiv preprint arXiv:2510.24605 , year=

Diffusion llm with native variable generation lengths: Let [eos] lead the way , author=. arXiv preprint arXiv:2510.24605 , year=

arXiv

[2] [2]

Journal of Machine Learning Research , volume=

Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

[3] [3]

arXiv preprint arXiv:2510.03680 , year=

Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs , author=. arXiv preprint arXiv:2510.03680 , year=

arXiv

[4] [4]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

[5] [5]

Advances in Neural Information Processing Systems , volume=

Large language diffusion models , author=. Advances in Neural Information Processing Systems , volume=

[6] [6]

arXiv preprint arXiv:2508.00819 , year=

Beyond fixed: Training-free variable-length denoising for diffusion large language models , author=. arXiv preprint arXiv:2508.00819 , year=

arXiv

[7] [7]

Yang, Jingyi and Jiang, Yuxian and Shao, Jing , journal=

[8] [8]

arXiv preprint arXiv:2603.06123 , year=

Diffusion Language Models Are Natively Length-Aware , author=. arXiv preprint arXiv:2603.06123 , year=

arXiv

[9] [9]

Advances in neural information processing systems , volume=

Diffusion-lm improves controllable text generation , author=. Advances in neural information processing systems , volume=

[10] [10]

Advances in neural information processing systems , volume=

Structured denoising diffusion models in discrete state-spaces , author=. Advances in neural information processing systems , volume=

[11] [11]

Advances in Neural Information Processing Systems , volume=

Simple and effective masked diffusion language models , author=. Advances in Neural Information Processing Systems , volume=

[12] [12]

arXiv preprint arXiv:2508.15487 , year=

Dream 7b: Diffusion large language models , author=. arXiv preprint arXiv:2508.15487 , year=

Pith/arXiv arXiv

[13] [13]

1: Speeding up text diffusion via token editing , author=

Llada2. 1: Speeding up text diffusion via token editing , author=. arXiv preprint arXiv:2602.08676 , year=

arXiv

[14] [14]

2018 , publisher=

Improving language understanding by generative pre-training , author=. 2018 , publisher=

2018

[15] [15]

arXiv preprint arXiv:2505.22618 , year=

Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding , author=. arXiv preprint arXiv:2505.22618 , year=

Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2602.18176 , year=

Improving Sampling for Masked Diffusion Models via Information Gain , author=. arXiv preprint arXiv:2602.18176 , year=

Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2602.07546 , year=

Improving Variable-Length Generation in Diffusion Language Models via Length Regularization , author=. arXiv preprint arXiv:2602.07546 , year=

arXiv

[18] [18]

Advances in Neural Information Processing Systems , volume=

Accelerated sampling from masked diffusion models via entropy bounded unmasking , author=. Advances in Neural Information Processing Systems , volume=

[19] [19]

arXiv preprint arXiv:2505.21467 , year=

FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion , author=. arXiv preprint arXiv:2505.21467 , year=

arXiv

[20] [20]

arXiv preprint arXiv:2602.01326 , year=

DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas , author=. arXiv preprint arXiv:2602.01326 , year=

arXiv

[21] [21]

arXiv preprint arXiv:2509.24007 , year=

Sequential diffusion language models , author=. arXiv preprint arXiv:2509.24007 , year=

arXiv

[22] [22]

Advances in Neural Information Processing Systems , volume=

Klass: Kl-guided fast inference in masked diffusion models , author=. Advances in Neural Information Processing Systems , volume=

[23] [23]

arXiv preprint arXiv:2508.13021 , year=

Pc-sampler: Position-aware calibration of decoding bias in masked diffusion models , author=. arXiv preprint arXiv:2508.13021 , year=

arXiv

[24] [24]

arXiv preprint arXiv:2108.07732 , year=

Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

Pith/arXiv arXiv

[25] [25]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[26] [26]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[28] [28]

arXiv preprint arXiv:2411.15124 , year=

Tulu 3: Pushing frontiers in open language model post-training , author=. arXiv preprint arXiv:2411.15124 , year=

Pith/arXiv arXiv

[29] [29]

Hugging Face Blog , volume=

SmolLM-blazingly fast and remarkably powerful , author=. Hugging Face Blog , volume=

[30] [30]

arXiv preprint arXiv:2310.16834 , year=

Discrete diffusion modeling by estimating the ratios of the data distribution , author=. arXiv preprint arXiv:2310.16834 , year=

Pith/arXiv arXiv

[31] [31]

arXiv preprint arXiv:2502.06768 , year=

Train for the worst, plan for the best: Understanding token ordering in masked diffusions , author=. arXiv preprint arXiv:2502.06768 , year=

arXiv

[32] [32]

International Conference on Learning Representations , volume=

Block diffusion: Interpolating between autoregressive and diffusion language models , author=. International Conference on Learning Representations , volume=

[33] [33]

arXiv preprint arXiv:2510.06303 , year=

Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation , author=. arXiv preprint arXiv:2510.06303 , year=

arXiv

[34] [34]

arXiv preprint arXiv:2509.06949 , year=

Revolutionizing reinforcement learning framework for diffusion large language models , author=. arXiv preprint arXiv:2509.06949 , year=

arXiv

[35] [35]

arXiv preprint arXiv:2603.22248 , year=

Confidence-Based Decoding is Provably Efficient for Diffusion Language Models , author=. arXiv preprint arXiv:2603.22248 , year=

arXiv

[36] [36]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

[37] [37]

arXiv preprint arXiv:2510.17206 , year=

Soft-masked diffusion language models , author=. arXiv preprint arXiv:2510.17206 , year=

arXiv

[38] [38]

arXiv preprint arXiv:2509.24389 , year=

Llada-moe: A sparse moe diffusion language model , author=. arXiv preprint arXiv:2509.24389 , year=

arXiv

[39] [39]

arXiv preprint arXiv:2505.19223 , year=

Llada 1.5: Variance-reduced preference optimization for large language diffusion models , author=. arXiv preprint arXiv:2505.19223 , year=

Pith/arXiv arXiv

[40] [40]

0: Scaling up diffusion language models to 100b , author=

Llada2. 0: Scaling up diffusion language models to 100b , author=. arXiv preprint arXiv:2512.15745 , year=

Pith/arXiv arXiv

[41] [41]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Sparse-dllm: Accelerating diffusion llms with dynamic cache eviction , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[42] [42]

The Fourteenth International Conference on Learning Representations , year=

Dynamic-dLLM: Dynamic Cache-Budget and Adaptive Parallel Decoding for Training-Free Acceleration of Diffusion LLM , author=. The Fourteenth International Conference on Learning Representations , year=

[43] [43]

arXiv preprint arXiv:2602.05992 , year=

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs , author=. arXiv preprint arXiv:2602.05992 , year=

Pith/arXiv arXiv

[44] [44]

arXiv preprint arXiv:2605.10938 , year=

ELF: Embedded Language Flows , author=. arXiv preprint arXiv:2605.10938 , year=

Pith/arXiv arXiv

[45] [45]

arXiv preprint arXiv:2512.22737 , year=

Wedlm: Reconciling diffusion language models with standard causal attention for fast inference , author=. arXiv preprint arXiv:2512.22737 , year=

arXiv

[46] [46]

arXiv preprint arXiv:2605.06548 , year=

Continuous Latent Diffusion Language Model , author=. arXiv preprint arXiv:2605.06548 , year=

Pith/arXiv arXiv

[47] [47]

Advances in Neural Information Processing Systems , volume=

Continuous diffusion model for language modeling , author=. Advances in Neural Information Processing Systems , volume=

[48] [48]

arXiv preprint arXiv:2603.02547 , year=

Codar: Continuous diffusion language models are more powerful than you think , author=. arXiv preprint arXiv:2603.02547 , year=

arXiv

[49] [49]

Advances in neural information processing systems , volume=

Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

[50] [50]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Discrete diffusion language model for efficient text summarization , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

2025

[51] [51]

arXiv preprint arXiv:2506.10892 , year=

The diffusion duality , author=. arXiv preprint arXiv:2506.10892 , year=

arXiv

[52] [52]

Advances in neural information processing systems , volume=

Argmax flows and multinomial diffusion: Learning categorical distributions , author=. Advances in neural information processing systems , volume=

[53] [53]

Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Diffusionbert: Improving generative masked language models with diffusion models , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

[54] [54]

arXiv preprint arXiv:2508.19982 , year=

Diffusion language models know the answer before decoding , author=. arXiv preprint arXiv:2508.19982 , year=

Pith/arXiv arXiv

[55] [55]

Advances in neural information processing systems , volume=

Accelerating diffusion llms via adaptive parallel decoding , author=. Advances in neural information processing systems , volume=

[56] [56]

arXiv preprint arXiv:2506.10848 , year=

Accelerating diffusion large language models with slowfast sampling: The three golden principles , author=. arXiv preprint arXiv:2506.10848 , year=

arXiv

[57] [57]

arXiv preprint arXiv:2509.20624 , year=

Fs-dfm: Fast and accurate long text generation with few-step diffusion language models , author=. arXiv preprint arXiv:2509.20624 , year=

Pith/arXiv arXiv

[58] [58]

arXiv preprint arXiv:2605.00161 , year=

Consistent Diffusion Language Models , author=. arXiv preprint arXiv:2605.00161 , year=

Pith/arXiv arXiv

[59] [59]

arXiv preprint arXiv:2602.12262 , year=

T3d: Few-step diffusion language models via trajectory self-distillation with direct discriminative optimization , author=. arXiv preprint arXiv:2602.12262 , year=

arXiv

[60] [60]

arXiv preprint arXiv:2506.00290 , year=

Dlm-one: Diffusion language models for one-step sequence generation , author=. arXiv preprint arXiv:2506.00290 , year=

arXiv

[61] [61]

arXiv preprint arXiv:2509.01025 , year=

Any-order flexible length masked diffusion , author=. arXiv preprint arXiv:2509.01025 , year=

arXiv

[62] [62]

arXiv preprint arXiv:2604.23994 , year=

When to Commit? Towards Variable-Size Self-Contained Blocks for Discrete Diffusion Language Models , author=. arXiv preprint arXiv:2604.23994 , year=

Pith/arXiv arXiv

[63] [63]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[64] [64]

arXiv preprint arXiv:1412.6980 , year=

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv

[65] [65]

arXiv preprint arXiv:2511.21759 , year=

Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models , author=. arXiv preprint arXiv:2511.21759 , year=

arXiv

[66] [66]

Advances in Neural Information Processing Systems , year=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. Advances in Neural Information Processing Systems , year=