arxiv: 2605.11854 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: no theorem link

Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

Dandan Tu, Haoliang Li, Hui Liu, Kecheng Chen, Lingpeng Kong, Rui Liu, Shi Wu, Suiyun Zhang, Xijia Tao, Xinyu Fu, Yibing Liu, Ziru Liu

Pith reviewed 2026-05-13 05:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelsself-distillationtrajectory alignmentBoltzmann modelingpairwise rankingpost-trainingtraining-inference discrepancycatastrophic forgetting

0 comments

The pith

A new optimization framework for diffusion language models uses self-distilled inference trajectories and Boltzmann modeling of entropies to close the gap with standard supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion language models are trained with random masking in one step but generate by progressively unmasking tokens from easy to hard over multiple steps, creating a mismatch that limits what standard fine-tuning can achieve. Self-distilled trajectories from the model's own inference offer a lower barrier for learning yet yield only marginal gains when used with ordinary NELBO objectives. TABOM models the preference order of unmasking as a Boltzmann distribution over predictive entropies and converts that into a pairwise ranking loss that directly aligns training with the observed decoding path. When this alignment is applied, the models show improved results on new domains and retain prior knowledge more effectively than conventional post-training.

Core claim

The paper establishes that modeling the inference unmasking preference as a Boltzmann distribution over predictive entropies and deriving a tractable pairwise ranking objective allows self-distilled trajectories to support genuine knowledge acquisition in diffusion language models, rather than serving only for sampling acceleration.

What carries the argument

TABOM, which treats the sequence of token unmasking during inference as a Boltzmann distribution over the model's predictive entropies and optimizes a pairwise ranking loss to enforce the same ordering during training.

If this is right

TABOM produces substantial performance gains on tasks from new domains.
The method expands the effective knowledge boundary reachable by diffusion language models.
Catastrophic forgetting is significantly reduced relative to standard supervised fine-tuning on the same trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-based ranking principle could be tested in other iterative generative models whose training and inference steps differ in structure.
Pairwise ranking derived from model-internal uncertainty signals may offer a general way to incorporate self-generated data without external supervision.
The approach suggests that explicit trajectory alignment could become a standard post-training step for any model that decodes in an easy-to-hard sequence.

Load-bearing premise

Modeling the inference unmasking preference as a Boltzmann distribution over predictive entropies and deriving a pairwise ranking objective from it will produce genuine knowledge acquisition rather than marginal or illusory gains.

What would settle it

Retraining a diffusion language model with the TABOM ranking loss on its own inference trajectories and measuring no improvement over standard NELBO fine-tuning on held-out domain tasks or equivalent forgetting rates on prior tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.11854 by Dandan Tu, Haoliang Li, Hui Liu, Kecheng Chen, Lingpeng Kong, Rui Liu, Shi Wu, Suiyun Zhang, Xijia Tao, Xinyu Fu, Yibing Liu, Ziru Liu.

**Figure 2.** Figure 2: Trajectory Discrimination Score during decoding on Dream. We compute the variance of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of Cross-Entropy loss between GT and SD data across different mask ratios. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding. In this work, we ask whether self-distilled trajectories can be used not merely for faster inference, but for genuine knowledge acquisition. Although these trajectories lie on the pretrained DLM's own distributional manifold and thus offer a potentially lower optimization barrier, we find that naively fine-tuning on them with standard NELBO objectives yields only marginal gains. To address this limitation, we propose \textbf{T}rajectory-\textbf{A}ligned optimization via \textbf{Bo}ltzmann \textbf{M}odeling (\textbf{TABOM}), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TABOM derives a Boltzmann ranking loss from self-trajectories to align DLM training with inference, but the evidence for real knowledge gains beyond the pretrained manifold is still thin.

read the letter

The core idea is to model the model's own inference unmasking order as a Boltzmann distribution over predictive entropies and turn that into a pairwise ranking objective. This goes past earlier trajectory self-distillation, which mostly targeted faster sampling, and tries to use the same trajectories for actual capability lift instead of just efficiency. The paper makes a clear case that plain NELBO fine-tuning on those trajectories barely moves the needle, which sets up why a ranking loss might enforce the right certainty ordering during training. That framing is honest about the training-inference gap in diffusion language models and gives a concrete, tractable fix. The motivation section and the derivation read cleanly, and the claim that this reduces catastrophic forgetting relative to standard SFT is at least worth checking against the numbers they report. The soft spot is exactly the one the stress-test flags. All training signals still come from the pretrained model's predictions on its own trajectories, so improved performance on new domains could easily be re-weighting of already-represented tokens rather than acquisition of information outside the original support. Without ablations that test on inputs whose correct answers lie clearly outside the pretraining distribution, or error breakdowns showing the model now gets novel facts right instead of just ranking familiar ones higher, the stronger language about expanding the knowledge boundary stays interpretive. The experiments appear to show gains, but they need to rule out manifold exploitation before the forgetting and boundary claims land solidly. This is relevant for anyone working on post-training or non-autoregressive generative models. It deserves peer review so the experimental controls and derivation details can be examined properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes TABOM, a self-distilled trajectory-based post-training method for diffusion language models that models inference unmasking preferences as a Boltzmann distribution over predictive entropies along the model's own trajectories and derives a pairwise ranking objective to align training with the easy-to-hard denoising process, claiming this yields substantial gains over standard NELBO fine-tuning in new domains, expands the effective knowledge boundary, and reduces catastrophic forgetting.

Significance. If the empirical results hold and demonstrate gains beyond re-weighting within the pretrained manifold, the approach could offer a useful alignment technique for DLMs that bridges the training-inference gap without requiring external data, potentially improving post-training efficiency and capability retention.

major comments (2)

[Abstract and §3] Abstract and §3 (method derivation): the central claim that TABOM enables 'genuine knowledge acquisition' and 'expands the effective knowledge boundary' rests on a ranking objective derived from Boltzmann modeling of predictive entropies computed solely on self-generated trajectories; this risks circularity, as any improvement could reflect re-weighting of already-represented tokens rather than acquisition outside the original support, and the abstract notes that naive NELBO on the same trajectories yields only marginal gains.
[Experiments] Experiments section: to support the headline claims of substantial gains in new domains and mitigation of forgetting, the evaluation must include controls showing that correct predictions occur on inputs whose answers lie outside the pretraining distribution; without such tests or ablations isolating the ranking loss from standard SFT, the distinction from manifold exploitation remains unverified.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta or forgetting metric) alongside the qualitative claims.
[§3] Notation for the Boltzmann distribution and the derived pairwise loss should be introduced with explicit equations early in §3 to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major point below with clarifications and proposed revisions to better distinguish the contributions of the ranking objective from standard fine-tuning.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method derivation): the central claim that TABOM enables 'genuine knowledge acquisition' and 'expands the effective knowledge boundary' rests on a ranking objective derived from Boltzmann modeling of predictive entropies computed solely on self-generated trajectories; this risks circularity, as any improvement could reflect re-weighting of already-represented tokens rather than acquisition outside the original support, and the abstract notes that naive NELBO on the same trajectories yields only marginal gains.

Authors: We agree that self-generated trajectories lie within the pretrained manifold and that this could invite concerns about circularity or mere re-weighting. The manuscript already notes the marginal gains from naive NELBO on identical trajectories, which serves as a control. The distinction arises because the Boltzmann-derived pairwise ranking loss explicitly aligns the model's certainty ordering with the observed inference trajectory, enabling more effective optimization than reconstruction alone. We will revise the abstract and §3 to replace 'genuine knowledge acquisition' with 'improved utilization of existing knowledge via trajectory alignment' and add a paragraph clarifying that effective boundary expansion is measured by downstream gains rather than strict support expansion. revision: partial
Referee: [Experiments] Experiments section: to support the headline claims of substantial gains in new domains and mitigation of forgetting, the evaluation must include controls showing that correct predictions occur on inputs whose answers lie outside the pretraining distribution; without such tests or ablations isolating the ranking loss from standard SFT, the distinction from manifold exploitation remains unverified.

Authors: We acknowledge that stronger controls would help isolate the effect. The current experiments already compare TABOM against NELBO fine-tuning on the same self-generated trajectories and against standard SFT, showing consistent gains in new domains and reduced forgetting. We will add explicit ablations that remove the ranking term while keeping the trajectories fixed, and include tests on held-out examples constructed to lie outside the pretraining support (e.g., via synthetic or low-frequency facts) to demonstrate that correct predictions are enabled by the alignment objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses explicit ansatz with independent empirical claims

full rationale

The paper introduces TABOM by choosing to model inference unmasking preferences as a Boltzmann distribution over predictive entropies, then deriving a pairwise ranking loss to align with observed trajectories. This is presented as a deliberate modeling decision to bridge training-inference mismatch, not as a result forced by prior self-citations, fitted parameters renamed as predictions, or self-definitional equivalence. The central claims of gains, knowledge boundary expansion, and reduced forgetting are supported by empirical comparisons to standard SFT and NELBO on the same trajectories, rather than reducing tautologically to the input data or modeling choice. No quoted equation or step shows the objective or results as equivalent to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that inference preferences can be usefully represented as a Boltzmann distribution over predictive entropies; no explicit free parameters or new physical entities are introduced in the abstract.

axioms (1)

domain assumption Inference unmasking preference can be modeled as a Boltzmann distribution over predictive entropies
This modeling choice is the foundation for deriving the pairwise ranking objective in TABOM.

pith-pipeline@v0.9.0 · 5638 in / 1320 out tokens · 36998 ms · 2026-05-13T05:48:46.216040+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 11 internal anchors

[1]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

OpenAI o3 and o4-mini system card.https://openai.com/index/o3-o4-mini-system- card/, 2025

OpenAI. OpenAI o3 and o4-mini system card.https://openai.com/index/o3-o4-mini-system- card/, 2025

work page 2025
[3]

Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page
[4]

URLhttps://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv. org/abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Principled rl for diffusion llms emerges from a sequence-level perspective, 2025

Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled rl for diffusion llms emerges from a sequence-level perspective, 2025. URLhttps://arxiv.org/abs/2512.03759

work page arXiv 2025
[9]

DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

Jingyi Yang, Yuxian Jiang, Xuhao Hu, Shuang Cheng, Biqing Qi, and Jing Shao. Dare: Diffusion large language models alignment and reinforcement executor, 2026. URL https: //arxiv.org/abs/2604.04215

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Taming masked diffusion language models via consistency trajectory reinforcement learning with fewer decoding step, 2025

Jingyi Yang, Guanxu Chen, Xuhao Hu, and Jing Shao. Taming masked diffusion language models via consistency trajectory reinforcement learning with fewer decoding step, 2025. URL https://arxiv.org/abs/2509.23924

work page arXiv 2025
[12]

dinfer: An efficient inference framework for diffusion language models, 2025

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, and Da Zheng. dinfer: An efficient inference framework for diffusion language models, 2025. URL htt...

work page arXiv 2025
[13]

Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. Seed diffusion: A large-scale diffusion language model with high-speed inference, 2025. URL h...

work page arXiv 2025
[14]

T3d: Few-step diffusion lan- guage models via trajectory self-distillation with direct discriminative optimization.arXiv preprint arXiv:2602.12262, 2026

Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Vladimir Pavlovic, et al. T3d: Few-step diffusion lan- guage models via trajectory self-distillation with direct discriminative optimization.arXiv preprint arXiv:2602.12262, 2026

work page arXiv 2026
[15]

Ling-coder-sft

inclusionAI. Ling-coder-sft. https://huggingface.co/datasets/inclusionAI/ Ling-Coder-SFT, 2024

work page 2024
[16]

Mixchain-z-prm12k

horseee. Mixchain-z-prm12k. https://huggingface.co/datasets/horseee/ MixChain-Z-PRM12K, 2024

work page 2024
[17]

A convergence theory for diffusion language models: An information-theoretic perspective, 2025

Gen Li and Changxiao Cai. A convergence theory for diffusion language models: An information-theoretic perspective, 2025. URLhttps://arxiv.org/abs/2505.21400. 13

work page arXiv 2025
[18]

Mercury: Ultra-fast language models based on diffusion, 2025

Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and V olodymyr Kuleshov. Mercury: Ultra-fast language models based on diffusion, 2025. URL https://arxiv.org/abs/2506.17298

work page arXiv 2025
[19]

Large language models are overconfident and amplify human bias, 2025

Fengfei Sun, Ningke Li, Kailong Wang, and Lorenz Goette. Large language models are overconfident and amplify human bias, 2025. URLhttps://arxiv.org/abs/2505.02151

work page arXiv 2025
[20]

Yann Lecun, Sumit Chopra, and Raia Hadsell.A tutorial on energy-based learning. 01 2006

work page 2006
[21]

d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

work page arXiv 2025
[22]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

work page 2021
[25]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information P...

work page 2021
[27]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015

work page 2015
[29]

Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34:12454–12465, 2021

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions.Advances in Neural Information Processing Systems, 34:12454–12465, 2021

work page 2021
[30]

A continuous time framework for discrete denoising models

Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information 14 Proce...

work page 2022
[31]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K Titsias. Simplified and generalized masked diffusion for discrete data.arXiv preprint arXiv:2406.04329, 2024

work page arXiv 2024
[33]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024
[34]

Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao, Ming-Yu Liu, Jun Zhu, and Qinsheng Zhang. Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling.arXiv preprint arXiv:2409.02908, 2024

work page arXiv 2024
[35]

Your absorbing discrete diffusion secretly models the conditional distributions of clean data

Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. arXiv preprint arXiv:2406.03736, 2024

work page arXiv 2024
[36]

Gemini-diffusion, 2025

Google DeepMind. Gemini-diffusion, 2025. URL https://blog.google/technology/ google-deepmind/gemini-diffusion/

work page 2025
[37]

Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding.CoRR, abs/2505.22618, 2025

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding.CoRR, abs/2505.22618, 2025. 15 A Sensitivity toλandγ We perform a 2×3 sensitivity analysis over λ∈ {1,2} and γ∈ {0.1,0.2,0.3} , where γ denotes the ma...

work page arXiv 2025