pith. sign in

arxiv: 2605.29398 · v1 · pith:26OFLDTLnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Pith reviewed 2026-06-29 08:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords diffusion language modelsreinforcement learningself-distillationdenoiserELBO biastraining-inference mismatchpolicy optimizationreverse KL
0
0 comments X

The pith

GDSD reframes RL for diffusion language models as normalization-free self-distillation from a closed-form self-teacher.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that ELBO-based RL for dLLMs introduces training-inference mismatch bias when substituting the evidence lower bound for the intractable policy likelihood. GDSD instead derives an advantage-guided self-teacher directly from the closed-form optimum of reverse-KL regularized RL and matches the denoiser's logits to it with a normalization-free objective. This converts the RL problem into likelihood-free self-distillation that avoids the mismatch. On planning, math, and coding tasks with LLaDA-8B and Dream-7B, the method delivers up to 19.6 percent higher test accuracy and smoother reward curves than prior ELBO approaches. A reader would care because it supplies a concrete route to policy improvement that stays aligned with the model's generative training without surrogate approximations.

Core claim

GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids.

What carries the argument

Advantage-guided self-teacher obtained from the closed-form optimum of reverse-KL regularized RL, paired with normalization-free logit matching for denoiser distillation.

If this is right

  • GDSD produces up to +19.6% higher test accuracy than ELBO-based RL on the reported benchmarks.
  • Training reward curves remain more stable under GDSD than under ELBO surrogates.
  • ELBO methods appear as special cases of distillation divergences that introduce avoidable pathologies.
  • The approach eliminates the need for random masking to estimate likelihood surrogates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same closed-form teacher construction could be tested on other diffusion or autoregressive models where exact likelihoods remain intractable.
  • If the normalization-free matching proves robust, it may simplify reward-aligned fine-tuning pipelines that currently rely on multiple importance-sampling corrections.
  • Scaling the self-teacher to larger dLLMs would test whether the closed-form optimum remains computationally tractable at higher parameter counts.

Load-bearing premise

The closed-form optimum of reverse-KL regularized RL supplies an unbiased self-teacher whose logits can be matched directly without new approximation errors.

What would settle it

If GDSD applied to LLaDA-8B or Dream-7B on the same planning/math/coding benchmarks yields no accuracy gain or unstable rewards relative to the best ELBO baseline, the claim that bypassing TIM via self-distillation improves RL would be falsified.

Figures

Figures reproduced from arXiv: 2605.29398 by Che Liu, Ilija Bogunovic, Keyue Jiang, Qifang Zhao, Sangwoong Yoon, Xiaohang Tang, Xiaoxiao Xu.

Figure 1
Figure 1. Figure 1: Overview of reinforcement learning for diffusion Large Language Models (dLLMs). Prior [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training reward dynamics trained on LLaDA-8B-Instruct with different methods on different [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on Token-level Logit Centralization (TLC) and the energy-guidance [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Test accuracy of different methods (best across generation length 128, 256, and 512) on base model LLaDA-8B. Right: Average test accuracy of different methods across generation lengths, on base model LLaDA-8B. 0 600 1200 1800 Step 0.5 1.0 1.5 Reward w/ TLC w/o TLC (a) LLaDA-8B trained on GSM8k 0 1500 3000 4500 Step 0.30 0.45 0.60 Reward LLaDA + TLC LLaDA w/o TLC Dream + TLC Dream w/o TLC (b) Countdow… view at source ↗
Figure 5
Figure 5. Figure 5: Reward dynamics of GDSD with and without token logits centralization (TLC). TLC [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
read the original abstract

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard ELBO-based RL for diffusion LLMs introduces training-inference mismatch (TIM) bias, and proposes GDSD as a fix: derive an advantage-guided self-teacher from the closed-form optimum of reverse-KL regularized RL, then perform normalization-free logit matching to reduce RL to likelihood-free self-distillation. This is asserted to bypass TIM biases while recovering prior ELBO methods as special cases with pathologies; experiments on LLaDA-8B and Dream-7B report up to +19.6% gains on planning/math/coding benchmarks with more stable reward dynamics.

Significance. If the closed-form self-teacher is exactly unbiased and the normalization-free matching introduces no new approximation error in the discrete masked setting, the approach would provide a cleaner and more stable RL procedure for dLLMs than ELBO surrogates. The code release supports reproducibility, but the central claim rests on the validity of the reduction, which is not yet verified from the provided abstract-level description.

major comments (2)
  1. [Derivation of self-teacher (abstract + §3)] The central reduction (abstract and likely §3): the claim that the advantage-guided self-teacher is the exact closed-form optimum of reverse-KL regularized RL, allowing direct normalization-free logit matching to eliminate TIM bias, requires explicit verification that this optimum remains exact and computable without continuous relaxations or implicit normalization in the discrete token/masking schedule of dLLMs. The skeptic's concern that this may reintroduce approximation error is load-bearing for attributing gains to the proposed mechanism rather than other factors.
  2. [Experiments (§5)] Experimental attribution (likely §5 and tables): the reported +19.6% gains and stable dynamics are presented as evidence that GDSD bypasses ELBO biases, but without ablations isolating the effect of the normalization-free objective versus the self-teacher construction, it is unclear whether the improvements stem from the claimed mechanism or from other implementation choices.
minor comments (2)
  1. [Abstract] The abstract states that recent ELBO methods 'emerge as instances of applying different distillation divergences' but does not specify which divergences or provide the mapping; this should be made explicit with equations.
  2. [Notation and preliminaries] Notation for the normalization-free objective and the advantage-guided teacher should be introduced with clear definitions before the main claims, to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Derivation of self-teacher (abstract + §3)] The central reduction (abstract and likely §3): the claim that the advantage-guided self-teacher is the exact closed-form optimum of reverse-KL regularized RL, allowing direct normalization-free logit matching to eliminate TIM bias, requires explicit verification that this optimum remains exact and computable without continuous relaxations or implicit normalization in the discrete token/masking schedule of dLLMs. The skeptic's concern that this may reintroduce approximation error is load-bearing for attributing gains to the proposed mechanism rather than other factors.

    Authors: Section 3 derives the advantage-guided self-teacher directly as the closed-form optimum of the reverse-KL regularized objective over the discrete token distribution induced by the dLLM masking schedule. The derivation uses only the exact token probabilities and does not invoke continuous relaxations or implicit normalization; the subsequent normalization-free logit matching is obtained by algebraic rearrangement of this optimum. We can add a short clarifying paragraph in §3 that explicitly states the absence of approximation error in the discrete case. revision: partial

  2. Referee: [Experiments (§5)] Experimental attribution (likely §5 and tables): the reported +19.6% gains and stable dynamics are presented as evidence that GDSD bypasses ELBO biases, but without ablations isolating the effect of the normalization-free objective versus the self-teacher construction, it is unclear whether the improvements stem from the claimed mechanism or from other implementation choices.

    Authors: We agree that the current experimental section does not contain ablations that isolate the normalization-free objective from the self-teacher construction. We will add these ablations (e.g., variants that retain the self-teacher but revert to a normalized distillation loss) to the revised §5 and tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper derives the advantage-guided self-teacher from the closed-form optimum of reverse-KL regularized RL (a standard result in the RL literature) and then reduces the problem to normalization-free logit matching. No load-bearing step is shown to reduce by construction to a fitted parameter, a self-citation chain, or a redefinition of the target quantity. The abstract and claimed mechanism treat the closed-form optimum as an external theoretical input rather than an internal fit or self-referential ansatz. Empirical gains are reported separately from the derivation. This is the common honest case of a self-contained argument against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated assumption that the reverse-KL optimum admits a usable closed-form teacher and that logit matching without normalization preserves the RL signal.

pith-pipeline@v0.9.1-grok · 5849 in / 1194 out tokens · 23062 ms · 2026-06-29T08:42:44.062192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 49 canonical work pages · 24 internal anchors

  1. [1]

    Arel’s sudoku generator

    Arel. Arel’s sudoku generator. https://www.ocf.berkeley.edu/~arel/sudoku/main. html, 2025. Accessed: 2025-04-08

  2. [2]

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

    Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and V olodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573, 2025

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URLhttps://arxiv.org/abs/2108.07732

  4. [4]

    Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745, 2025

  5. [5]

    Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, et al. Llada2. 1: Speeding up text diffusion via token editing.arXiv preprint arXiv:2602.08676, 2026

  6. [6]

    Score regularized policy optimization through diffusion behavior.arXiv preprint arXiv:2310.07297, 2023

    Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, and Jun Zhu. Score regularized policy optimization through diffusion behavior.arXiv preprint arXiv:2310.07297, 2023

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  8. [8]

    dultra: Ultra-fast diffusion language models via reinforcement learning.arXiv preprint arXiv:2512.21446, 2025

    Shirui Chen, Jiantao Jiao, Lillian J Ratliff, and Banghua Zhu. dultra: Ultra-fast diffusion language models via reinforcement learning.arXiv preprint arXiv:2512.21446, 2025

  9. [9]

    Safe and stable control via lyapunov- guided diffusion models.arXiv preprint arXiv:2509.25375, 2025

    Xiaoyuan Cheng, Xiaohang Tang, and Yiming Yang. Safe and stable control via lyapunov- guided diffusion models.arXiv preprint arXiv:2509.25375, 2025

  10. [10]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  11. [11]

    Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

    Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, and Yongxin Chen. Rethinking the design space of reinforcement learning for diffusion models: On 10 the importance of likelihood estimation beyond loss design.arXiv preprint arXiv:2602.04663, 2026

  12. [12]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021

  13. [13]

    A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

    Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

  14. [14]

    The language model evaluation harness, 07 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  15. [15]

    Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

    Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

  16. [16]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  17. [17]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

  18. [18]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  19. [19]

    Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models.arXiv preprint arXiv:2505.10446, 2025

    Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, and Guo-Jun Qi. Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models.arXiv preprint arXiv:2505.10446, 2025

  20. [20]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

  21. [21]

    Diffusion policy optimization without drifting apart

    Haozhe Jiang, Haiwen Feng, Jiantao Jiao, Angjoo Kanazawa, and Nika Haghtalab. Diffusion policy optimization without drifting apart. InICLR 2026 2nd Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy

  22. [22]

    Mercury: Ultra-Fast Language Models Based on Diffusion

    Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, and V olodymyr Kuleshov. Mercury: Ultra-Fast Language Models Based on Diffusion, 2025. URLhttps://arxiv.org/abs/2506.17298

  23. [23]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InICLR. OpenReview.net, 2024

  24. [24]

    When speed kills stability: Demystifying RL collapse from the training-inference mismatch, September 2025

    Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Zhuo Jiang. When speed kills stability: Demystifying RL collapse from the training-inference mismatch, September 2025. URLhttps://richardli.xyz/rl-collapse

  25. [25]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  26. [26]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like Training: A Critical Perspective.arXiv preprint arXiv:2503.20783, 2025. 11

  27. [27]

    Contrastive Energy Prediction for Exact Energy-Guided Diffusion Sampling in Offline Reinforcement Learning

    Cheng Lu, Huayu Chen, Jianfei Chen, Hang Su, Chongxuan Li, and Jun Zhu. Contrastive Energy Prediction for Exact Energy-Guided Diffusion Sampling in Offline Reinforcement Learning. InInternational Conference on Machine Learning, pages 22825–22855. PMLR, 2023

  28. [28]

    What makes a good diffusion planner for decision making?arXiv preprint arXiv:2503.00535, 2025

    Haofei Lu, Dongqi Han, Yifei Shen, and Dongsheng Li. What makes a good diffusion planner for decision making?arXiv preprint arXiv:2503.00535, 2025

  29. [29]

    Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

    David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

  30. [30]

    Concrete Score Matching: Generalized Score Matching for Discrete Data.Advances in Neural Information Processing Systems, 35:34532–34545, 2022

    Chenlin Meng, Kristy Choi, Jiaming Song, and Stefano Ermon. Concrete Score Matching: Generalized Score Matching for Discrete Data.Advances in Neural Information Processing Systems, 35:34532–34545, 2022

  31. [31]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large Language Diffusion Models.arXiv preprint arXiv:2502.09992, 2025

  32. [32]

    Principled rl for diffusion llms emerges from a sequence-level perspective

    Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled rl for diffusion llms emerges from a sequence-level perspective. arXiv preprint arXiv:2512.03759, 2025

  33. [33]

    Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data, 2025. URLhttps://arxiv.org/abs/2406.03736

  34. [34]

    Training Language Models to Follow Instructions with Human Feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training Language Models to Follow Instructions with Human Feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  35. [35]

    Tinyzero

    Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, and Alane Suhr. Tinyzero. https://github.com/Jiayi-Pan/TinyZero, 2025. Accessed: 2025-01-24

  36. [36]

    Generalizing from simple to hard visual reasoning: Can we mitigate modality imbalance in vlms?arXiv preprint arXiv:2501.02669, 2025

    Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu, Anirudh Goyal, and Sanjeev Arora. Generalizing from simple to hard visual reasoning: Can we mitigate modality imbalance in vlms?arXiv preprint arXiv:2501.02669, 2025

  37. [37]

    Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

    Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

  38. [38]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model.Advances in Neural Information Processing Systems, 36:53728–53741, 2023

  39. [39]

    Diffusion Policy Policy Optimization

    Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

  40. [40]

    Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

    Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

  41. [41]

    Approximating kl divergence, 2020.URL http://joschu

    John Schulman. Approximating kl divergence, 2020.URL http://joschu. net/blog/kl-approx. html, 2020

  42. [42]

    Trust Region Policy Optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust Region Policy Optimization. InInternational conference on machine learning, pages 1889–

  43. [43]

    Energy matching based preference learning for diffusion language models

    Shiv Shankar. Energy matching based preference learning for diffusion language models. InPro- ceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 776–786, 2026

  44. [44]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300, 2024

  45. [45]

    Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data.Advances in neural information processing systems, 37:103131–103167, 2024

  46. [46]

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis K. Titsias. Simplified and Generalized Masked Diffusion for Discrete Data, 2025. URL https://arxiv.org/abs/ 2406.04329

  47. [47]

    Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation.Advances in neural information processing systems, 12, 1999

  48. [48]

    wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838, 2025

    Xiaohang Tang, Rares Dolga, Sangwoong Yoon, and Ilija Bogunovic. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838, 2025

  49. [49]

    Rspo: Regularized self-play alignment of large language models.arXiv preprint arXiv:2503.00030, 2025

    Xiaohang Tang, Sangwoong Yoon, Seongho Son, Huizhuo Yuan, Quanquan Gu, and Ilija Bogunovic. Rspo: Regularized self-play alignment of large language models.arXiv preprint arXiv:2503.00030, 2025

  50. [50]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  51. [51]

    Duel: Exact likelihood for masked diffusion via deterministic unmasking.arXiv preprint arXiv:2603.01367, 2026

    Gilad Turok, Chris De Sa, and V olodymyr Kuleshov. Duel: Exact likelihood for masked diffusion via deterministic unmasking.arXiv preprint arXiv:2603.01367, 2026

  52. [52]

    SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

    Chengyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, et al. Spg: Sandwiched policy gradient for masked diffusion language models.arXiv preprint arXiv:2510.09541, 2025

  53. [53]

    d2: Improving Reasoning in Diffusion Language Models via Trajectory Likelihood Estimation

    Guanghan Wang, Yair Schiff, Gilad Turok, and V olodymyr Kuleshov. d2: Improved techniques for training reasoning diffusion language models.arXiv preprint arXiv:2509.21474, 2025

  54. [54]

    Revolutioniz- ing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

    Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutioniz- ing reinforcement learning framework for diffusion large language models.arXiv preprint arXiv:2509.06949, 2025

  55. [55]

    Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022

  56. [56]

    Lfpo: Likelihood-free policy optimization for masked diffusion models.arXiv preprint arXiv:2603.01563, 2026

    Chenxing Wei, Jiazhen Kang, Hong Wang, Jianqing Zhang, Hao Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F Richard Yu, Yao Shu, et al. Lfpo: Likelihood-free policy optimization for masked diffusion models.arXiv preprint arXiv:2603.01563, 2026

  57. [57]

    Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

    Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024

  58. [58]

    MMaDA: Multimodal Large Diffusion Language Models

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal Large Diffusion Language Models.arXiv preprint arXiv:2505.15809, 2025

  59. [59]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. 13

  60. [60]

    Acecoder: Acing coder rl via automated test-case synthesis, 2025

    Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis, 2025. URLhttps://arxiv.org/ abs/2502.01718

  61. [61]

    Energy-Weighted Flow Matching for Offline Reinforcement Learning.arXiv preprint arXiv:2503.04975, 2025

    Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-Weighted Flow Matching for Offline Reinforcement Learning.arXiv preprint arXiv:2503.04975, 2025

  62. [62]

    d1: Scaling Reasoning in Dif- fusion Large Language Models via Reinforcement Learning.arXiv preprint arXiv:2504.12216, 2025

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling Reasoning in Dif- fusion Large Language Models via Reinforcement Learning.arXiv preprint arXiv:2504.12216, 2025

  63. [63]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

  64. [64]

    Stabilizing reinforcement learning for diffusion language models.arXiv preprint arXiv:2603.06743, 2026

    Jianyuan Zhong, Kaibo Wang, Ding Ding, Zijin Feng, Haoli Bai, Yang Xiang, Jiacheng Sun, and Qiang Xu. Stabilizing reinforcement learning for diffusion language models.arXiv preprint arXiv:2603.06743, 2026

  65. [65]

    Fine-Tuning Language Models with Advantage-Induced Policy Alignment.arXiv preprint arXiv:2306.02231, 2023

    Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I Jordan, and Jiantao Jiao. Fine-Tuning Language Models with Advantage-Induced Policy Alignment.arXiv preprint arXiv:2306.02231, 2023

  66. [66]

    LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models.arXiv preprint arXiv:2505.19223, 2025

  67. [67]

    ‘python

    Yuchen Zhu, Wei Guo, Jaemoo Choi, Petr Molodyk, Bo Yuan, Molei Tao, and Yongxin Chen. Enhancing reasoning for diffusion llms via distribution matching policy optimization.arXiv preprint arXiv:2510.08233, 2025. 14 A Proof A.1 Lemma: Token-level Logit Centralization Lemma A.1.Denote the n-th token of clean completion x0 as x(n) 0 , for any MDM p, we have th...