Recognition: no theorem link
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Pith reviewed 2026-05-15 12:35 UTC · model grok-4.3
The pith
Diffusion language models can be post-trained with reinforcement learning using an exact unbiased policy gradient over the denoising steps that requires no sequence-level likelihood.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood.
What carries the argument
Finite-horizon MDP over the denoising trajectory, whose policy gradient decomposes into a sum of per-step intermediate advantages.
If this is right
- Policy updates become exact and unbiased for any diffusion language model without surrogate likelihoods.
- Training cost drops because only selected denoising steps are updated and no expensive multi-step rollouts are required.
- Performance reaches state-of-the-art on coding and logical-reasoning tasks and remains competitive on mathematical reasoning.
- The same decomposition applies to any finite-horizon diffusion process whose reward is available at each step.
Where Pith is reading between the lines
- The stepwise advantage structure may transfer to other non-autoregressive generative models whose likelihoods are also intractable.
- Because advantages are estimated locally, the method could improve credit assignment on very long sequences compared with sequence-level RL.
- An open question is whether the entropy-guided step selector remains optimal when the model size or sequence length grows by an order of magnitude.
Load-bearing premise
The diffusion model’s one-step denoising reward supplies an advantage estimate accurate enough to keep the policy gradient unbiased and useful in practice.
What would settle it
Train two versions of the same model—one using only the one-step reward for advantages and one using full multi-step rollouts to compute exact advantages—then compare final benchmark scores and gradient bias; a large gap would falsify the sufficiency claim.
Figures
read the original abstract
Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formulates diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory. It derives an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. For a practical estimator, it introduces entropy-guided step selection via an approximation bound and estimates advantages using the model's native one-step denoising reward. Experiments on coding, logical reasoning, and mathematical reasoning benchmarks report state-of-the-art or competitive results over prior RL post-training methods for diffusion LLMs, with code released.
Significance. If the central derivation establishes unbiasedness despite the practical approximations, the work offers a principled extension of policy gradients to diffusion language models that respects the sequential denoising structure and avoids surrogate likelihoods. This could meaningfully advance RL post-training for non-autoregressive generators, with the released code supporting reproducibility and follow-up work.
major comments (2)
- [Abstract / §3] Abstract and derivation (likely §3): The claim of an 'exact, unbiased policy gradient' is not reconciled with the entropy-guided approximation bound and the one-step denoising reward. Standard policy-gradient theory requires that per-step rewards equal (or validly shape) the expected return from that timestep; the one-step denoising objective is a local surrogate, so the resulting advantage estimates may differ from the true advantage by a non-constant term, introducing bias even in the Markovian MDP.
- [§4] Practical estimator (likely §4): No error analysis or bound is provided showing that the combination of the approximation bound and one-step reward preserves unbiasedness of the decomposed gradient. Without this, the central claim that the estimator remains exact and unbiased cannot be verified from the given formulation.
minor comments (1)
- [Experiments] Experiments section: Clarify whether the reported gains hold under the exact (non-approximated) gradient or only under the entropy-guided variant, and include an ablation isolating the contribution of the one-step reward versus multi-step rollouts.
Simulated Author's Rebuttal
We thank the referee for the careful reading and insightful comments on the distinction between the theoretical policy gradient and its practical estimator. We address each major comment below and will revise the manuscript accordingly to improve clarity.
read point-by-point responses
-
Referee: [Abstract / §3] Abstract and derivation (likely §3): The claim of an 'exact, unbiased policy gradient' is not reconciled with the entropy-guided approximation bound and the one-step denoising reward. Standard policy-gradient theory requires that per-step rewards equal (or validly shape) the expected return from that timestep; the one-step denoising objective is a local surrogate, so the resulting advantage estimates may differ from the true advantage by a non-constant term, introducing bias even in the Markovian MDP.
Authors: We agree that the manuscript should more explicitly separate the exact theoretical result from the practical estimator. Section 3 derives an exact, unbiased policy gradient for the finite-horizon MDP over the full denoising trajectory, with advantages defined via the true expected return. The entropy-guided step selection (via the approximation bound) and one-step denoising reward are presented in Section 4 solely as a computationally tractable estimator that avoids full rollouts. We will revise the abstract, Section 3, and Section 4 to state clearly that the practical estimator approximates the exact gradient and may introduce bias; we will also add a brief discussion of how the entropy bound controls the selection of steps where the one-step reward remains a reasonable proxy for the advantage. revision: yes
-
Referee: [§4] Practical estimator (likely §4): No error analysis or bound is provided showing that the combination of the approximation bound and one-step reward preserves unbiasedness of the decomposed gradient. Without this, the central claim that the estimator remains exact and unbiased cannot be verified from the given formulation.
Authors: We acknowledge the absence of a formal error analysis. The paper does not claim that the practical estimator is exactly unbiased; it presents the estimator as an efficient approximation to the exact gradient derived in Section 3. The entropy-guided bound is intended to restrict updates to steps where the approximation error is small, and the one-step reward is the natural per-step signal provided by the diffusion objective. We will add an appendix or subsection providing (i) a qualitative discussion of the bias sources and (ii) additional empirical diagnostics (e.g., variance of the estimator and sensitivity to the entropy threshold) to quantify the practical impact of these approximations. revision: yes
Circularity Check
Derivation remains self-contained from MDP formulation with no reduction to inputs by construction
full rationale
The paper starts from a standard finite-horizon MDP over the denoising trajectory and applies the policy gradient theorem to obtain a decomposition into per-step advantages; the one-step denoising reward is introduced as a practical estimator supplied directly by the diffusion model rather than as a fitted or redefined quantity that forces the gradient. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the central unbiasedness claim, and the entropy-guided selection is presented as an approximation bound rather than an exact equivalence. The derivation therefore does not collapse to its inputs by construction and stands as an independent application of RL theory to the diffusion setting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion sequence generation can be exactly represented as a finite-horizon Markov decision process over the denoising trajectory
Forward citations
Cited by 1 Pith paper
-
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.
Reference graph
Works this paper leans on
-
[1]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kuleshov, V . Block diffusion: Inter- polating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Program Synthesis with Large Language Models
Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. Structured denoising diffusion models in dis- crete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2021a. Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large langua...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Evaluating Large Language Models Trained on Code
Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforce- ment learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
LongLLaDA: Unlocking Long Context Capabili- ties in Diffusion LLMs.arXiv preprint arXiv:2506.14429,
Liu, X., Song, Y ., Liu, Z., Huang, Z., Guo, Q., He, Z., and Qiu, X. LongLLaDA: Unlocking Long Context Capabili- ties in Diffusion LLMs.arXiv preprint arXiv:2506.14429,
-
[7]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Lou, A., Meng, C., and Ermon, S. Discrete diffusion model- ing by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Sahoo, S. S., Yang, Z., Akhauri, Y ., Liu, J., Singh, D., Cheng, Z., Liu, Z., Xing, E., Thickstun, J., and Vah- dat, A. Esoteric language models.arXiv preprint arXiv:2506.01928,
-
[9]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
9 Entropy-Guided Stepwise Policy Optimization with Stepwise Advantages Song, Y ., Zhang, Z., Luo, C., Gao, P., Xia, F., Luo, H., Li, Z., Yang, Y ., Yu, H., Qu, X., et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193,
work page internal anchor Pith review arXiv
-
[12]
Tang, X., Dolga, R., Yoon, S., and Bogunovic, I. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,
-
[13]
Uehara, M., Zhao, Y ., Black, K., Hajiramezanali, E., Scalia, G., Diamant, N. L., Tseng, A. M., Biancalani, T., and Levine, S. Fine-tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194,
-
[14]
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
Wang, C., Rashidinejad, P., Su, D., Jiang, S., Wang, S., Zhao, S., Zhou, C., Shen, S. Z., Chen, F., Jaakkola, T., et al. SPG: Sandwiched policy gradient for masked diffu- sion language models.arXiv preprint arXiv:2510.09541, 2025a. Wang, G., Schiff, Y ., Sahoo, S. S., and Kuleshov, V . Re- masking discrete diffusion models with inference-time scaling. InA...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
MMaDA: Multimodal Large Diffusion Language Models
Yang, L., Tian, Y ., Li, B., Zhang, X., Shen, K., Tong, Y ., and Wang, M. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,
work page internal anchor Pith review arXiv
-
[16]
Ye, J., Gao, J., Gong, S., Zheng, L., Jiang, X., Li, Z., and Kong, L. Beyond autoregression: Discrete diffu- sion for complex reasoning and planning.arXiv preprint arXiv:2410.14157,
-
[17]
Dream 7B: Diffusion Large Language Models
Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7B: Diffusion large language models.arXiv preprint arXiv:2508.15487,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025
Zhang, K., Zuo, Y ., He, B., Sun, Y ., Liu, R., Jiang, C., Fan, Y ., Tian, K., Jia, G., Li, P., et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,
-
[19]
Zhao, H., Liang, D., Tang, W., Yao, D., and Kallus, N. DiFFPO: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a. Zhao, L., Ding, X., Yu, L., and Akoglu, L. Improving and unifying discrete & continuous-time discrete denoising diffusion.CoRR,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.