arxiv: 2603.12554 · v2 · submitted 2026-03-13 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Vishnu Teja Kunde , Fatemeh Doudi , Mahdi Farahbakhsh , Dileep Kalathil , Krishna Narayanan , Jean-Francois Chamberland

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learningdiffusion language modelspolicy gradientdenoising trajectoryintermediate advantagesentropy-guided selectionsequence generation

0 comments

The pith

Diffusion language models can be post-trained with reinforcement learning using an exact unbiased policy gradient over the denoising steps that requires no sequence-level likelihood.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper recasts diffusion-based sequence generation as a finite-horizon Markov decision process whose states are the successive noisy versions of the sequence. From this view it derives a policy gradient that factors exactly into a sum of intermediate advantages at each denoising step. Because the gradient never needs the full-sequence probability, the intractable likelihood problem disappears. To keep the estimator practical the authors add an entropy-guided rule for choosing which steps to update and reuse the diffusion model’s own one-step denoising reward as the advantage signal. On coding and logical-reasoning benchmarks the resulting method sets new state-of-the-art numbers while remaining competitive on mathematical reasoning.

Core claim

We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood.

What carries the argument

Finite-horizon MDP over the denoising trajectory, whose policy gradient decomposes into a sum of per-step intermediate advantages.

If this is right

Policy updates become exact and unbiased for any diffusion language model without surrogate likelihoods.
Training cost drops because only selected denoising steps are updated and no expensive multi-step rollouts are required.
Performance reaches state-of-the-art on coding and logical-reasoning tasks and remains competitive on mathematical reasoning.
The same decomposition applies to any finite-horizon diffusion process whose reward is available at each step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The stepwise advantage structure may transfer to other non-autoregressive generative models whose likelihoods are also intractable.
Because advantages are estimated locally, the method could improve credit assignment on very long sequences compared with sequence-level RL.
An open question is whether the entropy-guided step selector remains optimal when the model size or sequence length grows by an order of magnitude.

Load-bearing premise

The diffusion model’s one-step denoising reward supplies an advantage estimate accurate enough to keep the policy gradient unbiased and useful in practice.

What would settle it

Train two versions of the same model—one using only the one-step reward for advantages and one using full multi-step rollouts to compute exact advantages—then compare final benchmark scores and gradient bias; a large gap would falsify the sufficiency claim.

Figures

Figures reproduced from arXiv: 2603.12554 by Dileep Kalathil, Fatemeh Doudi, Jean-Francois Chamberland, Krishna Narayanan, Mahdi Farahbakhsh, Vishnu Teja Kunde.

**Figure 2.** Figure 2: (a) Illustration of entropy-guided denoising step selection: At each denoising step t, the entropy Ht of the unmasking policy distribution is computed and used to identify the K informative steps that have maximum entropy. In the figure, assuming H3 > H1 > H2 > H0 and K = 2, the two highest-entropy steps 3, 1 are selected for per-step policy gradient computation (marked by solid lines). (b) Illustration of… view at source ↗

**Figure 3.** Figure 3: Compute efficiency comparison between EGSPO-SA and d1 on Sudoku. (a) FLOPs are accumulated over all forward passes across 8 GPUs. (b) Samples count cumulative prompt–completion pairs seen during training. (c) Gradient steps count optimizer updates (accounting for gradient accumulation). EGSPO-SA dominates d1 under all three compute budgets. 0 1000 2000 3000 4000 5000 6000 7000 Step 0.0 0.2 0.4 0.6 0.8 1.0 … view at source ↗

**Figure 4.** Figure 4: Training curves for EGSPO and EGSPO-SA on Sudoku, Countdown, GSM8K, and MATH500 using the training settings described in Section C. 0 1000 2000 3000 4000 5000 6000 Step 0.0 0.2 0.4 0.6 0.8 1.0 Reward EGSPO USPO RSPO (a) 0 20 40 60 80 100 120 Selected diffusion step 0.000 0.005 0.010 0.015 0.020 Frequency (b) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation studies analyzing (a) entropy-guided step selection (EGSPO) versus uniform step selection (USPO) and (b) the distribution of selected steps. By avoiding low-entropy, near-deterministic steps while maintaining coverage, entropy-based selection further concentrates updates on timesteps with stronger learning signal, consistent with the analysis in Section 4.2. Distribution of Selected Denoising St… view at source ↗

read the original abstract

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a step-wise policy gradient for diffusion LLMs that avoids full-sequence likelihoods, but the one-step reward and entropy bound raise real questions about whether the estimator stays unbiased.

read the letter

The main takeaway is that this work models denoising as a finite-horizon MDP and produces a policy gradient that breaks down over individual steps using intermediate advantages. That decomposition is the genuinely new piece; prior RL work on diffusion models leaned on surrogate likelihoods or heuristics that hid the sequential structure. The authors then make it practical by picking update steps with an entropy-guided bound and estimating advantages from the model's native one-step denoising signal, which avoids expensive rollouts. The experiments report state-of-the-art numbers on coding and logical reasoning benchmarks plus competitive math results, and the code is public, so the claims are at least checkable. That combination of a cleaner derivation plus usable efficiency is what the paper does well. The soft spot sits in the estimator itself. The abstract asserts an exact unbiased gradient, yet the practical version replaces the full sum with an approximation bound and pulls advantages from a local one-step reward. In standard policy-gradient theory, advantages must reflect expected future return; a denoising loss is a local surrogate, so any mismatch can inject bias even though the underlying MDP is Markovian. Without seeing the full derivation, error bounds, or ablations that quantify how much the approximation drifts from the ideal gradient, it is difficult to know how much of the claimed exactness survives in practice. This paper is aimed at people working on post-training for non-autoregressive models or on RL methods that need to respect the denoising trajectory. A reader who wants to move beyond surrogate tricks will find the core formulation useful even if the estimator details need tightening. I would send it for peer review. The idea targets a real computational barrier and the reported results are strong enough to justify referee time, though the authors should expect questions on whether the one-step reward preserves unbiasedness.

Referee Report

2 major / 1 minor

Summary. The paper formulates diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory. It derives an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. For a practical estimator, it introduces entropy-guided step selection via an approximation bound and estimates advantages using the model's native one-step denoising reward. Experiments on coding, logical reasoning, and mathematical reasoning benchmarks report state-of-the-art or competitive results over prior RL post-training methods for diffusion LLMs, with code released.

Significance. If the central derivation establishes unbiasedness despite the practical approximations, the work offers a principled extension of policy gradients to diffusion language models that respects the sequential denoising structure and avoids surrogate likelihoods. This could meaningfully advance RL post-training for non-autoregressive generators, with the released code supporting reproducibility and follow-up work.

major comments (2)

[Abstract / §3] Abstract and derivation (likely §3): The claim of an 'exact, unbiased policy gradient' is not reconciled with the entropy-guided approximation bound and the one-step denoising reward. Standard policy-gradient theory requires that per-step rewards equal (or validly shape) the expected return from that timestep; the one-step denoising objective is a local surrogate, so the resulting advantage estimates may differ from the true advantage by a non-constant term, introducing bias even in the Markovian MDP.
[§4] Practical estimator (likely §4): No error analysis or bound is provided showing that the combination of the approximation bound and one-step reward preserves unbiasedness of the decomposed gradient. Without this, the central claim that the estimator remains exact and unbiased cannot be verified from the given formulation.

minor comments (1)

[Experiments] Experiments section: Clarify whether the reported gains hold under the exact (non-approximated) gradient or only under the entropy-guided variant, and include an ablation isolating the contribution of the one-step reward versus multi-step rollouts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and insightful comments on the distinction between the theoretical policy gradient and its practical estimator. We address each major comment below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses

Referee: [Abstract / §3] Abstract and derivation (likely §3): The claim of an 'exact, unbiased policy gradient' is not reconciled with the entropy-guided approximation bound and the one-step denoising reward. Standard policy-gradient theory requires that per-step rewards equal (or validly shape) the expected return from that timestep; the one-step denoising objective is a local surrogate, so the resulting advantage estimates may differ from the true advantage by a non-constant term, introducing bias even in the Markovian MDP.

Authors: We agree that the manuscript should more explicitly separate the exact theoretical result from the practical estimator. Section 3 derives an exact, unbiased policy gradient for the finite-horizon MDP over the full denoising trajectory, with advantages defined via the true expected return. The entropy-guided step selection (via the approximation bound) and one-step denoising reward are presented in Section 4 solely as a computationally tractable estimator that avoids full rollouts. We will revise the abstract, Section 3, and Section 4 to state clearly that the practical estimator approximates the exact gradient and may introduce bias; we will also add a brief discussion of how the entropy bound controls the selection of steps where the one-step reward remains a reasonable proxy for the advantage. revision: yes
Referee: [§4] Practical estimator (likely §4): No error analysis or bound is provided showing that the combination of the approximation bound and one-step reward preserves unbiasedness of the decomposed gradient. Without this, the central claim that the estimator remains exact and unbiased cannot be verified from the given formulation.

Authors: We acknowledge the absence of a formal error analysis. The paper does not claim that the practical estimator is exactly unbiased; it presents the estimator as an efficient approximation to the exact gradient derived in Section 3. The entropy-guided bound is intended to restrict updates to steps where the approximation error is small, and the one-step reward is the natural per-step signal provided by the diffusion objective. We will add an appendix or subsection providing (i) a qualitative discussion of the bias sources and (ii) additional empirical diagnostics (e.g., variance of the estimator and sensitivity to the entropy threshold) to quantify the practical impact of these approximations. revision: yes

Circularity Check

0 steps flagged

Derivation remains self-contained from MDP formulation with no reduction to inputs by construction

full rationale

The paper starts from a standard finite-horizon MDP over the denoising trajectory and applies the policy gradient theorem to obtain a decomposition into per-step advantages; the one-step denoising reward is introduced as a practical estimator supplied directly by the diffusion model rather than as a fitted or redefined quantity that forces the gradient. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the central unbiasedness claim, and the entropy-guided selection is presented as an approximation bound rather than an exact equivalence. The derivation therefore does not collapse to its inputs by construction and stands as an independent application of RL theory to the diffusion setting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling the full denoising trajectory as a finite-horizon MDP and treating the model's one-step output as a valid reward signal for advantage estimation; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Diffusion sequence generation can be exactly represented as a finite-horizon Markov decision process over the denoising trajectory
Invoked at the start of the formulation to enable the policy-gradient derivation.

pith-pipeline@v0.9.0 · 5530 in / 1305 out tokens · 32935 ms · 2026-05-15T12:35:02.009614+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
cs.LG 2026-05 conditional novelty 7.0

TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kuleshov, V . Block diffusion: Inter- polating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Program Synthesis with Large Language Models

Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. Structured denoising diffusion models in dis- crete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), 2021a. Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large langua...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Evaluating Large Language Models Trained on Code

Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforce- ment learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

LongLLaDA: Unlocking Long Context Capabili- ties in Diffusion LLMs.arXiv preprint arXiv:2506.14429,

Liu, X., Song, Y ., Liu, Z., Huang, Z., Guo, Q., He, Z., and Qiu, X. LongLLaDA: Unlocking Long Context Capabili- ties in Diffusion LLMs.arXiv preprint arXiv:2506.14429,

work page arXiv
[7]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Lou, A., Meng, C., and Ermon, S. Discrete diffusion model- ing by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

S., Yang, Z., Akhauri, Y ., Liu, J., Singh, D., Cheng, Z., Liu, Z., Xing, E., Thickstun, J., and Vah- dat, A

Sahoo, S. S., Yang, Z., Akhauri, Y ., Liu, J., Singh, D., Cheng, Z., Liu, Z., Xing, E., Thickstun, J., and Vah- dat, A. Esoteric language models.arXiv preprint arXiv:2506.01928,

work page arXiv
[9]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

9 Entropy-Guided Stepwise Policy Optimization with Stepwise Advantages Song, Y ., Zhang, Z., Luo, C., Gao, P., Xia, F., Luo, H., Li, Z., Yang, Y ., Yu, H., Qu, X., et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193,

work page internal anchor Pith review arXiv
[12]

wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

Tang, X., Dolga, R., Yoon, S., and Bogunovic, I. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

work page arXiv
[13]

L., Tseng, A

Uehara, M., Zhao, Y ., Black, K., Hajiramezanali, E., Scalia, G., Diamant, N. L., Tseng, A. M., Biancalani, T., and Levine, S. Fine-tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194,

work page arXiv
[14]

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Wang, C., Rashidinejad, P., Su, D., Jiang, S., Wang, S., Zhao, S., Zhou, C., Shen, S. Z., Chen, F., Jaakkola, T., et al. SPG: Sandwiched policy gradient for masked diffu- sion language models.arXiv preprint arXiv:2510.09541, 2025a. Wang, G., Schiff, Y ., Sahoo, S. S., and Kuleshov, V . Re- masking discrete diffusion models with inference-time scaling. InA...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

MMaDA: Multimodal Large Diffusion Language Models

Yang, L., Tian, Y ., Li, B., Zhang, X., Shen, K., Tong, Y ., and Wang, M. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,

work page internal anchor Pith review arXiv
[16]

Beyond autoregression: Discrete diffu- sion for complex reasoning and planning.arXiv preprint arXiv:2410.14157,

Ye, J., Gao, J., Gong, S., Zheng, L., Jiang, X., Li, Z., and Kong, L. Beyond autoregression: Discrete diffu- sion for complex reasoning and planning.arXiv preprint arXiv:2410.14157,

work page arXiv
[17]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7B: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

Zhang, K., Zuo, Y ., He, B., Sun, Y ., Liu, R., Jiang, C., Fan, Y ., Tian, K., Jia, G., Li, P., et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,

work page arXiv
[19]

DiFFPO: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a

Zhao, H., Liang, D., Tang, W., Yao, D., and Kallus, N. DiFFPO: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a. Zhao, L., Ding, X., Yu, L., and Akoglu, L. Improving and unifying discrete & continuous-time discrete denoising diffusion.CoRR,

work page arXiv