Diffusion-State Policy Optimization for Masked Diffusion Language Models

Daisuke Oba; Hiroki Furuta; Naoaki Okazaki

arxiv: 2602.06462 · v4 · pith:HJKVI4V7new · submitted 2026-02-06 · 💻 cs.CL · cs.LG

Diffusion-State Policy Optimization for Masked Diffusion Language Models

Daisuke Oba , Hiroki Furuta , Naoaki Okazaki This is my paper

Pith reviewed 2026-05-21 14:35 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords masked diffusionpolicy optimizationcredit assignmentlanguage modelsreinforcement learningtext generationDiSPOdiffusion models

0 comments

The pith

DiSPO optimizes intermediate token-filling decisions in masked diffusion language models by branching from cached rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Diffusion-State Policy Optimization as a way to assign credit to the individual token-filling steps that occur during masked diffusion generation rather than waiting for a reward only on the finished text. DiSPO selects intermediate masked states, creates branches by resampling the remaining positions from already-cached logits, evaluates the full completions, and applies policy-gradient updates solely to the newly filled tokens. This reuses the exact same rollouts that terminal-feedback methods already produce, adding no extra diffusion steps or optimizer passes. A reader would care because the method raises accuracy on math and planning tasks while keeping compute fixed, suggesting that finer-grained updates can make diffusion-based generators more effective at step-by-step reasoning.

Core claim

DiSPO is a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. The method formalizes a fixed-state objective for branched completions and derives a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization.

What carries the argument

DiSPO branching at intermediate masked states, which resamples currently masked tokens from cached logits to produce branched completions whose scores drive policy-gradient updates on the filling decisions.

If this is right

DiSPO raises performance over terminal-feedback baselines such as diffu-GRPO and SPG on math and planning benchmarks.
The gains occur while holding rollout compute and optimizer steps constant.
DiSPO functions as a general plug-in that can be added to existing masked diffusion policy optimization pipelines.
Credit assignment is supplied only for the newly filled tokens at each selected state, leaving earlier decisions untouched.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same branching idea could be tested on non-language diffusion generators to see whether intermediate-state updates improve other iterative sampling processes.
DiSPO might reduce the total number of full rollouts needed to reach a given performance level by making each rollout more informative.
Extending the method to variable-length sequences or to tasks with sparse terminal rewards could reveal whether the fixed-state objective remains stable.

Load-bearing premise

Resampling currently masked positions from rollout-cached logits at chosen intermediate states produces an unbiased policy-gradient update for those filling decisions.

What would settle it

Apply DiSPO to the same math and planning benchmarks with identical rollout counts and optimizer steps and observe no accuracy gain or a drop relative to the terminal-feedback baselines.

Figures

Figures reproduced from arXiv: 2602.06462 by Daisuke Oba, Hiroki Furuta, Naoaki Okazaki.

**Figure 1.** Figure 1: Conceptual overview. Top: Terminal-feedback GRPO treats the denoising trajectory as one decision. Bottom: DiSPO is a plug-in step that branches at intermediate states (resample Z fillings from cached logits), scores them with the same reward, and backpropagates gradients only through the filled tokens. guage models (MDLMs), which generate by repeatedly filling masked positions over multiple denoising step… view at source ↗

**Figure 2.** Figure 2: Reward curves. Terminal reward curves (top) and step reward curves (bottom) on LLaDA-8B-Instruct during policy optimization. Across tasks, DISPO reaches higher terminal rewards earlier and maintains them over training. Step rewards exhibit relatively smaller magnitudes but follow trends as terminal rewards, indicating their role as a complementary training signal. 5.1. Setup Models. We evaluate LLaDA-8B-In… view at source ↗

**Figure 3.** Figure 3: Variance reduction of the step-wise gradient estimator on Sudoku. Left: Updating only action tokens (vs. all tokens) reduces variance at Z=2 (Prop. 4.3). Right: Increasing Z from Z=2 reduces variance with action-only updates (Prop. 4.4). Error bars show paired 95% bootstrap CIs [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: compares the same instance at the same denoising step: diffu-GRPO already violates constraints due to an early incorrect fill, whereas DISPO maintains a consistent partial assignment [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Wall-clock-matched training curves on LLaDA-8BInstruct for Sudoku. Accuracy (Ngen=128) and reward vs. training time. DISPO surpasses diffu-GRPO within the budget. comparisons: at a fixed intermediate state, we contrast alternative mask fillings (actions) rather than only learning from terminal rollouts. State-aware reward shaping. SAPO scores intermediate denoising states to build step-aware bonuses tha… view at source ↗

read the original abstract

Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

DiSPO gives a practical way to optimize intermediate filling decisions in masked diffusion models by branching off cached rollouts, but the unbiasedness of the resulting gradient estimator is the part that still needs verification. The core idea is to pick intermediate masked states, resample only the still-masked tokens from the logits already stored during the original rollout, score the completed sequences, and apply the policy gradient update solely to those new tokens. This reuses the same terminal rollouts and optimizer steps as the baseline terminal-feedback methods, so the compute budget stays matched. On the LLaDA-8B model the authors report steady gains over diffu-GRPO and SPG on math and planning tasks, which is the kind of result that matters when you are trying to make non-autoregressive generators competitive on reasoning benchmarks. That reuse trick and the selective update rule are the concrete novelties relative to the cited baselines. The experiments appear to have been run under controlled rollout and step counts, which is a plus. The soft spot is the estimator itself. Resampling from cached logits at chosen states could introduce a correlation between the branch probability and the advantage that is not obviously corrected by simply reusing the original rollouts. The abstract says a fixed-state objective is formalized and a policy-gradient estimator is derived, yet the provided text does not include the equations or proof sketch that would confirm the estimator remains unbiased. If the full derivation contains an importance-weight term or a baseline that removes that correlation, the claim holds; otherwise the reported improvements might partly reflect a biased update. Minor additional checks, such as an ablation on state-selection frequency or a direct comparison of the estimator to a known unbiased alternative, would tighten the argument. This paper is aimed at researchers working on masked diffusion language models or on RL-style optimization for non-autoregressive architectures. It is not a foundational theory paper but a targeted, low-overhead plug-in with empirical support. The thinking is clear and the engagement with existing baselines is honest, so the work is coherent on its own terms. I would send it to peer review so the derivation and the experimental controls can be examined properly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Diffusion-State Policy Optimization (DiSPO) as a plug-in credit-assignment method for masked diffusion language models. It formalizes a fixed-state objective over intermediate masked states and derives a policy-gradient estimator that branches by resampling currently masked tokens from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens while reusing the original terminal rollouts. Experiments on LLaDA-8B-Instruct report consistent gains over terminal-feedback baselines (diffu-GRPO, SPG) on math and planning benchmarks under matched rollout compute and optimizer steps.

Significance. If the estimator is unbiased, DiSPO supplies an efficient mechanism for finer-grained optimization of filling decisions in iterative masked generation without extra multi-step diffusion or optimizer overhead. The reuse of cached rollouts is a practical strength that could make the method a general add-on for RL fine-tuning of diffusion LMs on reasoning tasks.

major comments (2)

[§3.2] §3.2, Eq. (4)–(6): The policy-gradient estimator for the fixed-state objective claims unbiasedness under logit-resampling branching, yet the derivation does not exhibit an explicit importance-weight term or proof that the resampling distribution (conditioned on cached logits) leaves the expectation equal to the true gradient of the fixed-state objective; correlation between state-selection probability and advantage would require correction.
[§4.1] §4.1, Table 1: The reported gains over diffu-GRPO and SPG are presented under matched rollout compute, but no ablation isolates the contribution of the fixed-state objective versus the particular choice of intermediate-state selection heuristic; without this control the cross-method comparison is inconclusive.

minor comments (2)

[§3.1] Notation for the fixed-state objective (Eq. (2)) re-uses the symbol p_θ for both the original policy and the branched completion distribution; a distinct symbol would improve readability.
The project page URL is given but no link to code or reproduction scripts appears in the manuscript; adding a footnote with the repository would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [§3.2] §3.2, Eq. (4)–(6): The policy-gradient estimator for the fixed-state objective claims unbiasedness under logit-resampling branching, yet the derivation does not exhibit an explicit importance-weight term or proof that the resampling distribution (conditioned on cached logits) leaves the expectation equal to the true gradient of the fixed-state objective; correlation between state-selection probability and advantage would require correction.

Authors: We appreciate the referee highlighting this aspect of the derivation. In §3.2, the fixed-state objective is defined over a specific masked state, and the policy-gradient estimator is obtained by resampling the masked tokens directly from the cached logits of the rollout policy at that state. Since the resampling distribution is identical to the policy used to define the objective, the estimator is on-policy and does not require an importance-weight correction; the expectation over the resampled completions equals the gradient of the fixed-state objective by construction. The state-selection heuristic (detailed in §3.3) is a deterministic function of the current masked state and is independent of the realized advantage, eliminating any correlation that would necessitate additional correction terms. To make this explicit, we will include a short proof of unbiasedness in the revised §3.2. revision: yes
Referee: [§4.1] §4.1, Table 1: The reported gains over diffu-GRPO and SPG are presented under matched rollout compute, but no ablation isolates the contribution of the fixed-state objective versus the particular choice of intermediate-state selection heuristic; without this control the cross-method comparison is inconclusive.

Authors: We agree that an ablation isolating the fixed-state objective from the state-selection heuristic would strengthen the experimental claims. The current comparisons in Table 1 demonstrate that DiSPO improves upon terminal-feedback baselines under matched compute budgets. However, to address the referee's concern, we will add an ablation study in the revised manuscript that applies the same intermediate-state selection heuristic but optimizes only with terminal feedback (i.e., without the fixed-state objective). This will help isolate the contribution of the proposed objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper formalizes a fixed-state objective for branched completions and derives a policy-gradient estimator reusing the same rollouts as terminal-feedback optimization, with experiments under matched compute. No load-bearing step reduces by the paper's own equations or self-citation to its inputs; the estimator is presented as independently derived rather than fitted or self-defined. The description indicates self-contained derivation against external benchmarks like diffu-GRPO and SPG, with no evidence of renaming known results or smuggling ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard reinforcement-learning assumptions for policy gradients in sequential generation but does not introduce or quantify new free parameters, axioms, or invented entities beyond the choice of intermediate states.

axioms (1)

domain assumption A fixed-state objective for branched completions admits a valid policy-gradient estimator
Invoked when the paper states it formalizes the objective and derives the estimator that reuses terminal rollouts.

pith-pipeline@v0.9.0 · 5713 in / 1255 out tokens · 51176 ms · 2026-05-21T14:35:01.890399+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize a fixed-state expected-return objective for intermediate-state branching and show that DISPO yields a valid policy-gradient estimator for it (Theorem 4.1).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the step-wise loss provides a principled policy-gradient estimator for a fixed-state objective

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 9 internal anchors

[1]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

Gong, S., Zhang, R., Zheng, H., Gu, J., Jaitly, N., Kong, L., and Zhang, Y . Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,

work page arXiv
[3]

Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,

He, H., Renz, K., Cao, Y ., and Geiger, A. Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,

work page arXiv
[4]

Let's Verify Step by Step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

s1: Simple test-time scaling

URL https://arxiv.org/abs/2501.19393. Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language dif- fusion models,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Large Language Diffusion Models

URL https://arxiv.org/ abs/2502.09992. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Improving reasoning for diffusion language models via group diffusion policy optimization

Rojas, K., Lin, J., Rasul, K., Schneider, A., Nevmyvaka, Y ., Tao, M., and Deng, W. Improving reasoning for diffusion language models via group diffusion policy optimization. arXiv preprint arXiv:2510.08554,

work page arXiv
[8]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

Tang, X., Dolga, R., Yoon, S., and Bogunovic, I. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

work page arXiv
[11]

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

9 Diffusion-State Policy Optimization for Masked Diffusion Language Models Wang, C., Rashidinejad, P., Su, D., Jiang, S., Wang, S., Zhao, S., Zhou, C., Shen, S. Z., Chen, F., Jaakkola, T., et al. Spg: Sandwiched policy gradient for masked diffu- sion language models.arXiv preprint arXiv:2510.09541, 2025a. Wang, G., Schiff, Y ., Turok, G., and Kuleshov, V ...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Advancing Reasoning in Diffusion Language Models with Denoising Process Rewards

Xie, S., Kong, L., Song, X., Dong, X., Chen, G., Xing, E. P., and Zhang, K. Step-aware policy optimization for reasoning in diffusion large language models.arXiv preprint arXiv:2510.01544,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

Yang, J., Chen, G., Hu, X., and Shao, J. Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

work page arXiv
[14]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Fine-tuning discrete diffusion models with policy gradient methods

Zekri, O. and Boull ´e, N. Fine-tuning discrete diffusion models with policy gradient methods.arXiv preprint arXiv:2502.01384,

work page arXiv
[16]

Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a

Zhao, H., Liang, D., Tang, W., Yao, D., and Kallus, N. Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a. Zhao, S., Gupta, D., Zheng, Q., and Grover, A. d1: Scaling reasoning in diffusion large language models via rein- forcement learning.arXiv preprint arXiv:2504.12216, 2025b. Zie...

work page arXiv 1909
[17]

Condition on a particular timestep t being selected

on the corresponding intermediate state(s). Condition on a particular timestep t being selected. Under the assumptions of Theorem 4.1, we have E[−∇θLstep(θ)|t] =c Z ∇θJt(θ), where the expectation is over q∼ D , st ∼d t(q), and the branched action samples at that state. Taking expectation over t∼ω(t)yields E[−∇θLstep(θ)] =c Z X t ω(t)∇ θJt(θ).(19) Terminal...

work page 2024
[18]

We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset

is a subset of MATH focusing on competition-level problems.Rewardis computed by considering the two axes, i.e., format reward (max1.0) and correctness reward (max2.0) Sudoku.4 ×4 Sudoku tasks is synthetic benchmark for planning. We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset . As for the evaluation data, w...

work page 2025

[1] [1]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

Gong, S., Zhang, R., Zheng, H., Gu, J., Jaitly, N., Kong, L., and Zhang, Y . Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,

work page arXiv

[3] [3]

Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,

He, H., Renz, K., Cao, Y ., and Geiger, A. Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,

work page arXiv

[4] [4]

Let's Verify Step by Step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

s1: Simple test-time scaling

URL https://arxiv.org/abs/2501.19393. Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language dif- fusion models,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Large Language Diffusion Models

URL https://arxiv.org/ abs/2502.09992. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Improving reasoning for diffusion language models via group diffusion policy optimization

Rojas, K., Lin, J., Rasul, K., Schneider, A., Nevmyvaka, Y ., Tao, M., and Deng, W. Improving reasoning for diffusion language models via group diffusion policy optimization. arXiv preprint arXiv:2510.08554,

work page arXiv

[8] [8]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

Tang, X., Dolga, R., Yoon, S., and Bogunovic, I. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

work page arXiv

[11] [11]

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

9 Diffusion-State Policy Optimization for Masked Diffusion Language Models Wang, C., Rashidinejad, P., Su, D., Jiang, S., Wang, S., Zhao, S., Zhou, C., Shen, S. Z., Chen, F., Jaakkola, T., et al. Spg: Sandwiched policy gradient for masked diffu- sion language models.arXiv preprint arXiv:2510.09541, 2025a. Wang, G., Schiff, Y ., Turok, G., and Kuleshov, V ...

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Advancing Reasoning in Diffusion Language Models with Denoising Process Rewards

Xie, S., Kong, L., Song, X., Dong, X., Chen, G., Xing, E. P., and Zhang, K. Step-aware policy optimization for reasoning in diffusion large language models.arXiv preprint arXiv:2510.01544,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

Yang, J., Chen, G., Hu, X., and Shao, J. Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

work page arXiv

[14] [14]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Fine-tuning discrete diffusion models with policy gradient methods

Zekri, O. and Boull ´e, N. Fine-tuning discrete diffusion models with policy gradient methods.arXiv preprint arXiv:2502.01384,

work page arXiv

[16] [16]

Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a

Zhao, H., Liang, D., Tang, W., Yao, D., and Kallus, N. Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a. Zhao, S., Gupta, D., Zheng, Q., and Grover, A. d1: Scaling reasoning in diffusion large language models via rein- forcement learning.arXiv preprint arXiv:2504.12216, 2025b. Zie...

work page arXiv 1909

[17] [17]

Condition on a particular timestep t being selected

on the corresponding intermediate state(s). Condition on a particular timestep t being selected. Under the assumptions of Theorem 4.1, we have E[−∇θLstep(θ)|t] =c Z ∇θJt(θ), where the expectation is over q∼ D , st ∼d t(q), and the branched action samples at that state. Taking expectation over t∼ω(t)yields E[−∇θLstep(θ)] =c Z X t ω(t)∇ θJt(θ).(19) Terminal...

work page 2024

[18] [18]

We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset

is a subset of MATH focusing on competition-level problems.Rewardis computed by considering the two axes, i.e., format reward (max1.0) and correctness reward (max2.0) Sudoku.4 ×4 Sudoku tasks is synthetic benchmark for planning. We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset . As for the evaluation data, w...

work page 2025