Diffusion-State Policy Optimization for Masked Diffusion Language Models
Pith reviewed 2026-05-21 14:35 UTC · model grok-4.3
The pith
DiSPO optimizes intermediate token-filling decisions in masked diffusion language models by branching from cached rollouts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiSPO is a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. The method formalizes a fixed-state objective for branched completions and derives a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization.
What carries the argument
DiSPO branching at intermediate masked states, which resamples currently masked tokens from cached logits to produce branched completions whose scores drive policy-gradient updates on the filling decisions.
If this is right
- DiSPO raises performance over terminal-feedback baselines such as diffu-GRPO and SPG on math and planning benchmarks.
- The gains occur while holding rollout compute and optimizer steps constant.
- DiSPO functions as a general plug-in that can be added to existing masked diffusion policy optimization pipelines.
- Credit assignment is supplied only for the newly filled tokens at each selected state, leaving earlier decisions untouched.
Where Pith is reading between the lines
- The same branching idea could be tested on non-language diffusion generators to see whether intermediate-state updates improve other iterative sampling processes.
- DiSPO might reduce the total number of full rollouts needed to reach a given performance level by making each rollout more informative.
- Extending the method to variable-length sequences or to tasks with sparse terminal rewards could reveal whether the fixed-state objective remains stable.
Load-bearing premise
Resampling currently masked positions from rollout-cached logits at chosen intermediate states produces an unbiased policy-gradient update for those filling decisions.
What would settle it
Apply DiSPO to the same math and planning benchmarks with identical rollout counts and optimizer steps and observe no accuracy gain or a drop relative to the terminal-feedback baselines.
Figures
read the original abstract
Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Diffusion-State Policy Optimization (DiSPO) as a plug-in credit-assignment method for masked diffusion language models. It formalizes a fixed-state objective over intermediate masked states and derives a policy-gradient estimator that branches by resampling currently masked tokens from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens while reusing the original terminal rollouts. Experiments on LLaDA-8B-Instruct report consistent gains over terminal-feedback baselines (diffu-GRPO, SPG) on math and planning benchmarks under matched rollout compute and optimizer steps.
Significance. If the estimator is unbiased, DiSPO supplies an efficient mechanism for finer-grained optimization of filling decisions in iterative masked generation without extra multi-step diffusion or optimizer overhead. The reuse of cached rollouts is a practical strength that could make the method a general add-on for RL fine-tuning of diffusion LMs on reasoning tasks.
major comments (2)
- [§3.2] §3.2, Eq. (4)–(6): The policy-gradient estimator for the fixed-state objective claims unbiasedness under logit-resampling branching, yet the derivation does not exhibit an explicit importance-weight term or proof that the resampling distribution (conditioned on cached logits) leaves the expectation equal to the true gradient of the fixed-state objective; correlation between state-selection probability and advantage would require correction.
- [§4.1] §4.1, Table 1: The reported gains over diffu-GRPO and SPG are presented under matched rollout compute, but no ablation isolates the contribution of the fixed-state objective versus the particular choice of intermediate-state selection heuristic; without this control the cross-method comparison is inconclusive.
minor comments (2)
- [§3.1] Notation for the fixed-state objective (Eq. (2)) re-uses the symbol p_θ for both the original policy and the branched completion distribution; a distinct symbol would improve readability.
- The project page URL is given but no link to code or reproduction scripts appears in the manuscript; adding a footnote with the repository would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions will be made to strengthen the presentation.
read point-by-point responses
-
Referee: [§3.2] §3.2, Eq. (4)–(6): The policy-gradient estimator for the fixed-state objective claims unbiasedness under logit-resampling branching, yet the derivation does not exhibit an explicit importance-weight term or proof that the resampling distribution (conditioned on cached logits) leaves the expectation equal to the true gradient of the fixed-state objective; correlation between state-selection probability and advantage would require correction.
Authors: We appreciate the referee highlighting this aspect of the derivation. In §3.2, the fixed-state objective is defined over a specific masked state, and the policy-gradient estimator is obtained by resampling the masked tokens directly from the cached logits of the rollout policy at that state. Since the resampling distribution is identical to the policy used to define the objective, the estimator is on-policy and does not require an importance-weight correction; the expectation over the resampled completions equals the gradient of the fixed-state objective by construction. The state-selection heuristic (detailed in §3.3) is a deterministic function of the current masked state and is independent of the realized advantage, eliminating any correlation that would necessitate additional correction terms. To make this explicit, we will include a short proof of unbiasedness in the revised §3.2. revision: yes
-
Referee: [§4.1] §4.1, Table 1: The reported gains over diffu-GRPO and SPG are presented under matched rollout compute, but no ablation isolates the contribution of the fixed-state objective versus the particular choice of intermediate-state selection heuristic; without this control the cross-method comparison is inconclusive.
Authors: We agree that an ablation isolating the fixed-state objective from the state-selection heuristic would strengthen the experimental claims. The current comparisons in Table 1 demonstrate that DiSPO improves upon terminal-feedback baselines under matched compute budgets. However, to address the referee's concern, we will add an ablation study in the revised manuscript that applies the same intermediate-state selection heuristic but optimizes only with terminal feedback (i.e., without the fixed-state objective). This will help isolate the contribution of the proposed objective. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper formalizes a fixed-state objective for branched completions and derives a policy-gradient estimator reusing the same rollouts as terminal-feedback optimization, with experiments under matched compute. No load-bearing step reduces by the paper's own equations or self-citation to its inputs; the estimator is presented as independently derived rather than fitted or self-defined. The description indicates self-contained derivation against external benchmarks like diffu-GRPO and SPG, with no evidence of renaming known results or smuggling ansatzes via citation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A fixed-state objective for branched completions admits a valid policy-gradient estimator
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize a fixed-state expected-return objective for intermediate-state branching and show that DISPO yields a valid policy-gradient estimator for it (Theorem 4.1).
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the step-wise loss provides a principled policy-gradient estimator for a fixed-state objective
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Gong, S., Zhang, R., Zheng, H., Gu, J., Jaitly, N., Kong, L., and Zhang, Y . Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,
-
[3]
He, H., Renz, K., Cao, Y ., and Geiger, A. Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,
-
[4]
Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
URL https://arxiv.org/abs/2501.19393. Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language dif- fusion models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Large Language Diffusion Models
URL https://arxiv.org/ abs/2502.09992. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Improving reasoning for diffusion language models via group diffusion policy optimization
Rojas, K., Lin, J., Rasul, K., Schneider, A., Nevmyvaka, Y ., Tao, M., and Deng, W. Improving reasoning for diffusion language models via group diffusion policy optimization. arXiv preprint arXiv:2510.08554,
-
[8]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Tang, X., Dolga, R., Yoon, S., and Bogunovic, I. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,
-
[11]
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
9 Diffusion-State Policy Optimization for Masked Diffusion Language Models Wang, C., Rashidinejad, P., Su, D., Jiang, S., Wang, S., Zhao, S., Zhou, C., Shen, S. Z., Chen, F., Jaakkola, T., et al. Spg: Sandwiched policy gradient for masked diffu- sion language models.arXiv preprint arXiv:2510.09541, 2025a. Wang, G., Schiff, Y ., Turok, G., and Kuleshov, V ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Advancing Reasoning in Diffusion Language Models with Denoising Process Rewards
Xie, S., Kong, L., Song, X., Dong, X., Chen, G., Xing, E. P., and Zhang, K. Step-aware policy optimization for reasoning in diffusion large language models.arXiv preprint arXiv:2510.01544,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Yang, J., Chen, G., Hu, X., and Shao, J. Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,
-
[14]
Dream 7B: Diffusion Large Language Models
Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Fine-tuning discrete diffusion models with policy gradient methods
Zekri, O. and Boull ´e, N. Fine-tuning discrete diffusion models with policy gradient methods.arXiv preprint arXiv:2502.01384,
-
[16]
Zhao, H., Liang, D., Tang, W., Yao, D., and Kallus, N. Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a. Zhao, S., Gupta, D., Zheng, Q., and Grover, A. d1: Scaling reasoning in diffusion large language models via rein- forcement learning.arXiv preprint arXiv:2504.12216, 2025b. Zie...
-
[17]
Condition on a particular timestep t being selected
on the corresponding intermediate state(s). Condition on a particular timestep t being selected. Under the assumptions of Theorem 4.1, we have E[−∇θLstep(θ)|t] =c Z ∇θJt(θ), where the expectation is over q∼ D , st ∼d t(q), and the branched action samples at that state. Taking expectation over t∼ω(t)yields E[−∇θLstep(θ)] =c Z X t ω(t)∇ θJt(θ).(19) Terminal...
work page 2024
-
[18]
We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset
is a subset of MATH focusing on competition-level problems.Rewardis computed by considering the two axes, i.e., format reward (max1.0) and correctness reward (max2.0) Sudoku.4 ×4 Sudoku tasks is synthetic benchmark for planning. We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset . As for the evaluation data, w...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.