pith. sign in

arxiv: 2602.06462 · v4 · pith:HJKVI4V7new · submitted 2026-02-06 · 💻 cs.CL · cs.LG

Diffusion-State Policy Optimization for Masked Diffusion Language Models

Pith reviewed 2026-05-21 14:35 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords masked diffusionpolicy optimizationcredit assignmentlanguage modelsreinforcement learningtext generationDiSPOdiffusion models
0
0 comments X

The pith

DiSPO optimizes intermediate token-filling decisions in masked diffusion language models by branching from cached rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Diffusion-State Policy Optimization as a way to assign credit to the individual token-filling steps that occur during masked diffusion generation rather than waiting for a reward only on the finished text. DiSPO selects intermediate masked states, creates branches by resampling the remaining positions from already-cached logits, evaluates the full completions, and applies policy-gradient updates solely to the newly filled tokens. This reuses the exact same rollouts that terminal-feedback methods already produce, adding no extra diffusion steps or optimizer passes. A reader would care because the method raises accuracy on math and planning tasks while keeping compute fixed, suggesting that finer-grained updates can make diffusion-based generators more effective at step-by-step reasoning.

Core claim

DiSPO is a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. The method formalizes a fixed-state objective for branched completions and derives a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization.

What carries the argument

DiSPO branching at intermediate masked states, which resamples currently masked tokens from cached logits to produce branched completions whose scores drive policy-gradient updates on the filling decisions.

If this is right

  • DiSPO raises performance over terminal-feedback baselines such as diffu-GRPO and SPG on math and planning benchmarks.
  • The gains occur while holding rollout compute and optimizer steps constant.
  • DiSPO functions as a general plug-in that can be added to existing masked diffusion policy optimization pipelines.
  • Credit assignment is supplied only for the newly filled tokens at each selected state, leaving earlier decisions untouched.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same branching idea could be tested on non-language diffusion generators to see whether intermediate-state updates improve other iterative sampling processes.
  • DiSPO might reduce the total number of full rollouts needed to reach a given performance level by making each rollout more informative.
  • Extending the method to variable-length sequences or to tasks with sparse terminal rewards could reveal whether the fixed-state objective remains stable.

Load-bearing premise

Resampling currently masked positions from rollout-cached logits at chosen intermediate states produces an unbiased policy-gradient update for those filling decisions.

What would settle it

Apply DiSPO to the same math and planning benchmarks with identical rollout counts and optimizer steps and observe no accuracy gain or a drop relative to the terminal-feedback baselines.

Figures

Figures reproduced from arXiv: 2602.06462 by Daisuke Oba, Hiroki Furuta, Naoaki Okazaki.

Figure 1
Figure 1. Figure 1: Conceptual overview. Top: Terminal-feedback GRPO treats the denoising trajectory as one decision. Bottom: DiSPO is a plug-in step that branches at intermediate states (resample Z fillings from cached logits), scores them with the same reward, and backpropagates gradients only through the filled tokens. guage models (MDLMs), which generate by repeatedly fill￾ing masked positions over multiple denoising step… view at source ↗
Figure 2
Figure 2. Figure 2: Reward curves. Terminal reward curves (top) and step reward curves (bottom) on LLaDA-8B-Instruct during policy optimization. Across tasks, DISPO reaches higher terminal rewards earlier and maintains them over training. Step rewards exhibit relatively smaller magnitudes but follow trends as terminal rewards, indicating their role as a complementary training signal. 5.1. Setup Models. We evaluate LLaDA-8B-In… view at source ↗
Figure 3
Figure 3. Figure 3: Variance reduction of the step-wise gradient estimator on Sudoku. Left: Updating only action tokens (vs. all tokens) reduces variance at Z=2 (Prop. 4.3). Right: Increasing Z from Z=2 reduces variance with action-only updates (Prop. 4.4). Error bars show paired 95% bootstrap CIs [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: compares the same instance at the same denoising step: diffu-GRPO already violates constraints due to an early incorrect fill, whereas DISPO maintains a consistent partial assignment [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Wall-clock-matched training curves on LLaDA-8B￾Instruct for Sudoku. Accuracy (Ngen=128) and reward vs. train￾ing time. DISPO surpasses diffu-GRPO within the budget. comparisons: at a fixed intermediate state, we contrast alter￾native mask fillings (actions) rather than only learning from terminal rollouts. State-aware reward shaping. SAPO scores intermediate denoising states to build step-aware bonuses tha… view at source ↗
read the original abstract

Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Diffusion-State Policy Optimization (DiSPO) as a plug-in credit-assignment method for masked diffusion language models. It formalizes a fixed-state objective over intermediate masked states and derives a policy-gradient estimator that branches by resampling currently masked tokens from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens while reusing the original terminal rollouts. Experiments on LLaDA-8B-Instruct report consistent gains over terminal-feedback baselines (diffu-GRPO, SPG) on math and planning benchmarks under matched rollout compute and optimizer steps.

Significance. If the estimator is unbiased, DiSPO supplies an efficient mechanism for finer-grained optimization of filling decisions in iterative masked generation without extra multi-step diffusion or optimizer overhead. The reuse of cached rollouts is a practical strength that could make the method a general add-on for RL fine-tuning of diffusion LMs on reasoning tasks.

major comments (2)
  1. [§3.2] §3.2, Eq. (4)–(6): The policy-gradient estimator for the fixed-state objective claims unbiasedness under logit-resampling branching, yet the derivation does not exhibit an explicit importance-weight term or proof that the resampling distribution (conditioned on cached logits) leaves the expectation equal to the true gradient of the fixed-state objective; correlation between state-selection probability and advantage would require correction.
  2. [§4.1] §4.1, Table 1: The reported gains over diffu-GRPO and SPG are presented under matched rollout compute, but no ablation isolates the contribution of the fixed-state objective versus the particular choice of intermediate-state selection heuristic; without this control the cross-method comparison is inconclusive.
minor comments (2)
  1. [§3.1] Notation for the fixed-state objective (Eq. (2)) re-uses the symbol p_θ for both the original policy and the branched completion distribution; a distinct symbol would improve readability.
  2. The project page URL is given but no link to code or reproduction scripts appears in the manuscript; adding a footnote with the repository would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below, providing clarifications and indicating where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [§3.2] §3.2, Eq. (4)–(6): The policy-gradient estimator for the fixed-state objective claims unbiasedness under logit-resampling branching, yet the derivation does not exhibit an explicit importance-weight term or proof that the resampling distribution (conditioned on cached logits) leaves the expectation equal to the true gradient of the fixed-state objective; correlation between state-selection probability and advantage would require correction.

    Authors: We appreciate the referee highlighting this aspect of the derivation. In §3.2, the fixed-state objective is defined over a specific masked state, and the policy-gradient estimator is obtained by resampling the masked tokens directly from the cached logits of the rollout policy at that state. Since the resampling distribution is identical to the policy used to define the objective, the estimator is on-policy and does not require an importance-weight correction; the expectation over the resampled completions equals the gradient of the fixed-state objective by construction. The state-selection heuristic (detailed in §3.3) is a deterministic function of the current masked state and is independent of the realized advantage, eliminating any correlation that would necessitate additional correction terms. To make this explicit, we will include a short proof of unbiasedness in the revised §3.2. revision: yes

  2. Referee: [§4.1] §4.1, Table 1: The reported gains over diffu-GRPO and SPG are presented under matched rollout compute, but no ablation isolates the contribution of the fixed-state objective versus the particular choice of intermediate-state selection heuristic; without this control the cross-method comparison is inconclusive.

    Authors: We agree that an ablation isolating the fixed-state objective from the state-selection heuristic would strengthen the experimental claims. The current comparisons in Table 1 demonstrate that DiSPO improves upon terminal-feedback baselines under matched compute budgets. However, to address the referee's concern, we will add an ablation study in the revised manuscript that applies the same intermediate-state selection heuristic but optimizes only with terminal feedback (i.e., without the fixed-state objective). This will help isolate the contribution of the proposed objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper formalizes a fixed-state objective for branched completions and derives a policy-gradient estimator reusing the same rollouts as terminal-feedback optimization, with experiments under matched compute. No load-bearing step reduces by the paper's own equations or self-citation to its inputs; the estimator is presented as independently derived rather than fitted or self-defined. The description indicates self-contained derivation against external benchmarks like diffu-GRPO and SPG, with no evidence of renaming known results or smuggling ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard reinforcement-learning assumptions for policy gradients in sequential generation but does not introduce or quantify new free parameters, axioms, or invented entities beyond the choice of intermediate states.

axioms (1)
  • domain assumption A fixed-state objective for branched completions admits a valid policy-gradient estimator
    Invoked when the paper states it formalizes the objective and derives the estimator that reuses terminal rollouts.

pith-pipeline@v0.9.0 · 5713 in / 1255 out tokens · 51176 ms · 2026-05-21T14:35:01.890399+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 9 internal anchors

  1. [1]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  2. [2]

    Diffu- coder: Understanding and improving masked diffusion mod- els for code generation.arXiv preprint arXiv:2506.20639,

    Gong, S., Zhang, R., Zheng, H., Gu, J., Jaitly, N., Kong, L., and Zhang, Y . Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639,

  3. [3]

    Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,

    He, H., Renz, K., Cao, Y ., and Geiger, A. Mdpo: Overcom- ing the training-inference divide of masked diffusion lan- guage models.arXiv preprint arXiv:2508.13148,

  4. [4]

    Let's Verify Step by Step

    Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

  5. [5]

    s1: Simple test-time scaling

    URL https://arxiv.org/abs/2501.19393. Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language dif- fusion models,

  6. [6]

    Large Language Diffusion Models

    URL https://arxiv.org/ abs/2502.09992. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744,

  7. [7]

    Improving reasoning for diffusion language models via group diffusion policy optimization

    Rojas, K., Lin, J., Rasul, K., Schneider, A., Nevmyvaka, Y ., Tao, M., and Deng, W. Improving reasoning for diffusion language models via group diffusion policy optimization. arXiv preprint arXiv:2510.08554,

  8. [8]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  9. [9]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  10. [10]

    wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

    Tang, X., Dolga, R., Yoon, S., and Bogunovic, I. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,

  11. [11]

    SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

    9 Diffusion-State Policy Optimization for Masked Diffusion Language Models Wang, C., Rashidinejad, P., Su, D., Jiang, S., Wang, S., Zhao, S., Zhou, C., Shen, S. Z., Chen, F., Jaakkola, T., et al. Spg: Sandwiched policy gradient for masked diffu- sion language models.arXiv preprint arXiv:2510.09541, 2025a. Wang, G., Schiff, Y ., Turok, G., and Kuleshov, V ...

  12. [12]

    Advancing Reasoning in Diffusion Language Models with Denoising Process Rewards

    Xie, S., Kong, L., Song, X., Dong, X., Chen, G., Xing, E. P., and Zhang, K. Step-aware policy optimization for reasoning in diffusion large language models.arXiv preprint arXiv:2510.01544,

  13. [13]

    Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

    Yang, J., Chen, G., Hu, X., and Shao, J. Taming masked diffusion language models via consistency trajectory re- inforcement learning with fewer decoding step.arXiv preprint arXiv:2509.23924,

  14. [14]

    Dream 7B: Diffusion Large Language Models

    Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487,

  15. [15]

    Fine-tuning discrete diffusion models with policy gradient methods

    Zekri, O. and Boull ´e, N. Fine-tuning discrete diffusion models with policy gradient methods.arXiv preprint arXiv:2502.01384,

  16. [16]

    Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a

    Zhao, H., Liang, D., Tang, W., Yao, D., and Kallus, N. Diffpo: Training diffusion llms to reason fast and furious via reinforcement learning.arXiv preprint arXiv:2510.02212, 2025a. Zhao, S., Gupta, D., Zheng, Q., and Grover, A. d1: Scaling reasoning in diffusion large language models via rein- forcement learning.arXiv preprint arXiv:2504.12216, 2025b. Zie...

  17. [17]

    Condition on a particular timestep t being selected

    on the corresponding intermediate state(s). Condition on a particular timestep t being selected. Under the assumptions of Theorem 4.1, we have E[−∇θLstep(θ)|t] =c Z ∇θJt(θ), where the expectation is over q∼ D , st ∼d t(q), and the branched action samples at that state. Taking expectation over t∼ω(t)yields E[−∇θLstep(θ)] =c Z X t ω(t)∇ θJt(θ).(19) Terminal...

  18. [18]

    We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset

    is a subset of MATH focusing on competition-level problems.Rewardis computed by considering the two axes, i.e., format reward (max1.0) and correctness reward (max2.0) Sudoku.4 ×4 Sudoku tasks is synthetic benchmark for planning. We use the training data publicly available https: //github.com/Black-Phoenix/4x4-Sudoku-Dataset . As for the evaluation data, w...