d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Aiwei Liu; Bolin Ding; Leyi Pan; Liancheng Fang; Lijie Wen; Lingzhe Zhang; Minghua He; Shuchang Tao; Yunpeng Zhai; Zhaoyang Liu

arxiv: 2512.09675 · v3 · pith:GV7LXWOKnew · submitted 2025-12-10 · 💻 cs.CL

d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models

Leyi Pan , Shuchang Tao , Yunpeng Zhai , Zheyu Fu , Liancheng Fang , Minghua He , Lingzhe Zhang , Zhaoyang Liu

show 3 more authors

Bolin Ding Aiwei Liu Lijie Wen

This is my paper

Pith reviewed 2026-05-16 23:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords diffusion language modelspolicy optimizationreinforcement learningtree-structured rolloutsadvantage estimationself-distillationreasoning benchmarksverifiable rewards

0 comments

The pith

Tree-structured rollouts with verifiable rewards and scheduled self-distillation deliver reliable step-wise advantages for diffusion language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes d-TreeRPO as a reinforcement learning framework that replaces sparse or unverifiable signals in diffusion LLM training with tree-structured rollouts. These rollouts compute bottom-up advantages directly from final verifiable outcomes, producing fine-grained step-wise signals. A theoretical argument shows that raising the model's prediction confidence shrinks the gap between single-step probability estimates and the true expectation over all possible decoding orders. A time-scheduled self-distillation loss is added in later training stages to increase this confidence and tighten the estimates. The resulting policy updates yield large gains on reasoning tasks that depend on precise credit assignment.

Core claim

d-TreeRPO uses tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to supply fine-grained step-wise signals. It proves that higher prediction confidence reduces the difference between a single forward-pass probability estimate and the unbiased expectation over all decoding orders, and introduces a time-scheduled self-distillation loss to raise confidence in later training stages.

What carries the argument

Tree-structured rollouts whose leaves carry verifiable outcome rewards, with advantages propagated bottom-up, plus a time-scheduled self-distillation term that raises prediction confidence to close the gap to unbiased decoding-order expectations.

If this is right

Step-wise advantages become less noisy, so policy gradients exhibit lower variance during diffusion LLM training.
Reasoning performance improves most on tasks whose final answers can be checked automatically.
The self-distillation schedule allows later training epochs to use tighter probability estimates without changing the rollout procedure.
The same tree construction can be reused across multiple policy updates as long as the reward function stays fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If verifiable rewards are replaced by learned critics, the bias-variance tradeoff of the tree estimates would need fresh analysis.
The method's gains may shrink on open-ended generation tasks where no automatic verifier exists.
Extending rollout depth beyond the tested budgets could further reduce variance in long-horizon reasoning problems.
The confidence-scheduling idea might transfer to other autoregressive or diffusion generators that face intractable marginalization over orderings.

Load-bearing premise

Tree-structured rollouts based on verifiable outcome rewards produce unbiased fine-grained step-wise advantage estimates that remain valid outside the sampled trees.

What would settle it

Measure whether d-TreeRPO's advantage estimates remain accurate when the model is evaluated on decoding orders that were never present in any training tree.

Figures

Figures reproduced from arXiv: 2512.09675 by Aiwei Liu, Bolin Ding, Leyi Pan, Liancheng Fang, Lijie Wen, Lingzhe Zhang, Minghua He, Shuchang Tao, Yunpeng Zhai, Zhaoyang Liu, Zheyu Fu.

**Figure 1.** Figure 1: Performance comparison of d-TreeRPO with existing dLLM RL methods on four reasoning benchmarks, using LLaDA-8B-Instruct as base model. and iteratively reveal tokens through parallel denoising steps, enabling faster inference. Closedsource models (e.g., Gemini Diffusion, Seed Diffusion (Song et al., 2025)) achieve 1,400-2,150 tokens/s, while open-source models like LLaDA (Nie et al., 2025; Zhu et al., … view at source ↗

**Figure 2.** Figure 2: Overview of d-TreeRPO. Our framework employs a tree-structured rollout to propagate rewards and compute verifiable step-wise advantages. Guided by theoretical analysis, a time-scheduled self-distillation loss enhances model determinism in later training stages, improving estimation and delivering better performance. tokens to reveal (e.g., the top-k most confident tokens in masked positions), and updates … view at source ↗

**Figure 3.** Figure 3: Performance comparison of d-TreeRPO with dLLM RL baselines under different decoding strategies. +51.6% on Countdown, +4.5% on GSM8K, and +5.3% on Math500 with 256-token generations on LLaDA-8B-Instruct, as well as +65.6% on Sudoku, +24.6% on Countdown, +3.7% on GSM8K, and +11.1% on Math500 with 256-token generations on LLaDA-MoE-7BA1B-Instruct. Appendix D.1 demonstrates the training reward curves. Further… view at source ↗

**Figure 5.** Figure 5: Training curves for the Sudoku task under [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics comparison of d-TreeRPO and its reverse-scheduled variant on the Sudoku task with LLaDA-8B-Instruct as the base model. Self-distillation Loss Reduces Estimation Error. We estimate Eq. (3) via Monte Carlo by sampling 32 random decoding orders per sample and computing, for each token, the probability of the realized token at the step when it is revealed; this yields ptrue and the per-token… view at source ↗

**Figure 7.** Figure 7: Comparison of training reward curves between [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Self-distillation loss with λ(t) over the course of training on the four evaluated tasks. Training Step 0 1 2 3 4 Self-distillation Loss (without (t)) (a) Sudoku Training Step 0 1 2 3 4 Self-distillation Loss (without (t)) (b) Countdown Training Step 0 1 2 3 4 Self-distillation Loss (without (t)) (c) GSM8k Training Step 0 1 2 3 4 Self-distillation Loss (without (t)) (d) Math500 [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 9.** Figure 9: Self-distillation loss without λ(t) over the course of training on the four evaluated tasks. As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Entropy curves over the course of training on the four evaluated tasks under three settings: the full [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Reward curves over the course of training on the four evaluated tasks under three settings: the full [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Evaluation performance under different τmax settings on the four evaluated tasks. 0.1 0.4 0.7 1.0 40 50 60 70 80 90 100 Performance (%) (a) Sudoku 0.1 0.4 0.7 1.0 40 50 60 70 80 Performance (%) (b) Countdown 0.1 0.4 0.7 1.0 76 78 80 82 84 Performance (%) (c) GSM8k 0.1 0.4 0.7 1.0 32 33 34 35 36 37 38 39 Performance (%) (d) Math500 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Evaluation performance under different β settings on the four evaluated tasks. 3e-05 0.0003 0.003 0.03 0.3 max 40 50 60 70 80 90 100 Performance (%) (a) Sudoku 3e-05 0.0003 0.003 0.03 0.3 max 40 50 60 70 80 Performance (%) (b) Countdown 3e-05 0.0003 0.003 0.03 0.3 max 76 78 80 82 84 Performance (%) (c) GSM8k 3e-05 0.0003 0.003 0.03 0.3 max 32 34 36 38 Performance (%) (d) Math500 [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 14.** Figure 14: Evaluation performance under different λmax settings on the four evaluated tasks. -2 -1 1 2 3 4 40 50 60 70 80 90 100 Performance (%) (a) Sudoku -2 -1 1 2 3 4 40 50 60 70 80 Performance (%) (b) Countdown -2 -1 1 2 3 4 76 78 80 82 84 Performance (%) (c) GSM8k -2 -1 1 2 3 4 32 33 34 35 36 37 38 39 Performance (%) (d) Math500 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Evaluation performance under different γ settings on the four evaluated tasks. Recall that λ(t) is defined as λ(t) = λmax · e γt/T − 1 e γ − 1 , (43) which introduces two hyper-parameters: λmax and γ. Larger λmax increases the overall scale of the self-distillation loss and thus strengthens its effect in promoting determinism. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: A case study of LLaDA-8B-Instruct’s response to a GSM8K question. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: A case study of LLaDA-8B-Instruct trained with Diffu-GRPO responding to a GSM8K question. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: A case study of LLaDA-8B-Instruct trained with wd1 responding to a GSM8K question. [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: A case study of LLaDA-8B-Instruct trained with [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

read the original abstract

Reinforcement learning (RL) is pivotal for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, existing dLLM policy optimization methods suffer from two critical reliability bottlenecks: (1) reward sparsity, arising from coarse or unverifiable signals that impede accurate advantage calculation; and (2) their probability estimates do not account for the gap to the unbiased expectation over all decoding orders, which are intractable to compute. To mitigate these issues, we propose d-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. Furthermore, we provide a theoretical proof demonstrating that increasing prediction confidence effectively minimizes the gap between unbiased expected prediction probabilities and its single-step forward pass estimate. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and better performance. Experiments demonstrate that d-TreeRPO outperforms existing baselines and achieves significant improvements across multiple reasoning benchmarks. Specifically, it achieves +86.2% on Sudoku, +51.6% on Countdown, +4.5% on GSM8K, and +5.3% on Math500 compared to the base model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

d-TreeRPO pairs tree rollouts for step-wise advantages with scheduled self-distillation to tighten probability estimates in diffusion LLM RL, delivering large reported gains on puzzle benchmarks but leaving the advantage estimates without clear unbiasedness support.

read the letter

The paper's core move is to replace sparse outcome rewards in dLLM policy optimization with tree-structured rollouts that compute bottom-up advantages, plus a time-scheduled self-distillation loss whose schedule is justified by a proof that higher prediction confidence shrinks the gap to the intractable expectation over all decoding orders. That combination is the actual novelty; prior dLLM RL work cited in the abstract does not describe this pairing. The Sudoku and Countdown lifts (+86% and +52%) are the clearest signal that the method helps when step-wise verification is possible, while the smaller GSM8K and Math500 gains are consistent with tasks where the base model already has some traction. The self-distillation term is a clean addition because it directly targets the probability bias rather than adding another hyperparameter to the RL objective. The tree rollout idea itself is straightforward engineering that makes sense once you accept verifiable final rewards as the only reliable signal. The soft spot is exactly the one flagged in the stress test: nothing demonstrates that the finite trees produce advantage estimates whose expectation matches the true value under the diffusion process, or that the estimates stay stable when tree depth or branching changes. The provided proof covers only the distillation term, so the RL advantage construction remains an assumption rather than a derived guarantee. The abstract also omits baseline details, variance numbers, and statistical tests, which makes it hard to judge whether the large puzzle gains would survive different tree sampling choices. This work is aimed at groups already running RL on diffusion or non-autoregressive models and looking for concrete recipes to densify rewards. A reader who cares about verifiable reasoning tasks will find usable implementation ideas even if the theory needs tightening. It deserves a serious referee because the method is specific enough to test and the benchmark deltas are large enough to matter if they hold up under scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper proposes d-TreeRPO, an RL framework for diffusion language models that uses tree-structured rollouts with bottom-up advantage computation from verifiable outcome rewards to address reward sparsity, combined with a time-scheduled self-distillation loss. A theoretical proof shows that increasing prediction confidence minimizes the gap between single-step forward-pass probability estimates and the unbiased expectation over all decoding orders. Experiments report large gains over the base model: +86.2% on Sudoku, +51.6% on Countdown, +4.5% on GSM8K, and +5.3% on Math500.

Significance. If the core claims hold, the work could provide a practical route to more reliable policy optimization in dLLMs by supplying finer-grained verifiable signals and tighter probability estimates. The reported benchmark gains, especially on Sudoku and Countdown, indicate potential impact for reasoning tasks if the tree-based advantages prove stable and generalizable. The explicit theoretical treatment of the self-distillation term is a constructive element.

major comments (2)

[Abstract and theoretical analysis section] The central claim that finite tree rollouts with bottom-up advantage computation yield unbiased step-wise estimates is load-bearing but unsupported. The diffusion process involves intractable expectations over decoding orders; no analysis shows that the particular tree sampling (depth, branching, selection) produces estimates whose expectation matches the true value function or remains stable under changes to the tree distribution. This is distinct from the self-distillation term, which receives a proof.
[Experiments section] Experimental reporting is insufficient to assess the claimed improvements. No baseline descriptions, number of runs, statistical significance tests, variance estimates, or ablations on tree hyperparameters are provided, making it impossible to determine whether the +86.2% Sudoku and +51.6% Countdown gains are robust or method-specific.

minor comments (1)

[Method section] The time schedule for the self-distillation loss is described only at a high level; an explicit functional form or pseudocode would clarify how the schedule interacts with the RL objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We appreciate the recognition of the potential impact of d-TreeRPO and will address the major comments by providing additional analysis and experimental details in the revised version.

read point-by-point responses

Referee: [Abstract and theoretical analysis section] The central claim that finite tree rollouts with bottom-up advantage computation yield unbiased step-wise estimates is load-bearing but unsupported. The diffusion process involves intractable expectations over decoding orders; no analysis shows that the particular tree sampling (depth, branching, selection) produces estimates whose expectation matches the true value function or remains stable under changes to the tree distribution. This is distinct from the self-distillation term, which receives a proof.

Authors: We thank the referee for pointing out this important distinction. The manuscript provides a theoretical proof specifically for the self-distillation loss, showing that increasing prediction confidence minimizes the gap to the unbiased expectation over decoding orders. For the tree-structured rollouts, the approach relies on bottom-up advantage computation from verifiable outcome rewards to deliver fine-grained signals, which we demonstrate empirically through substantial performance gains. However, we acknowledge that a formal proof or analysis establishing that the finite tree sampling produces unbiased estimates matching the true value function or its stability under varying tree distributions is not included. In the revised manuscript, we will add a discussion section addressing the potential bias and stability of the tree-based estimates, including any available bounds or empirical validation of robustness to tree hyperparameters. revision: yes
Referee: [Experiments section] Experimental reporting is insufficient to assess the claimed improvements. No baseline descriptions, number of runs, statistical significance tests, variance estimates, or ablations on tree hyperparameters are provided, making it impossible to determine whether the +86.2% Sudoku and +51.6% Countdown gains are robust or method-specific.

Authors: We agree that the current experimental reporting is insufficient for full assessment of the results' robustness. In the revised manuscript, we will expand the experiments section to include: detailed descriptions of all baselines and their implementations; results averaged over multiple independent runs with reported means, standard deviations, and variance estimates; statistical significance tests (e.g., t-tests) comparing d-TreeRPO to baselines; and comprehensive ablations on tree hyperparameters such as rollout depth, branching factor, and selection strategies. These additions will allow readers to better evaluate the reliability of the reported gains on Sudoku, Countdown, and other benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; methods derive from RL principles and internal proof without reduction to inputs

full rationale

The paper introduces tree-structured rollouts for bottom-up advantage computation from verifiable outcome rewards and a separate theoretical proof that increasing prediction confidence reduces the gap to unbiased expectations over decoding orders. The self-distillation loss is then scheduled based on that proof. No equations or claims reduce by construction to fitted parameters, self-citations, or renamed inputs; the advantage estimates and probability correction are presented as independent constructions, with empirical gains reported separately on benchmarks. The derivation chain remains self-contained against external RL and diffusion baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full paper would be required to enumerate free parameters, axioms, and invented entities with precision. The central claim rests on an unstated assumption that tree rollouts remain computationally tractable and that the theoretical gap-minimization result holds under the training schedule used.

free parameters (1)

time schedule for self-distillation
The schedule that increases distillation strength in later stages is introduced but its exact functional form and hyperparameters are not specified in the abstract.

axioms (1)

domain assumption Increasing prediction confidence minimizes the gap between single-step probability estimates and the unbiased expectation over all decoding orders
This is the key theoretical result invoked to justify the self-distillation component.

pith-pipeline@v0.9.0 · 5574 in / 1380 out tokens · 30032 ms · 2026-05-16T23:19:28.436792+00:00 · methodology

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery
cs.CE 2026-05 unverdicted novelty 7.0

QuantEvolver applies reinforcement fine-tuning to evolve an LLM policy for generating executable alpha factor expressions, yielding higher-quality and more complementary factors than prompt-based baselines on market b...
Relative Score Policy Optimization for Diffusion Language Models
cs.CL 2026-05 unverdicted novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
cs.SE 2026-05 unverdicted novelty 7.0

Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning
cs.SE 2026-04 unverdicted novelty 7.0

E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 conditional novelty 7.0

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.
DMax: Aggressive Parallel Decoding for dLLMs
cs.LG 2026-04 unverdicted novelty 5.0

DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 6 Pith papers · 3 internal anchors

[1]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, and 1 others. 2025. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617. Shansan Gong, Ruixiang ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Nianyi Lin, Jiajie Zhang, Lei Hou, and Juanzi Li. 2025. Boundary-guided policy optimization for memory- efficient rl of diffusion large language models.arXiv preprint arXiv:2510.11683. Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

arXiv preprint arXiv:2410.18514 , year=

Scaling up masked diffusion models on text. arXiv preprint arXiv:2410.18514. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Large language dif- fusion models.arXiv preprint arXiv:2502.09992. Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jian- wen Xie, Stefano Ermon, Yi Wu, a...

work page arXiv 2025
[4]

Improving reasoning for diffusion language models via group diffusion policy optimization

Improving reasoning for diffusion language models via group diffusion policy optimization. arXiv preprint arXiv:2510.08554. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proxi- mal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Ha...

work page arXiv 2017
[5]

Dream 7B: Diffusion Large Language Models

Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487. Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Quanquan Gu. 2023. Diffusion language models can perform many tasks with scaling and instruction- finetuning.arXiv preprint arXiv:2308.12219. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tia...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

A survey on parallel text generation: From parallel decoding to diffusion language models.arXiv preprint arXiv:2508.08712, 2025

A survey on parallel text generation: From par- allel decoding to diffusion language models.arXiv preprint arXiv:2508.08712. Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. 2025a. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216. Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu...

work page arXiv 2025
[7]

high probability

Why “high probability”?The randomness comes from σ∼ Q . Theorem 1 states that the estimation error is bounded on an event of probability at least1−δw.r.t. dQ

work page
[8]

Why does confidence tighten the bound?The bound is monotone in ϵd,δ: as ϵd,δ decreases, −log(1−ϵ d,δ) decreases and approaches

work page
[9]

Thus, higher token confidence (smaller typical-path confidence gap) yields a tighter error bound

work page
[10]

Since ptrue d is the expectation of qτ(d,σ) (σ) over σ∼ Q , reducing the typical-path devi- ation (smaller ϵd,δ) makes ˆpd a more reliable proxy when approximatingp true d

Relation to ptrue d .Theorem 1 bounds the deviation between ˆpd and the random vari- able qτ(d,σ) (σ) for typical decoding orders. Since ptrue d is the expectation of qτ(d,σ) (σ) over σ∼ Q , reducing the typical-path devi- ation (smaller ϵd,δ) makes ˆpd a more reliable proxy when approximatingp true d . 11 B Detailed Descriptions of dLLM RL Baselines Diff...

work page 2025
[12]

From the second relationship, we know that 2T+G=S

Two treeks and one goolee are equal in weight to one squig:2T+G=S We need to find the number of treeks whose combined weight equals the weight of one squig. From the second relationship, we know that 2T+G=S . This implies that the combined weight of two treeks and one goolee is equal to the weight of one squig. Therefore, the weight of two treeks equals t...

work page
[13]

Ten treeks weigh as much as three squigs and one goolee:10T= 3S+G

work page
[14]

From the second relationship, we know that 2T+G=S

Two treeks and one goolee are equal in weight to one squig:2T+G=S We need to find the number of treeks whose combined weight equals the weight of one squig. From the second relationship, we know that 2T+G=S . This means that the combined weight of two treeks and one goolee is equal to the weight of one squig. Therefore, the number of treeks that equal the...

work page
[15]

Let the weight of one treek be T , the weight of one squig be S, and the weight of one goolee be G

work page
[16]

From the first equation:10T= 3S+G

work page
[17]

From the second equation:2T+G=S

work page
[18]

Substitute S= 2T+G from the second equation into the first: 10T= 3(2T+G) +G , 10T= 6 +T+G,4T=G✗

work page
[19]

LLaDA-8B-Instruct +d-TreeRPO Question:Ten treeks weigh as much as three squigs and one goolee

To find how many treeks equal the weight of one squig:S= 6Twhich means 6 treeks </reasoning> <answer> 6 ✗ </answer> Figure 18: A case study of LLaDA-8B-Instruct trained with wd1 responding to a GSM8K question. LLaDA-8B-Instruct +d-TreeRPO Question:Ten treeks weigh as much as three squigs and one goolee. Two treeks and one goolee are equal in weight to one...

work page

[1] [1]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, and 1 others. 2025. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617. Shansan Gong, Ruixiang ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Let's Verify Step by Step

Let’s verify step by step.arXiv preprint arXiv:2305.20050. Nianyi Lin, Jiajie Zhang, Lei Hou, and Juanzi Li. 2025. Boundary-guided policy optimization for memory- efficient rl of diffusion large language models.arXiv preprint arXiv:2510.11683. Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, and Chongxuan Li

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

arXiv preprint arXiv:2410.18514 , year=

Scaling up masked diffusion models on text. arXiv preprint arXiv:2410.18514. Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Large language dif- fusion models.arXiv preprint arXiv:2502.09992. Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jian- wen Xie, Stefano Ermon, Yi Wu, a...

work page arXiv 2025

[4] [4]

Improving reasoning for diffusion language models via group diffusion policy optimization

Improving reasoning for diffusion language models via group diffusion policy optimization. arXiv preprint arXiv:2510.08554. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proxi- mal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Ha...

work page arXiv 2017

[5] [5]

Dream 7B: Diffusion Large Language Models

Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487. Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, and Quanquan Gu. 2023. Diffusion language models can perform many tasks with scaling and instruction- finetuning.arXiv preprint arXiv:2308.12219. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tia...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

A survey on parallel text generation: From parallel decoding to diffusion language models.arXiv preprint arXiv:2508.08712, 2025

A survey on parallel text generation: From par- allel decoding to diffusion language models.arXiv preprint arXiv:2508.08712. Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. 2025a. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216. Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu...

work page arXiv 2025

[7] [7]

high probability

Why “high probability”?The randomness comes from σ∼ Q . Theorem 1 states that the estimation error is bounded on an event of probability at least1−δw.r.t. dQ

work page

[8] [8]

Why does confidence tighten the bound?The bound is monotone in ϵd,δ: as ϵd,δ decreases, −log(1−ϵ d,δ) decreases and approaches

work page

[9] [9]

Thus, higher token confidence (smaller typical-path confidence gap) yields a tighter error bound

work page

[10] [10]

Since ptrue d is the expectation of qτ(d,σ) (σ) over σ∼ Q , reducing the typical-path devi- ation (smaller ϵd,δ) makes ˆpd a more reliable proxy when approximatingp true d

Relation to ptrue d .Theorem 1 bounds the deviation between ˆpd and the random vari- able qτ(d,σ) (σ) for typical decoding orders. Since ptrue d is the expectation of qτ(d,σ) (σ) over σ∼ Q , reducing the typical-path devi- ation (smaller ϵd,δ) makes ˆpd a more reliable proxy when approximatingp true d . 11 B Detailed Descriptions of dLLM RL Baselines Diff...

work page 2025

[11] [12]

From the second relationship, we know that 2T+G=S

Two treeks and one goolee are equal in weight to one squig:2T+G=S We need to find the number of treeks whose combined weight equals the weight of one squig. From the second relationship, we know that 2T+G=S . This implies that the combined weight of two treeks and one goolee is equal to the weight of one squig. Therefore, the weight of two treeks equals t...

work page

[12] [13]

Ten treeks weigh as much as three squigs and one goolee:10T= 3S+G

work page

[13] [14]

From the second relationship, we know that 2T+G=S

Two treeks and one goolee are equal in weight to one squig:2T+G=S We need to find the number of treeks whose combined weight equals the weight of one squig. From the second relationship, we know that 2T+G=S . This means that the combined weight of two treeks and one goolee is equal to the weight of one squig. Therefore, the number of treeks that equal the...

work page

[14] [15]

Let the weight of one treek be T , the weight of one squig be S, and the weight of one goolee be G

work page

[15] [16]

From the first equation:10T= 3S+G

work page

[16] [17]

From the second equation:2T+G=S

work page

[17] [18]

Substitute S= 2T+G from the second equation into the first: 10T= 3(2T+G) +G , 10T= 6 +T+G,4T=G✗

work page

[18] [19]

LLaDA-8B-Instruct +d-TreeRPO Question:Ten treeks weigh as much as three squigs and one goolee

To find how many treeks equal the weight of one squig:S= 6Twhich means 6 treeks </reasoning> <answer> 6 ✗ </answer> Figure 18: A case study of LLaDA-8B-Instruct trained with wd1 responding to a GSM8K question. LLaDA-8B-Instruct +d-TreeRPO Question:Ten treeks weigh as much as three squigs and one goolee. Two treeks and one goolee are equal in weight to one...

work page