pith. machine review for the scientific record. sign in

arxiv: 2605.10218 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Relative Score Policy Optimization for Diffusion Language Models

Bingqing Jiang, Difan Zou, Shengze Xu, Wenyi Zhang, Zichao Yu

Pith reviewed 2026-05-12 03:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords diffusion language modelsreinforcement learningpolicy optimizationverifiable rewardsmathematical reasoningplanning tasks
0
0 comments X

The pith

Relative Score Policy Optimization calibrates noisy log-ratio estimates in diffusion language models using verifiable rewards as targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion language models enable parallel text generation but face challenges in post-training for reasoning because they lack tractable sequence-level log-ratios for standard reinforcement learning. High-variance approximations based on ELBO lead to unstable updates when high rewards amplify estimation errors. The paper introduces Relative Score Policy Optimization, which treats reward advantages as targets for the relative log-ratio between current and reference policies. The method then updates the policy to close the gap between its noisy estimate and this reward-implied target. Experiments demonstrate stronger results on planning benchmarks alongside competitive performance on mathematical reasoning tasks.

Core claim

By interpreting reward advantages as targets for relative log-ratios and optimizing the policy on the calibration error between noisy estimates and these targets, RSPO enables stable and effective reinforcement learning with verifiable rewards for diffusion language models.

What carries the argument

The calibration step that sets the reward advantage as the target relative log-ratio and updates the policy according to the difference from the current noisy estimate.

Load-bearing premise

Reward advantages give a reliable and unbiased signal for the true relative log-ratio between policies.

What would settle it

A controlled test on a planning benchmark where RSPO produces no gain or added instability when the reward model is replaced with one known to be uncorrelated with actual policy improvements.

Figures

Figures reproduced from arXiv: 2605.10218 by Bingqing Jiang, Difan Zou, Shengze Xu, Wenyi Zhang, Zichao Yu.

Figure 1
Figure 1. Figure 1: Overview of RSPO: verifier rewards define centered relative-score targets, and the residual to the current ELBO-based score provides policy-update feedback. They multiply a model-side score by a reward or advantage coefficient, thereby specifying which responses should be encouraged. When the score is an ELBO-based log-ratio surrogate, however, the coefficient does not depend on the current relative score.… view at source ↗
Figure 2
Figure 2. Figure 2: Training reward dynamics of RSPO and baselines on planning benchmarks. Shaded regions indicate [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sudoku ablations of RSPO feedback components. The two panels compare reward dynamics and [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Additional training reward dynamics on mathematical reasoning benchmarks. Shaded regions [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
read the original abstract

Diffusion large language models (dLLMs) offer a promising route to parallel and efficient text generation, but improving their reasoning ability requires effective post-training. Reinforcement learning with verifiable rewards (RLVR) is a natural choice for this purpose, yet its application to dLLMs is hindered by the absence of tractable sequence-level log-ratios, which are central to standard policy optimization. The lack of tractable sequence-level log-ratios forces existing methods to rely on high-variance ELBO-based approximations, where high verifier rewards can amplify inaccurate score estimates and destabilize RL training. To overcome this issue, we propose \textbf{R}elative \textbf{S}core \textbf{P}olicy \textbf{O}ptimization (RSPO), a simple RLVR method that uses verifiable rewards to calibrate noisy likelihood estimates in dLLMs. The core of our algorithm relies on a key observation: a reward advantage can be interpreted not only as an update direction, but also as a target for the relative log-ratio between the current and reference policies. Accordingly, RSPO calibrates this noisy relative log-ratio estimate by comparing its reward advantage with the reward-implied target relative log-ratio, updating the policy according to the gap between the current estimate and the target rather than the raw advantage alone. Experiments on mathematical reasoning and planning benchmarks show that RSPO yields especially strong gains on planning tasks and competitive mathematical-reasoning performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Relative Score Policy Optimization (RSPO) as a reinforcement learning with verifiable rewards (RLVR) method tailored to diffusion large language models (dLLMs). It identifies the intractability of sequence-level log-ratios as the source of high-variance ELBO approximations in existing approaches, which can be destabilized by high verifier rewards. RSPO interprets reward advantage as both an update direction and a target for the relative log-ratio between current and reference policies, then updates the policy on the gap between the noisy estimate and this target to calibrate likelihoods. Experiments are reported to show strong gains on planning benchmarks and competitive performance on mathematical reasoning tasks.

Significance. If the calibration mechanism proves robust, RSPO could enable more stable post-training of dLLMs for reasoning, addressing a key barrier to their adoption over autoregressive models. The approach builds on standard RL concepts with a targeted adaptation, and the empirical focus on verifiable rewards aligns with current trends in LLM alignment. However, the significance hinges on whether the method genuinely reduces variance without introducing reward-model biases, which the provided description leaves open.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (method description): The core claim that 'a reward advantage can be interpreted ... as a target for the relative log-ratio' is load-bearing but under-specified for binary 0/1 verifiable rewards. Advantage is discrete and bounded while log-ratios are unbounded; without an explicit temperature β, baseline subtraction, or value function to scale the target, the gap update risks driving probability mass to extremes rather than stabilizing the ELBO estimate.
  2. [Experiments] Experiments section (results on planning and math benchmarks): The reported 'especially strong gains on planning tasks' and 'competitive mathematical-reasoning performance' lack effect sizes, standard deviations, number of runs, or direct comparison to ELBO-based RLVR baselines. This makes it impossible to verify whether the calibration step reduces variance or simply correlates with other implementation choices.
  3. [§3.2] §3.2 or algorithm pseudocode: No derivation or variance analysis is shown demonstrating that updating on the gap between noisy relative log-ratio and reward-implied target reduces ELBO variance rather than coupling reward noise into the already-approximate diffusion likelihoods.
minor comments (2)
  1. [Abstract] Notation for the reference policy and current policy should be introduced consistently (e.g., π_ref vs. π_θ) at first use to aid readability.
  2. [Abstract] The abstract would benefit from naming the specific planning and math benchmarks (e.g., GSM8K, MATH, or planning-specific suites) and listing the main baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below with clarifications and commitments to revisions that strengthen the presentation of the method, experiments, and analysis without misrepresenting the original contributions.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): The core claim that 'a reward advantage can be interpreted ... as a target for the relative log-ratio' is load-bearing but under-specified for binary 0/1 verifiable rewards. Advantage is discrete and bounded while log-ratios are unbounded; without an explicit temperature β, baseline subtraction, or value function to scale the target, the gap update risks driving probability mass to extremes rather than stabilizing the ELBO estimate.

    Authors: We agree that the scaling for binary verifiable rewards requires explicit treatment to prevent potential instability from unbounded log-ratios. In the revised manuscript we will introduce a temperature parameter β that scales the reward advantage to define the target relative log-ratio (target = β × advantage). This provides a continuous, tunable calibration signal, bounds the magnitude of updates, and avoids driving probability mass to extremes. We will update the abstract, §3, and the method description accordingly, while preserving the core interpretation that the advantage serves as both direction and target. revision: yes

  2. Referee: [Experiments] Experiments section (results on planning and math benchmarks): The reported 'especially strong gains on planning tasks' and 'competitive mathematical-reasoning performance' lack effect sizes, standard deviations, number of runs, or direct comparison to ELBO-based RLVR baselines. This makes it impossible to verify whether the calibration step reduces variance or simply correlates with other implementation choices.

    Authors: We acknowledge that the current experimental reporting is insufficient for rigorous verification. In the revised experiments section we will add absolute effect sizes, standard deviations computed across multiple independent runs (explicitly stating the number of random seeds), and direct head-to-head comparisons against ELBO-based RLVR baselines. These additions will allow readers to isolate the contribution of the RSPO calibration to variance reduction. revision: yes

  3. Referee: [§3.2] §3.2 or algorithm pseudocode: No derivation or variance analysis is shown demonstrating that updating on the gap between noisy relative log-ratio and reward-implied target reduces ELBO variance rather than coupling reward noise into the already-approximate diffusion likelihoods.

    Authors: The original manuscript emphasizes the algorithmic intuition and empirical results. To address the request for theoretical grounding, we will add a concise derivation in the appendix (and reference it from §3.2) showing that the gap update minimizes a calibrated surrogate whose fixed point aligns the noisy log-ratio estimate to the reward-derived target. This anchoring demonstrably reduces the variance of the effective ELBO gradient without directly injecting reward noise into the diffusion likelihood parameters, as the update operates on the discrepancy rather than the raw advantage. revision: partial

Circularity Check

0 steps flagged

No circularity: RSPO is a design choice built on an interpretive observation, not a reduction to fitted inputs or self-citations

full rationale

The paper's central step is the stated observation that reward advantage can serve as a target for relative log-ratio, followed by an update rule that minimizes the gap between the noisy estimate and this target. This is presented as a modeling choice adapting RLVR to dLLMs rather than a derivation whose equations reduce by construction to the inputs (no self-definitional loops, no fitted parameters renamed as predictions, and no load-bearing self-citations appear in the provided text). The method remains self-contained against external benchmarks because the calibration rule is independently motivated and falsifiable via downstream task performance; the binary-reward scaling concern is a question of correctness, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that verifiable rewards can serve as accurate targets for log-ratio calibration in the absence of tractable sequence-level probabilities; no explicit free parameters or invented entities are described.

axioms (1)
  • domain assumption Reward advantage can be interpreted as a target for the relative log-ratio between current and reference policies
    This is the key observation enabling the calibration step in RSPO.

pith-pipeline@v0.9.0 · 5559 in / 1233 out tokens · 42832 ms · 2026-05-12T03:40:20.495809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 18 internal anchors

  1. [1]

    1998 , publisher=

    Reinforcement learning: An introduction , author=. 1998 , publisher=

  2. [6]

    Open-Sora: Democratizing Efficient Video Production for All

    Open-sora: Democratizing efficient video production for all , author=. arXiv preprint arXiv:2412.20404 , year=

  3. [10]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  4. [11]

    von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin , license =

  5. [17]

    Advances in Neural Information Processing Systems , volume=

    Simple and effective masked diffusion language models , author=. Advances in Neural Information Processing Systems , volume=

  6. [19]

    Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data , author=

  7. [20]

    Advances in neural information processing systems , volume=

    Simplified and generalized masked diffusion for discrete data , author=. Advances in neural information processing systems , volume=

  8. [21]

    Large Language Diffusion Models , author=

  9. [28]

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=

  10. [31]

    Score-Based Generative Modeling through Stochastic Differential Equations , author=

  11. [32]

    Advances in neural information processing systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=

  12. [33]

    Advances in neural information processing systems , volume=

    Diffusion models beat gans on image synthesis , author=. Advances in neural information processing systems , volume=

  13. [34]

    Advances in neural information processing systems , volume=

    Structured denoising diffusion models in discrete state-spaces , author=. Advances in neural information processing systems , volume=

  14. [35]

    Advances in Neural Information Processing Systems , volume=

    A continuous time framework for discrete denoising models , author=. Advances in Neural Information Processing Systems , volume=

  15. [36]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=

  16. [38]

    arXiv preprint arXiv:2510.08554 , year=

    Improving reasoning for diffusion language models via group diffusion policy optimization , author=. arXiv preprint arXiv:2510.08554 , year=

  17. [43]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  18. [44]

    2009 , publisher=

    Large deviations techniques and applications , author=. 2009 , publisher=

  19. [49]

    Arel's Sudoku Generator , howpublished =

  20. [50]

    Jiayi Pan and Junjie Zhang and Xingyao Wang and Lifan Yuan and Hao Peng and Alane Suhr , title =

  21. [51]

    The twelfth international conference on learning representations , year=

    Let's verify step by step , author=. The twelfth international conference on learning representations , year=

  22. [53]

    Advances in Neural Information Processing Systems , year=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , year=

  23. [55]

    International Conference on Machine Learning , year=

    Trust Region Policy Optimization , author=. International Conference on Machine Learning , year=

  24. [56]

    Advances in Neural Information Processing Systems , year=

    Policy Gradient Methods for Reinforcement Learning with Function Approximation , author=. Advances in Neural Information Processing Systems , year=

  25. [57]

    arXiv preprint arXiv:2308.12219 , year=

    Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning , author=. arXiv preprint arXiv:2308.12219 , year=

  26. [58]

    https://www.ocf.berkeley.edu/ arel/sudoku/main.html

    Arel's sudoku generator. https://www.ocf.berkeley.edu/ arel/sudoku/main.html. Accessed: 2026-05-03

  27. [59]

    Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, et al. Llada2. 0-uni: Unifying multimodal understanding and generation with diffusion large language model. arXiv preprint arXiv:2604.20796, 2026

  28. [60]

    Block diffusion: Interpolating between autoregressive and diffusion language models

    Marianne Arriola, Aaron Gokaslan, Justin T Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. In The Thirteenth International Conference on Learning Representations

  29. [61]

    Structured denoising diffusion models in discrete state-spaces

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in neural information processing systems, 34: 0 17981--17993, 2021

  30. [62]

    Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, et al. Llada2. 0: Scaling up diffusion language models to 100b. arXiv preprint arXiv:2512.15745, 2025

  31. [63]

    A continuous time framework for discrete denoising models

    Andrew Campbell, Joe Benton, Valentin De Bortoli, Thomas Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 35: 0 28266--28279, 2022

  32. [64]

    Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation.arXiv preprint arXiv:2510.06303,

    Shuang Cheng, Yihan Bian, Dawei Liu, Linfeng Zhang, Qian Yao, Zhongbo Tian, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, et al. Sdar: A synergistic diffusion-autoregression paradigm for scalable sequence generation. arXiv preprint arXiv:2510.06303, 2025

  33. [65]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  34. [66]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 0 8780--8794, 2021

  35. [67]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  36. [68]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

  37. [69]

    LTX-2: Efficient Joint Audio-Visual Foundation Model

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model. arXiv preprint arXiv:2601.03233, 2026

  38. [70]

    Response-level rewards are all you need for online reinforcement learning in llms: A mathematical perspective

    Shenghua He, Tian Xia, Xuan Zhou, and Hui Wei. Response-level rewards are all you need for online reinforcement learning in llms: A mathematical perspective. arXiv preprint arXiv:2506.02553, 2025

  39. [71]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

  40. [72]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1 0 (2): 0 3, 2022

  41. [73]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  42. [74]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The twelfth international conference on learning representations, 2023

  43. [75]

    Boundary-guided policy optimization for memory-efficient rl of diffusion large language models

    Nianyi Lin, Jiajie Zhang, Lei Hou, and Juanzi Li. Boundary-guided policy optimization for memory-efficient rl of diffusion large language models. arXiv preprint arXiv:2510.11683, 2025

  44. [76]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  45. [77]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In Forty-first International Conference on Machine Learning

  46. [78]

    Large language diffusion models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

  47. [79]

    Your absorbing discrete diffusion secretly models the conditional distributions of clean data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. In The Thirteenth International Conference on Learning Representations

  48. [80]

    Principled rl for diffusion llms emerges from a sequence-level perspective, 2025

    Jingyang Ou, Jiaqi Han, Minkai Xu, Shaoxuan Xu, Jianwen Xie, Stefano Ermon, Yi Wu, and Chongxuan Li. Principled rl for diffusion llms emerges from a sequence-level perspective. arXiv preprint arXiv:2512.03759, 2025

  49. [81]

    Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022

  50. [82]

    Tinyzero

    Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, and Alane Suhr. Tinyzero. https://github.com/Jiayi-Pan/TinyZero, 2025 a . Accessed: 2025-01-24

  51. [83]

    d-treerpo: Towards more reliable policy optimization for diffusion language models

    Leyi Pan, Shuchang Tao, Yunpeng Zhai, Zheyu Fu, Liancheng Fang, Minghua He, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, et al. d-treerpo: Towards more reliable policy optimization for diffusion language models. arXiv preprint arXiv:2512.09675, 2025 b

  52. [84]

    Simple and effective masked diffusion language models

    Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37: 0 130136--130184, 2024

  53. [85]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  54. [86]

    Seedance 2.0: Advancing Video Generation for World Complexity

    Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, et al. Seedance 2.0: Advancing video generation for world complexity. arXiv preprint arXiv:2604.14148, 2026

  55. [87]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  56. [88]

    Simplified and generalized masked diffusion for discrete data

    Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and generalized masked diffusion for discrete data. Advances in neural information processing systems, 37: 0 103131--103167, 2024

  57. [89]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations

  58. [90]

    Reinforcement learning: An introduction, volume 1

    Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  59. [91]

    wd1: Weighted policy optimization for reasoning in diffusion language models

    Xiaohang Tang, Rares Dolga, Sangwoong Yoon, and Ilija Bogunovic. wd1: Weighted policy optimization for reasoning in diffusion language models. arXiv preprint arXiv:2507.08838, 2025

  60. [92]

    TRL: Transformers Reinforcement Learning , 2020

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Reinforcement Learning , 2020. URL https://github.com/huggingface/trl

  61. [93]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

  62. [94]

    SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

    Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, et al. Spg: Sandwiched policy gradient for masked diffusion language models. arXiv preprint arXiv:2510.09541, 2025 a

  63. [95]

    Revolutionizing reinforcement learning framework for diffusion large language models

    Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, and Mengdi Wang. Revolutionizing reinforcement learning framework for diffusion large language models. arXiv preprint arXiv:2509.06949, 2025 b

  64. [96]

    Advancing Reasoning in Diffusion Language Models with Denoising Process Rewards

    Shaoan Xie, Lingjing Kong, Xiangchen Song, Xinshuai Dong, Guangyi Chen, Eric P Xing, and Kun Zhang. Step-aware policy optimization for reasoning in diffusion large language models. arXiv preprint arXiv:2510.01544, 2025

  65. [97]

    Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models. arXiv preprint arXiv:2505.15809, 2025

  66. [98]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487, 2025

  67. [99]

    Llada-v: Large language diffusion models with visual instruction tuning.arXiv preprint arXiv:2505.16933, 2025

    Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning. arXiv preprint arXiv:2505.16933, 2025

  68. [100]

    Llada-o: An effective and length-adaptive omni diffusion model

    Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, and Ji-Rong Wen. Llada-o: An effective and length-adaptive omni diffusion model. arXiv preprint arXiv:2603.01068, 2026

  69. [101]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  70. [102]

    d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216, 2025

    Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning. arXiv preprint arXiv:2504.12216, 2025

  71. [103]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025 a

  72. [104]

    Open-sora 2.0: Training a commercial-level video generation model in $200k,

    Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in \ 200 k. arXiv preprint arXiv:2503.09642, 2025 b

  73. [105]

    Zhong, K

    Jianyuan Zhong, Kaibo Wang, Ding Ding, Zijin Feng, Haoli Bai, Yang Xiang, Jiacheng Sun, and Qiang Xu. Stabilizing reinforcement learning for diffusion language models. arXiv preprint arXiv:2603.06743, 2026

  74. [106]

    Llada 1.5: Variance-reduced preference optimization for large language diffusion models.arXiv preprint arXiv:2505.19223, 2025

    Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, et al. Llada 1.5: Variance-reduced preference optimization for large language diffusion models. arXiv preprint arXiv:2505.19223, 2025