arxiv: 2605.03327 · v2 · submitted 2026-05-05 · 💻 cs.LG · cs.AI

Recognition: no theorem link

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

Hongbo Jin , Rongpeng Zhu , Zhongjing Du , Xu Jiang , Jingqi Tian , Qiaoman Zhang , Jiayu Ding

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningpolicy optimizationlarge language modelscredit assignmentchain of thoughtHellinger distanceentropy gatingmathematical reasoning

0 comments

The pith

DGPO replaces the KL penalty with an entropy-gated Hellinger distance to assign credit to individual tokens in language model reasoning chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Distribution Guided Policy Optimization (DGPO) to address coarse credit assignment in reinforcement learning for large language models. Current methods struggle with long chains of thought because they assign rewards at the sequence level and use unstable KL divergence penalties. DGPO instead measures token-level deviations using the bounded Hellinger distance and scales it by the model's uncertainty to guide which steps deserve more reward. This allows precise incentives for important exploratory actions while avoiding extra value networks or gradient problems. A sympathetic reader would care because it promises more efficient training and better reasoning performance on challenging tasks like math competitions.

Core claim

DGPO is a critic-free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty. It replaces the unbounded KL divergence with the bounded Hellinger distance to quantify token-level exploration safely. An entropy gating mechanism scales this deviation by the policy's epistemic uncertainty to distinguish genuine reasoning breakthroughs from noise. By redistributing the sequence-level advantage to tokens based on these gated scores, DGPO incentivizes critical steps and suppresses low-value deviations, completely eliminating the traditional token-level KL penalty and achieving fine-grained credit reallocation.

What carries the argument

The entropy gating mechanism applied to Hellinger distance, which scales distribution deviation by epistemic uncertainty to enable token-level credit redistribution.

If this is right

Training becomes more stable by avoiding unbounded KL divergence and its gradient issues.
Fine-grained credit assignment improves performance on complex reasoning benchmarks without added compute for value networks.
Models can explore and discover novel reasoning trajectories more effectively.
Critic-free alignment reaches state-of-the-art results on tasks like AIME math problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gating ideas could be applied to other policy optimization methods in sequential decision making beyond language models.
Removing the need for separate value networks might simplify scaling reinforcement learning to larger models.
Further tests on non-math reasoning tasks would show if the approach generalizes to other domains.

Load-bearing premise

The entropy gating mechanism can reliably distinguish genuine reasoning breakthroughs from hallucinatory noise using only the policy's uncertainty estimates without introducing new biases.

What would settle it

Running an ablation study that disables the entropy gating in DGPO and observing either no gain over baselines or increased instability on AIME2024 and AIME2025 would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.03327 by Hongbo Jin, Jiayu Ding, Jingqi Tian, Qiaoman Zhang, Rongpeng Zhu, Xu Jiang, Zhongjing Du.

**Figure 1.** Figure 1: Conceptual comparison between standard GRPO and our proposed DGPO. While GRPO view at source ↗

**Figure 2.** Figure 2: The computational pipeline of Distribution-Guided Policy Optimization (DGPO). view at source ↗

**Figure 3.** Figure 3: Validation accuracy on the AIME benchmark during training (Qwen2.5-32B-Base). The view at source ↗

**Figure 4.** Figure 4: Qualitative visualization of the token-level credit reallocation. The background color view at source ↗

read the original abstract

Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty. DGPO replaces the volatile KL divergence with the bounded Hellinger distance to safely quantify token level exploration without the risk of gradient explosion. To effectively distinguish genuine reasoning breakthroughs from hallucinatory noise, we propose an entropy gating mechanism that scales this deviation by the policy`s epistemic uncertainty. By dynamically redistributing the coarse sequence-level advantage to individual tokens based on these gated scores, DGPO heavily incentivizes critical exploratory steps while suppressing unwarranted, low-entropy deviations. Consequently, DGPO completely eliminates the traditional token-level KL penalty and achieves fine-grained credit reallocation without the computational overhead of an additional value network. Extensive empirical evaluations demonstrate that DGPO sets a new state-of-the-art for critic free alignment. Notably, on the Qwen2.5-32B architecture, DGPO achieves 60.0% Avg@32 accuracy and 46.0% Avg@32 accuracy on the challenging AIME2024 and AIME2025 benchmarks respectively, substantially outperforming competitive baselines like DAPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DGPO swaps KL for bounded Hellinger distance and adds entropy-gated redistribution for token-level credit in critic-free LLM RL, but the entropy proxy for uncertainty looks unreliable and the evidence is thin.

read the letter

The paper introduces DGPO to address coarse sequence-level credit assignment in long CoT reasoning for LLMs. It replaces the standard KL penalty with Hellinger distance and scales deviations by an entropy gate before redistributing advantages to individual tokens, all without a critic or value network. The goal is to push credit toward high-entropy exploratory steps while damping low-entropy noise, and the abstract reports strong gains on AIME2024 and AIME2025 with Qwen2.5-32B over DAPO baselines. That combination of bounded divergence and gated redistribution is the main novelty relative to GRPO-style methods. The framing of the problem is clear and the reported benchmark numbers, if they hold, would be useful for anyone scaling reasoning without extra critic overhead. The soft spot is the entropy gating step. Token entropy in these models mixes aleatoric noise, calibration error, and data artifacts rather than cleanly signaling epistemic uncertainty about whether a step is a genuine breakthrough. If the gate mis-weights tokens, the redistributed advantages can introduce selection bias instead of removing it, and the abstract gives no bias analysis, variance bounds, or ablation results to check stability. The stress-test concern about misclassified exploratory steps therefore lands on the current description. This work is aimed at researchers building critic-free RL for complex LLM reasoning chains. Readers already experimenting with distribution-based penalties or token-level advantage shaping could pull practical ideas from the mechanisms, though they would need to fill in the implementation details and run their own checks. It deserves a serious referee because the target problem is real and the proposed fix has enough technical specificity to merit detailed review, even with the gaps in supporting evidence.

Referee Report

2 major / 2 minor

Summary. The paper proposes Distribution Guided Policy Optimization (DGPO), a critic-free RL framework for fine-grained credit assignment in LLM reasoning alignment. It replaces the standard KL divergence penalty with bounded Hellinger distance to quantify token-level exploration, introduces an entropy gating mechanism that scales distribution deviations by the policy's token entropy (as epistemic uncertainty proxy), and redistributes coarse sequence-level advantages to individual tokens. This is claimed to eliminate token-level KL penalties and value network overhead while achieving SOTA empirical results, including 60.0% Avg@32 on AIME2024 and 46.0% on AIME2025 with Qwen2.5-32B, outperforming baselines such as DAPO.

Significance. If the central claims hold, DGPO would represent a meaningful advance in critic-free RL for long CoT reasoning by enabling more stable, fine-grained credit assignment without KL-induced instability or extra value networks. The reported benchmark gains on challenging math problems are notable and could influence practical alignment pipelines for large models, provided the entropy gating avoids introducing new selection biases in advantage estimation.

major comments (2)

[§3.2] §3.2 (Entropy Gating Mechanism) and the gated advantage estimator: the assumption that token entropy reliably proxies epistemic uncertainty to separate 'genuine reasoning breakthroughs' from 'hallucinatory noise' is load-bearing for the fine-grained redistribution claim. Token entropy in LLMs conflates aleatoric uncertainty, calibration, and data noise rather than isolating epistemic doubt over trajectories; without bias analysis, variance bounds, or a proof that the resulting estimator remains unbiased for policy gradients, the elimination of KL and value-network overhead may come at the cost of new selection bias on long CoT sequences.
[§4] §4 (Experiments): the SOTA claims rest on Avg@32 accuracies (60.0% AIME2024, 46.0% AIME2025) outperforming DAPO, yet no ablations isolate the contribution of entropy gating versus Hellinger distance alone, and no statistical tests or variance estimates across runs are reported. This weakens attribution of gains to the proposed mechanisms.

minor comments (2)

[Abstract] Abstract: minor typographical issues including 'policy`s' (should be 'policy's') and inconsistent spacing around 'Kullback Leibler'.
[§3] Notation: the manuscript introduces Hellinger distance and gated scores but does not explicitly define the final advantage redistribution formula in a single equation; adding a consolidated expression would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest responses possible while committing to revisions that improve the work without overstating its current theoretical or empirical foundations.

read point-by-point responses

Referee: [§3.2] §3.2 (Entropy Gating Mechanism) and the gated advantage estimator: the assumption that token entropy reliably proxies epistemic uncertainty to separate 'genuine reasoning breakthroughs' from 'hallucinatory noise' is load-bearing for the fine-grained redistribution claim. Token entropy in LLMs conflates aleatoric uncertainty, calibration, and data noise rather than isolating epistemic doubt over trajectories; without bias analysis, variance bounds, or a proof that the resulting estimator remains unbiased for policy gradients, the elimination of KL and value-network overhead may come at the cost of new selection bias on long CoT sequences.

Authors: We agree that token entropy is an imperfect proxy that mixes epistemic uncertainty with aleatoric noise, calibration issues, and data artifacts, and that this approximation is central to the gated redistribution claim. In the original manuscript we present it as a practical heuristic motivated by observed training dynamics in long CoT, where low-entropy tokens tend to reflect confident but potentially spurious continuations. To respond to the concern, we have added a dedicated limitations paragraph in the revised Section 3.2 that explicitly discusses the conflation of uncertainty types and provides supporting empirical plots of gated versus ungated advantage variance during training. A full bias analysis, variance bounds, or proof of unbiasedness for the resulting policy-gradient estimator is not present in the current work and would require a separate theoretical treatment; we therefore flag this as an open direction rather than claiming theoretical guarantees. revision: partial
Referee: [§4] §4 (Experiments): the SOTA claims rest on Avg@32 accuracies (60.0% AIME2024, 46.0% AIME2025) outperforming DAPO, yet no ablations isolate the contribution of entropy gating versus Hellinger distance alone, and no statistical tests or variance estimates across runs are reported. This weakens attribution of gains to the proposed mechanisms.

Authors: We accept that the lack of component-wise ablations and run-level statistics limits the strength of causal attribution for the reported gains. In the revised manuscript we have inserted a new subsection in Section 4 that presents ablation results for four variants: (i) Hellinger distance without gating, (ii) KL distance with gating, (iii) full DGPO, and (iv) a no-redistribution baseline. We also now report mean Avg@32 scores together with standard deviations computed across five independent random seeds for both AIME2024 and AIME2025. These additions allow readers to assess the individual and combined contributions of the two proposed mechanisms. revision: yes

standing simulated objections not resolved

A formal proof, bias analysis, or variance bounds establishing that the entropy-gated advantage estimator is unbiased for the policy gradient.

Circularity Check

0 steps flagged

No significant circularity detected in DGPO derivation

full rationale

The paper presents DGPO as a new critic-free RL framework that replaces KL divergence with bounded Hellinger distance and introduces an entropy gating mechanism for token-level credit redistribution. The abstract and described construction rely on reinterpretation of distribution deviation as a guiding signal, empirical benchmarks on AIME tasks, and architectural claims about eliminating token-level KL penalties and value networks. No equations, fitted parameters, or self-citations are shown that reduce the central results to inputs by construction, self-definition, or tautology. The derivation chain remains self-contained against external benchmarks and does not invoke load-bearing uniqueness theorems or ansatzes from prior self-work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement learning assumptions plus the effectiveness of the newly proposed entropy gating; no free parameters or invented physical entities are mentioned in the abstract.

axioms (1)

standard math Policy gradient methods can be applied to token-level advantages derived from sequence-level rewards
The framework redistributes sequence-level advantage to tokens, which presupposes the validity of policy gradient estimation at the token level.

pith-pipeline@v0.9.0 · 5604 in / 1389 out tokens · 37054 ms · 2026-05-11T02:16:55.224386+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 10 internal anchors

[1]

Springer, 2016

Shun-ichi Amari.Information geometry and its applications. Springer, 2016

work page 2016
[2]

Minimum hellinger distance estimates for parametric models.The annals of Statistics, pages 445–463, 1977

Rudolf Beran. Minimum hellinger distance estimates for parametric models.The annals of Statistics, pages 445–463, 1977

work page 1977
[3]

Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning.arXiv preprint arXiv:2505.02387, 2025

work page arXiv 2025
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Pre-trained policy discriminators are general reward models.arXiv preprint arXiv:2507.05197, 2025

Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, et al. Pre-trained policy discriminators are general reward models.arXiv preprint arXiv:2507.05197, 2025

work page arXiv 2025
[6]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review arXiv 2025
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Videocurl: Video curriculum reinforcement learning with orthogonal difficulty decomposition, 2025

Hongbo Jin, Kuanwei Lin, Wenhao Zhang, Yichen Jin, and Ge Li. Videocurl: Video curriculum reinforcement learning with orthogonal difficulty decomposition, 2025

work page 2025
[9]

Himac: Hierarchical macro-micro learning for long-horizon llm agents, 2026

Hongbo Jin, Rongpeng Zhu, Jiayu Ding, Wenhao Zhang, and Ge Li. Himac: Hierarchical macro-micro learning for long-horizon llm agents, 2026

work page 2026
[10]

On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forget- ting

Tomasz Korbak, Hady Elsahar, Germán Kruszewski, and Marc Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forget- ting. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - ...

work page 2022
[11]

On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

Solomon Kullback and Richard A Leibler. On information and sufficiency.The annals of mathematical statistics, 22(1):79–86, 1951

work page 1951
[12]

Distribution-centric policy optimization dominates exploration-exploitation trade-off.arXiv preprint arXiv:2601.12730, 2026

Zhaochun Li, Chen Wang, Jionghao Bai, Shisheng Cui, Ge Lan, Zhou Zhao, and Yue Wang. Distribution-centric policy optimization dominates exploration-exploitation trade-off.arXiv preprint arXiv:2601.12730, 2026

work page arXiv 2026
[13]

Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models

Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024, pages 29128–29163

work page 2024
[14]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. 10

work page 2024
[15]

Adaptivestep: Automatically divid- ing reasoning step through model confidence.arXiv preprint arXiv:2502.13943, 2025

Yuliang Liu, Junjie Lu, Zhaoling Chen, Chaofeng Qu, Jason Klein Liu, Chonghan Liu, Zefan Cai, Yunhui Xia, Li Zhao, Jiang Bian, et al. Adaptivestep: Automatically dividing reasoning step through model confidence.arXiv preprint arXiv:2502.13943, 2025

work page arXiv 2025
[16]

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization, 2026

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

work page arXiv 2026
[17]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

work page 2022
[18]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[19]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023

work page 2023
[20]

Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization

Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. InThe Eleventh International Conference on Learning Representa...

work page 2023
[21]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

R- prm: Reasoning-driven process reward modeling

Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R- prm: Reasoning-driven process reward modeling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13449–13462, 2025

work page 2025
[24]

Hybridflow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297

work page 2025
[25]

Espo: Entropy importance sampling policy optimization.arXiv preprint arXiv:2512.00499, 2025

Yuepeng Sheng, Yuwei Huang, Shuman Liu, Anxiang Zeng, and Haibo Zhang. Espo: Entropy importance sampling policy optimization.arXiv preprint arXiv:2512.00499, 2025

work page arXiv 2025
[26]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[27]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022. 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Beyond reverse KL: generalizing direct preference optimization with diverse divergence constraints

Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse KL: generalizing direct preference optimization with diverse divergence constraints. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024

work page 2024
[29]

Arbitrary entropy policy optimization: Entropy is controllable in reinforcement finetuning.arXiv e-prints, pages arXiv–2510, 2025

Chen Wang, Zhaochun Li, Jionghao Bai, Yuzhi Zhang, Shisheng Cui, Zhou Zhao, and Yue Wang. Arbitrary entropy policy optimization: Entropy is controllable in reinforcement finetuning.arXiv e-prints, pages arXiv–2510, 2025

work page 2025
[30]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review arXiv 2024
[31]

Beyond correctness: Harmonizing process and outcome rewards through rl training.arXiv preprint arXiv:2509.03403, 2025

Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, and Anurag Beniwal. Beyond correctness: Harmonizing process and outcome rewards through rl training.arXiv preprint arXiv:2509.03403, 2025

work page arXiv 2025
[32]

Dynamic and generalizable process reward modeling

Zhangyue Yin, Qiushi Sun, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, and Xuan-Jing Huang. Dynamic and generalizable process reward modeling. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4203–4233, 2025

work page 2025
[33]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Token- level direct preference optimization

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token- level direct preference optimization. InForty-first International Conference on Machine Learn- ing, ICML 2024, Vienna, Austria, July 21-27, 2024, pages 58348–58365

work page 2024
[35]

Groundedprm: Tree-guided and fidelity-aware process reward modeling for step-level reasoning.arXiv preprint arXiv:2510.14942, 2025

Yao Zhang, Yu Wu, Haowei Zhang, Weiguo Li, Haokun Chen, Jingpei Wu, Guohao Li, Zhen Han, and V olker Tresp. Groundedprm: Tree-guided and fidelity-aware process reward modeling for step-level reasoning.arXiv preprint arXiv:2510.14942, 2025

work page arXiv 2025
[36]

Linking process to outcome: Conditional reward modeling for llm reasoning.arXiv preprint arXiv:2509.26578, 2025

Zheng Zhang, Ziwei Shan, Kaitao Song, Yexin Li, and Kan Ren. Linking process to outcome: Conditional reward modeling for llm reasoning.arXiv preprint arXiv:2509.26578, 2025

work page arXiv 2025
[37]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

arXiv preprint arXiv:2603.01162v3 , year=

Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, and Chengchun Shi. Demystifying group relative policy optimization: Its policy gradient is a u-statistic.arXiv preprint arXiv:2603.01162, 2026

work page arXiv 2026
[39]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 12 A Theoretical Analysis and Proofs In this section, we provide detailed mathematical derivations and theoretical proofs for the core mechan...

work page internal anchor Pith review Pith/arXiv arXiv 1909