arxiv: 2604.00860 · v2 · submitted 2026-04-01 · 💻 cs.LG

Recognition: no theorem link

Policy Improvement Reinforcement Learning

Deqing Wang, Haoyi Zhou, Huaiyang Wang, Jianxin Li, Xiaojie Li, Yaodong Yang, Yikun Ban, Zixuan Huang

Pith reviewed 2026-05-13 22:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningpolicy optimizationlarge language modelsverifiable rewardspolicy improvementclosed-loop optimizationmathematical reasoning

0 comments

The pith

Reinforcement learning for language models can be made self-correcting by directly maximizing cumulative verified policy improvement across iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current approaches to post-training language models with reinforcement learning update policies using only instantaneous batch reward signals and never check whether the update actually improved the model. This open-loop process risks drift or collapse because there is no feedback on inter-iteration progress. The paper introduces Policy Improvement Reinforcement Learning, whose explicit goal is to maximize the total policy improvement accumulated over successive iterations, and proves this temporal objective aligns exactly with reaching the highest final task performance. It then presents Policy Improvement Policy Optimization, which closes the loop by retrospectively comparing each update against a sliding-window historical baseline and reinforcing only those that produce genuine gains while suppressing the rest.

Core claim

The paper establishes that the objective of maximizing cumulative policy improvement over iterations is perfectly aligned with maximizing final task performance. It shows that Policy Improvement Policy Optimization implements this objective through closed-loop retrospective verification: at each step the method evaluates the preceding update against a sliding-window baseline, then ascends the objective in expectation by reinforcing beneficial changes and suppressing harmful ones.

What carries the argument

The PIRL objective of maximizing cumulative policy improvement, realized through retrospective verification of each update against a sliding-window historical baseline.

If this is right

Optimization becomes self-correcting and less prone to drift or collapse.
Training stability increases on mathematical reasoning benchmarks.
Final task performance rises compared with prior open-loop methods.
The process converts isolated batch updates into a sequence of verified progressive gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The verification step could be adapted to other reinforcement-learning domains to improve reliability over long training runs.
Historical baselines may help surface gradual policy degradation that single-batch statistics miss.
Adjusting window length offers a tunable trade-off between responsiveness and robustness to noise.

Load-bearing premise

Retrospective verification against a sliding-window historical baseline can reliably detect genuine policy improvement without bias from window size, data distribution, or baseline statistics.

What would settle it

An experiment in which varying the sliding-window size produces inconsistent detection of improvements or allows performance collapse despite the use of the proposed method.

Figures

Figures reproduced from arXiv: 2604.00860 by Deqing Wang, Haoyi Zhou, Huaiyang Wang, Jianxin Li, Xiaojie Li, Yaodong Yang, Yikun Ban, Zixuan Huang.

**Figure 1.** Figure 1: Overview of Policy Improvement Reinforcement Learning (PIRL) framework. Left: Traditional RLVR methods follow an open-loop paradigm, updating policies from instantaneous rewards without verifying actual improvement. Middle: PIRL introduces a verification stage, forming a closed-loop optimization driven by policy improvement signals. Right: During verification, updates are adaptively regulated: positive sig… view at source ↗

**Figure 2.** Figure 2: Theoretical distortion and empirical instability of GRPO. (a) Gradient Distortion: The gradient scaling factor η(pt) evaluated across success rates pt. As established in Corollary 3.2, GRPO (G = 8, 128) exhibit severe sensitivity explosion at the boundaries (pt → 0, 1). (b) Empirical Stability: Standard GRPO suffers from drastic gradient norm spikes (left) and severe Pass@1 collapse (right). Incorporating … view at source ↗

**Figure 3.** Figure 3: Comparison of training dynamics on Qwen3-4B-Base. (a) Average Pass@1 accuracy evolution across the five [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics across multiple random seeds (6, 21, and 42) on Qwen3-4B-Base. [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Computational efficiency analysis on Qwen3-4B-Base. (a) Total wall-clock time comparison. (b) Evolution [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIRL and PIPO add retrospective verification to RLVR but the sliding-window baseline may bias the improvement signal in noisy or gradual-update regimes.

read the letter

The paper's main move is to replace open-loop reward maximization in RLVR with an explicit objective of cumulative policy improvement across iterations. PIRL frames this as the target, and PIPO tries to achieve it by checking each update against a sliding-window historical baseline before reinforcing or suppressing it. That closed-loop idea is the clearest novelty here, and it directly targets the drift problem the authors flag in GRPO-style methods. The experiments on math reasoning tasks reportedly show gains in stability and final performance, which is the kind of practical signal worth noticing if the controls are solid. The theoretical claim that the PIRL objective aligns perfectly with final task performance and that PIPO ascends it in expectation is presented as a proof, but the abstract gives no derivation steps or error bounds, so the strength of that result is still unclear. The stress-test concern about the baseline looks real on the description given: when policies improve incrementally and rewards are noisy, a fixed window can blend old and new data, turning the comparator into a biased statistic rather than a clean verifier. Without sensitivity checks on window size or explicit handling of non-stationarity, the self-correcting loop could reinforce the wrong updates. The work builds cleanly on the cited GRPO literature without obvious circular self-citation. This is aimed at researchers doing post-training for LLM reasoning. It deserves peer review because the problem is concrete and the framing is distinct, even if the theory and baseline robustness will need tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization in RLVR with the explicit objective of maximizing cumulative policy improvement across iterations and claims this temporal objective is perfectly aligned with final task performance. It proposes Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization via retrospective verification of each update against a sliding-window historical baseline, reinforcing beneficial updates and suppressing harmful ones; theoretical analysis asserts that PIPO performs ascent on the PIRL objective in expectation, with experiments on mathematical reasoning benchmarks showing improved stability and performance over GRPO variants.

Significance. If the alignment proof and expected-ascent guarantee hold without bias in the baseline, the work would address a genuine gap in open-loop RLVR methods by adding verifiable inter-iteration feedback, potentially improving stability for LLM reasoning post-training.

major comments (2)

[Theoretical Analysis] Abstract and Theoretical Analysis section: the claim that the PIRL objective is 'perfectly aligned' with maximizing final task performance is presented without derivation steps or error analysis; the retrospective verification depends on a sliding-window historical baseline whose independence from the optimization trajectory is not shown, creating the circularity risk that baseline statistics mix improving and prior policies under non-stationary or noisy updates.
[PIPO Algorithm and Experiments] PIPO description and experiments: no sensitivity analysis is provided for window size, baseline statistic choice (mean/median), or data distribution; in regimes with incremental policy updates and noisy rewards (e.g., LLM math reasoning), this can bias the comparator and violate the asserted expected ascent, undermining the self-correcting loop claim.

minor comments (1)

The abstract states 'we provide theoretical analysis' and 'experiments demonstrate improved stability' but supplies no implementation details, hyperparameter values for the sliding window, or specific benchmark numbers; these should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which help clarify key aspects of our theoretical claims and experimental validation. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Theoretical Analysis] Abstract and Theoretical Analysis section: the claim that the PIRL objective is 'perfectly aligned' with maximizing final task performance is presented without derivation steps or error analysis; the retrospective verification depends on a sliding-window historical baseline whose independence from the optimization trajectory is not shown, creating the circularity risk that baseline statistics mix improving and prior policies under non-stationary or noisy updates.

Authors: We appreciate the referee's call for greater rigor here. The alignment between the PIRL objective (cumulative policy improvement) and final task performance follows from the fact that any policy achieving higher cumulative improvement must eventually reach a higher-performing fixed point; we will expand the Theoretical Analysis section with explicit derivation steps, including a short proof by induction on the improvement sequence and a bound on the approximation error introduced by finite-window estimation. Regarding baseline independence, the sliding window is populated exclusively from policies that have already passed retrospective verification, so its statistics are conditioned only on prior accepted updates. We will add a supporting lemma showing that, under the expected-ascent property, the baseline expectation remains unbiased with respect to the current candidate update, thereby removing the circularity concern even in non-stationary regimes. revision: yes
Referee: [PIPO Algorithm and Experiments] PIPO description and experiments: no sensitivity analysis is provided for window size, baseline statistic choice (mean/median), or data distribution; in regimes with incremental policy updates and noisy rewards (e.g., LLM math reasoning), this can bias the comparator and violate the asserted expected ascent, undermining the self-correcting loop claim.

Authors: We agree that an explicit sensitivity study would strengthen the empirical support. In the revised manuscript we will add an ablation subsection reporting performance for window sizes {3,5,10,20} and for both mean and median baselines on the same mathematical-reasoning benchmarks. These results will demonstrate that the self-correcting behavior and expected-ascent guarantee remain intact across the tested range; we will also include a brief analysis of how the verification threshold interacts with reward noise to keep the comparator unbiased. The theoretical expected-ascent result itself does not depend on a particular window size or statistic, but the new experiments will confirm practical robustness. revision: yes

Circularity Check

1 steps flagged

PIRL objective alignment with final performance reduces to definitional equivalence via retrospective baseline

specific steps

self definitional [Abstract]
"we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. ... We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation"

The objective is defined as maximizing cumulative policy improvement; the claimed proof of alignment with final performance and the expected ascent under PIPO both rely on retrospective verification against a sliding-window historical baseline whose statistics are drawn from the same optimization trajectory. The alignment therefore reduces to the definition of the objective rather than an independent derivation.

full rationale

The paper defines the PIRL objective explicitly as maximizing cumulative policy improvement across iterations and claims a proof of perfect alignment with final task performance, while PIPO's ascent is shown via retrospective verification against a sliding-window historical baseline. This creates a self-referential structure: improvement is both the quantity being maximized and the quantity verified by the same trajectory-dependent baseline, so the alignment and expected ascent hold by construction of the objective rather than independent derivation. The central theoretical result therefore reduces to a restatement of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the newly introduced PIRL objective and the assumption that retrospective verification against a historical baseline can detect true improvement. No explicit free parameters are named in the abstract. The proof of alignment is treated as a domain assumption whose details are unavailable without the full text.

axioms (1)

domain assumption The temporal objective of maximizing cumulative policy improvement is perfectly aligned with maximizing final task performance
Stated as proven in the abstract but the actual derivation is not provided.

invented entities (2)

PIRL objective no independent evidence
purpose: Replace surrogate reward maximization with explicit maximization of cumulative policy improvement across iterations
Newly defined temporal objective introduced in the paper.
PIPO algorithm no independent evidence
purpose: Implement closed-loop optimization through retrospective verification against a sliding-window baseline
New algorithm that evaluates and selectively reinforces prior updates.

pith-pipeline@v0.9.0 · 5564 in / 1502 out tokens · 41569 ms · 2026-05-13T22:44:54.577345+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 7 internal anchors

[1]

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 12248–12267. Association for C...

work page 2024
[2]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. MiniMax-M1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

EEPO: Exploration-enhanced policy optimization via sample-then-forget

Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schuetze, and Kam-Fai Wong. EEPO: Exploration-enhanced policy optimization via sample-then-forget. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[4]

Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward

Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[5]

Self-evolving curriculum for LLM reasoning.arXiv preprint arXiv:2505.14970, 2025

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, et al. Self-evolving curriculum for LLM reasoning.arXiv preprint arXiv:2505.14970, 2025

work page arXiv 2025
[6]

Weak-Driven Learning: How Weak Agents make Strong Agents Stronger.arXiv preprint arXiv:2602.08222, 2026

Zehao Chen, Gongxun Li, Tianxiang Ai, Yifei Li, Zixuan Huang, Wang Zhou, Fuzhen Zhuang, Xianglong Liu, Jianxin Li, Deqing Wang, and Yikun Ban. Weak-Driven Learning: How Weak Agents make Strong Agents Stronger.arXiv preprint arXiv:2602.08222, 2026

work page arXiv 2026
[7]

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, and Hailong Sun. Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process.arXiv preprint arXiv:2512.23213, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint, 2026

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint, 2026

work page 2026
[9]

Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

work page arXiv 2025
[10]

CDE: Curiosity-driven exploration for efficient reinforcement learning in large language models

Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, and Dong Yu. CDE: Curiosity-driven exploration for efficient reinforcement learning in large language models. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[11]

Harder is better: Boosting mathematical reasoning via difficulty-aware GRPO and multi-aspect question reformulation

Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, and Zhiwu Lu. Harder is better: Boosting mathematical reasoning via difficulty-aware GRPO and multi-aspect question reformulation. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[12]

Decomposing the entropy-performance exchange: The missing keys to unlocking effective reinforcement learning.arXiv preprint arXiv:2508.02260, 2025

Jia Deng, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, and Ji-Rong Wen. Decomposing the entropy-performance exchange: The missing keys to unlocking effective reinforcement learning.arXiv preprint arXiv:2508.02260, 2025

work page arXiv 2025
[13]

Sciknoweval: Evaluating multi-level scientific knowledge of large language models

Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

work page arXiv 2024
[14]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, et al. DeepSeek-R1 incentivizes re...

work page 2025
[15]

Segment Policy Optimization: Effective segment-level credit assignment in RL for large language models

Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment Policy Optimization: Effective segment-level credit assignment in RL for large language models. InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[16]

KL-regularized reinforcement learning is designed to mode collapse

Anthony GX-Chen, Jatin Prakash, Jeff Guo, Rob Fergus, and Rajesh Ranganath. KL-regularized reinforcement learning is designed to mode collapse. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[17]

On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585, 2025

Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585, 2025

work page arXiv 2025
[18]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024
[19]

Random policy valuation is enough for LLM reasoning with verifiable rewards

Haoran He, Yuxiao Ye, Qingpeng Cai, Chen Hu, Binxing Jiao, Daxin Jiang, and Ling Pan. Random policy valuation is enough for LLM reasoning with verifiable rewards. InThe Fourteenth International Conference on Learning Representations, 2026. 12 Policy Improvement Reinforcement Learning

work page 2026
[20]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

work page 2021
[21]

Reinforce++: A simple and efficient approach for aligning large language models.arXiv e-prints, pages arXiv–2501, 2025

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv e-prints, pages arXiv–2501, 2025

work page 2025
[22]

BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, et al. BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

work page arXiv 2025
[23]

PROS: Towards compute-efficient RLVR via rollout prefix reuse

Baizhou Huang and Xiaojun Wan. PROS: Towards compute-efficient RLVR via rollout prefix reuse. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[24]

On the direction of RLVR updates for LLM reasoning: Identification and exploitation

Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, and Jingren Zhou. On the direction of RLVR updates for LLM reasoning: Identification and exploitation. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[25]

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, Fuzhen Zhuang, Jianxin Li, Yikun Ban, and Deqing Wang. Does Your Reasoning Model Implicitly Know When to Stop Thinking?arXiv preprint arXiv:2602.08354, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Overthinking reduction with decoupled rewards and curriculum data scheduling

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, and Yu Wang. Overthinking reduction with decoupled rewards and curriculum data scheduling. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[27]

The optimal token baseline: Variance reduction for long-horizon LLM-RL.arXiv preprint arXiv:2602.07078, 2026

Yingru Li, Jiawei Xu, Ziniu Li, Jiacai Liu, Wei Liu, Yuxuan Tong, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, and Baoxiang Wang. The optimal token baseline: Variance reduction for long-horizon LLM-RL.arXiv preprint arXiv:2602.07078, 2026

work page arXiv 2026
[28]

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration.arXiv preprint arXiv:2603.21563, 2026

Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, and Fuzhen Zhuang. Counterfactual Credit Policy Optimization for Multi-Agent Collaboration.arXiv preprint arXiv:2603.21563, 2026

work page arXiv 2026
[29]

Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yikun Ban, and Fuzhen Zhuang. Adaptive Robust Estimator for Multi-Agent Reinforcement Learning. arXiv preprint arXiv:2603.21574, 2026

work page arXiv 2026
[30]

Beyond pass@ 1: Self-play with variational problem synthesis sustains RLVR

Xiao Liang, Zhong-Zhi Li, Yeyun Gong, yelong shen, Ying Nian Wu, Zhijiang Guo, and Weizhu Chen. Beyond pass@ 1: Self-play with variational problem synthesis sustains RLVR. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[31]

CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling.arXiv preprint arXiv:2603.08035, 2026

Dengcan Liu, Fengkai Yang, Xiaohan Wang, Shurui Yan, Jiajun Chai, Jiahao Li, Yikun Ban, Zhendong Mao, Wei Lin, and Guojun Yin. CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling.arXiv preprint arXiv:2603.08035, 2026

work page arXiv 2026
[32]

Understanding R1-Zero-like training: A critical perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-like training: A critical perspective. In2nd AI for Math Workshop @ ICML 2025, 2025

work page 2025
[33]

Contextual rollout bandits for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.08499, 2026

Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, and Deqing Wang. Contextual rollout bandits for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.08499, 2026

work page arXiv 2026
[34]

F-GRPO: Don’t let your policy learn the obvious and forget the rare

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov. F-GRPO: Don’t let your policy learn the obvious and forget the rare. InThe 1st Workshop on Scaling Post-training for LLMs, 2026

work page 2026
[35]

Rethinking the trust region in LLM reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in LLM reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

work page arXiv 2026
[36]

HiPO: Self-hint policy optimization for RLVR

Deng Qiyuan, Kehai Chen, Min Zhang, and Zhongwen Xu. HiPO: Self-hint policy optimization for RLVR. In The Fourteenth International Conference on Learning Representations, 2026

work page 2026
[37]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Rewarding progress: Scaling automated process verifiers for LLM reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[39]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 13 Policy Improvement Reinforcement Learning

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

HybridFlow: A flexible and efficient RLHF framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys), pages 1279–1297. ACM, 2025

work page 2025
[41]

Experiential reinforcement learning.arXiv preprint arXiv:2602.13949, 2026

Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, and Jieyu Zhao. Experiential reinforcement learning.arXiv preprint arXiv:2602.13949, 2026

work page arXiv 2026
[42]

KTAE: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning

Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. KTAE: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning. InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[43]

Teaching models to teach themselves: Reasoning at the edge of learnability.arXiv preprint arXiv:2601.18778, 2026

Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, and Julia Kempe. Teaching models to teach themselves: Reasoning at the edge of learnability.arXiv preprint arXiv:2601.18778, 2026

work page arXiv 2026
[44]

Maximum likelihood reinforcement learning

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning. InICLR 2026 Workshop on Scaling Post-training for LLMs, 2026

work page 2026
[45]

Cross-batch negative sampling for training two-tower recom- menders

Jinpeng Wang, Jieming Zhu, and Xiuqiang He. Cross-batch negative sampling for training two-tower recom- menders. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1632–1636, 2021

work page 2021
[46]

Cross-batch memory for embedding learning

Xun Wang, Haozhi Zhang, Weilin Huang, and Matthew R Scott. Cross-batch memory for embedding learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6388–6397, 2020

work page 2020
[47]

The Invisible Leash: Why RLVR may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The Invisible Leash: Why RLVR may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

work page arXiv 2025
[48]

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models.arXiv preprint arXiv:2512.17385, 2025

Jiajun Wu, Jian Yang, Wei Zhang, Lin Jing, Yuqing Ma, Ensheng Shi, Yuchi Ma, Zhoujun Li, and Xianglong Liu. UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models.arXiv preprint arXiv:2512.17385, 2025

work page arXiv 2025
[49]

UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment.arXiv preprint arXiv:2602.09538, 2026

Hongyan Xie, Yikun Ban, Ruiyu Fang, Zixuan Huang, Deqing Wang, Jianxin Li, Yitong Yao, Chao Wang, and Shuangyong Song. UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment.arXiv preprint arXiv:2602.09538, 2026

work page arXiv 2026
[50]

Yu, and Ming Zhang

Jiaye Xie, Yusheng Zhao, Qixin Zhang, Wanjia Zhao, Weizhi Zhang, Zhiping Xiao, Xiao Luo, Philip S. Yu, and Ming Zhang. Sample lottery: Unsupervised discovery of critical instances for LLM reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[51]

Single-stream Policy Optimization

Zhongwen Xu and Zihan Ding. Single-stream Policy Optimization. InProceedings of the 14th International Conference on Learning Representations (ICLR), 2026

work page 2026
[52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521, 2026

Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, et al. Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521, 2026

work page arXiv 2026
[54]

Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, and Rihui Xin. DCPO: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

work page arXiv 2025
[55]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao...

work page 2025
[56]

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[57]

Heterogeneous Agent Collaborative Reinforcement Learning.arXiv preprint arXiv:2603.02604, 2026

Zhixia Zhang, Zixuan Huang, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, and Yikun Ban. Heterogeneous Agent Collaborative Reinforcement Learning.arXiv preprint arXiv:2603.02604, 2026

work page arXiv 2026
[58]

Geometric-mean policy optimization

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. 14 Policy Improvement Reinforcement Learning

work page 2026
[59]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group Sequence Policy Optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

The surprising effectiveness of negative reinforcement in LLM reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning. InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[61]

Transformer copilot: Learning from the mistake log in LLM fine-tuning

Jiaru Zou, Yikun Ban, Zihao Li, Yunzhe Qi, Ruizhong Qiu, Ling Yang, and Jingrui He. Transformer copilot: Learning from the mistake log in LLM fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[62]

{Question} Let’s think step by step and output the final answer within\boxed{}

Yuxin Zuo, Bingxiang He, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Cheng Qian, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan ang Gao, Yuchen Zhang, Lifan Yuan, Zhiyuan Liu, Bowen Zhou, and Ning Ding. How far can unsupervised RLVR scale LLM training? InThe Fourteenth International Con...

work page 2026