pith. machine review for the scientific record. sign in

arxiv: 2604.00860 · v2 · submitted 2026-04-01 · 💻 cs.LG

Recognition: no theorem link

Policy Improvement Reinforcement Learning

Deqing Wang, Haoyi Zhou, Huaiyang Wang, Jianxin Li, Xiaojie Li, Yaodong Yang, Yikun Ban, Zixuan Huang

Pith reviewed 2026-05-13 22:44 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningpolicy optimizationlarge language modelsverifiable rewardspolicy improvementclosed-loop optimizationmathematical reasoning
0
0 comments X

The pith

Reinforcement learning for language models can be made self-correcting by directly maximizing cumulative verified policy improvement across iterations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current approaches to post-training language models with reinforcement learning update policies using only instantaneous batch reward signals and never check whether the update actually improved the model. This open-loop process risks drift or collapse because there is no feedback on inter-iteration progress. The paper introduces Policy Improvement Reinforcement Learning, whose explicit goal is to maximize the total policy improvement accumulated over successive iterations, and proves this temporal objective aligns exactly with reaching the highest final task performance. It then presents Policy Improvement Policy Optimization, which closes the loop by retrospectively comparing each update against a sliding-window historical baseline and reinforcing only those that produce genuine gains while suppressing the rest.

Core claim

The paper establishes that the objective of maximizing cumulative policy improvement over iterations is perfectly aligned with maximizing final task performance. It shows that Policy Improvement Policy Optimization implements this objective through closed-loop retrospective verification: at each step the method evaluates the preceding update against a sliding-window baseline, then ascends the objective in expectation by reinforcing beneficial changes and suppressing harmful ones.

What carries the argument

The PIRL objective of maximizing cumulative policy improvement, realized through retrospective verification of each update against a sliding-window historical baseline.

If this is right

  • Optimization becomes self-correcting and less prone to drift or collapse.
  • Training stability increases on mathematical reasoning benchmarks.
  • Final task performance rises compared with prior open-loop methods.
  • The process converts isolated batch updates into a sequence of verified progressive gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The verification step could be adapted to other reinforcement-learning domains to improve reliability over long training runs.
  • Historical baselines may help surface gradual policy degradation that single-batch statistics miss.
  • Adjusting window length offers a tunable trade-off between responsiveness and robustness to noise.

Load-bearing premise

Retrospective verification against a sliding-window historical baseline can reliably detect genuine policy improvement without bias from window size, data distribution, or baseline statistics.

What would settle it

An experiment in which varying the sliding-window size produces inconsistent detection of improvements or allows performance collapse despite the use of the proposed method.

Figures

Figures reproduced from arXiv: 2604.00860 by Deqing Wang, Haoyi Zhou, Huaiyang Wang, Jianxin Li, Xiaojie Li, Yaodong Yang, Yikun Ban, Zixuan Huang.

Figure 1
Figure 1. Figure 1: Overview of Policy Improvement Reinforcement Learning (PIRL) framework. Left: Traditional RLVR methods follow an open-loop paradigm, updating policies from instantaneous rewards without verifying actual improvement. Middle: PIRL introduces a verification stage, forming a closed-loop optimization driven by policy improvement signals. Right: During verification, updates are adaptively regulated: positive sig… view at source ↗
Figure 2
Figure 2. Figure 2: Theoretical distortion and empirical instability of GRPO. (a) Gradient Distortion: The gradient scaling factor η(pt) evaluated across success rates pt. As established in Corollary 3.2, GRPO (G = 8, 128) exhibit severe sensitivity explosion at the boundaries (pt → 0, 1). (b) Empirical Stability: Standard GRPO suffers from drastic gradient norm spikes (left) and severe Pass@1 collapse (right). Incorporating … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of training dynamics on Qwen3-4B-Base. (a) Average Pass@1 accuracy evolution across the five [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics across multiple random seeds (6, 21, and 42) on Qwen3-4B-Base. [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Computational efficiency analysis on Qwen3-4B-Base. (a) Total wall-clock time comparison. (b) Evolution [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. Building on PIRL, we propose Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization through retrospective verification. At each iteration, PIPO evaluates whether the previous update yielded genuine improvement against a sliding-window historical baseline, then actively reinforces beneficial updates and suppresses the harmful ones -- transforming an open-loop process into a self-correcting one. We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation, and experiments on mathematical reasoning benchmarks demonstrate improved stability and performance over GRPO and its variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization in RLVR with the explicit objective of maximizing cumulative policy improvement across iterations and claims this temporal objective is perfectly aligned with final task performance. It proposes Policy Improvement Policy Optimization (PIPO), which implements closed-loop optimization via retrospective verification of each update against a sliding-window historical baseline, reinforcing beneficial updates and suppressing harmful ones; theoretical analysis asserts that PIPO performs ascent on the PIRL objective in expectation, with experiments on mathematical reasoning benchmarks showing improved stability and performance over GRPO variants.

Significance. If the alignment proof and expected-ascent guarantee hold without bias in the baseline, the work would address a genuine gap in open-loop RLVR methods by adding verifiable inter-iteration feedback, potentially improving stability for LLM reasoning post-training.

major comments (2)
  1. [Theoretical Analysis] Abstract and Theoretical Analysis section: the claim that the PIRL objective is 'perfectly aligned' with maximizing final task performance is presented without derivation steps or error analysis; the retrospective verification depends on a sliding-window historical baseline whose independence from the optimization trajectory is not shown, creating the circularity risk that baseline statistics mix improving and prior policies under non-stationary or noisy updates.
  2. [PIPO Algorithm and Experiments] PIPO description and experiments: no sensitivity analysis is provided for window size, baseline statistic choice (mean/median), or data distribution; in regimes with incremental policy updates and noisy rewards (e.g., LLM math reasoning), this can bias the comparator and violate the asserted expected ascent, undermining the self-correcting loop claim.
minor comments (1)
  1. The abstract states 'we provide theoretical analysis' and 'experiments demonstrate improved stability' but supplies no implementation details, hyperparameter values for the sliding window, or specific benchmark numbers; these should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments, which help clarify key aspects of our theoretical claims and experimental validation. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Abstract and Theoretical Analysis section: the claim that the PIRL objective is 'perfectly aligned' with maximizing final task performance is presented without derivation steps or error analysis; the retrospective verification depends on a sliding-window historical baseline whose independence from the optimization trajectory is not shown, creating the circularity risk that baseline statistics mix improving and prior policies under non-stationary or noisy updates.

    Authors: We appreciate the referee's call for greater rigor here. The alignment between the PIRL objective (cumulative policy improvement) and final task performance follows from the fact that any policy achieving higher cumulative improvement must eventually reach a higher-performing fixed point; we will expand the Theoretical Analysis section with explicit derivation steps, including a short proof by induction on the improvement sequence and a bound on the approximation error introduced by finite-window estimation. Regarding baseline independence, the sliding window is populated exclusively from policies that have already passed retrospective verification, so its statistics are conditioned only on prior accepted updates. We will add a supporting lemma showing that, under the expected-ascent property, the baseline expectation remains unbiased with respect to the current candidate update, thereby removing the circularity concern even in non-stationary regimes. revision: yes

  2. Referee: [PIPO Algorithm and Experiments] PIPO description and experiments: no sensitivity analysis is provided for window size, baseline statistic choice (mean/median), or data distribution; in regimes with incremental policy updates and noisy rewards (e.g., LLM math reasoning), this can bias the comparator and violate the asserted expected ascent, undermining the self-correcting loop claim.

    Authors: We agree that an explicit sensitivity study would strengthen the empirical support. In the revised manuscript we will add an ablation subsection reporting performance for window sizes {3,5,10,20} and for both mean and median baselines on the same mathematical-reasoning benchmarks. These results will demonstrate that the self-correcting behavior and expected-ascent guarantee remain intact across the tested range; we will also include a brief analysis of how the verification threshold interacts with reward noise to keep the comparator unbiased. The theoretical expected-ascent result itself does not depend on a particular window size or statistic, but the new experiments will confirm practical robustness. revision: yes

Circularity Check

1 steps flagged

PIRL objective alignment with final performance reduces to definitional equivalence via retrospective baseline

specific steps
  1. self definitional [Abstract]
    "we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of maximizing cumulative policy improvement across iterations, and prove this temporal objective is perfectly aligned with maximizing final task performance. ... We provide theoretical analysis showing that PIPO performs ascent on the PIRL objective in expectation"

    The objective is defined as maximizing cumulative policy improvement; the claimed proof of alignment with final performance and the expected ascent under PIPO both rely on retrospective verification against a sliding-window historical baseline whose statistics are drawn from the same optimization trajectory. The alignment therefore reduces to the definition of the objective rather than an independent derivation.

full rationale

The paper defines the PIRL objective explicitly as maximizing cumulative policy improvement across iterations and claims a proof of perfect alignment with final task performance, while PIPO's ascent is shown via retrospective verification against a sliding-window historical baseline. This creates a self-referential structure: improvement is both the quantity being maximized and the quantity verified by the same trajectory-dependent baseline, so the alignment and expected ascent hold by construction of the objective rather than independent derivation. The central theoretical result therefore reduces to a restatement of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the newly introduced PIRL objective and the assumption that retrospective verification against a historical baseline can detect true improvement. No explicit free parameters are named in the abstract. The proof of alignment is treated as a domain assumption whose details are unavailable without the full text.

axioms (1)
  • domain assumption The temporal objective of maximizing cumulative policy improvement is perfectly aligned with maximizing final task performance
    Stated as proven in the abstract but the actual derivation is not provided.
invented entities (2)
  • PIRL objective no independent evidence
    purpose: Replace surrogate reward maximization with explicit maximization of cumulative policy improvement across iterations
    Newly defined temporal objective introduced in the paper.
  • PIPO algorithm no independent evidence
    purpose: Implement closed-loop optimization through retrospective verification against a sliding-window baseline
    New algorithm that evaluates and selectively reinforces prior updates.

pith-pipeline@v0.9.0 · 5564 in / 1502 out tokens · 41569 ms · 2026-05-13T22:44:54.577345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 7 internal anchors

  1. [1]

    Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 12248–12267. Association for C...

  2. [2]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. MiniMax-M1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  3. [3]

    EEPO: Exploration-enhanced policy optimization via sample-then-forget

    Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schuetze, and Kam-Fai Wong. EEPO: Exploration-enhanced policy optimization via sample-then-forget. InThe Fourteenth International Conference on Learning Representations, 2026

  4. [4]

    Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward

    Peter Chen, Xiaopeng Li, Ziniu Li, Wotao Yin, Xi Chen, and Tianyi Lin. Exploration vs exploitation: Rethinking RLVR through clipping, entropy, and spurious reward. InThe Fourteenth International Conference on Learning Representations, 2026

  5. [5]

    Self-evolving curriculum for LLM reasoning.arXiv preprint arXiv:2505.14970, 2025

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, et al. Self-evolving curriculum for LLM reasoning.arXiv preprint arXiv:2505.14970, 2025

  6. [6]

    Weak-Driven Learning: How Weak Agents make Strong Agents Stronger.arXiv preprint arXiv:2602.08222, 2026

    Zehao Chen, Gongxun Li, Tianxiang Ai, Yifei Li, Zixuan Huang, Wang Zhou, Fuzhen Zhuang, Xianglong Liu, Jianxin Li, Deqing Wang, and Yikun Ban. Weak-Driven Learning: How Weak Agents make Strong Agents Stronger.arXiv preprint arXiv:2602.08222, 2026

  7. [7]

    Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

    Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, and Hailong Sun. Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process.arXiv preprint arXiv:2512.23213, 2026

  8. [8]

    Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint, 2026

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint, 2026

  9. [9]

    Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

  10. [10]

    CDE: Curiosity-driven exploration for efficient reinforcement learning in large language models

    Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, and Dong Yu. CDE: Curiosity-driven exploration for efficient reinforcement learning in large language models. InThe Fourteenth International Conference on Learning Representations, 2026

  11. [11]

    Harder is better: Boosting mathematical reasoning via difficulty-aware GRPO and multi-aspect question reformulation

    Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, and Zhiwu Lu. Harder is better: Boosting mathematical reasoning via difficulty-aware GRPO and multi-aspect question reformulation. InThe Fourteenth International Conference on Learning Representations, 2026

  12. [12]

    Decomposing the entropy-performance exchange: The missing keys to unlocking effective reinforcement learning.arXiv preprint arXiv:2508.02260, 2025

    Jia Deng, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, and Ji-Rong Wen. Decomposing the entropy-performance exchange: The missing keys to unlocking effective reinforcement learning.arXiv preprint arXiv:2508.02260, 2025

  13. [13]

    Sciknoweval: Evaluating multi-level scientific knowledge of large language models

    Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

  14. [14]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, et al. DeepSeek-R1 incentivizes re...

  15. [15]

    Segment Policy Optimization: Effective segment-level credit assignment in RL for large language models

    Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment Policy Optimization: Effective segment-level credit assignment in RL for large language models. InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  16. [16]

    KL-regularized reinforcement learning is designed to mode collapse

    Anthony GX-Chen, Jatin Prakash, Jeff Guo, Rob Fergus, and Rajesh Ranganath. KL-regularized reinforcement learning is designed to mode collapse. InThe Fourteenth International Conference on Learning Representations, 2026

  17. [17]

    On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585, 2025

    Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585, 2025

  18. [18]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  19. [19]

    Random policy valuation is enough for LLM reasoning with verifiable rewards

    Haoran He, Yuxiao Ye, Qingpeng Cai, Chen Hu, Binxing Jiao, Daxin Jiang, and Ling Pan. Random policy valuation is enough for LLM reasoning with verifiable rewards. InThe Fourteenth International Conference on Learning Representations, 2026. 12 Policy Improvement Reinforcement Learning

  20. [20]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

  21. [21]

    Reinforce++: A simple and efficient approach for aligning large language models.arXiv e-prints, pages arXiv–2501, 2025

    Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv e-prints, pages arXiv–2501, 2025

  22. [22]

    BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

    Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, et al. BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

  23. [23]

    PROS: Towards compute-efficient RLVR via rollout prefix reuse

    Baizhou Huang and Xiaojun Wan. PROS: Towards compute-efficient RLVR via rollout prefix reuse. InThe Fourteenth International Conference on Learning Representations, 2026

  24. [24]

    On the direction of RLVR updates for LLM reasoning: Identification and exploitation

    Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, and Jingren Zhou. On the direction of RLVR updates for LLM reasoning: Identification and exploitation. InThe Fourteenth International Conference on Learning Representations, 2026

  25. [25]

    Does Your Reasoning Model Implicitly Know When to Stop Thinking?

    Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, Fuzhen Zhuang, Jianxin Li, Yikun Ban, and Deqing Wang. Does Your Reasoning Model Implicitly Know When to Stop Thinking?arXiv preprint arXiv:2602.08354, 2026

  26. [26]

    Overthinking reduction with decoupled rewards and curriculum data scheduling

    Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, and Yu Wang. Overthinking reduction with decoupled rewards and curriculum data scheduling. InThe Fourteenth International Conference on Learning Representations, 2026

  27. [27]

    The optimal token baseline: Variance reduction for long-horizon LLM-RL.arXiv preprint arXiv:2602.07078, 2026

    Yingru Li, Jiawei Xu, Ziniu Li, Jiacai Liu, Wei Liu, Yuxuan Tong, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, and Baoxiang Wang. The optimal token baseline: Variance reduction for long-horizon LLM-RL.arXiv preprint arXiv:2602.07078, 2026

  28. [28]

    Counterfactual Credit Policy Optimization for Multi-Agent Collaboration.arXiv preprint arXiv:2603.21563, 2026

    Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, and Fuzhen Zhuang. Counterfactual Credit Policy Optimization for Multi-Agent Collaboration.arXiv preprint arXiv:2603.21563, 2026

  29. [29]

    Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

    Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yikun Ban, and Fuzhen Zhuang. Adaptive Robust Estimator for Multi-Agent Reinforcement Learning. arXiv preprint arXiv:2603.21574, 2026

  30. [30]

    Beyond pass@ 1: Self-play with variational problem synthesis sustains RLVR

    Xiao Liang, Zhong-Zhi Li, Yeyun Gong, yelong shen, Ying Nian Wu, Zhijiang Guo, and Weizhu Chen. Beyond pass@ 1: Self-play with variational problem synthesis sustains RLVR. InThe Fourteenth International Conference on Learning Representations, 2026

  31. [31]

    CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling.arXiv preprint arXiv:2603.08035, 2026

    Dengcan Liu, Fengkai Yang, Xiaohan Wang, Shurui Yan, Jiajun Chai, Jiahao Li, Yikun Ban, Zhendong Mao, Wei Lin, and Guojun Yin. CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling.arXiv preprint arXiv:2603.08035, 2026

  32. [32]

    Understanding R1-Zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-like training: A critical perspective. In2nd AI for Math Workshop @ ICML 2025, 2025

  33. [33]

    Contextual rollout bandits for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.08499, 2026

    Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Yu Luo, Fuzhen Zhuang, Yikun Ban, and Deqing Wang. Contextual rollout bandits for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.08499, 2026

  34. [34]

    F-GRPO: Don’t let your policy learn the obvious and forget the rare

    Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov. F-GRPO: Don’t let your policy learn the obvious and forget the rare. InThe 1st Workshop on Scaling Post-training for LLMs, 2026

  35. [35]

    Rethinking the trust region in LLM reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

    Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in LLM reinforcement learning.arXiv preprint arXiv:2602.04879, 2026

  36. [36]

    HiPO: Self-hint policy optimization for RLVR

    Deng Qiyuan, Kehai Chen, Min Zhang, and Zhongwen Xu. HiPO: Self-hint policy optimization for RLVR. In The Fourteenth International Conference on Learning Representations, 2026

  37. [37]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  38. [38]

    Rewarding progress: Scaling automated process verifiers for LLM reasoning

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

  39. [39]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 13 Policy Improvement Reinforcement Learning

  40. [40]

    HybridFlow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems (EuroSys), pages 1279–1297. ACM, 2025

  41. [41]

    Experiential reinforcement learning.arXiv preprint arXiv:2602.13949, 2026

    Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, and Jieyu Zhao. Experiential reinforcement learning.arXiv preprint arXiv:2602.13949, 2026

  42. [42]

    KTAE: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning

    Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. KTAE: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning. InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  43. [43]

    Teaching models to teach themselves: Reasoning at the edge of learnability.arXiv preprint arXiv:2601.18778, 2026

    Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, and Julia Kempe. Teaching models to teach themselves: Reasoning at the edge of learnability.arXiv preprint arXiv:2601.18778, 2026

  44. [44]

    Maximum likelihood reinforcement learning

    Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning. InICLR 2026 Workshop on Scaling Post-training for LLMs, 2026

  45. [45]

    Cross-batch negative sampling for training two-tower recom- menders

    Jinpeng Wang, Jieming Zhu, and Xiuqiang He. Cross-batch negative sampling for training two-tower recom- menders. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1632–1636, 2021

  46. [46]

    Cross-batch memory for embedding learning

    Xun Wang, Haozhi Zhang, Weilin Huang, and Matthew R Scott. Cross-batch memory for embedding learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6388–6397, 2020

  47. [47]

    The Invisible Leash: Why RLVR may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

    Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The Invisible Leash: Why RLVR may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

  48. [48]

    UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models.arXiv preprint arXiv:2512.17385, 2025

    Jiajun Wu, Jian Yang, Wei Zhang, Lin Jing, Yuqing Ma, Ensheng Shi, Yuchi Ma, Zhoujun Li, and Xianglong Liu. UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models.arXiv preprint arXiv:2512.17385, 2025

  49. [49]

    UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment.arXiv preprint arXiv:2602.09538, 2026

    Hongyan Xie, Yikun Ban, Ruiyu Fang, Zixuan Huang, Deqing Wang, Jianxin Li, Yitong Yao, Chao Wang, and Shuangyong Song. UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment.arXiv preprint arXiv:2602.09538, 2026

  50. [50]

    Yu, and Ming Zhang

    Jiaye Xie, Yusheng Zhao, Qixin Zhang, Wanjia Zhao, Weizhi Zhang, Zhiping Xiao, Xiao Luo, Philip S. Yu, and Ming Zhang. Sample lottery: Unsupervised discovery of critical instances for LLM reasoning. InThe Fourteenth International Conference on Learning Representations, 2026

  51. [51]

    Single-stream Policy Optimization

    Zhongwen Xu and Zihan Ding. Single-stream Policy Optimization. InProceedings of the 14th International Conference on Learning Representations (ICLR), 2026

  52. [52]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  53. [53]

    Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521, 2026

    Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, et al. Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521, 2026

  54. [54]

    Dcpo: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

    Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, and Rihui Xin. DCPO: Dynamic clipping policy optimization.arXiv preprint arXiv:2509.02333, 2025

  55. [55]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao...

  56. [56]

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  57. [57]

    Heterogeneous Agent Collaborative Reinforcement Learning.arXiv preprint arXiv:2603.02604, 2026

    Zhixia Zhang, Zixuan Huang, Xin Xia, Deqing Wang, Fuzhen Zhuang, Shuai Ma, Ning Ding, Yaodong Yang, Jianxin Li, and Yikun Ban. Heterogeneous Agent Collaborative Reinforcement Learning.arXiv preprint arXiv:2603.02604, 2026

  58. [58]

    Geometric-mean policy optimization

    Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. 14 Policy Improvement Reinforcement Learning

  59. [59]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group Sequence Policy Optimization.arXiv preprint arXiv:2507.18071, 2025

  60. [60]

    The surprising effectiveness of negative reinforcement in LLM reasoning

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in LLM reasoning. InProceedings of the 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  61. [61]

    Transformer copilot: Learning from the mistake log in LLM fine-tuning

    Jiaru Zou, Yikun Ban, Zihao Li, Yunzhe Qi, Ruizhong Qiu, Ling Yang, and Jingrui He. Transformer copilot: Learning from the mistake log in LLM fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  62. [62]

    {Question} Let’s think step by step and output the final answer within\boxed{}

    Yuxin Zuo, Bingxiang He, Zeyuan Liu, Shangziqi Zhao, Zixuan Fu, Junlin Yang, Kaiyan Zhang, Yuchen Fan, Ganqu Cui, Cheng Qian, Xiusi Chen, Youbang Sun, Xingtai Lv, Xuekai Zhu, Li Sheng, Ran Li, Huan ang Gao, Yuchen Zhang, Lifan Yuan, Zhiyuan Liu, Bowen Zhou, and Ning Ding. How far can unsupervised RLVR scale LLM training? InThe Fourteenth International Con...