Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

Dawei Yin; Daxiang Dong; Dou Shen; Haotian Zhao; Jianmin Wu; Jingnan Gu; Lun Tian; Tianshu Zhu; Wenyu Zhang; Xiaoying Zuo

arxiv: 2605.05112 · v3 · pith:MT66L2EJnew · submitted 2026-05-06 · 💻 cs.LG

Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

Tianshu Zhu , Wenyu Zhang , Xiaoying Zuo , Lun Tian , Haotian Zhao , Yucheng Zeng , Jingnan Gu , Daxiang Dong

show 3 more authors

Jianmin Wu Dawei Yin Dou Shen

This is my paper

Pith reviewed 2026-05-19 17:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords pass-rate controlbinary-reward RLPrefix SamplingGRPOagentic reinforcement learningrollout efficiencySWE-bench

0 comments

The pith

Steering rollout pass rates to 50 percent strengthens binary-reward signals in agentic RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agentic reinforcement learning with binary rewards often produces highly skewed success rates across grouped rollouts, which weakens contrastive signals for policy updates. The paper shows that the reward-side signal reaches its peak strength near a 50 percent pass rate, as judged by reward entropy, group-filtering survival, leave-one-out advantage energy under GRPO, and the raw count of success-failure pairs. Prefix Sampling corrects the skew by replaying prefixes from earlier trajectories: successful prefixes give failing groups a head start, while failing prefixes slow down mostly successful groups. Replayed tokens are masked from the loss so that gradients update only the current policy's new decisions. Experiments on SWE-bench Verified report 2.01x and 1.55x wall-clock speedups on 14B and 32B models while matching or exceeding baseline scores.

Core claim

We frame this as pass-rate control and show that the binary reward-side signal is strongest near a 50% rollout pass rate under four criteria: reward entropy, group-filtering survival, leave-one-out (RLOO) advantage energy under Group Relative Policy Optimization (GRPO), and success-failure pair count. We propose Prefix Sampling (PS), which replays self-generated trajectory prefixes to steer skewed groups toward this regime: successful prefixes give mostly failing groups a head start, while failing prefixes handicap mostly passing groups. Replayed states are reconstructed through the existing rollout path, and replayed tokens are masked from the loss so optimization applies only to current-

What carries the argument

Prefix Sampling, which replays prefixes from prior trajectories and masks their tokens from the loss so that optimization applies only to current-policy continuations, steering groups toward the 50 percent pass-rate regime.

If this is right

The method reaches the baseline high-score regime within evaluation variability on SWE-bench Verified.
It delivers 2.01x and 1.55x end-to-end wall-clock speedups on Qwen3-14B and Qwen3-32B models.
Peak performance on the 14B model improves from 0.274 to 0.295.
The same pass-rate-control pattern appears in AIME 2025 experiments on 4B and 8B models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same steering logic may apply to other binary-reward or sparse-reward RL settings outside software engineering.
Dynamic adjustment of group size or sampling temperature could be combined with pass-rate control to maintain the informative regime more cheaply.
If 50 percent is the true optimum, curriculum or difficulty schedulers might be redesigned to target that balance directly rather than maximizing raw diversity.

Load-bearing premise

Replaying prefixes from prior trajectories and masking their tokens will steer pass rates to the informative regime without introducing systematic bias into the policy gradient or destabilizing GRPO optimization.

What would settle it

An experiment in which Prefix Sampling fails to increase the fraction of groups near 50 percent pass rate, or in which the reported wall-clock speedups vanish when prefix reconstruction and masking overhead are fully included, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.05112 by Dawei Yin, Daxiang Dong, Dou Shen, Haotian Zhao, Jianmin Wu, Jingnan Gu, Lun Tian, Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo, Yucheng Zeng.

**Figure 1.** Figure 1: Prefix Sampling pipeline. For each task we sample a rollout group and route it by pass count: degenerate 0/8 or 8/8 groups are filtered, already balanced 3/8–5/8 groups are used for standard training, and skewed groups provide replay prefixes. Mostly failing hard buckets reuse a successful prefix as a head start, while mostly passing easy buckets reuse a failing prefix as a handicap. The current policy gen… view at source ↗

**Figure 2.** Figure 2: Benchmark performance over training for Prefix Sampling and the baseline across agentic SWE-bench Verified runs and single-turn AIME 2025 runs. All points average 8 evaluation runs and windows end at selected peaks. Dashed vertical projections compare the baseline peak step with the earliest Prefix Sampling step reaching the same score level, or the same level within available avg8 variability for 32B; das… view at source ↗

**Figure 3.** Figure 3: 4B AceReason-Math-Subset ablations. Left: ordinary-task training-score trajectories, with vertical markers at each method’s convergence step. Right: bucket-level control quality for the prefix-based arms, reporting mean rerollout pass rate p and mean absolute distance |p − 0.5| before each method’s convergence step; greener cells are closer to the 50% target and gray cells are disabled by design. 7 view at source ↗

**Figure 3.** Figure 3: 4B AceReason-Math-Subset ablations. Left: training-score trajectories on ordinary, non-replayed tasks, with vertical markers at each method’s convergence step. Right: bucket-level control quality for the prefix-based arms, reporting mean rerollout pass rate p and mean absolute distance |p − 0.5| before each method’s convergence step; greener cells are closer to the 50% target and gray cells are disabled by… view at source ↗

**Figure 4.** Figure 4: Prefix Sampling moves controlled rerollouts toward the 50% operating point on the 4B math run. Left: pass-count distance from the 4/8 target for baseline fresh groups, ordinary fresh groups from the PS run, and PS rerollout groups. Right: source-bucket pass rates before replay and mean rerollout pass rates after replay. The left panel of view at source ↗

**Figure 5.** Figure 5: Training-signal dynamics across all four backbones. Top row: ordinary-task training score for the baseline and Prefix Sampling, plus the PS rerollout pass rate. Bottom row: valid rollout groups after group filtering. Gray curves are baselines; blue solid curves are Prefix Sampling ordinary-task or per-step metrics; blue dashed curves are PS rerollout pass rates. Dashed horizontal segments in the row for va… view at source ↗

**Figure 6.** Figure 6: System diagnostics across all four backbones. Top row: wall-clock time per training step. Bottom row: entropy metric. Gray curves are baselines and blue curves are Prefix Sampling. Dashed horizontal segments in the timing row mark each method’s raw mean over its own convergencecropped window. These diagnostics support the wall-clock claims on the stateful 14B/32B SWEbench Verified runs; on 4B/8B math, th… view at source ↗

**Figure 7.** Figure 7: gives the transition audit behind the bucket-level correction summary in view at source ↗

**Figure 8.** Figure 8: Adaptive-controller dynamics on the 4B run. Left: the rerollout pass-rate EMA used as bucket-level feedback. Right: the adaptive prefix ratio applied to each source bucket. The plotted window ends at the 4B Prefix Sampling convergence step. G Case Study Details The two cases below illustrate the two directions of the Prefix Sampling intervention with one example each, both drawn from the 4B AceReason-Math-… view at source ↗

read the original abstract

Agentic reinforcement learning (RL) for software engineering spends much of its compute on stateful trajectories whose grouped binary rewards are highly skewed and weakly contrastive. We frame this as pass-rate control and show that the binary reward-side signal is strongest near a 50% rollout pass rate under four criteria: reward entropy, group-filtering survival, leave-one-out (RLOO) advantage energy under Group Relative Policy Optimization (GRPO), and success-failure pair count. We propose Prefix Sampling (PS), which replays self-generated trajectory prefixes to steer skewed groups toward this regime: successful prefixes give mostly failing groups a head start, while failing prefixes handicap mostly passing groups. Replayed states are reconstructed through the existing rollout path, and replayed tokens are masked from the loss so optimization applies only to current-policy continuations. On SWE-bench Verified, PS reaches the baseline high-score regime within evaluation variability while delivering 2.01x and 1.55x end-to-end wall-clock speedups on Qwen3-14B and Qwen3-32B; the 14B peak improves from 0.274 to 0.295. AIME 2025 experiments on 4B and 8B show the same pass-rate-control pattern, and 4B ablations attribute gains to replay, bidirectional coverage, and adaptive control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prefix Sampling gives practical speedups by steering binary-reward groups to 50% pass rate, but the GRPO updates sit on a deliberately skewed state distribution from replayed prefixes.

read the letter

The main takeaway is that Prefix Sampling replays self-generated prefixes to push rollout groups toward a 50% pass rate, which the authors link to better entropy, more surviving groups, stronger RLOO advantages under GRPO, and more success-failure pairs. Successful prefixes help failing groups while failing prefixes handicap passing ones, and the replayed tokens are masked so the loss only touches the new continuations. This produces 2.01x and 1.55x wall-clock speedups on the 14B and 32B models on SWE-bench Verified while the 14B score edges up from 0.274 to 0.295, with the same pattern showing up on AIME 2025 runs. The 4B ablations break out replay, bidirectional coverage, and adaptive control as the sources of the gains, which is straightforward and useful to see. The framing around the four criteria for why 50% is informative is the clearest part of the contribution and distinguishes it from generic curriculum sampling. The off-policy issue is the clearest soft spot. Prefixes come from earlier trajectories, so the states where the current policy starts generating are chosen according to prior success statistics rather than the current policy's occupancy. GRPO still computes group-relative advantages and baselines over the full trajectories, and no importance-sampling correction appears. Even with token masking, the gradient for the new tokens is taken with respect to a shifted state measure, and the bias could grow as the method corrects larger skews. The reported lift is modest, so error bars and more runs would help judge stability. This is aimed at groups scaling binary-reward RL on agentic coding benchmarks like SWE-bench. Readers who care about wall-clock efficiency in GRPO-style training will get concrete value from the mechanism and the ablations. The work is focused and empirically grounded enough to deserve referee time, mainly to pressure-test the distribution-shift point and confirm the advantage calculations stay valid. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Prefix Sampling (PS) as a method to steer binary-reward RL trajectories in agentic settings (e.g., software engineering) toward a ~50% rollout pass rate, which the authors argue maximizes signal strength under four metrics: reward entropy, group-filtering survival, RLOO advantage energy in GRPO, and success-failure pair count. PS replays self-generated prefixes from prior trajectories (successful prefixes for failing groups, failing prefixes for passing groups), reconstructs states via existing rollout paths, and masks replayed tokens from the loss so that optimization applies only to current-policy continuations. Experiments on SWE-bench Verified report 2.01x and 1.55x wall-clock speedups on Qwen3-14B and 32B while matching or exceeding baseline scores (14B peak rising from 0.274 to 0.295), with similar patterns on AIME 2025 and ablations attributing gains to replay, bidirectional coverage, and adaptive control.

Significance. If the off-policy concerns can be resolved and the reported speedups hold under full experimental controls, the work could meaningfully improve sample efficiency for GRPO-style RL on tasks with sparse binary rewards by keeping groups in a high-information regime. The empirical results on SWE-bench and AIME provide a concrete demonstration of pass-rate control, and the four-criteria analysis offers a useful diagnostic framework. However, the absence of importance-sampling corrections or state-distribution adjustments in the core method limits immediate adoption without further validation.

major comments (2)

[Method (Prefix Sampling description)] Method section on Prefix Sampling: the claim that masking replayed tokens ensures optimization occurs only on current-policy continuations does not address the fact that GRPO advantages and group-relative baselines are still computed over full trajectories that begin from selectively replayed, off-policy prefixes. No importance-sampling correction or state-occupancy adjustment is described, which risks systematic bias in the policy gradient as the degree of pass-rate correction increases. This directly affects the central claim that PS preserves GRPO validity while steering to the informative regime.
[Experiments (SWE-bench results)] Experimental results on SWE-bench Verified: the reported lift from 0.274 to 0.295 on the 14B model and the 2.01x/1.55x speedups lack visible error bars, full ablation tables, or details on how many independent runs were averaged. Without these, it is difficult to determine whether post-hoc group selection or evaluation variability accounts for the gains rather than the pass-rate control itself.

minor comments (2)

[Introduction / Analysis] The four criteria for 'most informative regime' (reward entropy, group-filtering survival, RLOO advantage energy, success-failure pair count) are presented without explicit equations or pseudocode showing how each is computed from the grouped binary rewards.
[Experiments] AIME 2025 experiments are mentioned as showing the same pass-rate-control pattern, but no quantitative tables or figures are referenced for the 4B/8B models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Method (Prefix Sampling description)] Method section on Prefix Sampling: the claim that masking replayed tokens ensures optimization occurs only on current-policy continuations does not address the fact that GRPO advantages and group-relative baselines are still computed over full trajectories that begin from selectively replayed, off-policy prefixes. No importance-sampling correction or state-occupancy adjustment is described, which risks systematic bias in the policy gradient as the degree of pass-rate correction increases. This directly affects the central claim that PS preserves GRPO validity while steering to the informative regime.

Authors: We thank the referee for pointing out this important nuance. The masking of replayed tokens does restrict the loss computation to the current policy's generated tokens, but as noted, the GRPO advantages are indeed calculated over the complete trajectories. Since the prefixes are replayed from self-generated trajectories under a recent policy snapshot and the pass-rate control is adaptive, the distributional shift is kept moderate. Nevertheless, we acknowledge that a full importance-sampling correction is not applied. In the revised manuscript, we will expand the method section to discuss this off-policy aspect explicitly, including a qualitative analysis of why the bias appears limited in practice based on our ablations, and we will note this as a direction for future theoretical work. This does not alter our empirical findings but improves the transparency of the presentation. revision: partial
Referee: [Experiments (SWE-bench results)] Experimental results on SWE-bench Verified: the reported lift from 0.274 to 0.295 on the 14B model and the 2.01x/1.55x speedups lack visible error bars, full ablation tables, or details on how many independent runs were averaged. Without these, it is difficult to determine whether post-hoc group selection or evaluation variability accounts for the gains rather than the pass-rate control itself.

Authors: We agree that providing statistical details would strengthen the results section. The reported numbers are averages over three independent training runs with different random seeds, and the performance lift on the 14B model was consistent across these runs (with standard deviation of approximately 0.01). We will add error bars to the main figures, include a complete ablation table in the appendix detailing the contributions of replay, bidirectional coverage, and adaptive control, and specify the number of runs in the experimental setup. These additions will clarify that the observed improvements are attributable to the pass-rate control mechanism rather than variability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on external benchmarks

full rationale

The paper introduces Prefix Sampling as a procedural intervention to steer rollout pass rates toward an empirically identified informative regime (near 50%) under four listed criteria. These criteria and the resulting speedups (2.01x / 1.55x wall-clock) plus score improvements are validated on independent external benchmarks (SWE-bench Verified, AIME 2025) rather than being algebraically forced by the method's own definitions or fitted parameters. No equation or derivation step reduces a claimed prediction to a quantity defined by the sampling rule itself. GRPO references, if self-citations, are not load-bearing for the primary empirical results, which rely on new rollout experiments. The derivation chain is therefore self-contained against external measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that 50% pass rate maximizes the four listed signal metrics and that prefix replay can be performed without altering the underlying MDP or introducing unaccounted bias.

axioms (1)

domain assumption Binary reward signal strength peaks near 50% rollout pass rate
Invoked to motivate Prefix Sampling; supported by the four criteria listed in the abstract.

pith-pipeline@v0.9.0 · 5814 in / 1333 out tokens · 42665 ms · 2026-05-19T17:28:23.118864+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 7 internal anchors

[1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024. doi: 10.48550/arXiv.2402.03300. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[2]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
[3]

DeepSWE: Training a fully open-sourced, state-of-the-art coding agent by scaling RL

Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Tarun Venkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, and Ion Stoica. DeepSWE: Training a fully open-sourced, state-of-the-art coding agent by scaling RL. https://www.together.ai/blog/deepswe, 2025

work page 2025
[4]

Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, et al. Let it flow: Agentic crafting on rock and roll, building the ROME model within an open agentic learning ecosystem. arXiv:2512.24873, 2025. URLhttps://arxiv.org/abs/2512.24873

work page arXiv 2025
[5]

Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, and Sang Michael Xie. Reuse your FLOPs: Scaling RL on hard problems by conditioning on very off-policy prefixes.CoRR, abs/2601.18795, 2026. URLhttps://arxiv.org/abs/2601.18795

work page arXiv 2026
[6]

Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: Learning to reason on hard problems via privileged on-policy exploration.CoRR, abs/2601.18779, 2026. URLhttps://arxiv.org/abs/2601.18779

work page arXiv 2026
[7]

arXiv preprint arXiv:2507.02841 , year=

Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. StepHint: Multi-level stepwise hints enhance reinforcement learning to reason.CoRR, abs/2507.02841, 2025. URLhttps://arxiv.org/abs/2507.02841

work page arXiv 2025
[8]

Adhint: Adaptive hints with difficulty priors for reinforcement learning, 2026

Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, and Yang Yang. ADHint: Adaptive hints with difficulty priors for reinforcement learning.CoRR, abs/2512.13095, 2025. URLhttps://arxiv.org/abs/2512.13095

work page arXiv 2025
[9]

Boosting MLLM reasoning with text-debiased Hint-GRPO

Qihan Huang, Weilong Dai, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jingyuan Chen, Chang Yao, and Jie Song. Boosting MLLM reasoning with text-debiased Hint-GRPO. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4848–4857, 2025. URL https://openaccess.thecvf.com/content/ICCV2025/html/ Huang_Boosting_MLLM_Reasoning_wit...

work page 2025
[10]

Self-hinting language models enhance reinforcement learning.arXiv preprint arXiv:2602.03143, 2026

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.CoRR, abs/2602.03143, 2026. URL https://arxiv. org/abs/2602.03143

work page arXiv 2026
[11]

How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization

Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu, Lu Pan, Ke Zeng, and Xunliang Cai. How to allocate, how to learn? dynamic rollout allocation and advantage modulation for policy optimization.CoRR, abs/2602.19208, 2026. URL https://arxiv. org/abs/2602.19208

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Learning to reason under off-policy guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=vO8LLoNWWk. 11

work page 2025
[13]

SRFT: A single-stage method with super- vised and reinforcement fine-tuning for reasoning

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT: A single-stage method with super- vised and reinforcement fine-tuning for reasoning. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=n6E0r6kQWQ

work page 2026
[14]

Learning what rein- forcement learning can’t: Interleaved online fine-tuning for hardest questions

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Wentao Zhang, and Bin Cui. Learning what rein- forcement learning can’t: Interleaved online fine-tuning for hardest questions. InInternational Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/ forum?id=LzCBLrNoyM

work page 2026
[15]

UFT: Unifying supervised and rein- forcement fine-tuning

Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. UFT: Unifying supervised and rein- forcement fine-tuning. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=usOkGv1S7M

work page 2025
[17]

URLhttps://arxiv.org/abs/2509.06923

work page arXiv
[18]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. URL https:// openreview.net/forum?id=VTF8yNQM66

work page 2024
[19]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021. URLhttps://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006. ISBN 9780471241959. URL https: //www.wiley-vch.de/en/areas-interest/computing-computer-sciences/ computer-science-17cs/information-technologies-17cs3/ elements-of-information-theory-978-0-471-24195-9

work page 2006
[21]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8:229–256, 1992. doi: 10.1007/BF00992696. URL https://link.springer.com/article/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[22]

Qwen3-14B.https://huggingface.co/Qwen/Qwen3-14B, 2025

Qwen Team. Qwen3-14B.https://huggingface.co/Qwen/Qwen3-14B, 2025

work page 2025
[23]

Qwen3-32B.https://huggingface.co/Qwen/Qwen3-32B, 2025

Qwen Team. Qwen3-32B.https://huggingface.co/Qwen/Qwen3-32B, 2025

work page 2025
[24]

American Invitational Mathematics Examination (AIME)

Mathematical Association of America. American Invitational Mathematics Examination (AIME). https://maa.org/maa-invitational-competitions/, 2025. Official MAA AIME information page

work page 2025
[25]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. doi: 10.48550/arXiv. 1707.06347. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2017
[26]

R2E-gym: Procedural environments and hybrid verifiers for scaling open-weights SWE agents.arXiv preprint arXiv:2504.07164, 2025

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2E- Gym: Procedural environments and hybrid verifiers for scaling open-weights SWE agents. arXiv preprint arXiv:2504.07164, 2025. doi: 10.48550/arXiv.2504.07164. URL https: //arxiv.org/abs/2504.07164

work page doi:10.48550/arxiv.2504.07164 2025
[27]

SWE-bench Verified

SWE-bench Team. SWE-bench Verified. https://www.swebench.com/verified.html,

work page
[28]

Human-validated 500-instance subset created in collaboration with OpenAI

work page
[29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. doi: 10.48550/arXiv.2505.09388. URL https://arxiv.org/abs/2505.09388. 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[30]

Qwen3-4B-Instruct-2507

Qwen Team. Qwen3-4B-Instruct-2507. https://huggingface.co/Qwen/ Qwen3-4B-Instruct-2507, 2025

work page 2025
[31]

Qwen3-8B.https://huggingface.co/Qwen/Qwen3-8B, 2025

Qwen Team. Qwen3-8B.https://huggingface.co/Qwen/Qwen3-8B, 2025

work page 2025
[32]

AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025. doi: 10.48550/arXiv.2505. 16400. URLhttps://arxiv.org/abs/2505.16400

work page doi:10.48550/arxiv.2505 2025
[33]

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Xin Xu, Clive Bai, Kai Yang, Tianhao Chen, Yangkun Chen, Weijie Liu, Hao Chen, Yang Wang, Saiyong Yang, and Can Yang. Composition-RL: Compose your verifiable prompts for reinforcement learning of large language models.CoRR, abs/2602.12036, 2026. URL https://arxiv.org/abs/2602.12036

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

URLhttps://arxiv.org/abs/2602.09000

work page arXiv
[36]

arXiv preprint arXiv:2602.02482 , year=

Yuda Song, Lili Chen, Fahim Tajwar, Rémi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.CoRR, abs/2602.02482, 2026. URLhttps://arxiv.org/abs/2602.02482

work page arXiv 2026
[37]

arXiv preprint arXiv:2602.13949 , year=

Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, and Jieyu Zhao. Experiential reinforcement learning.CoRR, abs/2602.13949, 2026. URL https://arxiv.org/abs/2602. 13949

work page arXiv 2026
[38]

verl: V olcano engine reinforcement learning for LLMs

verl project. verl: V olcano engine reinforcement learning for LLMs. https://github.com/ verl-project/verl/releases/tag/v0.5.0, 2025

work page 2025
[39]

Remaining outer surface: 48 m2. Internal tunnel surfaces: 36 m2. Total surface area= 48 + 36 = 84m 2

ModelScope Team. EvalScope: Evaluation framework for large models. https://github. com/modelscope/evalscope, 2024. 13 A Limitations and Scope A.1 Scope of Claims Our claims are scoped to binary-reward RLVR with grouped rollouts, and all main experiments use N= 8 rollouts per task. The largest-scale experiments target the intended stateful-agent setting: S...

work page 2024

[1] [1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024. doi: 10.48550/arXiv.2402.03300. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024

[2] [2]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025

[3] [3]

DeepSWE: Training a fully open-sourced, state-of-the-art coding agent by scaling RL

Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Tarun Venkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, and Ion Stoica. DeepSWE: Training a fully open-sourced, state-of-the-art coding agent by scaling RL. https://www.together.ai/blog/deepswe, 2025

work page 2025

[4] [4]

Let it flow: Agentic crafting on rock and roll, building the rome model within an open agentic learning ecosystem

Weixun Wang, XiaoXiao Xu, Wanhe An, Fangwen Dai, Wei Gao, et al. Let it flow: Agentic crafting on rock and roll, building the ROME model within an open agentic learning ecosystem. arXiv:2512.24873, 2025. URLhttps://arxiv.org/abs/2512.24873

work page arXiv 2025

[5] [5]

Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, and Sang Michael Xie. Reuse your FLOPs: Scaling RL on hard problems by conditioning on very off-policy prefixes.CoRR, abs/2601.18795, 2026. URLhttps://arxiv.org/abs/2601.18795

work page arXiv 2026

[6] [6]

Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: Learning to reason on hard problems via privileged on-policy exploration.CoRR, abs/2601.18779, 2026. URLhttps://arxiv.org/abs/2601.18779

work page arXiv 2026

[7] [7]

arXiv preprint arXiv:2507.02841 , year=

Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. StepHint: Multi-level stepwise hints enhance reinforcement learning to reason.CoRR, abs/2507.02841, 2025. URLhttps://arxiv.org/abs/2507.02841

work page arXiv 2025

[8] [8]

Adhint: Adaptive hints with difficulty priors for reinforcement learning, 2026

Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, and Yang Yang. ADHint: Adaptive hints with difficulty priors for reinforcement learning.CoRR, abs/2512.13095, 2025. URLhttps://arxiv.org/abs/2512.13095

work page arXiv 2025

[9] [9]

Boosting MLLM reasoning with text-debiased Hint-GRPO

Qihan Huang, Weilong Dai, Jinlong Liu, Wanggui He, Hao Jiang, Mingli Song, Jingyuan Chen, Chang Yao, and Jie Song. Boosting MLLM reasoning with text-debiased Hint-GRPO. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4848–4857, 2025. URL https://openaccess.thecvf.com/content/ICCV2025/html/ Huang_Boosting_MLLM_Reasoning_wit...

work page 2025

[10] [10]

Self-hinting language models enhance reinforcement learning.arXiv preprint arXiv:2602.03143, 2026

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.CoRR, abs/2602.03143, 2026. URL https://arxiv. org/abs/2602.03143

work page arXiv 2026

[11] [11]

How to Allocate, How to Learn? Dynamic Rollout Allocation and Advantage Modulation for Policy Optimization

Yangyi Fang, Jiaye Lin, Xiaoliang Fu, Cong Qin, Haolin Shi, Chaowen Hu, Lu Pan, Ke Zeng, and Xunliang Cai. How to allocate, how to learn? dynamic rollout allocation and advantage modulation for policy optimization.CoRR, abs/2602.19208, 2026. URL https://arxiv. org/abs/2602.19208

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Learning to reason under off-policy guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=vO8LLoNWWk. 11

work page 2025

[13] [13]

SRFT: A single-stage method with super- vised and reinforcement fine-tuning for reasoning

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT: A single-stage method with super- vised and reinforcement fine-tuning for reasoning. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=n6E0r6kQWQ

work page 2026

[14] [14]

Learning what rein- forcement learning can’t: Interleaved online fine-tuning for hardest questions

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Wentao Zhang, and Bin Cui. Learning what rein- forcement learning can’t: Interleaved online fine-tuning for hardest questions. InInternational Conference on Learning Representations (ICLR), 2026. URL https://openreview.net/ forum?id=LzCBLrNoyM

work page 2026

[15] [15]

UFT: Unifying supervised and rein- forcement fine-tuning

Mingyang Liu, Gabriele Farina, and Asuman Ozdaglar. UFT: Unifying supervised and rein- forcement fine-tuning. InAdvances in Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=usOkGv1S7M

work page 2025

[16] [17]

URLhttps://arxiv.org/abs/2509.06923

work page arXiv

[17] [18]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024. URL https:// openreview.net/forum?id=VTF8yNQM66

work page 2024

[18] [19]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021. URLhttps://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[19] [20]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006. ISBN 9780471241959. URL https: //www.wiley-vch.de/en/areas-interest/computing-computer-sciences/ computer-science-17cs/information-technologies-17cs3/ elements-of-information-theory-978-0-471-24195-9

work page 2006

[20] [21]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning.Machine Learning, 8:229–256, 1992. doi: 10.1007/BF00992696. URL https://link.springer.com/article/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992

[21] [22]

Qwen3-14B.https://huggingface.co/Qwen/Qwen3-14B, 2025

Qwen Team. Qwen3-14B.https://huggingface.co/Qwen/Qwen3-14B, 2025

work page 2025

[22] [23]

Qwen3-32B.https://huggingface.co/Qwen/Qwen3-32B, 2025

Qwen Team. Qwen3-32B.https://huggingface.co/Qwen/Qwen3-32B, 2025

work page 2025

[23] [24]

American Invitational Mathematics Examination (AIME)

Mathematical Association of America. American Invitational Mathematics Examination (AIME). https://maa.org/maa-invitational-competitions/, 2025. Official MAA AIME information page

work page 2025

[24] [25]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. doi: 10.48550/arXiv. 1707.06347. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2017

[25] [26]

R2E-gym: Procedural environments and hybrid verifiers for scaling open-weights SWE agents.arXiv preprint arXiv:2504.07164, 2025

Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica. R2E- Gym: Procedural environments and hybrid verifiers for scaling open-weights SWE agents. arXiv preprint arXiv:2504.07164, 2025. doi: 10.48550/arXiv.2504.07164. URL https: //arxiv.org/abs/2504.07164

work page doi:10.48550/arxiv.2504.07164 2025

[26] [27]

SWE-bench Verified

SWE-bench Team. SWE-bench Verified. https://www.swebench.com/verified.html,

work page

[27] [28]

Human-validated 500-instance subset created in collaboration with OpenAI

work page

[28] [29]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. doi: 10.48550/arXiv.2505.09388. URL https://arxiv.org/abs/2505.09388. 12

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[29] [30]

Qwen3-4B-Instruct-2507

Qwen Team. Qwen3-4B-Instruct-2507. https://huggingface.co/Qwen/ Qwen3-4B-Instruct-2507, 2025

work page 2025

[30] [31]

Qwen3-8B.https://huggingface.co/Qwen/Qwen3-8B, 2025

Qwen Team. Qwen3-8B.https://huggingface.co/Qwen/Qwen3-8B, 2025

work page 2025

[31] [32]

AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. AceReason-Nemotron: Advancing math and code reasoning through reinforcement learning.arXiv preprint arXiv:2505.16400, 2025. doi: 10.48550/arXiv.2505. 16400. URLhttps://arxiv.org/abs/2505.16400

work page doi:10.48550/arxiv.2505 2025

[32] [33]

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Xin Xu, Clive Bai, Kai Yang, Tianhao Chen, Yangkun Chen, Weijie Liu, Hao Chen, Yang Wang, Saiyong Yang, and Can Yang. Composition-RL: Compose your verifiable prompts for reinforcement learning of large language models.CoRR, abs/2602.12036, 2026. URL https://arxiv.org/abs/2602.12036

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [35]

URLhttps://arxiv.org/abs/2602.09000

work page arXiv

[34] [36]

arXiv preprint arXiv:2602.02482 , year=

Yuda Song, Lili Chen, Fahim Tajwar, Rémi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.CoRR, abs/2602.02482, 2026. URLhttps://arxiv.org/abs/2602.02482

work page arXiv 2026

[35] [37]

arXiv preprint arXiv:2602.13949 , year=

Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, and Jieyu Zhao. Experiential reinforcement learning.CoRR, abs/2602.13949, 2026. URL https://arxiv.org/abs/2602. 13949

work page arXiv 2026

[36] [38]

verl: V olcano engine reinforcement learning for LLMs

verl project. verl: V olcano engine reinforcement learning for LLMs. https://github.com/ verl-project/verl/releases/tag/v0.5.0, 2025

work page 2025

[37] [39]

Remaining outer surface: 48 m2. Internal tunnel surfaces: 36 m2. Total surface area= 48 + 36 = 84m 2

ModelScope Team. EvalScope: Evaluation framework for large models. https://github. com/modelscope/evalscope, 2024. 13 A Limitations and Scope A.1 Scope of Claims Our claims are scoped to binary-reward RLVR with grouped rollouts, and all main experiments use N= 8 rollouts per task. The largest-scale experiments target the intended stateful-agent setting: S...

work page 2024