Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Chufan Shi; Dingdong Wang; Junjie Wang; Ruihang Chu; Tianhe Wu; Yiming Ren; Yiran Xu; Yujiu Yang; Yukang Chen; Yu Qiao

arxiv: 2605.30789 · v2 · pith:5DGV25FMnew · submitted 2026-05-29 · 💻 cs.LG · cs.AI

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren , Yiran Xu , Zicheng Lin , Chufan Shi , Yukang Chen , Dingdong Wang , Tianhe Wu , Junjie Wang

show 3 more authors

Yujiu Yang Yu Qiao Ruihang Chu

This is my paper

Pith reviewed 2026-06-28 23:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords GRPOpolicy optimizationmodel diversitymathematical reasoningsmall-to-large trainingrollout efficiencyLLM reinforcement learning

0 comments

The pith

Smaller models supply policy-level diversity that improves GRPO training of larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that smaller models from the same family generate higher policy-level diversity than larger ones, shown by stronger pass@k gains as the number of samples rises. This form of diversity stays temporally correlated and logically consistent, unlike token-level noise that can produce incoherent paths. The authors introduce S2L-PO to use a fixed small model for initial rollouts while training a large model, then progressively anneal to the large model's own samples. The method raises accuracy on mathematical reasoning benchmarks and lowers the compute spent on rollouts.

Core claim

Smaller models within the same family inherently exhibit higher policy-level diversity than larger counterparts, indicated by their superior pass@k relative to larger models as sample counts increase. This diversity is temporally correlated, preserves logical consistency, and supplies structured exploration signals for gradient estimation in GRPO. S2L-PO leverages fixed small models as natural explorers with a progressive annealing strategy that shifts from offline small-model rollouts to the large learner's own sampling, avoiding mid-training drops and achieving faster convergence plus a higher performance ceiling.

What carries the argument

S2L-PO framework with progressive annealing from fixed small-model rollouts to large-model sampling

If this is right

Accuracy improves on mathematical reasoning benchmarks such as +8.8 percent on AIME 24 when a 1.7B explorer guides an 8B model.
Rollout compute decreases while training proceeds.
Training avoids performance drops during the transition to the large model's sampling.
Convergence speeds up and the final performance ceiling for the large model rises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finding suggests that policy optimization may benefit from deliberately pairing models of different sizes for exploration and exploitation phases.
The annealing schedule could be adapted to other reinforcement learning setups that rely on multiple rollouts for gradient estimates.
If the diversity advantage holds across families, it would imply a new scaling consideration where smaller companions are retained rather than discarded after pretraining.

Load-bearing premise

The observed pass@k advantage of smaller models reflects temporally correlated policy-level diversity that supplies superior gradient signals rather than an artifact of capacity limits or evaluation metrics.

What would settle it

A direct comparison on the same benchmarks where rollouts from the small explorer produce no accuracy gain for the large model beyond what standard token-level diversity already achieves.

Figures

Figures reproduced from arXiv: 2605.30789 by Chufan Shi, Dingdong Wang, Junjie Wang, Ruihang Chu, Tianhe Wu, Yiming Ren, Yiran Xu, Yujiu Yang, Yukang Chen, Yu Qiao, Zicheng Lin.

**Figure 1.** Figure 1: S2L-PO (Bottom) simply modifies the rollout generation process of standard GRPO (Top). Motivated by the observation that smaller models inherently exhibit higher policy-level diversity, S2L-PO leverages a frozen smaller policy model to sample diverse rollouts for training a larger model. In early training, rollouts are primarily sampled from the smaller model to encourage diverse exploration. As training p… view at source ↗

**Figure 2.** Figure 2: Pass@k curves on AIME24 and AIME25 for Qwen3 Base models of various scales. While larger models perform better at small k, smaller models continue to improve as k increases and can match or exceed larger models under large sampling size. sess an inherent diversity, stemming not from token-level randomness but from more varied solution strategies (Bansal et al., 2024; Dragoi et al., 2025; Yue et al., 2025).… view at source ↗

**Figure 3.** Figure 3: Two ways to increase rollout diversity under standard GRPO. (a) Increasing token-level perturbation (e.g., higher sampling temperature) introduces step-wise stochasticity that accumulates over decoding steps, often reducing long-range coherence. (b) Policylevel perturbations (e.g., parameter-level compression within a model family) induce temporally consistent trajectory deviations, yielding diverse yet s… view at source ↗

**Figure 4.** Figure 4: S2L-PO improves both final performance and convergence speed. Pass@1 on AIME24&25 versus effective training progress for different scale transitions. S2L-PO uses a smaller model to generate part of each rollout group early in training and progressively anneals to fully on-policy GRPO. and regresses to significantly lower Pass@1 in later stages, our policy perturbation proves to be more stable, converges f… view at source ↗

**Figure 5.** Figure 5: Pure small-model rollouts are insufficient for sustained improvement. Here N denotes the number of GRPO rollouts and n denotes the number of small-model rollouts, allowing to match total compute across settings. 0 2k 4k 6k 8k 10k 12k 14k 16k Training data size 0.10 0.12 0.14 0.16 0.18 Qwen3-8B-Base Progressive transition Abrupt switching GRPO [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Progressive transition vs. abrupt two-phase switching. cally requires expensive on-policy rollouts and maintaining multiple synchronized components (e.g., policy, reference model, and often a critic), leading to considerable engineering complexity and computational overhead. To simplify training, Direct Preference Optimization (DPO) (Rafailov et al., 2023) rewrites KL-regularized preference learning into … view at source ↗

**Figure 7.** Figure 7: Ablation on transition length. We compare progressive annealing schedules that reduce the small-model rollout ratio to zero over the first 8 steps versus the first 5 steps. 5.2. Diversity and Exploration in GRPO-Style Training A central practical factor for GRPO-style methods is the diversity of candidate trajectories sampled for each prompt: when the sampled group becomes overly homogeneous or degenerates… view at source ↗

read the original abstract

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S2L-PO shows practical gains on math reasoning by using fixed small models as explorers plus annealing, but the core claim that their pass@k edge reflects temporally coherent policy diversity needs tighter checks.

read the letter

The paper's main point is that smaller models from the same family already supply more useful rollout diversity for GRPO than extra token-level noise, and their S2L-PO setup with progressive annealing from small-model samples to the large model's own samples delivers measurable accuracy lifts while cutting compute. The +8.8% on AIME 24 with a 1.7B explorer guiding an 8B model is the clearest number they report.

What they do cleanly is identify the coherence problem with token noise and propose a schedule that starts with offline small-model rollouts then shifts to the learner's samples. That avoids the obvious mid-training drop when the explorer is too weak. The idea is simple enough that groups already running GRPO variants could test it quickly.

The soft spot is the interpretation of the pass@k curves. The abstract treats the small model's higher pass@k at large k as evidence of temporally correlated, logically consistent trajectories that give better gradient signals. Pass@k only tells you whether any sample succeeded; it does not separate structured exploration from higher-variance but incoherent errors that lower capacity naturally produces. Without ablations that match diversity levels across model sizes, compare gradient variance, or measure trajectory coherence, it is still possible the gains come mainly from the annealing schedule and lower rollout cost rather than the claimed diversity property. If the full paper has those controls, the mechanism claim strengthens; if not, the empirical result stands but the explanation stays partly open.

This is aimed at people working on RL fine-tuning for reasoning models. The experimental setup looks worth referee time because the gains are concrete and the compute angle matters in practice, even if a reviewer will probably ask for the missing diversity diagnostics.

Referee Report

3 major / 0 minor

Summary. The paper claims that smaller models in the same family inherently provide higher policy-level diversity for GRPO training of LLMs, as indicated by superior pass@k at increasing sample counts; this diversity is argued to be temporally correlated and logically consistent (unlike token-level noise). It proposes the S2L-PO framework using fixed small models (e.g., 1.7B) as explorers for larger models (e.g., 8B) with a progressive annealing schedule from offline small-model rollouts to the learner's own sampling, reporting gains such as +8.8% on AIME 24 while reducing rollout compute.

Significance. If the central claim holds and the observed pass@k advantage indeed supplies structured, temporally correlated exploration signals that improve GRPO gradients, the work would offer a practical, compute-efficient alternative to token-level randomness for enhancing rollout diversity in LLM policy optimization, with direct applicability to mathematical reasoning benchmarks.

major comments (3)

[Abstract] Abstract: The claim that smaller models' pass@k advantage reflects 'temporally correlated, preserves logical consistency' policy-level diversity that supplies superior gradient signals is load-bearing for the S2L-PO motivation, yet the text provides no direct measurements (e.g., trajectory coherence scores, token-level correlation statistics, or gradient variance comparisons) to distinguish this from capacity-driven coverage or higher-variance error patterns.
[Abstract] Abstract: No ablations are described that would isolate the diversity mechanism, such as replacing small-model samples with matched-diversity large-model samples or comparing against the annealing schedule alone; without these, it is unclear whether the reported +8.8% AIME 24 gain (or reduced rollout compute) stems from the proposed policy-level diversity or from other factors like the training schedule.
[Abstract] Abstract: The paper states that smaller models exhibit 'superior pass@k relative to larger counterparts as sample counts increase' but provides no details on controls for model-family effects, exact diversity metrics beyond pass@k, statistical significance, or baseline comparisons, which are required to substantiate the 'inherently exhibit higher policy-level diversity' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the empirical support for our claims. We agree that additional measurements, ablations, and controls will improve the manuscript and will incorporate revisions to address each point. Our responses below are organized point-by-point.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that smaller models' pass@k advantage reflects 'temporally correlated, preserves logical consistency' policy-level diversity that supplies superior gradient signals is load-bearing for the S2L-PO motivation, yet the text provides no direct measurements (e.g., trajectory coherence scores, token-level correlation statistics, or gradient variance comparisons) to distinguish this from capacity-driven coverage or higher-variance error patterns.

Authors: We acknowledge that the current version relies primarily on pass@k trends as an indicator. To directly substantiate the temporally correlated and logically consistent properties, the revised manuscript will add trajectory coherence scores, token-level correlation statistics across rollouts, and gradient variance comparisons between small-model and token-noise baselines. These will appear in a new diversity analysis subsection. revision: yes
Referee: [Abstract] Abstract: No ablations are described that would isolate the diversity mechanism, such as replacing small-model samples with matched-diversity large-model samples or comparing against the annealing schedule alone; without these, it is unclear whether the reported +8.8% AIME 24 gain (or reduced rollout compute) stems from the proposed policy-level diversity or from other factors like the training schedule.

Authors: We agree that isolating the diversity source is necessary. The revision will include two new ablations: (1) replacing small-model rollouts with large-model samples matched for pass@k diversity, and (2) an annealing-schedule-only baseline without small-model explorers. These will quantify the contribution of policy-level diversity versus schedule effects. revision: yes
Referee: [Abstract] Abstract: The paper states that smaller models exhibit 'superior pass@k relative to larger counterparts as sample counts increase' but provides no details on controls for model-family effects, exact diversity metrics beyond pass@k, statistical significance, or baseline comparisons, which are required to substantiate the 'inherently exhibit higher policy-level diversity' claim.

Authors: We will expand the experimental section with explicit controls for model-family effects (e.g., cross-family comparisons), additional diversity metrics (e.g., trajectory edit distance), statistical significance tests (p-values across seeds), and further baselines. These details will be added to support the inherent diversity claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation plus proposed schedule

full rationale

The paper's central claim is an empirical observation that smaller models show higher pass@k as k increases, presented as a measured fact rather than a derived quantity. From this they motivate S2L-PO and an annealing schedule. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The derivation chain is observation → method design, which remains self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical observation that smaller models exhibit higher policy-level diversity; no free parameters, axioms, or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Smaller models within the same family exhibit higher policy-level diversity than larger ones, visible in pass@k scaling.
Stated as the key uncovered insight that motivates the method.

invented entities (1)

S2L-PO framework no independent evidence
purpose: Leverage small-model rollouts to train larger models with annealing to avoid capacity limits.
Newly proposed training procedure.

pith-pipeline@v0.9.1-grok · 5781 in / 1330 out tokens · 24496 ms · 2026-06-28T23:52:32.131189+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 30 canonical work pages · 17 internal anchors

[1]

H., Gendler, A., Baruch, E

Anschel, O., Shoshan, A., Botach, A., Hakimi, S. H., Gendler, A., Baruch, E. B., Bhonker, N., Kviatkovsky, I., Aggarwal, M., and Medioni, G. Group-aware rein- forcement learning for output diversity in large language models. InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, pp. 32382–32403,

2025
[2]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Balunovi´c, M., Dekoninck, J., Petrov, I., Jovanovi ´c, N., and Vechev, M. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

9 Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO Bamba, U., Fang, M., Yu, Y ., Zheng, H., and Lai, F. Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Q., and Kazemi, M

Bansal, H., Hosseini, A., Agarwal, R., Tran, V . Q., and Kazemi, M. Smaller, weaker, yet better: Training llm reasoners via compute-optimal sampling.arXiv preprint arXiv:2408.16737,

work page arXiv
[5]

InternLM2 Technical Report

Cai, Z., Cao, M., Chen, H., Chen, K., Chen, K., Chen, X., Chen, X., et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero- like training of large language models.arXiv preprint arXiv:2505.09655,

Chen, X., Zhu, W., Qiu, P., Dong, X., Wang, H., Wu, H., Li, H., Sotiras, A., Wang, Y ., and Razi, A. Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero- like training of large language models.arXiv preprint arXiv:2505.09655,

work page arXiv
[7]

Dragoi, M., Pintilie, I., Gogianu, F., and Brad, F

URLhttps://arxiv.org/abs/2602.06107. Dragoi, M., Pintilie, I., Gogianu, F., and Brad, F. Beyond pass@ k: Breadth-depth metrics for reasoning boundaries. arXiv preprint arXiv:2510.08325,

work page arXiv
[8]

Soft Adaptive Policy Optimization

Gao, C., Zheng, C., Chen, X.-H., Dang, K., Liu, S., Yu, B., Yang, A., Bai, S., Zhou, J., and Lin, J. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Minillm: Knowl- edge distillation of large language models

Gu, Y ., Dong, L., Wei, F., and Huang, M. Minillm: Knowl- edge distillation of large language models. InInterna- tional Conference on Learning Representations, volume 2024, pp. 32694–32717,

2024
[10]

Gapo: Learning preferential prompt through generative adversarial policy optimization.arXiv preprint arXiv:2503.20194,

Gu, Z., Chen, X., Shi, X., Wang, T., Zheng, S., Li, T., Feng, H., and Xiao, Y . Gapo: Learning preferential prompt through generative adversarial policy optimization.arXiv preprint arXiv:2503.20194,

work page arXiv
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Hao, Z., Wang, H., Liu, H., Luo, J., Yu, J., Dong, H., Lin, Q., Wang, C., and Chen, J. Rethinking entropy interventions in rlvr: An entropy change perspective.arXiv preprint arXiv:2510.10150,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

ORPO: Monolithic Preference Optimization without Reference Model

Hong, J., Lee, N., and Thorne, J. Orpo: Monolithic pref- erence optimization without reference model.arXiv preprint arXiv:2403.07691,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

C., Yin, H., et al

Huang, W., Ge, Y ., Yang, S., Xiao, Y ., Mao, H., Lin, Y ., Ye, H., Liu, S., Cheung, K. C., Yin, H., et al. Qerl: Beyond efficiency–quantization-enhanced reinforcement learning for llms.arXiv preprint arXiv:2510.11696,

work page arXiv
[18]

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Jin, R., Gao, P., Ren, Y ., Han, Z., Zhang, T., Huang, W., Liu, W., Luan, J., and Xiong, D. Revisiting entropy in reinforcement learning for large reasoning models.arXiv preprint arXiv:2511.05993,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

A survey of reinforcement learning from human feedback

Kaufmann, T., Weng, P., Bengs, V ., and H¨ullermeier, E. A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925,

work page arXiv
[20]

E., et al

Lanchantin, J., Chen, A., Lan, J., Li, X., Saha, S., Wang, T., Xu, J., Yu, P., Yuan, W., Weston, J. E., et al. Bridging offline and online reinforcement learning for llms.arXiv preprint arXiv:2506.21495,

work page arXiv
[21]

Critical tokens matter: Token-level contrastive estimation enhances llm’s reason- ing capability.arXiv preprint arXiv:2411.19943,

Lin, Z., Liang, T., Xu, J., Lin, Q., Wang, X., Luo, R., Shi, C., Li, S., Yang, Y ., and Tu, Z. Critical tokens matter: Token-level contrastive estimation enhances llm’s reason- ing capability.arXiv preprint arXiv:2411.19943,

work page arXiv
[22]

Revisiting group relative policy optimization: Insights into on-policy and off-policy training.arXiv preprint arXiv:2505.22257,

Mroueh, Y ., Dupuis, N., Belgodere, B., Nitsure, A., Rigotti, M., Greenewald, K., Navratil, J., Ross, J., and Rios, J. Revisiting group relative policy optimization: Insights into on-policy and off-policy training.arXiv preprint arXiv:2505.22257,

work page arXiv
[23]

arXiv preprint arXiv:2407.01082 , year =

Nguyen, M. N., Baker, A., Neo, C., Roush, A., Kirsch, A., and Shwartz-Ziv, R. Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082,

work page arXiv
[24]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast

Shi, C., Yang, C., Zhu, X., Wang, J., Wu, T., Li, S., Cai, D., Yang, Y ., and Meng, Y . Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast. Advances in Neural Information Processing Systems, 37: 136897–136921, 2024a. Shi, C., Yang, H., Cai, D., Zhang, Z., Wang, Y ., Yang, Y ., and Lam, W. A thorough examination of decoding ...

2024
[28]

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

Wang, C., Li, Z., Bai, J., Zhang, Y ., Cui, S., Zhao, Z., and Wang, Y . Arbitrary entropy policy optimization breaks the exploration bottleneck of reinforcement learning.arXiv preprint arXiv:2510.08141, 2025a. Wang, H., Hao, S., Dong, H., Zhang, S., Bao, Y ., Yang, Z., and Wu, Y . Offline reinforcement learning for llm multi-step reasoning. InFindings of ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yue, Y ., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Group expectation policy optimization for stable heterogeneous reinforcement learning in llms.arXiv e- prints, pp

Zhang, H., Zheng, R., Yi, Z., Peng, H., Wang, H., and Yu, Y . Group expectation policy optimization for stable heterogeneous reinforcement learning in llms.arXiv e- prints, pp. arXiv–2508, 2025a. Zhang, J. and Zuo, C. Grpo-lead: A difficulty-aware re- inforcement learning approach for concise mathemat- ical reasoning in language models.arXiv preprint arXi...

work page arXiv
[31]

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025b

Zhang, X., Wen, S., Wu, W., and Huang, L. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025b. Zhang, X., Wu, S., Zhu, Y ., Tan, H., Yu, S., He, Z., and Jia, J. Scaf-grpo: Scaffolded group relative policy op- timization for enhancing llm reasoning.arXiv preprint arXiv:2510.19807, 2025c...

work page arXiv
[32]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Zhuang, Y

Zhuang, H., Zhou, Y ., Guo, T., Huang, Y ., Liu, F., Song, K., and Zhang, X. Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892,

work page arXiv

[1] [1]

H., Gendler, A., Baruch, E

Anschel, O., Shoshan, A., Botach, A., Hakimi, S. H., Gendler, A., Baruch, E. B., Bhonker, N., Kviatkovsky, I., Aggarwal, M., and Medioni, G. Group-aware rein- forcement learning for output diversity in large language models. InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing, pp. 32382–32403,

2025

[2] [2]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Balunovi´c, M., Dekoninck, J., Petrov, I., Jovanovi ´c, N., and Vechev, M. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

9 Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO Bamba, U., Fang, M., Yu, Y ., Zheng, H., and Lai, F. Xrpo: Pushing the limits of grpo with targeted exploration and exploitation.arXiv preprint arXiv:2510.06672,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Q., and Kazemi, M

Bansal, H., Hosseini, A., Agarwal, R., Tran, V . Q., and Kazemi, M. Smaller, weaker, yet better: Training llm reasoners via compute-optimal sampling.arXiv preprint arXiv:2408.16737,

work page arXiv

[5] [5]

InternLM2 Technical Report

Cai, Z., Cao, M., Chen, H., Chen, K., Chen, K., Chen, X., Chen, X., et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero- like training of large language models.arXiv preprint arXiv:2505.09655,

Chen, X., Zhu, W., Qiu, P., Dong, X., Wang, H., Wu, H., Li, H., Sotiras, A., Wang, Y ., and Razi, A. Dra-grpo: Exploring diversity-aware reward adjustment for r1-zero- like training of large language models.arXiv preprint arXiv:2505.09655,

work page arXiv

[7] [7]

Dragoi, M., Pintilie, I., Gogianu, F., and Brad, F

URLhttps://arxiv.org/abs/2602.06107. Dragoi, M., Pintilie, I., Gogianu, F., and Brad, F. Beyond pass@ k: Breadth-depth metrics for reasoning boundaries. arXiv preprint arXiv:2510.08325,

work page arXiv

[8] [8]

Soft Adaptive Policy Optimization

Gao, C., Zheng, C., Chen, X.-H., Dang, K., Liu, S., Yu, B., Yang, A., Bai, S., Zhou, J., and Lin, J. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Minillm: Knowl- edge distillation of large language models

Gu, Y ., Dong, L., Wei, F., and Huang, M. Minillm: Knowl- edge distillation of large language models. InInterna- tional Conference on Learning Representations, volume 2024, pp. 32694–32717,

2024

[10] [10]

Gapo: Learning preferential prompt through generative adversarial policy optimization.arXiv preprint arXiv:2503.20194,

Gu, Z., Chen, X., Shi, X., Wang, T., Zheng, S., Li, T., Feng, H., and Xiao, Y . Gapo: Learning preferential prompt through generative adversarial policy optimization.arXiv preprint arXiv:2503.20194,

work page arXiv

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

Hao, Z., Wang, H., Liu, H., Luo, J., Yu, J., Dong, H., Lin, Q., Wang, C., and Chen, J. Rethinking entropy interventions in rlvr: An entropy change perspective.arXiv preprint arXiv:2510.10150,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

He, C., Luo, R., Bai, Y ., Hu, S., Thai, Z. L., Shen, J., Hu, J., Han, X., Huang, Y ., Zhang, Y ., et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

ORPO: Monolithic Preference Optimization without Reference Model

Hong, J., Lee, N., and Thorne, J. Orpo: Monolithic pref- erence optimization without reference model.arXiv preprint arXiv:2403.07691,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

C., Yin, H., et al

Huang, W., Ge, Y ., Yang, S., Xiao, Y ., Mao, H., Lin, Y ., Ye, H., Liu, S., Cheung, K. C., Yin, H., et al. Qerl: Beyond efficiency–quantization-enhanced reinforcement learning for llms.arXiv preprint arXiv:2510.11696,

work page arXiv

[18] [18]

Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

Jin, R., Gao, P., Ren, Y ., Han, Z., Zhang, T., Huang, W., Liu, W., Luan, J., and Xiong, D. Revisiting entropy in reinforcement learning for large reasoning models.arXiv preprint arXiv:2511.05993,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

A survey of reinforcement learning from human feedback

Kaufmann, T., Weng, P., Bengs, V ., and H¨ullermeier, E. A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925,

work page arXiv

[20] [20]

E., et al

Lanchantin, J., Chen, A., Lan, J., Li, X., Saha, S., Wang, T., Xu, J., Yu, P., Yuan, W., Weston, J. E., et al. Bridging offline and online reinforcement learning for llms.arXiv preprint arXiv:2506.21495,

work page arXiv

[21] [21]

Critical tokens matter: Token-level contrastive estimation enhances llm’s reason- ing capability.arXiv preprint arXiv:2411.19943,

Lin, Z., Liang, T., Xu, J., Lin, Q., Wang, X., Luo, R., Shi, C., Li, S., Yang, Y ., and Tu, Z. Critical tokens matter: Token-level contrastive estimation enhances llm’s reason- ing capability.arXiv preprint arXiv:2411.19943,

work page arXiv

[22] [22]

Revisiting group relative policy optimization: Insights into on-policy and off-policy training.arXiv preprint arXiv:2505.22257,

Mroueh, Y ., Dupuis, N., Belgodere, B., Nitsure, A., Rigotti, M., Greenewald, K., Navratil, J., Ross, J., and Rios, J. Revisiting group relative policy optimization: Insights into on-policy and off-policy training.arXiv preprint arXiv:2505.22257,

work page arXiv

[23] [23]

arXiv preprint arXiv:2407.01082 , year =

Nguyen, M. N., Baker, A., Neo, C., Roush, A., Kirsch, A., and Shwartz-Ziv, R. Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082,

work page arXiv

[24] [24]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., and Wu, C. Hybridflow: A flexi- ble and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast

Shi, C., Yang, C., Zhu, X., Wang, J., Wu, T., Li, S., Cai, D., Yang, Y ., and Meng, Y . Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast. Advances in Neural Information Processing Systems, 37: 136897–136921, 2024a. Shi, C., Yang, H., Cai, D., Zhang, Z., Wang, Y ., Yang, Y ., and Lam, W. A thorough examination of decoding ...

2024

[28] [28]

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

Wang, C., Li, Z., Bai, J., Zhang, Y ., Cui, S., Zhao, Z., and Wang, Y . Arbitrary entropy policy optimization breaks the exploration bottleneck of reinforcement learning.arXiv preprint arXiv:2510.08141, 2025a. Wang, H., Hao, S., Dong, H., Zhang, S., Bao, Y ., Yang, Z., and Wu, Y . Offline reinforcement learning for llm multi-step reasoning. InFindings of ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yue, Y ., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Group expectation policy optimization for stable heterogeneous reinforcement learning in llms.arXiv e- prints, pp

Zhang, H., Zheng, R., Yi, Z., Peng, H., Wang, H., and Yu, Y . Group expectation policy optimization for stable heterogeneous reinforcement learning in llms.arXiv e- prints, pp. arXiv–2508, 2025a. Zhang, J. and Zuo, C. Grpo-lead: A difficulty-aware re- inforcement learning approach for concise mathemat- ical reasoning in language models.arXiv preprint arXi...

work page arXiv

[31] [31]

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025b

Zhang, X., Wen, S., Wu, W., and Huang, L. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025b. Zhang, X., Wu, S., Zhu, Y ., Tan, H., Yu, S., He, Z., and Jia, J. Scaf-grpo: Scaffolded group relative policy op- timization for enhancing llm reasoning.arXiv preprint arXiv:2510.19807, 2025c...

work page arXiv

[32] [32]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Zhuang, Y

Zhuang, H., Zhou, Y ., Guo, T., Huang, Y ., Liu, F., Song, K., and Zhang, X. Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892,

work page arXiv