pith. sign in

arxiv: 2605.24331 · v1 · pith:GZOK2CQWnew · submitted 2026-05-23 · 💻 cs.LG · stat.ML

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

Pith reviewed 2026-06-30 13:51 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords LLM reasoningprompt reweightingRLVRquantile coordinate transformdistribution-aware weightingfunctional derivativereinforcement learning
0
0 comments X

The pith

Reweighting prompts by their rank and density in the pass-rate distribution improves LLM reasoning over standard RLVR methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives optimal prompt weights for reinforcement learning on language model reasoning by treating reweighting as the functional derivative of a utility defined over the space of pass-rate functions. This produces a single optimality condition that recovers existing methods such as REINFORCE and GRPO as special cases. CurveRL implements the condition through a quantile coordinate transform so that each prompt receives weight according to its position and local density within the current pass-rate distribution rather than its absolute value. Experiments across benchmarks show consistent gains relative to GRPO and other baselines. The work positions control of the pass-rate distribution itself as a central design choice for these algorithms.

Core claim

Formulating prompt reweighting as the functional derivative of a utility functional over the pass-rate function space yields a unified optimality framework. CurveRL realizes this framework with a quantile coordinate transform whose weights depend on rank and density in the empirical pass-rate distribution, producing higher reasoning performance than GRPO and other RLVR baselines on multiple benchmarks.

What carries the argument

quantile coordinate transform that converts pass rates into weights via their cumulative rank and local density in the observed distribution

If this is right

  • REINFORCE and GRPO emerge as particular choices of utility functional within the same framework.
  • Weights that depend only on absolute pass rates can be suboptimal when the distribution of pass rates is skewed or changing.
  • Context-distribution control supplies an explicit axis for analyzing and improving RLVR algorithms beyond simple thresholding.
  • Alternative utility functionals on the same pass-rate space can generate new reweighting rules with potentially different performance profiles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The quantile approach might transfer to other verified-reward settings where the distribution of success signals varies during training.
  • The framework offers a way to diagnose when certain prompts dominate learning and to adjust the distribution explicitly.
  • Extending the same functional-derivative view to multi-step reasoning traces or to non-verified rewards could reveal further structure.

Load-bearing premise

The chosen utility functional over pass-rate functions, together with the quantile transform, selects weights that genuinely raise downstream reasoning accuracy rather than merely reordering the same training examples.

What would settle it

Training the identical model and data with uniform or absolute-value reweighting and observing equal or higher benchmark scores would indicate that the distributional weighting is not required for the reported gains.

Figures

Figures reproduced from arXiv: 2605.24331 by Jiayi Xin, Ke Sun, Qi Long, Weijie Su, Yizhou Zhao.

Figure 1
Figure 1. Figure 1: Pass@k scaling on five representative benchmarks. Top row: Qwen3-1.7B-Base, k ∈ {1, . . . , 1024}. Bottom row: Qwen3-4B-Base, k ∈ {1, . . . , 512}. CurveRL outperforms GRPO and MaxRL across the full range of k on both model sizes, and exceeds the pretrained base model on most panels. bound on reasoning capability. Consistent with prior observations [Yue et al., 2025], GRPO and MaxRL exhibit varying degrees… view at source ↗
Figure 2
Figure 2. Figure 2: Prompt Distribution across Difficulty in Qwen3-1.7B-Base. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution-aware and data-driven weighting of CurveRL on Qwen3-4B-Base. Left: dynamics of CurveRL’s empirical pass-rate density ˆfref(pˆ), smoothed by interpolation. Center : dynamics of CurveRL’s adaptive prompt weight wt(pˆ) = ˆfref(pˆ)/Fˆ ref(pˆ) . Right: static GRPO weight 1/ p pˆ(1 − pˆ) and MaxRL weight 1/pˆ rescaled to [0, 1]. Sensitivity Analysis of Sliding Window Size t0. We further compare our … view at source ↗
Figure 4
Figure 4. Figure 4: fref and Fref for REINFORCE, GRPO, and MaxRL. General Case. In general, there is a key connection equation: Z 1 p w(t)dt = Z 1 p fref (t) Fref (t) dt = log Fref(1) − log Fref(p) = − log Fref(p). This implies that Fref(p) = exp  − Z 1 p w(t)dt . Since w is non-negative, we can easily verify that Fref is non-decreasing function as Fref(p1) < Fref(p2) if p1 < p2 and Fref(1) = 1. Therefore, Fref is a valid C… view at source ↗
Figure 5
Figure 5. Figure 5: Pass@k scaling on the additional three benchmarks. Top row: Qwen3-1.7B-Base, k ∈ {1, . . . , 1024}. Bottom row: Qwen3-4B-Base, k ∈ {1, . . . , 512}. CurveRL tracks or dominates the GRPO and MaxRL baselines on most benchmarks. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qwen3-1.7B-Base post-training prompt-difficulty distribution on the remaining six benchmarks. Across benchmarks, CurveRL’s unsolvable fraction is mostly no higher than that of the strongest pointwise-weighted baseline, while the hard/medium mass is preserved or enlarged. unsolvable hard medium easy difficulty level 0.0 0.1 0.2 0.3 0.4 0.5 fraction of prompts BRUMO 2025 unsolvable hard medium easy difficult… view at source ↗
Figure 7
Figure 7. Figure 7: Qwen3-4B-Base post-training prompt-difficulty distribution on all eight bench￾marks. CurveRL’s unsolvable fraction tracks or improves on the strongest pointwise-weighted baseline at the 4B scale as well, mirroring the 1.7B trend in [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fraction of prompts where the model generates at least one correct rollout out of 8 samples. [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pass@k scaling for the three sliding-window sizes. Five benchmarks from [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: reports additional training dynamics on Qwen3-1.7B-Base (top row) and Qwen3-4B-Base (bottom row), including mean response length, policy entropy, and gradient norm. CurveRL generally produces longer chains of thought and maintains higher actor entropy, supporting greater reasoning diversity and a wider capability boundary, as shown in Section 5. CurveRL also exhibits a flatter gradient-norm profile, indic… view at source ↗
Figure 11
Figure 11. Figure 11: Qwen3-1.7B-Base validation accuracy during training. 0 200 400 600 800 1000 RL step 70 75 80 85 pass@1 (%) MATH-500, pass@1 GRPO MaxRL CurveRL 0 200 400 600 800 1000 RL step 91 92 93 94 95 pass@16 (%) MATH-500, pass@16 [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qwen3-4B-Base validation accuracy during training. C.8 Distribution-Aware Weighting on Qwen3-1.7B-Base Results shown in [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Weighting schemes on Qwen3-1.7B-Base. Same layout and scale as [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The probability density function, CDF and weight function in terms of the pass rate [PITH_FULL_IMAGE:figures/full_fig_p037_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Dynamics of normalized weights among GRPO, MaxRL, and CurveRL on Qwen3-4B-Base. [PITH_FULL_IMAGE:figures/full_fig_p038_15.png] view at source ↗
read the original abstract

Context or prompt-level reweighting has emerged as a central algorithmic lever in Reinforcement Learning with Verified Rewards (RLVR) for improving the reasoning capability of large language models, yet the principle determining what constitutes an optimal weighting remains poorly understood. We address this gap by formulating prompt reweighting as a functional derivative of a utility functional defined in the pass-rate function space, yielding a unified optimality framework that accommodates existing schemes, including REINFORCE and GRPO. Building on this optimality framework, we propose a distribution-aware prompt reweighting approach, called CurveRL, based on a quantile coordinate transform, in which the weight assigned to each prompt depends not on the absolute value of pass rates but on its rank and density to reflect the distributional structure of the pass rates in the learning dynamics. Extensive experiments across multiple benchmarks demonstrate that our proposed CurveRL consistently outperforms GRPO and other RLVR baselines. Our study identifies context-distribution control as a principled axis for analyzing and designing prompt-reweighted RLVR algorithms. The code is released in https://github.com/zhyzmath/CurveRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper formulates prompt reweighting in RLVR as the functional derivative of a utility functional over the pass-rate function space, yielding a unified optimality framework that includes REINFORCE and GRPO. It introduces CurveRL, which applies a quantile coordinate transform so that weights depend on rank and density rather than absolute pass rates, and reports that this approach consistently outperforms GRPO and other RLVR baselines across multiple benchmarks. Code is released.

Significance. If the derivation of the optimality framework is made explicit and the empirical gains prove robust to ablations and statistical controls, the work would supply a principled axis (context-distribution control) for analyzing and designing prompt-reweighted RLVR algorithms. The public code release supports reproducibility and is a clear strength.

major comments (3)
  1. [Abstract] Abstract: the claim that 'a functional derivative yields the optimality framework' is presented without any derivation steps, explicit statement of the utility functional, or the resulting optimality condition. This information is load-bearing for the central unified-framework claim and must be supplied (with equation numbers) before the framework can be evaluated.
  2. [Experiments] Experiments (and abstract): no error bars, ablation studies, or statistical significance tests are described for the reported outperformance over GRPO. Without these, it is impossible to determine whether the quantile transform produces gains beyond what simpler monotonic reweightings would achieve.
  3. [Optimality framework] Optimality framework section: the manuscript does not demonstrate that the chosen utility functional together with the quantile transform selects weights whose effect on downstream reasoning is non-equivalent to simpler reweighting schemes that ignore the functional-derivative construction. This directly addresses the weakest assumption and must be addressed with a concrete comparison or proof.
minor comments (2)
  1. [Abstract] The abstract mentions 'extensive experiments across multiple benchmarks' but does not name the benchmarks or report the number of runs; this should be stated explicitly.
  2. Notation for the pass-rate function space and the quantile coordinate transform should be introduced with a clear definition before being used in the optimality derivation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the presentation of the optimality framework and the empirical results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'a functional derivative yields the optimality framework' is presented without any derivation steps, explicit statement of the utility functional, or the resulting optimality condition. This information is load-bearing for the central unified-framework claim and must be supplied (with equation numbers) before the framework can be evaluated.

    Authors: We agree that the abstract would benefit from greater explicitness. In the revised manuscript, we will update the abstract to briefly state the utility functional and the optimality condition, cross-referencing the equation numbers from Section 3 where the full derivation is provided. This will make the central claim more self-contained while preserving the abstract's length constraints. revision: yes

  2. Referee: [Experiments] Experiments (and abstract): no error bars, ablation studies, or statistical significance tests are described for the reported outperformance over GRPO. Without these, it is impossible to determine whether the quantile transform produces gains beyond what simpler monotonic reweightings would achieve.

    Authors: We acknowledge the absence of these statistical controls in the current version. We will incorporate error bars (standard deviations across multiple random seeds), ablation studies comparing the quantile transform to simpler reweighting functions, and statistical significance tests (e.g., paired t-tests) in the revised experiments section. These additions will help isolate the contribution of the distribution-aware component. revision: yes

  3. Referee: [Optimality framework] Optimality framework section: the manuscript does not demonstrate that the chosen utility functional together with the quantile transform selects weights whose effect on downstream reasoning is non-equivalent to simpler reweighting schemes that ignore the functional-derivative construction. This directly addresses the weakest assumption and must be addressed with a concrete comparison or proof.

    Authors: To demonstrate non-equivalence, we will add a dedicated comparison in the revised manuscript. This will include both a theoretical argument showing how the quantile coordinate transform (incorporating density) differs from absolute pass-rate or rank-only reweighting in the context of the utility functional, and empirical results contrasting the learning trajectories and final performance when using weights derived from the full construction versus simplified alternatives. We believe this will substantiate that the functional-derivative approach yields distinct and beneficial weighting behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation formulates prompt reweighting via functional derivative of a utility functional over the pass-rate function space and introduces a quantile coordinate transform based on rank and density. These steps are presented as mathematical constructions independent of the target benchmark scores. No equations or self-citations are shown that reduce the central optimality claim or the proposed CurveRL weights to a fit or renaming of the input data by construction. The framework is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a utility functional whose functional derivative produces an optimal weighting; the abstract does not enumerate free parameters or new entities, but implicitly treats the pass-rate function space and the quantile transform as well-defined constructs without external validation.

axioms (1)
  • domain assumption A utility functional exists on the space of pass-rate functions such that its functional derivative yields the optimal prompt weights.
    Invoked when the paper states that prompt reweighting is formulated as a functional derivative of a utility functional.

pith-pipeline@v0.9.1-grok · 5729 in / 1488 out tokens · 30549 ms · 2026-06-30T13:51:39.133207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 36 canonical work pages · 17 internal anchors

  1. [1]

    Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models.https://hkunlp.github.io/blog/2025/Polaris,

    ChenxinAn, ZhihuiXie, XiaonanLi, LeiLi, JunZhang, ShansanGong, MingZhong, JingjingXu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models.https://hkunlp.github.io/blog/2025/Polaris,

  2. [2]

    MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions.arXiv preprint arXiv:2505.23281,

  3. [3]

    Why pass@ k optimiza- tion can degrade pass@ 1: Prompt interference in llm post-training.arXiv preprint arXiv:2602.21189,

    Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, and Amrit Singh Bedi. Why pass@ k optimiza- tion can degrade pass@ 1: Prompt interference in llm post-training.arXiv preprint arXiv:2602.21189,

  4. [4]

    A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning

    Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning.arXiv preprint arXiv:1012.2599,

  5. [5]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  6. [6]

    Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025a

    Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, and Ehsan Kamalloo. Self-evolving curriculum for llm reasoning.arXiv preprint arXiv:2505.14970, 2025a. Xinzhu Chen, Xuesheng Li, Zhongxiang Sun, and Weijie Yu. Beyond high-entropy exploration: Correctness-aware low-entropy segment-based advantag...

  7. [7]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

  8. [8]

    What is the objective of reasoning with reinforcement learning? arXiv preprint arXiv:2510.13651,

    Damek Davis and Benjamin Recht. What is the objective of reasoning with reinforcement learning? arXiv preprint arXiv:2510.13651,

  9. [9]

    Simple regret minimization for contextual bandits.arXiv preprint arXiv:1810.07371,

    Aniket Anand Deshmukh, Srinagesh Sharma, James W Cutler, Mark Moldwin, and Clayton Scott. Simple regret minimization for contextual bandits.arXiv preprint arXiv:1810.07371,

  10. [10]

    A Tutorial on Bayesian Optimization

    Peter I Frazier. A tutorial on bayesian optimization.arXiv preprint arXiv:1807.02811,

  11. [11]

    Cheng Ge, Caitlyn Heqi Yin, Hao Liang, and Jiawei Zhang

    URLhttps:// openreview.net/forum?id=8LqHs0KIM7. Cheng Ge, Caitlyn Heqi Yin, Hao Liang, and Jiawei Zhang. Why grpo needs normalization: A local-curvature perspective on adaptive gradients.arXiv preprint arXiv:2601.23135,

  12. [12]

    Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

    Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  14. [14]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  15. [15]

    On the Emergence of Implicit Curriculum in RLVR Learning Dynamics

    Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, and Yuxin Chen. On the learning dynamics of rlvr at the edge of competence.arXiv preprint arXiv:2602.14872,

  16. [16]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  17. [17]

    Continuous control with deep reinforcement learning

    Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971,

  18. [18]

    Goal-conditioned reinforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299,

    Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal-conditioned reinforcement learning: Problems and solutions.arXiv preprint arXiv:2201.08299,

  19. [19]

    Dual active learning for reinforcement learning from human feedback.arXiv preprint arXiv:2410.02504,

    Pangpang Liu, Chengchun Shi, and Will Wei Sun. Dual active learning for reinforcement learning from human feedback.arXiv preprint arXiv:2410.02504,

  20. [20]

    Emphatic Temporal-Difference Learning

    A Rupam Mahmood, Huizhen Yu, Martha White, and Richard S Sutton. Emphatic temporal-difference learning.arXiv preprint arXiv:1507.01569,

  21. [21]

    Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

    Youssef Mroueh. Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

  22. [22]

    Olmo 3

    18 Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961,

  23. [23]

    Curriculum reinforcement learning from easy to hard tasks improves llm reasoning.arXiv preprint arXiv:2506.06632,

    Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, et al. Curriculum reinforcement learning from easy to hard tasks improves llm reasoning.arXiv preprint arXiv:2506.06632,

  24. [24]

    Learning to reason with curriculum i: Provable benefits of autocurriculum.arXiv preprint arXiv:2603.18325,

    Nived Rajaraman, Audrey Huang, Miro Dudik, Robert Schapire, Dylan J Foster, and Akshay Krishna- murthy. Learning to reason with curriculum i: Provable benefits of autocurriculum.arXiv preprint arXiv:2603.18325,

  25. [25]

    Prioritized Experience Replay

    Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.arXiv preprint arXiv:1511.05952,

  26. [26]

    ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914,

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  28. [28]

    Maximum likelihood reinforcement learning.arXiv preprint arXiv:2602.02710,

    Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning.arXiv preprint arXiv:2602.02710,

  29. [29]

    Multiplex thinking: Reasoning via token-wise branch-and-merge.arXiv preprint arXiv:2601.08808,

    19 Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, and Jiatao Gu. Multiplex thinking: Reasoning via token-wise branch-and-merge.arXiv preprint arXiv:2601.08808,

  30. [30]

    The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843,

    Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843,

  31. [31]

    Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, and Tong Zhang

    URLhttps://openreview.net/forum?id=WDP5b3mtFV. Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, and Tong Zhang. Reinforce-ada: An adaptive sampling framework under non-linear rl objectives. arXiv preprint arXiv:2510.04996,

  32. [32]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  33. [33]

    Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521,

    Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, et al. Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521,

  34. [34]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

  35. [35]

    On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783, 2025a

    Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783, 2025a. Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference o...

  36. [36]

    Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

    20 Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, and Hui Xiong. Gvpo: Group variance policy optimization for large language model post-training.Advances in neural information processing systems, 2025b. Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu, Jingqing Ruan, An Jian, Kejiang Chen, and Xing Hu. Revisitin...

  37. [37]

    Demystifying group relative policy optimization: Its policy gradient is a u-statistic.arXiv preprint arXiv:2603.01162,

    Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Shijin Gong, and Chengchun Shi. Demystifying group relative policy optimization: Its policy gradient is a u-statistic.arXiv preprint arXiv:2603.01162,

  38. [38]

    23 A.2 Proof of Proposition 1

    21 Appendix Table of Contents A Theoretical Results 23 A.1 Functional derivative under Pointwise Utility Reduces to Partial Derivative . . . . 23 A.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A.4 Proof of Th...

  39. [39]

    Firstly, given each promptx, we have the pointwise limit: ηwη θ(x) = eη −1 1 + (eη −1)p θ(x) = 1 1 eη−1 +p θ(x) → 1 pθ(x) .(η→+∞) This implies wη θ(x) ∼ 1 ηpθ(x) as η→ +∞. However, the pointwise convergence does not directly ensure the limit of the integral is the integral of the limit, i.e.,limE = Elim, which additionally requires some mild conditions (i...

  40. [40]

    Let’s think step by step and put the final answer within \boxed{}

    Therefore, Fref is a valid CDF. Given the fact thatd dp − R 1 p w(t)dt = w(p), by the chain rule, we can derive fref(p) = d dp exp − Z 1 p w(t)dt =F ref(p)w(p), which is the exact goal we want to prove in the beginning. Uniqueness.Assume ˜Falso satisfies ˜f(p) ˜F(p) =w(p). This implies that d dp log ˜F(p) =w(p). 26 By taking the integral between[p,1]on bo...

  41. [41]

    We do not employ adaptive sampling and do not apply any logit correction for the mismatch between inference and training runtimes

    Evaluation Decoding.Following [Tajwar et al., 2026], evaluation generations are sampled at temperature0 .6with top- p 0.95, and both top-k and min-p truncation are disabled. We do not employ adaptive sampling and do not apply any logit correction for the mismatch between inference and training runtimes. Reported metrics are computed on the final checkpoin...

  42. [42]

    Method BRUMO 2025 HMMT 11/25 Minerva A vg

    C.1 Pass@1 and Pass@64on Three Additional Benchmarks Table 5:Supplemental results on three additional math reasoning benchmarks.pass@1 and pass@64 (%), best per column in bold. Method BRUMO 2025 HMMT 11/25 Minerva A vg. pass@1 pass@64 pass@1 pass@64 pass@1 pass@64 pass@1 pass@64 Qwen3-1.7B-Base Base 8.0 40.1 2.4 18.7 20.8 56.9 10.4 38.6 GRPO 14.2 40.2 3.0...

  43. [43]

    Larger values indicate a richer pool of prompts that provide non-vanishing gradient signals

    31 C.4 Learning Signals Across Training Dynamics Figure 8 tracks the fraction of training prompts with strictly positive empirical pass rateˆpat each RL step on Qwen3-1.7B-Base and Qwen3-4B-Base. Larger values indicate a richer pool of prompts that provide non-vanishing gradient signals. While all three methods start from a similar pool, CurveRL and MaxRL...

  44. [44]

    CurveRL also exhibits a flatter gradient-norm profile, indicating more stable training. 0 200 400 600 800 1000 RL step 800 1000 1200 1400 1600 1800Mean response length (tokens) Qwen3-1.7B-Base response length GRPO MaxRL CurveRL 0 200 400 600 800 1000 RL step 0.10 0.12 0.14 0.16 0.18 0.20 0.22Policy entropy Qwen3-1.7B-Base entropy 0 200 400 600 800 1000 RL...

  45. [45]

    Only CurveRL’s panels (left two) reflect 1.7B training, since the GRPO and MaxRL weights are static. 34 D More Discussions D.1 Curriculum Learning under Our Framework Curriculum Learning is a Time-Varying Pointwise Prompt Reweighting Method.The curriculum learning strategy in RLVR, such as [Parashar et al., 2025], develops different data scheduling from e...