pith. machine review for the scientific record. sign in

arxiv: 2605.12969 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Reinforcement Learning from Verifiable RewardsContrastive Policy OptimizationGRPOLLM ReasoningInfoNCE ObjectiveMathematical Reasoning Benchmarks
0
0 comments X

The pith

ConSPO replaces GRPO's clipped ratios with length-normalized log-probabilities and a group-wise InfoNCE objective to improve credit assignment in LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first reformulates GRPO as a weighted positive-negative score difference that relies on clipped token-level importance sampling ratios. This reveals two limitations: the scores being optimized are misaligned with actual generation likelihoods, and credit assignment ignores relative score gaps within each group of rollouts. ConSPO corrects both by using length-normalized sequence log-probabilities as the score and optimizing a contrastive InfoNCE loss that contrasts each positive rollout against negative ones from the same group. A curriculum-scheduled margin further guides the optimization from coarse separation early on to tighter separation later. Evaluations across multiple models, scales, and datasets show consistent gains on mathematical reasoning benchmarks.

Core claim

ConSPO is a contrastive sequence-level policy optimization method for RLVR that aligns optimized scores with autoregressive likelihoods via length-normalized log-probabilities and applies a group-wise InfoNCE objective to make credit assignment depend on relative positive-negative gaps, together with a curriculum margin that increases separation strength over training.

What carries the argument

The group-wise InfoNCE-style objective applied to length-normalized sequence log-probabilities, which replaces clipped ratio-based surrogate scores to enable relative, score-gap-aware credit assignment within rollout groups.

If this is right

  • Credit assignment becomes sensitive to the magnitude of score gaps between positive and negative rollouts within the same group.
  • Optimized scores align directly with the likelihoods used during autoregressive generation instead of surrogate clipped ratios.
  • The curriculum margin allows training to begin with coarse positive-negative ordering and progressively demand stronger separation.
  • The method produces measurable gains on challenging mathematical reasoning benchmarks across varied model scales and training sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrastive reformulation could be applied to other RLVR variants that currently rely on ratio-based surrogates.
  • Group size and composition may affect the stability of the InfoNCE loss, suggesting a need to study optimal batching strategies.
  • The approach might extend naturally to domains with verifiable rewards beyond mathematics, such as code generation or theorem proving.
  • Because the loss amplifies updates on poorly separated positives, it could reduce the number of training steps needed to reach a target performance level.

Load-bearing premise

That replacing clipped ratio-based surrogate scores with length-normalized sequence log-probabilities and optimizing a group-wise InfoNCE objective will produce stable, superior credit assignment without introducing new optimization instabilities or sensitivity to group composition.

What would settle it

A controlled experiment in which ConSPO is trained on the same data and models as GRPO but shows no accuracy gain or exhibits training collapse on a standard mathematical reasoning benchmark such as GSM8K or MATH.

Figures

Figures reproduced from arXiv: 2605.12969 by Feng Zhang, Guanjun Jiang, Jianfei Zhao, Xi Leng, Xinhong Ma, Xin Sun, Yang Yang, Ziqiang Dong.

Figure 1
Figure 1. Figure 1: Gradient comparison and training dynamics of ConSPO. Left: score-level gradient comparison between GRPO and ConSPO. GRPO assigns credit using group-level statistics and ignores relative score gaps, whereas ConSPO assigns contrast-sensitive credit according to relative rollout scores. P + i and P − ij are defined in Eq. (11). Right: training reward curves of GRPO and ConSPO, showing that ConSPO achieves con… view at source ↗
Figure 2
Figure 2. Figure 2: Parameter study of ConSPO. We vary the contrastive temperature τ , target margin M, and margin warmup ratio α, and report the average performance on seven reasoning benchmarks. length-normalized log-likelihood scores with clipped importance sampling ratio scores also decreases the average score to 43.4, confirming the importance of aligning rollout scoring with autoregressive generation likelihoods. The sc… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template for rollout generation and evaluation. D Limitations Despite our best efforts, this study has several limitations. First, extending the empirical evaluation to larger-scale models, such as 32B and above, remains an important direction for future work. Second, ConSPO is a preliminary exploration of RLVR from a contrastive perspective, and broader choices of objective formulations and rollout… view at source ↗
read the original abstract

RLVR has become a widely adopted paradigm for improving LLMs' reasoning capabilities, and GRPO is one of its most representative algorithms. In this paper, we first show that GRPO admits an equivalent discriminative reformulation as a weighted positive-negative score difference. Under this view, GRPO increases sequence-level scores of verified positive rollouts and decreases those of negative rollouts, where the scores are averages of clipped token-level importance sampling ratios. This reformulation reveals two structural limitations of GRPO: likelihood-misaligned scoring, where clipped ratio-based surrogate scores are optimized instead of generation likelihoods, and score-insensitive credit assignment, where rollout-level credit is assigned without accounting for relative score gaps between positive and negative rollouts in the same group. To address these limitations, we propose ConSPO, a framework for Contrastive Sequence-level Policy Optimization in RLVR. ConSPO replaces GRPO's clipped ratio-based scores with length-normalized sequence log-probabilities, aligning the optimized rollout scores with the likelihoods used in autoregressive generation. It then optimizes a group-wise InfoNCE-style objective that contrasts each positive rollout against negative distractors from the same group, enabling credit assignment to depend on their relative scores. This contrastive formulation amplifies updates for poorly separated positives while concentrating suppressive updates on high-scoring negatives. Moreover, ConSPO introduces a curriculum-scheduled margin, guiding optimization from coarse positive-negative ordering in early training toward stronger separation in later stages. Extensive evaluations across diverse backbone models, parameter scales, and training datasets show that ConSPO consistently outperforms several strong RLVR baselines on challenging mathematical reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper reformulates GRPO in RLVR as an equivalent weighted positive-negative score difference using clipped token-level importance sampling ratios, identifies two limitations (likelihood-misaligned scoring and score-insensitive credit assignment), and proposes ConSPO. ConSPO replaces the surrogate scores with length-normalized sequence log-probabilities and optimizes a group-wise InfoNCE-style contrastive objective (with curriculum-scheduled margin) that contrasts verified positives against negatives from the same rollout group. Extensive experiments across backbone models, scales, and datasets claim consistent outperformance over strong RLVR baselines on mathematical reasoning benchmarks.

Significance. If the results and derivations hold, the contrastive reformulation provides a useful discriminative lens on RLVR algorithms and a more likelihood-aligned, relative-score-sensitive alternative to GRPO-style methods. This could improve credit assignment and optimization stability for verifiable-reward training of LLMs on reasoning tasks, with the curriculum margin offering a practical way to control separation strength over training.

major comments (1)
  1. [§3 (ConSPO objective and InfoNCE formulation)] The group-wise InfoNCE objective (described in the abstract and §3) contrasts each positive against negatives from the same rollout group using length-normalized log-probabilities. With typical group sizes of 8-16 rollouts containing only 0-4 verified positives, the denominator is frequently dominated by high-scoring negatives when positives are scarce; the curriculum margin modulates separation but provides no normalization for group cardinality or intra-group variance. No derivation shows the resulting gradient is unbiased or stable across the observed distribution of group compositions, so the reported gains could partly reflect favorable group statistics rather than intrinsic superiority of the reformulation.
minor comments (2)
  1. [Abstract and §3] The abstract and method description would benefit from explicit equations for the length-normalized log-probability score and the exact InfoNCE loss (including how the margin is scheduled), to make the contrast with GRPO's clipped-ratio surrogate fully transparent.
  2. [Experiments] Experiments section should include an ablation or analysis table on performance sensitivity to group size and positive-count distribution, as this directly tests the robustness claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. Our response to the major comment is provided below, along with planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§3 (ConSPO objective and InfoNCE formulation)] The group-wise InfoNCE objective (described in the abstract and §3) contrasts each positive against negatives from the same rollout group using length-normalized log-probabilities. With typical group sizes of 8-16 rollouts containing only 0-4 verified positives, the denominator is frequently dominated by high-scoring negatives when positives are scarce; the curriculum margin modulates separation but provides no normalization for group cardinality or intra-group variance. No derivation shows the resulting gradient is unbiased or stable across the observed distribution of group compositions, so the reported gains could partly reflect favorable group statistics rather than intrinsic superiority of the reformulation.

    Authors: We thank the referee for this insightful comment on the ConSPO objective. We agree that the InfoNCE formulation could benefit from further analysis of its gradient properties under different group compositions. Although we do not claim or derive unbiasedness in the traditional RL sense (as the objective is a contrastive surrogate rather than a direct policy gradient), the empirical results indicate stable training and consistent gains. The length-normalized log-probabilities provide better alignment, and the group-wise contrast allows relative scoring. To address this, we will revise Section 3 to include a brief analysis of the gradient behavior based on our observations and add ablations on varying group sizes and positive ratios in the experiments section. This will clarify that the gains are not solely due to favorable statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: reformulation and new objective are independently defined

full rationale

The paper first presents an algebraic reformulation of GRPO as a weighted positive-negative score difference, then defines ConSPO by substituting length-normalized sequence log-probabilities for clipped ratios and adopting a group-wise InfoNCE objective plus curriculum margin. These substitutions are explicit design choices stated in the abstract and are not obtained by fitting parameters to the same success metrics later reported; the derivation chain therefore remains self-contained and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard assumptions from contrastive learning and RL, plus one new scheduled hyperparameter.

free parameters (1)
  • curriculum-scheduled margin
    A margin value that increases over training to control the strength of positive-negative separation.
axioms (1)
  • domain assumption Group-wise contrastive loss assigns credit proportionally to relative score gaps between positive and negative rollouts.
    Invoked when stating that the InfoNCE-style objective enables score-sensitive credit assignment.

pith-pipeline@v0.9.0 · 5615 in / 1215 out tokens · 48773 ms · 2026-05-14T20:16:55.593487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 11 internal anchors

  1. [1]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

  4. [4]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  5. [5]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  6. [6]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=4OsgYD7em5

  7. [7]

    Adhint: Adaptive hints with difficulty priors for reinforcement learning, 2026

    Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, and Yang Yang. Adhint: Adaptive hints with difficulty priors for reinforcement learning, 2026. URL https://arxiv.org/abs/2512.13095

  8. [8]

    Geometric-mean policy optimization

    Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=nCEs0tSwc2

  9. [9]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps://arxiv.org/abs/2507.18071

  10. [10]

    Soft Adaptive Policy Optimization

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

  11. [11]

    Hapo: Training language models to reason concisely via history-aware policy optimization

    Chengyu Huang, Zhengxin Zhang, and Claire Cardie. Hapo: Training language models to reason concisely via history-aware policy optimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31122–31130, 2026

  12. [12]

    Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning

    Jingyang Yi, Justin Wang, and Sida Li. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= MJvwM5dBZM

  13. [13]

    Towards flash thinking via decoupled advantage policy optimization.arXiv preprint arXiv:2510.15374, 2025

    Zezhong Tan, Hang Gao, Xinhong Ma, Feng Zhang, and Ziqiang Dong. Towards flash thinking via decoupled advantage policy optimization.arXiv preprint arXiv:2510.15374, 2025

  14. [14]

    Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=Rwhi91ideu. 10

  15. [15]

    Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen

    Mingyang Chen, Linzhuang Sun, Tianpeng Li, sunhaoze, ZhouYijie, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for LLMs via reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview....

  16. [16]

    Agent-rlvr: Training software engineering agents via guidance and environment rewards.arXiv preprint arXiv:2506.11425, 2025

    Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, and Sean Hendryx. Agent-rlvr: Training software engineering agents via guidance and environment rewards.arXiv preprint arXiv:2506.11425, 2025

  17. [17]

    PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

    Luan Zhang, Dandan Song, Zhijing Wu, Zhengyu Chen, Chen Zhang, Yuhang Tian, Huipeng Ma, Chenhao Li, Changzhi Zhou, Xudong Li, and Shuhao Zhang. Prunetir: Inference-time tool call pruning for effective yet efficient tool-integrated reasoning, 2026. URL https: //arxiv.org/abs/2605.09931

  18. [18]

    Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning.arXiv preprint arXiv:2504.04524, 2025

    Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, and Yuting Liu. Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning.arXiv preprint arXiv:2504.04524, 2025

  19. [19]

    Clipo: Contrastive learning in policy optimization generalizes rlvr.arXiv preprint arXiv:2603.10101, 2026

    Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. Clipo: Contrastive learning in policy optimization generalizes rlvr.arXiv preprint arXiv:2603.10101, 2026

  20. [20]

    Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

  21. [21]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

  22. [22]

    Online difficulty filtering for reasoning oriented reinforcement learning

    Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 700–719, 2026

  23. [23]

    Vcrl: Variance-based curriculum reinforcement learning for large language models

    Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models. arXiv preprint arXiv:2509.19803, 2025

  24. [24]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  25. [25]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

  26. [26]

    A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

    Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

  27. [27]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=5PAF7PAY2Y

  28. [28]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025. 11

  29. [29]

    DisCO: Reinforcing large reasoning models with discriminative constrained optimization

    Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, and Tianbao Yang. DisCO: Reinforcing large reasoning models with discriminative constrained optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview. net/forum?id=zzUXS4f91r

  30. [30]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  31. [31]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  32. [32]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  33. [33]

    Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

    Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Ken- ton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

  34. [34]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  35. [35]

    Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening, 2026

    Sijun Tan, Michael Luo, Justin Wong, Colin Cai, Xiaoxiang Shi, William Yuan Tang, Manan Roongta, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening, 2026. URL https:// openreview.net/forum?id=I6GzDCne7U

  36. [36]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

  37. [37]

    Omni-MATH: A universal olympiad level mathematic benchmark for large language models

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-MATH: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth ...

  38. [38]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

    Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

  39. [39]

    Matharena: Evaluating llms on uncontaminated math competitions, February 2025

    Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/

  40. [40]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  41. [41]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  42. [42]

    Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning.arXiv preprint arXiv:2507.10628, 2025

    Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan Tu. Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning.arXiv preprint arXiv:2507.10628, 2025. 12

  43. [43]

    Learning to reason under off-policy guidance

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= vO8LLoNWWk

  44. [44]

    Detecting hallucination in large language models through deep internal representation analysis

    Luan Zhang, Dandan Song, Zhijing Wu, Yuhang Tian, Changzhi Zhou, Jing Xu, Ziyi Yang, and Shuhao Zhang. Detecting hallucination in large language models through deep internal representation analysis. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 8357–8365, 2025. A Proofs A.1 Proof of Propositi...