arxiv: 2605.12969 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

Feng Zhang , Xinhong Ma , Ziqiang Dong , Xi Leng , Jianfei Zhao , Xin Sun , Yang Yang , Guanjun Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Reinforcement Learning from Verifiable RewardsContrastive Policy OptimizationGRPOLLM ReasoningInfoNCE ObjectiveMathematical Reasoning Benchmarks

0 comments

The pith

ConSPO replaces GRPO's clipped ratios with length-normalized log-probabilities and a group-wise InfoNCE objective to improve credit assignment in LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first reformulates GRPO as a weighted positive-negative score difference that relies on clipped token-level importance sampling ratios. This reveals two limitations: the scores being optimized are misaligned with actual generation likelihoods, and credit assignment ignores relative score gaps within each group of rollouts. ConSPO corrects both by using length-normalized sequence log-probabilities as the score and optimizing a contrastive InfoNCE loss that contrasts each positive rollout against negative ones from the same group. A curriculum-scheduled margin further guides the optimization from coarse separation early on to tighter separation later. Evaluations across multiple models, scales, and datasets show consistent gains on mathematical reasoning benchmarks.

Core claim

ConSPO is a contrastive sequence-level policy optimization method for RLVR that aligns optimized scores with autoregressive likelihoods via length-normalized log-probabilities and applies a group-wise InfoNCE objective to make credit assignment depend on relative positive-negative gaps, together with a curriculum margin that increases separation strength over training.

What carries the argument

The group-wise InfoNCE-style objective applied to length-normalized sequence log-probabilities, which replaces clipped ratio-based surrogate scores to enable relative, score-gap-aware credit assignment within rollout groups.

If this is right

Credit assignment becomes sensitive to the magnitude of score gaps between positive and negative rollouts within the same group.
Optimized scores align directly with the likelihoods used during autoregressive generation instead of surrogate clipped ratios.
The curriculum margin allows training to begin with coarse positive-negative ordering and progressively demand stronger separation.
The method produces measurable gains on challenging mathematical reasoning benchmarks across varied model scales and training sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive reformulation could be applied to other RLVR variants that currently rely on ratio-based surrogates.
Group size and composition may affect the stability of the InfoNCE loss, suggesting a need to study optimal batching strategies.
The approach might extend naturally to domains with verifiable rewards beyond mathematics, such as code generation or theorem proving.
Because the loss amplifies updates on poorly separated positives, it could reduce the number of training steps needed to reach a target performance level.

Load-bearing premise

That replacing clipped ratio-based surrogate scores with length-normalized sequence log-probabilities and optimizing a group-wise InfoNCE objective will produce stable, superior credit assignment without introducing new optimization instabilities or sensitivity to group composition.

What would settle it

A controlled experiment in which ConSPO is trained on the same data and models as GRPO but shows no accuracy gain or exhibits training collapse on a standard mathematical reasoning benchmark such as GSM8K or MATH.

Figures

Figures reproduced from arXiv: 2605.12969 by Feng Zhang, Guanjun Jiang, Jianfei Zhao, Xi Leng, Xinhong Ma, Xin Sun, Yang Yang, Ziqiang Dong.

**Figure 1.** Figure 1: Gradient comparison and training dynamics of ConSPO. Left: score-level gradient comparison between GRPO and ConSPO. GRPO assigns credit using group-level statistics and ignores relative score gaps, whereas ConSPO assigns contrast-sensitive credit according to relative rollout scores. P + i and P − ij are defined in Eq. (11). Right: training reward curves of GRPO and ConSPO, showing that ConSPO achieves con… view at source ↗

**Figure 2.** Figure 2: Parameter study of ConSPO. We vary the contrastive temperature τ , target margin M, and margin warmup ratio α, and report the average performance on seven reasoning benchmarks. length-normalized log-likelihood scores with clipped importance sampling ratio scores also decreases the average score to 43.4, confirming the importance of aligning rollout scoring with autoregressive generation likelihoods. The sc… view at source ↗

**Figure 3.** Figure 3: Prompt template for rollout generation and evaluation. D Limitations Despite our best efforts, this study has several limitations. First, extending the empirical evaluation to larger-scale models, such as 32B and above, remains an important direction for future work. Second, ConSPO is a preliminary exploration of RLVR from a contrastive perspective, and broader choices of objective formulations and rollout… view at source ↗

read the original abstract

RLVR has become a widely adopted paradigm for improving LLMs' reasoning capabilities, and GRPO is one of its most representative algorithms. In this paper, we first show that GRPO admits an equivalent discriminative reformulation as a weighted positive-negative score difference. Under this view, GRPO increases sequence-level scores of verified positive rollouts and decreases those of negative rollouts, where the scores are averages of clipped token-level importance sampling ratios. This reformulation reveals two structural limitations of GRPO: likelihood-misaligned scoring, where clipped ratio-based surrogate scores are optimized instead of generation likelihoods, and score-insensitive credit assignment, where rollout-level credit is assigned without accounting for relative score gaps between positive and negative rollouts in the same group. To address these limitations, we propose ConSPO, a framework for Contrastive Sequence-level Policy Optimization in RLVR. ConSPO replaces GRPO's clipped ratio-based scores with length-normalized sequence log-probabilities, aligning the optimized rollout scores with the likelihoods used in autoregressive generation. It then optimizes a group-wise InfoNCE-style objective that contrasts each positive rollout against negative distractors from the same group, enabling credit assignment to depend on their relative scores. This contrastive formulation amplifies updates for poorly separated positives while concentrating suppressive updates on high-scoring negatives. Moreover, ConSPO introduces a curriculum-scheduled margin, guiding optimization from coarse positive-negative ordering in early training toward stronger separation in later stages. Extensive evaluations across diverse backbone models, parameter scales, and training datasets show that ConSPO consistently outperforms several strong RLVR baselines on challenging mathematical reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConSPO reframes GRPO as a positive-negative score difference and swaps in length-normalized log probs plus group-wise InfoNCE with a curriculum margin, which fixes two stated misalignments but leaves the objective's behavior under variable positive counts unproven.

read the letter

The main thing to know is that the authors rewrite GRPO as optimizing a weighted difference between positive and negative rollout scores, where those scores come from clipped token ratios instead of actual generation likelihoods. They then replace the ratios with length-normalized sequence log-probabilities and train a group-wise InfoNCE loss that contrasts each positive against the negatives from the same rollout group, plus a margin that starts loose and tightens over training. That combination is the distinct technical step beyond the GRPO variants they cite. It directly targets the likelihood misalignment and the fact that GRPO assigns the same credit to every positive regardless of how close the negatives score. The reformulation is clean and the motivation is easy to follow. Their experiments report consistent gains on math reasoning benchmarks across several backbones, scales, and datasets, which is the right kind of evidence for this line of work. The curriculum margin is a reasonable engineering choice that probably smooths early training. The soft spot is the one the stress test flags. Rollout groups in RLVR typically contain anywhere from zero to a handful of verified positives. When the positive count is low the InfoNCE denominator is dominated by negatives and the updates become aggressive; when it is high the signal dilutes. The margin controls separation strength but does not normalize for group size or intra-group variance, and no derivation shows the resulting gradient remains stable or unbiased across the observed distribution of group compositions. Without ablations that deliberately vary positive count or some analysis of gradient variance, the reported improvements could partly reflect favorable group statistics in their runs rather than intrinsic superiority of the contrastive form. This paper is for people already running or extending RLVR pipelines on reasoning models. A reader who wants a clearer view of GRPO's limitations and a concrete alternative objective will get value from it. It is coherent enough on its own terms and the claims are testable, so it deserves a serious referee who can ask for the missing stability checks and variance results.

Referee Report

1 major / 2 minor

Summary. The paper reformulates GRPO in RLVR as an equivalent weighted positive-negative score difference using clipped token-level importance sampling ratios, identifies two limitations (likelihood-misaligned scoring and score-insensitive credit assignment), and proposes ConSPO. ConSPO replaces the surrogate scores with length-normalized sequence log-probabilities and optimizes a group-wise InfoNCE-style contrastive objective (with curriculum-scheduled margin) that contrasts verified positives against negatives from the same rollout group. Extensive experiments across backbone models, scales, and datasets claim consistent outperformance over strong RLVR baselines on mathematical reasoning benchmarks.

Significance. If the results and derivations hold, the contrastive reformulation provides a useful discriminative lens on RLVR algorithms and a more likelihood-aligned, relative-score-sensitive alternative to GRPO-style methods. This could improve credit assignment and optimization stability for verifiable-reward training of LLMs on reasoning tasks, with the curriculum margin offering a practical way to control separation strength over training.

major comments (1)

[§3 (ConSPO objective and InfoNCE formulation)] The group-wise InfoNCE objective (described in the abstract and §3) contrasts each positive against negatives from the same rollout group using length-normalized log-probabilities. With typical group sizes of 8-16 rollouts containing only 0-4 verified positives, the denominator is frequently dominated by high-scoring negatives when positives are scarce; the curriculum margin modulates separation but provides no normalization for group cardinality or intra-group variance. No derivation shows the resulting gradient is unbiased or stable across the observed distribution of group compositions, so the reported gains could partly reflect favorable group statistics rather than intrinsic superiority of the reformulation.

minor comments (2)

[Abstract and §3] The abstract and method description would benefit from explicit equations for the length-normalized log-probability score and the exact InfoNCE loss (including how the margin is scheduled), to make the contrast with GRPO's clipped-ratio surrogate fully transparent.
[Experiments] Experiments section should include an ablation or analysis table on performance sensitivity to group size and positive-count distribution, as this directly tests the robustness claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. Our response to the major comment is provided below, along with planned revisions to the manuscript.

read point-by-point responses

Referee: [§3 (ConSPO objective and InfoNCE formulation)] The group-wise InfoNCE objective (described in the abstract and §3) contrasts each positive against negatives from the same rollout group using length-normalized log-probabilities. With typical group sizes of 8-16 rollouts containing only 0-4 verified positives, the denominator is frequently dominated by high-scoring negatives when positives are scarce; the curriculum margin modulates separation but provides no normalization for group cardinality or intra-group variance. No derivation shows the resulting gradient is unbiased or stable across the observed distribution of group compositions, so the reported gains could partly reflect favorable group statistics rather than intrinsic superiority of the reformulation.

Authors: We thank the referee for this insightful comment on the ConSPO objective. We agree that the InfoNCE formulation could benefit from further analysis of its gradient properties under different group compositions. Although we do not claim or derive unbiasedness in the traditional RL sense (as the objective is a contrastive surrogate rather than a direct policy gradient), the empirical results indicate stable training and consistent gains. The length-normalized log-probabilities provide better alignment, and the group-wise contrast allows relative scoring. To address this, we will revise Section 3 to include a brief analysis of the gradient behavior based on our observations and add ablations on varying group sizes and positive ratios in the experiments section. This will clarify that the gains are not solely due to favorable statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: reformulation and new objective are independently defined

full rationale

The paper first presents an algebraic reformulation of GRPO as a weighted positive-negative score difference, then defines ConSPO by substituting length-normalized sequence log-probabilities for clipped ratios and adopting a group-wise InfoNCE objective plus curriculum margin. These substitutions are explicit design choices stated in the abstract and are not obtained by fitting parameters to the same success metrics later reported; the derivation chain therefore remains self-contained and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard assumptions from contrastive learning and RL, plus one new scheduled hyperparameter.

free parameters (1)

curriculum-scheduled margin
A margin value that increases over training to control the strength of positive-negative separation.

axioms (1)

domain assumption Group-wise contrastive loss assigns credit proportionally to relative score gaps between positive and negative rollouts.
Invoked when stating that the InfoNCE-style objective enables score-sensitive credit assignment.

pith-pipeline@v0.9.0 · 5615 in / 1215 out tokens · 48773 ms · 2026-05-14T20:16:55.593487+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 11 internal anchors

[1]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=4OsgYD7em5

2025
[7]

Adhint: Adaptive hints with difficulty priors for reinforcement learning, 2026

Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, and Yang Yang. Adhint: Adaptive hints with difficulty priors for reinforcement learning, 2026. URL https://arxiv.org/abs/2512.13095

work page arXiv 2026
[8]

Geometric-mean policy optimization

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=nCEs0tSwc2

2026
[9]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps://arxiv.org/abs/2507.18071

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

work page internal anchor Pith review arXiv 2025
[11]

Hapo: Training language models to reason concisely via history-aware policy optimization

Chengyu Huang, Zhengxin Zhang, and Claire Cardie. Hapo: Training language models to reason concisely via history-aware policy optimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31122–31130, 2026

2026
[12]

Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning

Jingyang Yi, Justin Wang, and Sida Li. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= MJvwM5dBZM

2025
[13]

Towards flash thinking via decoupled advantage policy optimization.arXiv preprint arXiv:2510.15374, 2025

Zezhong Tan, Hang Gao, Xinhong Ma, Feng Zhang, and Ziqiang Dong. Towards flash thinking via decoupled advantage policy optimization.arXiv preprint arXiv:2510.15374, 2025

work page arXiv 2025
[14]

Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=Rwhi91ideu. 10

2025
[15]

Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen

Mingyang Chen, Linzhuang Sun, Tianpeng Li, sunhaoze, ZhouYijie, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for LLMs via reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview....

2026
[16]

Agent-rlvr: Training software engineering agents via guidance and environment rewards.arXiv preprint arXiv:2506.11425, 2025

Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, and Sean Hendryx. Agent-rlvr: Training software engineering agents via guidance and environment rewards.arXiv preprint arXiv:2506.11425, 2025

work page arXiv 2025
[17]

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

Luan Zhang, Dandan Song, Zhijing Wu, Zhengyu Chen, Chen Zhang, Yuhang Tian, Huipeng Ma, Chenhao Li, Changzhi Zhou, Xudong Li, and Shuhao Zhang. Prunetir: Inference-time tool call pruning for effective yet efficient tool-integrated reasoning, 2026. URL https: //arxiv.org/abs/2605.09931

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning.arXiv preprint arXiv:2504.04524, 2025

Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, and Yuting Liu. Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning.arXiv preprint arXiv:2504.04524, 2025

work page arXiv 2025
[19]

Clipo: Contrastive learning in policy optimization generalizes rlvr.arXiv preprint arXiv:2603.10101, 2026

Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. Clipo: Contrastive learning in policy optimization generalizes rlvr.arXiv preprint arXiv:2603.10101, 2026

work page arXiv 2026
[20]

Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

2024
[21]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

2025
[22]

Online difficulty filtering for reasoning oriented reinforcement learning

Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 700–719, 2026

2026
[23]

Vcrl: Variance-based curriculum reinforcement learning for large language models

Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models. arXiv preprint arXiv:2509.19803, 2025

work page arXiv 2025
[24]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009

2009
[26]

A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021

2021
[27]

Understanding r1-zero-like training: A critical perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=5PAF7PAY2Y

2025
[28]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

DisCO: Reinforcing large reasoning models with discriminative constrained optimization

Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, and Tianbao Yang. DisCO: Reinforcing large reasoning models with discriminative constrained optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview. net/forum?id=zzUXS4f91r

2025
[30]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

2020
[31]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

2020
[32]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[33]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Ken- ton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

work page arXiv 2024
[34]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening, 2026

Sijun Tan, Michael Luo, Justin Wong, Colin Cai, Xiaoxiang Shi, William Yuan Tang, Manan Roongta, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening, 2026. URL https:// openreview.net/forum?id=I6GzDCne7U

2026
[36]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

2024
[37]

Omni-MATH: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-MATH: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth ...

2025
[38]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

work page arXiv 2024
[39]

Matharena: Evaluating llms on uncontaminated math competitions, February 2025

Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/

2025
[40]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

2021
[41]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

2024
[42]

Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning.arXiv preprint arXiv:2507.10628, 2025

Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan Tu. Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning.arXiv preprint arXiv:2507.10628, 2025. 12

work page arXiv 2025
[43]

Learning to reason under off-policy guidance

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= vO8LLoNWWk

2026
[44]

Detecting hallucination in large language models through deep internal representation analysis

Luan Zhang, Dandan Song, Zhijing Wu, Yuhang Tian, Changzhi Zhou, Jing Xu, Ziyi Yang, and Shuhao Zhang. Detecting hallucination in large language models through deep internal representation analysis. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 8357–8365, 2025. A Proofs A.1 Proof of Propositi...

2025