Recognition: no theorem link
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
Pith reviewed 2026-05-14 20:16 UTC · model grok-4.3
The pith
ConSPO replaces GRPO's clipped ratios with length-normalized log-probabilities and a group-wise InfoNCE objective to improve credit assignment in LLM reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ConSPO is a contrastive sequence-level policy optimization method for RLVR that aligns optimized scores with autoregressive likelihoods via length-normalized log-probabilities and applies a group-wise InfoNCE objective to make credit assignment depend on relative positive-negative gaps, together with a curriculum margin that increases separation strength over training.
What carries the argument
The group-wise InfoNCE-style objective applied to length-normalized sequence log-probabilities, which replaces clipped ratio-based surrogate scores to enable relative, score-gap-aware credit assignment within rollout groups.
If this is right
- Credit assignment becomes sensitive to the magnitude of score gaps between positive and negative rollouts within the same group.
- Optimized scores align directly with the likelihoods used during autoregressive generation instead of surrogate clipped ratios.
- The curriculum margin allows training to begin with coarse positive-negative ordering and progressively demand stronger separation.
- The method produces measurable gains on challenging mathematical reasoning benchmarks across varied model scales and training sets.
Where Pith is reading between the lines
- The same contrastive reformulation could be applied to other RLVR variants that currently rely on ratio-based surrogates.
- Group size and composition may affect the stability of the InfoNCE loss, suggesting a need to study optimal batching strategies.
- The approach might extend naturally to domains with verifiable rewards beyond mathematics, such as code generation or theorem proving.
- Because the loss amplifies updates on poorly separated positives, it could reduce the number of training steps needed to reach a target performance level.
Load-bearing premise
That replacing clipped ratio-based surrogate scores with length-normalized sequence log-probabilities and optimizing a group-wise InfoNCE objective will produce stable, superior credit assignment without introducing new optimization instabilities or sensitivity to group composition.
What would settle it
A controlled experiment in which ConSPO is trained on the same data and models as GRPO but shows no accuracy gain or exhibits training collapse on a standard mathematical reasoning benchmark such as GSM8K or MATH.
Figures
read the original abstract
RLVR has become a widely adopted paradigm for improving LLMs' reasoning capabilities, and GRPO is one of its most representative algorithms. In this paper, we first show that GRPO admits an equivalent discriminative reformulation as a weighted positive-negative score difference. Under this view, GRPO increases sequence-level scores of verified positive rollouts and decreases those of negative rollouts, where the scores are averages of clipped token-level importance sampling ratios. This reformulation reveals two structural limitations of GRPO: likelihood-misaligned scoring, where clipped ratio-based surrogate scores are optimized instead of generation likelihoods, and score-insensitive credit assignment, where rollout-level credit is assigned without accounting for relative score gaps between positive and negative rollouts in the same group. To address these limitations, we propose ConSPO, a framework for Contrastive Sequence-level Policy Optimization in RLVR. ConSPO replaces GRPO's clipped ratio-based scores with length-normalized sequence log-probabilities, aligning the optimized rollout scores with the likelihoods used in autoregressive generation. It then optimizes a group-wise InfoNCE-style objective that contrasts each positive rollout against negative distractors from the same group, enabling credit assignment to depend on their relative scores. This contrastive formulation amplifies updates for poorly separated positives while concentrating suppressive updates on high-scoring negatives. Moreover, ConSPO introduces a curriculum-scheduled margin, guiding optimization from coarse positive-negative ordering in early training toward stronger separation in later stages. Extensive evaluations across diverse backbone models, parameter scales, and training datasets show that ConSPO consistently outperforms several strong RLVR baselines on challenging mathematical reasoning benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reformulates GRPO in RLVR as an equivalent weighted positive-negative score difference using clipped token-level importance sampling ratios, identifies two limitations (likelihood-misaligned scoring and score-insensitive credit assignment), and proposes ConSPO. ConSPO replaces the surrogate scores with length-normalized sequence log-probabilities and optimizes a group-wise InfoNCE-style contrastive objective (with curriculum-scheduled margin) that contrasts verified positives against negatives from the same rollout group. Extensive experiments across backbone models, scales, and datasets claim consistent outperformance over strong RLVR baselines on mathematical reasoning benchmarks.
Significance. If the results and derivations hold, the contrastive reformulation provides a useful discriminative lens on RLVR algorithms and a more likelihood-aligned, relative-score-sensitive alternative to GRPO-style methods. This could improve credit assignment and optimization stability for verifiable-reward training of LLMs on reasoning tasks, with the curriculum margin offering a practical way to control separation strength over training.
major comments (1)
- [§3 (ConSPO objective and InfoNCE formulation)] The group-wise InfoNCE objective (described in the abstract and §3) contrasts each positive against negatives from the same rollout group using length-normalized log-probabilities. With typical group sizes of 8-16 rollouts containing only 0-4 verified positives, the denominator is frequently dominated by high-scoring negatives when positives are scarce; the curriculum margin modulates separation but provides no normalization for group cardinality or intra-group variance. No derivation shows the resulting gradient is unbiased or stable across the observed distribution of group compositions, so the reported gains could partly reflect favorable group statistics rather than intrinsic superiority of the reformulation.
minor comments (2)
- [Abstract and §3] The abstract and method description would benefit from explicit equations for the length-normalized log-probability score and the exact InfoNCE loss (including how the margin is scheduled), to make the contrast with GRPO's clipped-ratio surrogate fully transparent.
- [Experiments] Experiments section should include an ablation or analysis table on performance sensitivity to group size and positive-count distribution, as this directly tests the robustness claim.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough review and valuable suggestions. Our response to the major comment is provided below, along with planned revisions to the manuscript.
read point-by-point responses
-
Referee: [§3 (ConSPO objective and InfoNCE formulation)] The group-wise InfoNCE objective (described in the abstract and §3) contrasts each positive against negatives from the same rollout group using length-normalized log-probabilities. With typical group sizes of 8-16 rollouts containing only 0-4 verified positives, the denominator is frequently dominated by high-scoring negatives when positives are scarce; the curriculum margin modulates separation but provides no normalization for group cardinality or intra-group variance. No derivation shows the resulting gradient is unbiased or stable across the observed distribution of group compositions, so the reported gains could partly reflect favorable group statistics rather than intrinsic superiority of the reformulation.
Authors: We thank the referee for this insightful comment on the ConSPO objective. We agree that the InfoNCE formulation could benefit from further analysis of its gradient properties under different group compositions. Although we do not claim or derive unbiasedness in the traditional RL sense (as the objective is a contrastive surrogate rather than a direct policy gradient), the empirical results indicate stable training and consistent gains. The length-normalized log-probabilities provide better alignment, and the group-wise contrast allows relative scoring. To address this, we will revise Section 3 to include a brief analysis of the gradient behavior based on our observations and add ablations on varying group sizes and positive ratios in the experiments section. This will clarify that the gains are not solely due to favorable statistics. revision: yes
Circularity Check
No circularity: reformulation and new objective are independently defined
full rationale
The paper first presents an algebraic reformulation of GRPO as a weighted positive-negative score difference, then defines ConSPO by substituting length-normalized sequence log-probabilities for clipped ratios and adopting a group-wise InfoNCE objective plus curriculum margin. These substitutions are explicit design choices stated in the abstract and are not obtained by fitting parameters to the same success metrics later reported; the derivation chain therefore remains self-contained and does not reduce any claimed result to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- curriculum-scheduled margin
axioms (1)
- domain assumption Group-wise contrastive loss assigns credit proportionally to relative score gaps between positive and negative rollouts.
Reference graph
Works this paper leans on
-
[1]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=4OsgYD7em5
2025
-
[7]
Adhint: Adaptive hints with difficulty priors for reinforcement learning, 2026
Feng Zhang, Zezhong Tan, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, and Yang Yang. Adhint: Adaptive hints with difficulty priors for reinforcement learning, 2026. URL https://arxiv.org/abs/2512.13095
-
[8]
Geometric-mean policy optimization
Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, and Furu Wei. Geometric-mean policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=nCEs0tSwc2
2026
-
[9]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps://arxiv.org/abs/2507.18071
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Soft Adaptive Policy Optimization
Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025
work page internal anchor Pith review arXiv 2025
-
[11]
Hapo: Training language models to reason concisely via history-aware policy optimization
Chengyu Huang, Zhengxin Zhang, and Claire Cardie. Hapo: Training language models to reason concisely via history-aware policy optimization. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31122–31130, 2026
2026
-
[12]
Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning
Jingyang Yi, Justin Wang, and Sida Li. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= MJvwM5dBZM
2025
-
[13]
Zezhong Tan, Hang Gao, Xinhong Ma, Feng Zhang, and Ziqiang Dong. Towards flash thinking via decoupled advantage policy optimization.arXiv preprint arXiv:2510.15374, 2025
-
[14]
Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=Rwhi91ideu. 10
2025
-
[15]
Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen
Mingyang Chen, Linzhuang Sun, Tianpeng Li, sunhaoze, ZhouYijie, Chenzheng Zhu, Haofen Wang, Jeff Z. Pan, Wen Zhang, Huajun Chen, Fan Yang, Zenan Zhou, and Weipeng Chen. Research: Learning to reason with search for LLMs via reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview....
2026
-
[16]
Jeff Da, Clinton Wang, Xiang Deng, Yuntao Ma, Nikhil Barhate, and Sean Hendryx. Agent-rlvr: Training software engineering agents via guidance and environment rewards.arXiv preprint arXiv:2506.11425, 2025
-
[17]
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning
Luan Zhang, Dandan Song, Zhijing Wu, Zhengyu Chen, Chen Zhang, Yuhang Tian, Huipeng Ma, Chenhao Li, Changzhi Zhou, Xudong Li, and Shuhao Zhang. Prunetir: Inference-time tool call pruning for effective yet efficient tool-integrated reasoning, 2026. URL https: //arxiv.org/abs/2605.09931
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, and Yuting Liu. Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning.arXiv preprint arXiv:2504.04524, 2025
-
[19]
Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. Clipo: Contrastive learning in policy optimization generalizes rlvr.arXiv preprint arXiv:2603.10101, 2026
-
[20]
Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024
2024
-
[21]
DAPO: An open-source LLM reinforcement learning system at scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...
2025
-
[22]
Online difficulty filtering for reasoning oriented reinforcement learning
Sanghwan Bae, Jiwoo Hong, Min Young Lee, Hanbyul Kim, JeongYeon Nam, and Donghyun Kwak. Online difficulty filtering for reasoning oriented reinforcement learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 700–719, 2026
2026
-
[23]
Vcrl: Variance-based curriculum reinforcement learning for large language models
Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models. arXiv preprint arXiv:2509.19803, 2025
-
[24]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
Curriculum learning
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th annual international conference on machine learning, pages 41–48, 2009
2009
-
[26]
A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021
Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE transactions on pattern analysis and machine intelligence, 44(9):4555–4576, 2021
2021
-
[27]
Understanding r1-zero-like training: A critical perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=5PAF7PAY2Y
2025
-
[28]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
DisCO: Reinforcing large reasoning models with discriminative constrained optimization
Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, and Tianbao Yang. DisCO: Reinforcing large reasoning models with discriminative constrained optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview. net/forum?id=zzUXS4f91r
2025
-
[30]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020
2020
-
[31]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020
2020
-
[32]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
2023
-
[33]
Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Ken- ton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024
-
[34]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening, 2026
Sijun Tan, Michael Luo, Justin Wong, Colin Cai, Xiaoxiang Shi, William Yuan Tang, Manan Roongta, Tianjun Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Effective RL scaling of reasoning models via iterative context lengthening, 2026. URL https:// openreview.net/forum?id=I6GzDCne7U
2026
-
[36]
Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024
2024
-
[37]
Omni-MATH: A universal olympiad level mathematic benchmark for large language models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-MATH: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth ...
2025
-
[38]
Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024
-
[39]
Matharena: Evaluating llms on uncontaminated math competitions, February 2025
Mislav Balunovi ´c, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi ´c, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URL https://matharena.ai/
2025
-
[40]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021
2021
-
[41]
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...
2024
-
[42]
Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, and Dandan Tu. Ghpo: Adaptive guidance for stable and efficient llm reinforcement learning.arXiv preprint arXiv:2507.10628, 2025. 12
-
[43]
Learning to reason under off-policy guidance
Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= vO8LLoNWWk
2026
-
[44]
Detecting hallucination in large language models through deep internal representation analysis
Luan Zhang, Dandan Song, Zhijing Wu, Yuhang Tian, Changzhi Zhou, Jing Xu, Ziyi Yang, and Shuhao Zhang. Detecting hallucination in large language models through deep internal representation analysis. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 8357–8365, 2025. A Proofs A.1 Proof of Propositi...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.