Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Fuzheng Zhang; Guorui Zhou; Jiakang Wang; Ling Pan; Runze Liu; Xiu Li

arxiv: 2507.15778 · v2 · pith:DOUNEVZ4new · submitted 2025-07-21 · 💻 cs.CL

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang , Runze Liu , Fuzheng Zhang , Xiu Li , Guorui Zhou , Ling Pan This is my paper

Pith reviewed 2026-05-21 23:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords RLVRdual-token constraintstoken entropyLLM reasoningmathematical reasoningcode generationreinforcement learningautoregressive generation

0 comments

The pith

Differentiated constraints on high- and low-entropy tokens improve reasoning in reinforcement learning for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that uniform optimization across all tokens in RLVR overlooks how some tokens drive reasoning while others store facts. It introduces dual-token constraints that classify tokens by entropy and apply stronger updates to reasoning tokens while using gentler rules for knowledge tokens. This keeps the full sequence optimized together instead of isolating tokens and breaking generation flow. The approach matters because it aims to raise accuracy on math and code tasks by respecting the step-by-step nature of model output.

Core claim

Archer shows that response-level entropy normalization for stable token classification, combined with differentiated clipping ranges and KL regularization, encourages exploration on high-entropy reasoning tokens while preserving low-entropy knowledge tokens, delivering consistent gains in pass@1 and pass@K on mathematical reasoning and code generation benchmarks across model scales.

What carries the argument

Dual-token constraints that modulate clipping and KL regularization ranges according to token entropy while maintaining joint optimization of the full autoregressive sequence.

If this is right

Outperforms strong baselines on mathematical reasoning and code generation benchmarks
Raises both pass@1 and pass@K scores across multiple model scales
Preserves the sequential dependency structure of autoregressive generation
Encourages exploration on reasoning tokens while stabilizing knowledge tokens

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The entropy-based split could extend to other post-training methods such as supervised fine-tuning to handle mixed reasoning and recall tasks.
Dynamic adjustment of entropy thresholds during training might further stabilize performance on longer reasoning chains.
This pattern of differentiated constraints may reduce the need for separate masking stages in future LLM optimization pipelines.

Load-bearing premise

High-entropy tokens correspond to reasoning steps and low-entropy tokens to factual knowledge, and differentiated constraint ranges can adjust updates without disrupting the sequential dependencies in generation.

What would settle it

A side-by-side run on the same math and code benchmarks where uniform constraints or isolation methods match or exceed the dual-constraint results would undermine the value of the differentiated approach.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs). However, existing methods mainly apply uniform optimization constraints across all tokens, ignoring their heterogeneous roles. Prior work shows that high-entropy tokens are closely tied to reasoning, while low-entropy tokens primarily encode factual knowledge, and recent approaches attempt to exploit this distinction by isolating token updates via masking or asynchronous training. We argue that such isolation breaks the sequential dependency structure of autoregressive generation, leading to suboptimal learning. To address this, we propose \textbf{Archer}, an entropy-aware RLVR framework with \textbf{dual-token constraints} that preserves joint optimization while modulating update strength across token types. Our method introduces response-level entropy normalization for stable token classification and applies differentiated clipping ranges and KL regularization to encourage exploration on reasoning tokens while preserving knowledge tokens. Experiments on mathematical reasoning and code generation benchmarks show that Archer consistently outperforms strong baselines across multiple model scales, improving both \textit{pass@1} and \textit{pass@K} performance. These results highlight the importance of respecting sequence-level dependencies when designing fine-grained RL optimization strategies for LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Archer adds dual-token constraints to RLVR via response-level entropy normalization and differentiated clipping/KL to favor exploration on high-entropy tokens while stabilizing low-entropy ones, without using isolation methods.

read the letter

The main thing to know is that this paper proposes Archer as an entropy-aware RLVR method that classifies tokens at the response level and then applies looser clipping and KL ranges to high-entropy tokens (tied to reasoning) and tighter ones to low-entropy tokens (tied to knowledge), all while keeping joint optimization across the sequence. It positions this as better than prior masking or asynchronous approaches because those break autoregressive dependencies. That framing is reasonable and directly addresses a practical issue in fine-grained RL for LLMs. The reported gains on math reasoning and code generation benchmarks for both pass@1 and pass@K across model scales are the kind of outcome that could matter for applied work. The paper builds on existing observations about entropy and token roles rather than claiming to discover them, which keeps the novelty focused on the combined constraint design and normalization step. The soft spot is the limited direct support for the classification itself. The argument assumes high-entropy tokens reliably mark reasoning steps in this RLVR setting, but without token-level analysis or ablations that isolate the dual constraints from the normalization or other hyperparameter choices, the outperformance could stem from broader effects rather than the targeted mechanism. The abstract claims consistent improvements but does not detail baselines or error breakdowns here, so the strength of the evidence depends on what the full experiments actually show. This is for researchers working on RL post-training for reasoning models who want to move past uniform constraints. A reader already familiar with entropy-based distinctions in LLMs would get the most out of it and could adapt the idea even if the exact split needs refinement. It is coherent enough on its own terms to deserve a serious referee, though the review would need to focus on verifying the mechanism through ablations.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Archer, an entropy-aware RLVR framework for LLMs that uses response-level entropy normalization to classify tokens into high-entropy (reasoning) and low-entropy (knowledge) categories, then applies differentiated clipping ranges and KL regularization under dual-token constraints to encourage exploration on reasoning tokens while stabilizing knowledge tokens, all while preserving joint autoregressive optimization rather than using isolation methods. Experiments on mathematical reasoning and code generation benchmarks report consistent outperformance over strong baselines across multiple model scales, with gains in both pass@1 and pass@K.

Significance. If the central mechanism and results hold, the work offers a practical way to perform fine-grained RL optimization in LLMs that respects token heterogeneity and sequential dependencies, potentially improving reasoning performance without the drawbacks of masking or asynchronous training. The empirical evaluation across scales and tasks, combined with the focus on verifiable rewards, provides a concrete contribution to post-training methods.

major comments (2)

[§3] §3 (Dual-Token Constraints): The framework rests on the assumption that high-entropy tokens correspond to reasoning steps and low-entropy tokens to factual knowledge, with response-level entropy normalization enabling stable classification; however, the manuscript provides no token-level analysis, ablation on classification accuracy, or error analysis in the RLVR setting to validate this mapping, leaving open the possibility that gains arise from normalization or hyperparameter choices rather than the differentiated constraints.
[§4] §4 (Experiments and Ablations): The reported improvements in pass@1 and pass@K are central to the claim of consistent outperformance, yet the experiments section lacks ablations that isolate the contribution of differentiated clipping ranges and KL ranges from uniform RLVR or other factors; without such controls, it is difficult to confirm that the dual-token approach (rather than incidental effects) drives the gains across model scales.

minor comments (2)

[Abstract] Abstract: The description of baselines and model scales is high-level; adding one sentence with concrete examples (e.g., specific models or benchmark names) would improve immediate clarity.
[§3] Notation: The definitions of the differentiated clipping ranges and KL coefficients would benefit from an explicit equation or pseudocode block to facilitate reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments raise valid points about strengthening the validation of our core assumptions and the isolation of component contributions. We address each major comment below and describe the revisions we will incorporate to improve the paper.

read point-by-point responses

Referee: [§3] §3 (Dual-Token Constraints): The framework rests on the assumption that high-entropy tokens correspond to reasoning steps and low-entropy tokens to factual knowledge, with response-level entropy normalization enabling stable classification; however, the manuscript provides no token-level analysis, ablation on classification accuracy, or error analysis in the RLVR setting to validate this mapping, leaving open the possibility that gains arise from normalization or hyperparameter choices rather than the differentiated constraints.

Authors: We appreciate the referee's emphasis on validating the entropy-to-role mapping. Our approach builds directly on prior work that has empirically linked high-entropy tokens to reasoning steps and low-entropy tokens to factual recall in LLMs. We acknowledge that the current manuscript does not include new token-level analysis or classification-error ablations specific to the RLVR setting. In the revised version we will add a dedicated subsection presenting qualitative examples of token classifications across reasoning traces, together with a sensitivity ablation on the response-level entropy normalization threshold. These additions will help demonstrate that the differentiated clipping and KL ranges, rather than normalization alone, are responsible for the observed gains. revision: yes
Referee: [§4] §4 (Experiments and Ablations): The reported improvements in pass@1 and pass@K are central to the claim of consistent outperformance, yet the experiments section lacks ablations that isolate the contribution of differentiated clipping ranges and KL ranges from uniform RLVR or other factors; without such controls, it is difficult to confirm that the dual-token approach (rather than incidental effects) drives the gains across model scales.

Authors: We agree that more targeted controls are needed to isolate the effect of the dual-token constraints. Our existing experiments already include comparisons against uniform RLVR baselines, but we recognize that these do not fully separate the differentiated clipping ranges and KL coefficients from other design choices. In the revision we will introduce additional ablation tables that apply uniform clipping and KL regularization while retaining response-level entropy normalization, and directly contrast these against the full Archer configuration. The new results will be reported across the same model scales and benchmarks to clarify the specific contribution of the differentiated constraints to the pass@1 and pass@K improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains rest on external entropy observations and benchmark results

full rationale

The paper's core proposal (Archer) defines dual-token constraints via response-level entropy normalization and differentiated clipping/KL ranges, citing prior work for the high-entropy/reasoning vs. low-entropy/knowledge distinction rather than deriving it internally. No equations reduce claimed pass@1/pass@K improvements to quantities defined by the method's own fitted parameters or self-referential inputs. The experimental outperformance is presented as an empirical outcome on mathematical and code benchmarks, not a first-principles prediction forced by construction. The framework preserves autoregressive dependencies by design and does not rely on load-bearing self-citations or uniqueness theorems from the same authors. This is a standard non-circular methodological contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the unverified mapping from entropy to token role and on the untested assertion that modulated constraints preserve autoregressive dependencies; both are introduced without independent evidence visible in the abstract.

axioms (2)

domain assumption High-entropy tokens are closely tied to reasoning while low-entropy tokens primarily encode factual knowledge.
Invoked in the abstract as established prior observation that motivates the dual-token approach.
ad hoc to paper Differentiated clipping ranges and KL regularization can encourage exploration on reasoning tokens while preserving knowledge tokens without breaking sequential dependencies.
Core design choice of Archer presented as the solution to the isolation problem.

invented entities (1)

Dual-token constraints no independent evidence
purpose: To modulate update strength across token types while preserving joint optimization of the autoregressive sequence.
New framework component introduced to replace masking or asynchronous training.

pith-pipeline@v0.9.0 · 5755 in / 1516 out tokens · 53443 ms · 2026-05-21T23:16:05.003269+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

high-entropy tokens are closely tied to reasoning, while low-entropy tokens primarily encode factual knowledge... differentiated clipping ranges and KL regularization
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

response-level entropy normalization for stable token classification

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
cs.LG 2026-05 unverdicted novelty 6.0

OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
cs.LG 2026-02 unverdicted novelty 6.0

Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL
cs.CL 2025-10 unverdicted novelty 6.0

The paper identifies that importance sampling ratios in outcome-supervised RL misallocate credit by creating unbalanced token updates, and introduces ASPO to correct the asymmetry for positive-advantage tokens.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 3 Pith papers · 22 internal anchors

[1]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Acereason-nemotron: Advancing math and code reasoning through reinforcement learning

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning. arXiv preprint arXiv:2505.16400, 2025

work page arXiv 2025
[4]

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. arXiv preprint arXiv:2506.14758, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedin...

work page 2017
[6]

Gpg: A simple and strong reinforcement learning baseline for model reasoning

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning. arXiv preprint arXiv:2504.02546, 2025

work page arXiv 2025
[7]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DeepSeek-AI , Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. O lympiad B ench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedin...

work page doi:10.18653/v1/2024.acl-long.211 2024
[12]

Skywork Open Reasoner 1 Technical Report

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report. arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=chfJJYC3iL

work page 2025
[15]

Vine PPO : Refining credit assignment in RL training of LLM s

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vine PPO : Refining credit assignment in RL training of LLM s. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=Myx2kJFzAn

work page 2025
[16]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team , Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[18]

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving Quantitative Reasoning Problems with Language Models . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, ...

work page 2022
[19]

Llms can easily learn to reason from demonstrations structure, not content, is what matters! arXiv preprint arXiv:2502.07374, 2025

Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demonstrations structure, not content, is what matters! arXiv preprint arXiv:2502.07374, 2025

work page arXiv 2025
[20]

NuminaMath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. NuminaMath . [https://github.com/project-numina/aimo-progress-prize](https://github.com/project-numina/aimo-progress-prize/blob/main...

work page 2024
[21]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022
[22]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

work page 2024
[23]

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning

Runze Liu, Fengshuo Bai, Yali Du, and Yaodong Yang. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22270--22284. Curran Associates, Inc., 2022. URL https...

work page 2022
[25]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling. arXiv preprint arXiv:2502.06703, 2025 b

work page arXiv 2025
[26]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025 c

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Deepcoder: A fully open-source 14b coder at o3-mini level

Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5...

work page 2025
[28]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, ...

work page 2025
[29]

American mathematics contest 12 (amc 12), November 2023

MAA . American mathematics contest 12 (amc 12), November 2023. URL https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions

work page 2023
[30]

American invitational mathematics examination (aime), February 2024

MAA . American invitational mathematics examination (aime), February 2024. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

work page 2024
[31]

American invitational mathematics examination (aime), February 2025

MAA . American invitational mathematics examination (aime), February 2025. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

work page 2025
[32]

Learning to reason with llms, 2024 a

OpenAI. Learning to reason with llms, 2024 a . URL https://openai.com/index/learning-to-reason-with-llms

work page 2024
[33]

Openai o3-mini, 2024 b

OpenAI. Openai o3-mini, 2024 b . URL https://openai.com/index/openai-o3-mini

work page 2024
[34]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[35]

Codeforces

Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025

work page 2025
[36]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=4FWAwZtd2n

work page 2025
[40]

Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models

Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models. arXiv preprint arXiv:2503.17287, 2025

work page arXiv 2025
[41]

Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning

Jean Vassoyan, Nathana \"e l Beau, and Roman Plaud. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning. arXiv preprint arXiv:2502.06533, 2025

work page arXiv 2025
[42]

A lpha Z ero-like tree-search can guide large language model decoding and training

Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus Mcaleer, Ying Wen, Weinan Zhang, and Jun Wang. A lpha Z ero-like tree-search can guide large language model decoding and training. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Con...

work page 2024
[43]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426--9439, 2024

work page 2024
[44]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms. arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Do not let low-probability tokens over-dominate in rl for llms

Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, and Yunjian Xu. Do not let low-probability tokens over-dominate in rl for llms. arXiv preprint arXiv:2505.12929, 2025 b

work page arXiv 2025
[49]

Mastering complex control in moba games with deep reinforcement learning

Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, Qiaobo Chen, Yinyuting Yin, Hao Zhang, Tengfei Shi, Liang Wang, Qiang Fu, Wei Yang, and Lanxiao Huang. Mastering complex control in moba games with deep reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 3...

work page doi:10.1609/aaai.v34i04.6144 2020
[50]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2504.05118, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Rl tango: Reinforcing generator and verifier together for language reasoning

Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S Boning, and Dina Katabi. Rl tango: Reinforcing generator and verifier together for language reasoning. arXiv preprint arXiv:2505.15034, 2025

work page arXiv 2025
[54]

arXiv preprint arXiv:2504.00891

Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, et al. Genprm: Scaling test-time compute of process reward models via generative reasoning. arXiv preprint arXiv:2504.00891, 2025

work page arXiv 2025

[1] [1]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Acereason-nemotron: Advancing math and code reasoning through reinforcement learning

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron: Advancing math and code reasoning through reinforcement learning. arXiv preprint arXiv:2505.16400, 2025

work page arXiv 2025

[4] [4]

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. arXiv preprint arXiv:2506.14758, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedin...

work page 2017

[6] [6]

Gpg: A simple and strong reinforcement learning baseline for model reasoning

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning. arXiv preprint arXiv:2504.02546, 2025

work page arXiv 2025

[7] [7]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

DeepSeek-AI , Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. O lympiad B ench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedin...

work page doi:10.18653/v1/2024.acl-long.211 2024

[12] [12]

Skywork Open Reasoner 1 Technical Report

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report. arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=chfJJYC3iL

work page 2025

[15] [15]

Vine PPO : Refining credit assignment in RL training of LLM s

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vine PPO : Refining credit assignment in RL training of LLM s. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=Myx2kJFzAn

work page 2025

[16] [16]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team , Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[18] [18]

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving Quantitative Reasoning Problems with Language Models . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, ...

work page 2022

[19] [19]

Llms can easily learn to reason from demonstrations structure, not content, is what matters! arXiv preprint arXiv:2502.07374, 2025

Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G Patil, Matei Zaharia, et al. Llms can easily learn to reason from demonstrations structure, not content, is what matters! arXiv preprint arXiv:2502.07374, 2025

work page arXiv 2025

[20] [20]

NuminaMath

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. NuminaMath . [https://github.com/project-numina/aimo-progress-prize](https://github.com/project-numina/aimo-progress-prize/blob/main...

work page 2024

[21] [21]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022

[22] [22]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations (ICLR), 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

work page 2024

[23] [23]

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. arXiv preprint arXiv:2505.24864, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning

Runze Liu, Fengshuo Bai, Yali Du, and Yaodong Yang. Meta-reward-net: Implicitly differentiable reward learning for preference-based reinforcement learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22270--22284. Curran Associates, Inc., 2022. URL https...

work page 2022

[25] [25]

Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling

Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, and Bowen Zhou. Can 1b llm surpass 405b llm? rethinking compute-optimal test-time scaling. arXiv preprint arXiv:2502.06703, 2025 b

work page arXiv 2025

[26] [26]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025 c

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Deepcoder: A fully open-source 14b coder at o3-mini level

Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepcoder: A fully open-source 14b coder at o3-mini level. https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5...

work page 2025

[28] [28]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2, ...

work page 2025

[29] [29]

American mathematics contest 12 (amc 12), November 2023

MAA . American mathematics contest 12 (amc 12), November 2023. URL https://artofproblemsolving.com/wiki/index.php/AMC_12_Problems_and_Solutions

work page 2023

[30] [30]

American invitational mathematics examination (aime), February 2024

MAA . American invitational mathematics examination (aime), February 2024. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

work page 2024

[31] [31]

American invitational mathematics examination (aime), February 2025

MAA . American invitational mathematics examination (aime), February 2025. URL https://artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions

work page 2025

[32] [32]

Learning to reason with llms, 2024 a

OpenAI. Learning to reason with llms, 2024 a . URL https://openai.com/index/learning-to-reason-with-llms

work page 2024

[33] [33]

Openai o3-mini, 2024 b

OpenAI. Openai o3-mini, 2024 b . URL https://openai.com/index/openai-o3-mini

work page 2024

[34] [34]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022

[35] [35]

Codeforces

Guilherme Penedo, Anton Lozhkov, Hynek Kydlíček, Loubna Ben Allal, Edward Beeching, Agustín Piqueres Lajarín, Quentin Gallouédec, Nathan Habib, Lewis Tunstall, and Leandro von Werra. Codeforces. https://huggingface.co/datasets/open-r1/codeforces, 2025

work page 2025

[36] [36]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling parameters for reasoning. In International Conference on Learning Representations (ICLR), 2025. URL https://openreview.net/forum?id=4FWAwZtd2n

work page 2025

[40] [40]

Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models

Mingyang Song, Mao Zheng, Zheng Li, Wenjie Yang, Xuan Luo, Yue Pan, and Feng Zhang. Fastcurl: Curriculum reinforcement learning with stage-wise context scaling for efficient training r1-like reasoning models. arXiv preprint arXiv:2503.17287, 2025

work page arXiv 2025

[41] [41]

Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning

Jean Vassoyan, Nathana \"e l Beau, and Roman Plaud. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning. arXiv preprint arXiv:2502.06533, 2025

work page arXiv 2025

[42] [42]

A lpha Z ero-like tree-search can guide large language model decoding and training

Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus Mcaleer, Ying Wen, Weinan Zhang, and Jun Wang. A lpha Z ero-like tree-search can guide large language model decoding and training. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Con...

work page 2024

[43] [43]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426--9439, 2024

work page 2024

[44] [44]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms. arXiv preprint arXiv:2506.14245, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Do not let low-probability tokens over-dominate in rl for llms

Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, and Yunjian Xu. Do not let low-probability tokens over-dominate in rl for llms. arXiv preprint arXiv:2505.12929, 2025 b

work page arXiv 2025

[49] [49]

Mastering complex control in moba games with deep reinforcement learning

Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, Qiaobo Chen, Yinyuting Yin, Hao Zhang, Tengfei Shi, Liang Wang, Qiang Fu, Wei Yang, and Lanxiao Huang. Mastering complex control in moba games with deep reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 3...

work page doi:10.1609/aaai.v34i04.6144 2020

[50] [50]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks. arXiv preprint arXiv:2504.05118, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Rl tango: Reinforcing generator and verifier together for language reasoning

Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S Boning, and Dina Katabi. Rl tango: Reinforcing generator and verifier together for language reasoning. arXiv preprint arXiv:2505.15034, 2025

work page arXiv 2025

[54] [54]

arXiv preprint arXiv:2504.00891

Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, et al. Genprm: Scaling test-time compute of process reward models via generative reasoning. arXiv preprint arXiv:2504.00891, 2025

work page arXiv 2025