pith. machine review for the scientific record. sign in

arxiv: 2605.06523 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningverifiable rewardslow-rank dynamicsreward overfittingsingular value spectrumreasoning modelsmodel alignment
0
0 comments X

The pith

RLVR exhibits implicit reward overfitting, enabling strong test performance even when training rewards stay low due to concentration in rank-1 components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how reinforcement learning with verifiable rewards improves reasoning in models. It builds on the finding that this improvement lives mostly in the rank-1 parts of the weight matrices. By applying periodic rank-1 substitution during training, the authors observe that models can still score well on held-out tests even while receiving relatively weak rewards on the training examples. The work further maps three concrete changes that RLVR produces in the model: the rank-1 part carries only mathematical reasoning, the singular values across layers follow a heavy-tailed pattern, and the left singular vectors become more aligned. These observations together describe how RLVR reshapes parameters and point toward ways to adjust training for better retention of other capabilities.

Core claim

Predicated on the observation that enhanced reasoning capabilities acquired by models through RLVR are primarily concentrated within the rank-1 components, Periodic Rank-1 Substitution reveals that RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. The effective rank-1 component maintains only mathematical reasoning capability and no other model knowledge. RLVR optimizes a specific singular spectrum such that the distribution of singular values of almost all linear layers behaves like a heavy-tailed distribution. The left-sin

What carries the argument

Periodic Rank-1 Substitution, the operation that periodically replaces higher-rank components with their rank-1 approximations to isolate how low-rank structure drives reward dynamics and test generalization.

If this is right

  • The rank-1 component in an RLVR-trained model retains only mathematical reasoning and discards other forms of knowledge.
  • RLVR training produces heavy-tailed singular-value distributions across nearly all linear layers.
  • Left singular vectors tied to the rank-1 components exhibit stronger alignment as training proceeds.
  • These parameter changes supply concrete directions for redesigning RL methods to support continual learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same low-rank concentration may appear whenever rewards are computed by an external verifier rather than being dense and human-designed.
  • Periodic replacement or diversification of rank-1 components could be tested as a way to preserve broader capabilities while still harvesting reasoning gains.
  • Tracking the alignment of left singular vectors might give an early signal of when the model has begun to overfit its reward signal.

Load-bearing premise

That Periodic Rank-1 Substitution isolates the overfitting effect without introducing artifacts that themselves alter the observed reward dynamics or test performance.

What would settle it

If models trained with RLVR under Periodic Rank-1 Substitution display both persistently low training rewards and correspondingly reduced test performance, rather than retaining high test accuracy, the implicit-overfitting claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.06523 by Bimei Wang, Bin Hu, Hao Ye, Hong Peng, Jisheng Dang, Junfeng Fang, Ning Lv, Tat-Seng Chua, Wencan Zhang, Yizhou Zhang.

Figure 1
Figure 1. Figure 1: Left: The process of extracting Rank-1 component from RL-trained model. We pick out the rank-1 matrix corresponding to the greatest singular value. Right: The process of periodic Rank-1 substitution. base model is repeatedly trained for a short interval, and only a rank-1 approximation of its weight update is kept. to the model, but rather optimizing its sampling strategy to efficiently elicit latent corre… view at source ↗
Figure 2
Figure 2. Figure 2: Left: The mean reward within each batch during GRPO training for Qwen2.5-7B￾Instruct [23]. Mid-left: Test-set accuracy of the leftmost figure. Mid-right: The mean reward within each batch during GRPO training for Llama3.1-8B-Instruct [9]. Right: Test-set accuracy of the Mid-right figure. The horizontal axes of four subfigures are training steps. We don’t use Qwen3 as post-training has enabled the lightweig… view at source ↗
Figure 3
Figure 3. Figure 3: Obvious performance degradation in safety after RLVR with periodic rank-1 subsitu￾tion. Models are the same as those in view at source ↗
Figure 4
Figure 4. Figure 4: The distribution of the singular values of each linear layer update. The greatest singular view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise alignment analysis between ∆WLoRA and ∆W(1). From left to right: greatest singular value, Frobenius cosine similarity, left principal angle, and right principal angle across layers. The Frobenius cosine similarity remains near zero, indicating negligible global correlation in parameter space. In contrast, the subspace angle distributions show consistently smaller left principal angles than right… view at source ↗
Figure 6
Figure 6. Figure 6: step10 0 1000 2000 3000 0.0000 0.0005 0.0010 0.0015 0.0020 model.layers.0.self_attn.q_proj.weight 0 100 200 300 400 500 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 model.layers.0.self_attn.k_proj.weight 0 100 200 300 400 500 0.0000 0.0005 0.0010 0.0015 0.0020 model.layers.0.self_attn.v_proj.weight 0 1000 2000 3000 0.000 0.001 0.002 0.003 0.004 0.005 0.006 model.layers.0.self_attn.o_proj.weight 0 1000 … view at source ↗
Figure 7
Figure 7. Figure 7: step20 15 view at source ↗
Figure 8
Figure 8. Figure 8: step30 0 1000 2000 3000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 0.0035 model.layers.0.self_attn.q_proj.weight 0 100 200 300 400 500 0.0000 0.0005 0.0010 0.0015 0.0020 model.layers.0.self_attn.k_proj.weight 0 100 200 300 400 500 0.000 0.001 0.002 0.003 0.004 0.005 model.layers.0.self_attn.v_proj.weight 0 1000 2000 3000 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 model.layers.0.self_attn.o_proj.… view at source ↗
Figure 9
Figure 9. Figure 9: step40 16 view at source ↗
Figure 10
Figure 10. Figure 10: step50 0 1000 2000 3000 0.0000 0.0005 0.0010 0.0015 0.0020 model.layers.0.self_attn.q_proj.weight 0 100 200 300 400 500 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 model.layers.0.self_attn.k_proj.weight 0 100 200 300 400 500 0.0000 0.0005 0.0010 0.0015 0.0020 model.layers.0.self_attn.v_proj.weight 0 1000 2000 3000 0.000 0.001 0.002 0.003 0.004 0.005 model.layers.0.self_attn.o_proj.weight 0 1000 2000 … view at source ↗
Figure 11
Figure 11. Figure 11: step60 17 view at source ↗
Figure 12
Figure 12. Figure 12: step70 0 1000 2000 3000 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 model.layers.0.self_attn.q_proj.weight 0 100 200 300 400 500 0.00000 0.00025 0.00050 0.00075 0.00100 0.00125 0.00150 model.layers.0.self_attn.k_proj.weight 0 100 200 300 400 500 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 model.layers.0.self_attn.v_proj.weight 0 1000 2000 3000 0.000 0.002 0.004 0.006 0.008 model.layers.0.self_attn.o_proj.weig… view at source ↗
Figure 13
Figure 13. Figure 13: step80 18 view at source ↗
read the original abstract

Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-1 Substitution and identified a counterintuitive phenomenon: RLVR may exhibit implicit reward overfitting to the training dataset. Specifically, the model can achieve satisfactory performance on the test set even when its rewards remain relatively low during the training process. Furthermore, we characterize three distinct properties of RL training: (1) The effective rank-1 component in RLVR don't maintain other model knowledge except mathematical reasoning capability. (2) RLVR fundamentally functions by optimizing a specific singular spectrum. The distribution of singular values of almost all linear layers in RLVR-trained model behaves like heavy-tailed distribution. (3) the left singular vectors associated with rank-1 components demonstrate a stronger alignment tendency during training, which echoes the discovery that RLVR is optimizing sampling efficiency in essence. Taken together, our findings and analysis further reveal how RLVR shapes model parameters and offer potential insights for improving existing RL paradigms or other training paradigms to implement continual learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that RLVR exhibits implicit reward overfitting to the training dataset, as models achieve satisfactory test-set performance despite relatively low training rewards; this is observed via Periodic Rank-1 Substitution. It further characterizes three properties of RL training: (1) the effective rank-1 component preserves only mathematical reasoning capability, (2) RLVR optimizes a heavy-tailed singular spectrum across linear layers, and (3) left singular vectors of rank-1 components exhibit stronger alignment during training, interpreted as optimization of sampling efficiency.

Significance. If the observations are shown to be robust to controls and not artifacts of the substitution procedure, the work could provide useful empirical insights into the low-rank mechanisms underlying RLVR's reasoning gains and suggest directions for designing RL methods that reduce overfitting while supporting continual learning.

major comments (2)
  1. [Abstract] Abstract: The central claim of implicit reward overfitting and the three listed properties rest entirely on observations obtained under Periodic Rank-1 Substitution, yet the abstract supplies no experimental details, controls, baselines, statistical tests, or ablation results that would demonstrate the substitution isolates overfitting without independently altering reward dynamics or test performance.
  2. [Abstract] The weakest assumption is that Periodic Rank-1 Substitution preserves the original RLVR reward dynamics and generalization behavior; without explicit controls (e.g., standard RLVR runs or ablation of the substitution schedule) showing that the intervention does not itself suppress measured rewards or inflate test scores, the reported mismatch between training rewards and test performance cannot be attributed to RLVR rather than the experimental procedure.
minor comments (1)
  1. [Abstract] The abstract would benefit from a concise definition or reference for 'Periodic Rank-1 Substitution' on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. The comments highlight important points about the presentation of our central claims in the abstract and the assumptions underlying Periodic Rank-1 Substitution. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of implicit reward overfitting and the three listed properties rest entirely on observations obtained under Periodic Rank-1 Substitution, yet the abstract supplies no experimental details, controls, baselines, statistical tests, or ablation results that would demonstrate the substitution isolates overfitting without independently altering reward dynamics or test performance.

    Authors: We agree that the abstract, constrained by length, does not detail the experimental controls or ablations. The full manuscript describes the Periodic Rank-1 Substitution procedure in detail (Section 3), including comparisons to standard RLVR runs and ablations of the substitution schedule that confirm it does not independently suppress training rewards or inflate test performance. To address the concern, we will revise the abstract to briefly reference the substitution method and note that its validity is supported by the controls and statistical comparisons reported in the experimental sections. revision: yes

  2. Referee: [Abstract] The weakest assumption is that Periodic Rank-1 Substitution preserves the original RLVR reward dynamics and generalization behavior; without explicit controls (e.g., standard RLVR runs or ablation of the substitution schedule) showing that the intervention does not itself suppress measured rewards or inflate test scores, the reported mismatch between training rewards and test performance cannot be attributed to RLVR rather than the experimental procedure.

    Authors: This concern is well-taken. The manuscript already includes explicit controls via direct comparisons between standard RLVR training and the Periodic Rank-1 Substitution variant, along with ablations of the substitution schedule (see Figures 4-6 and associated text). These demonstrate that the intervention preserves reward dynamics and does not artifactually create the observed reward-test mismatch. We will revise the abstract to summarize these controls concisely, ensuring readers can immediately appreciate that the mismatch is attributable to RLVR rather than the procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical substitution experiments with no self-referential derivations or fitted predictions

full rationale

The paper's core claims rest on observations from Periodic Rank-1 Substitution applied to RLVR training runs. No equations, derivations, or parameter-fitting steps are described that reduce by construction to the inputs (e.g., no self-definitional scaling, no 'prediction' of a quantity that was itself fitted). The three listed properties are direct empirical characterizations of the resulting models rather than outputs of a closed mathematical chain. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The analysis is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Analysis rests on standard linear-algebra operations and empirical measurement; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • standard math Singular value decomposition can be applied to weight matrices of linear layers to isolate rank-1 components.
    Invoked to extract and substitute the dominant rank-1 component during training.

pith-pipeline@v0.9.0 · 5528 in / 1032 out tokens · 42047 ms · 2026-05-08T12:35:18.423148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 29 canonical work pages · 20 internal anchors

  1. [1]

    Adam: A Method for Stochastic Optimization

    Kingma DP Ba J Adam et al. A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 1412(6), 2014

  2. [2]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740, 2024

  3. [3]

    arXiv preprint arXiv:2510.00553 , year=

    Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guiquan Liu, and Junfeng Fang. On predictability of reinforcement learning dynamics for large language models.arXiv preprint arXiv:2510.00553, 2025

  4. [4]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  5. [5]

    Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/2505.22617

  8. [8]

    Assessing diversity collapse in reasoning

    Xingyu Dang, Christina Baek, J Zico Kolter, and Aditi Raghunathan. Assessing diversity collapse in reasoning. InScaling Self-Improving Foundation Models without Human Supervision, 2025

  9. [9]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  10. [10]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    arXiv preprint arXiv:2504.11456 , year=

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

  13. [13]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

  14. [14]

    Alexandre Heuillet, Fabien Couthouis, and Natalia Díaz-Rodríguez. Collective explainable ai: Explaining cooperative strategies and agent contribution in multiagent reinforcement learning with shapley values.IEEE Computational Intelligence Magazine, 17(1):59–71, 2022

  15. [15]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025. 10

  16. [16]

    Llama3.1-8B-Thinking-R1

    Jackrong. Llama3.1-8B-Thinking-R1. https://huggingface.co/Jackrong/Llama3. 1-8B-Thinking-R1, 2025. Accessed: 2026-05-07

  17. [17]

    arXiv preprint arXiv:2512.05117 (2025)

    Prakhar Kaushik, Shravan Chaudhari, Ankit Vaidya, Rama Chellappa, and Alan Yuille. The universal weight subspace hypothesis.arXiv preprint arXiv:2512.05117, 2025

  18. [18]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  19. [19]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  20. [20]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  21. [21]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  22. [22]

    Tinyzero

    Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, and Alane Suhr. Tinyzero. https://github.com/Jiayi-Pan/TinyZero, 2025. Accessed: 2025-01-24

  23. [23]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  24. [24]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  25. [25]

    Learning dynamics of llm finetuning

    Yi Ren and Danica J Sutherland. Learning dynamics of llm finetuning. InThe Thirteenth International Conference on Learning Representations, 2025

  26. [26]

    LoRA without regret

    John Schulman and Thinking Machines Lab. Lora without regret.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

  27. [27]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  28. [28]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  29. [29]

    Interestingness elements for explainable reinforcement learning: Understanding agents’ capabilities and limitations.Artificial Intelligence, 288:103367, 2020

    Pedro Sequeira and Melinda Gervasio. Interestingness elements for explainable reinforcement learning: Understanding agents’ capabilities and limitations.Artificial Intelligence, 288:103367, 2020

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  31. [31]

    Rl’s razor: Why online reinforcement learning forgets less, 2025

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less, 2025. URLhttps://arxiv.org/abs/2509.04259. 11

  32. [32]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  33. [33]

    Sample more to think less: Group filtered policy optimization for concise reasoning

    Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726, 2025

  34. [34]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Ha...

  35. [35]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3), 2024

  36. [36]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Advances in Neural Information Processing Systems, 37:95266–95290, 2024

  37. [37]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  38. [38]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476, 2025

  39. [39]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837, 2025. 12

  40. [40]

    Lolcats: On low-rank linearizing of large language models

    Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models. arXiv preprint arXiv:2410.10254, 2024

  41. [41]

    Safetybench: Evaluating the safety of large language models

    Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. Safetybench: Evaluating the safety of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15537–15553, 2024

  42. [42]

    arXiv preprint arXiv:2507.20673 , year=

    Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

  43. [43]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  44. [44]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911. 13 A Why RLVR Algorithms Are the Same in Essence Let ˆAi = Ri −µ R σR , µ R = 1 G GX j=1 Rj. GRPO objective: JGRPO(θ) =E " 1 G GX i=1 1 |yi| |yi|X...