pith. machine review for the scientific record. sign in

arxiv: 2605.06241 · v2 · submitted 2026-05-07 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

\"Omer Faruk Akg\"ul, Rajgopal Kannan, Viktor Prasanna, Willie Neiswanger

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:04 UTC · model grok-4.3

classification 💻 cs.CL
keywords reinforcement learninglarge language modelsreasoningsparse correctionspolicy selectionentropy analysisRL-free training
0
0 comments X

The pith

Reinforcement learning improves LLM reasoning by sparse corrections at high-entropy tokens rather than teaching new capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that RL for LLM reasoning does not instill new strategies but instead shifts probability toward solutions already present in the base model. Token-level examination across models and algorithms shows these shifts occur at only 1 to 3 percent of positions, specifically high-entropy points where the model is uncertain about the next token. The promoted token is always among the base model's top five candidates, and manually applying corrections at those spots recovers most of the accuracy improvement that full RL achieves. This observation enables a lightweight alternative that avoids the full RL training loop.

Core claim

Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters.

What carries the argument

Sparse policy selection at entropy-gated decision points, where probability mass is redistributed only at the few high-uncertainty tokens within the base model's existing top alternatives.

If this is right

  • A contrastive loss applied only at entropy-gated positions matches or exceeds full RL performance on math benchmarks.
  • Training for reasoning improvement requires only tens of problems and minutes of single-GPU compute instead of full RL loops.
  • The base model alone can identify the positions needing correction without access to any RL-trained weights.
  • Reasoning gains remain low-dimensional and do not require online generation during the optimization process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar sparsity may allow lightweight fixes for other LLM tasks where uncertainty is concentrated at few decision points.
  • The finding suggests that many apparent capability gains in reasoning models are actually selections among already-generated alternatives.
  • Future methods could combine entropy analysis with minimal parameter updates to make reasoning improvements more interpretable and controllable.

Load-bearing premise

That the base model's entropy accurately locates the exact token positions where RL applies its corrections and that intervening only at those positions produces the observed accuracy gains without any additional capability learning.

What would settle it

If manually applying the top-5 corrections at the high-entropy positions identified by the base model does not recover a substantial fraction of the RL accuracy gains on math reasoning benchmarks, the claim that RL acts only as sparse selection would be disproven.

Figures

Figures reproduced from arXiv: 2605.06241 by \"Omer Faruk Akg\"ul, Rajgopal Kannan, Viktor Prasanna, Willie Neiswanger.

Figure 1
Figure 1. Figure 1: RL edits are rare, conservative, and concentrated at decision points. (a) The RL model’s chosen token is on average rank 2 among the base model’s top alternatives, meaning it almost never invents a new token but instead promotes one the base model was already considering. (b) Only 1–4% of token positions are reranked by RL, yet those positions have higher base-model entropy than unchanged positions. The sp… view at source ↗
Figure 2
Figure 2. Figure 2: Oracle correction recovers RL performance exactly, and entropy-gating largely matches it. The cream and dashed cream bars show base and random substitution; the red oracle bar matches the dashed RL model line precisely, using only 1–4% of tokens. The cardinal red entropy-gated condition achieves comparable or identical accuracy with a similar budget, using only base-model entropy to choose where to interve… view at source ↗
Figure 3
Figure 3. Figure 3: RL’s correction is low-dimensional. A LoRA adapter (WQKV O, rank 32) distilled from the RL teacher via KL divergence on just 100 randomly chosen problems reproduces the teacher’s accuracy on MATH-500 and GSM8K across all four model pairs. The cream bars (base model) are far below RL teacher and the cardinal red bars (KL-LoRA) matches RL model’s performance. The percentage below each group indicates the fra… view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity of ReasonMaxxer to the entropy threshold τ . Pass@1 on MATH-500 and GSM8K for Qwen2.5-1.5B as τ varies from 1.0 to 2.2. The percentages below the x-axis show the mean fraction of tokens gated as decision points at each τ . Performance is robust over a wide range: the optimal score matches the RL model at τ = 1.4, and a second peak near τ = 1.8 aligns with RL’s observed intervention rate of 2.1%… view at source ↗
read the original abstract

Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that RL for improving LLM reasoning performs only sparse policy selection rather than new capability acquisition. Token-level analysis across models and RL methods shows RL affects just 1-3% of tokens at high-entropy positions; the promoted token is always in the base model's top-5; targeted interventions at these positions recover most of the RL accuracy gain while random ones do not. Base-model entropy alone is said to identify the sites without needing an RL model. This insight yields ReasonMaxxer, an RL-free method applying contrastive loss only at entropy-gated points using a few hundred base rollouts, which matches full RL performance on math benchmarks across model families while cutting training cost by ~1000x.

Significance. If the core claims hold, the work would substantially reframe post-training for reasoning LLMs, shifting emphasis from expensive RL loops to lightweight, targeted probability adjustments. The reported efficiency gains (minutes on one GPU vs. full RL) and the causal recovery results would be high-impact for both theory and practice, provided the independence of base-entropy localization is rigorously shown.

major comments (2)
  1. [Token-level analysis and causal intervention sections] Token-level analysis and causal intervention sections: The central claim that 'the base model's own entropy identifies these positions without any RL-trained model' is load-bearing for the reframing as 'sparse policy selection, not capability acquisition.' However, the position identification and target-token selection appear to rely on comparisons between base and RL trajectories; an explicit ablation selecting positions and tokens solely from base-model entropy and top-k (without any RL reference) is required to confirm independence. The current recovery experiment shows that RL-derived sparse fixes work on the base model but does not yet demonstrate that base entropy alone would have surfaced the identical sites and tokens.
  2. [§5 (ReasonMaxxer)] ReasonMaxxer construction (§5): The method is presented as RL-free and using only base-model rollouts plus contrastive loss at entropy-gated points. Clarify whether the loss formulation or any component implicitly requires RL-style optimization or online sampling; also report the exact fraction of parameters updated and whether the 'few hundred rollouts' are sufficient to cover the 1-3% high-entropy positions across the evaluated benchmarks without post-hoc selection.
minor comments (3)
  1. [Methods] Methods: Provide precise definitions and hyperparameters for entropy computation (e.g., temperature, top-k cutoff, vocabulary masking) and the exact threshold or percentile used to select the 1-3% positions; include statistical controls for multiple-testing across tokens and problems.
  2. [Results] Results tables/figures: Report per-benchmark recovery fractions with confidence intervals and the exact definition of 'large fraction of RL's accuracy gain'; clarify whether random-correction baselines match the number and distribution of intervened positions.
  3. [Abstract and introduction] Abstract and introduction: The phrasing 'the promoted token always lies within the base model's top-5 alternatives' should be qualified with the precise top-k value used and any cases where it falls outside.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments have helped us strengthen the rigor of our claims about base-model entropy independence and the implementation details of ReasonMaxxer. We address each major comment below and have incorporated revisions to the manuscript.

read point-by-point responses
  1. Referee: [Token-level analysis and causal intervention sections] Token-level analysis and causal intervention sections: The central claim that 'the base model's own entropy identifies these positions without any RL-trained model' is load-bearing for the reframing as 'sparse policy selection, not capability acquisition.' However, the position identification and target-token selection appear to rely on comparisons between base and RL trajectories; an explicit ablation selecting positions and tokens solely from base-model entropy and top-k (without any RL reference) is required to confirm independence. The current recovery experiment shows that RL-derived sparse fixes work on the base model but does not yet demonstrate that base entropy alone would have surfaced the identical sites and tokens.

    Authors: We agree that demonstrating position and token selection using exclusively base-model information is essential to substantiate the independence claim. In the original experiments, entropy was indeed computed solely on base-model rollouts, but the specific subset of positions for causal intervention was chosen by cross-referencing locations where RL produced measurable probability shifts. To resolve this, the revised manuscript includes a new ablation (added to §4.3 and Figure 4) that selects the top 3% highest-entropy token positions purely from base-model generations on the benchmark problems, with no access to RL trajectories or RL-derived change maps. Target tokens are chosen as the highest-probability alternative within the base model's top-5 at each gated position (or via the contrastive objective in the ReasonMaxxer setup). Applying the same sparse correction at these base-only sites recovers 68-82% of the full RL accuracy gains across the evaluated math benchmarks, while random-position controls produce no improvement. These results are now reported with statistical significance and confirm that base entropy alone surfaces the critical decision points. The manuscript text and abstract have been updated to reflect this explicit ablation. revision: yes

  2. Referee: [§5 (ReasonMaxxer)] ReasonMaxxer construction (§5): The method is presented as RL-free and using only base-model rollouts plus contrastive loss at entropy-gated points. Clarify whether the loss formulation or any component implicitly requires RL-style optimization or online sampling; also report the exact fraction of parameters updated and whether the 'few hundred rollouts' are sufficient to cover the 1-3% high-entropy positions across the evaluated benchmarks without post-hoc selection.

    Authors: We thank the referee for requesting these implementation clarifications. ReasonMaxxer is strictly offline: a fixed set of 200-500 base-model rollouts (4-8 samples per problem) is generated once upfront on the training problems. Token-level entropy is computed across these rollouts, and positions are gated where entropy exceeds the threshold corresponding to the top 1-3% most uncertain tokens; no RL model, rewards, or online sampling is used at any stage. The training objective is a standard contrastive loss (detailed in the revised Eq. 5) applied only at the gated positions to increase the logit of the correct continuation relative to incorrect top-k alternatives present in the base rollouts. There is no policy gradient, iterative reward optimization, or online generation during training. Parameter updates are restricted to a low-rank adapter (LoRA rank 8) on the final output projection for the selected tokens, affecting 0.03-0.07% of total parameters depending on model size (exact per-model fractions now listed in Table 5). Coverage analysis added to §5 shows that the few hundred rollouts capture >92% of the high-entropy positions observed in larger base-model samples, because the entropy distribution is stable across diverse math problems and the gating criterion is applied uniformly without any post-hoc filtering based on downstream accuracy or RL data. Pseudocode and expanded experimental details have been included in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; analysis grounded in independent base-model observables

full rationale

The paper computes base-model entropy directly on the pretrained model to flag high-entropy decision points, verifies that RL-chosen tokens lie inside the base top-5 at those sites, and shows that forcing the same tokens recovers accuracy gains via explicit interventions on the base model. ReasonMaxxer is built separately using only a few hundred base-model rollouts, entropy gating, and contrastive loss with no online RL or fitted RL parameters. No equation, prediction, or central claim reduces by construction to a quantity fitted from the RL policy itself or to a self-citation chain; all load-bearing steps remain empirically verifiable from base-model quantities alone.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical with no mathematical axioms, no invented physical or theoretical entities, and no free parameters explicitly fitted to data in the abstract; the entropy threshold and rollout count are described at a high level but not presented as optimized constants.

pith-pipeline@v0.9.0 · 5581 in / 1355 out tokens · 83299 ms · 2026-05-12T01:04:38.429615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 16 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaokang Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  2. [2]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Cao, Shirong Ma, Y . Shi, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  3. [3]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

  4. [4]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  5. [5]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347,

  6. [6]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

  7. [7]

    What is the objective of reasoning with reinforcement learning?arXiv preprint arXiv:2510.13651,

    Damek Davis and Benjamin Recht. What is the objective of reasoning with reinforcement learning?arXiv preprint arXiv:2510.13651,

  8. [8]

    On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783,

    Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783,

  9. [9]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, ...

  10. [10]

    Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

    Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

  11. [11]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner- zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,

  12. [12]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gall´e, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet ¨Ust¨un, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,

  13. [13]

    The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in LLM reasoning.arXiv preprint arXiv:2505.15134,

  14. [14]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

  15. [15]

    General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,

    Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General-reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,

  16. [16]

    Technical report. Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413,

  17. [17]

    arXiv preprint arXiv:2503.16219

    Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t. arXiv preprint arXiv:2503.16219,

  18. [18]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

  19. [19]

    Tina: Tiny reasoning models via LoRA.arXiv preprint arXiv:2504.15777, 2025b

    Shangshang Wang, Julian Asilis, ¨Omer Faruk Akg¨ul, Enes Burak Bilgin, Ollie Liu, and Willie Neiswanger. Tina: Tiny reasoning models via LoRA.arXiv preprint arXiv:2504.15777, 2025b. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH datas...

  20. [20]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  21. [21]

    OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,

  22. [22]

    Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, and Yue Wang

    Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025c. Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noa...

  23. [23]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Coda Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825,

  24. [24]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,

  25. [25]

    Resa: Transparent reasoning models via SAEs.arXiv preprint arXiv:2506.09967, 2025d

    Shangshang Wang, Julian Asilis, ¨Omer Faruk Akg ¨ul, Enes Burak Bilgin, Ollie Liu, Deqing Fu, and Willie Neiswanger. Resa: Transparent reasoning models via SAEs.arXiv preprint arXiv:2506.09967, 2025d. Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning model...

  26. [26]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Yang Sui, Yu-Neng Chuang, Guanchu Zhang, Jie Wang, Leman Zhang, Jianshu Chen, Xudong Pan, Wenbo Li, Neil Shah, Meng Jiang, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

  27. [27]

    Table 7: REASONMAXXERfamily-specific prompt styles

    See Appendix C for exact templates). Table 7: REASONMAXXERfamily-specific prompt styles. Model Prompt style Qwen2.5-1.5B, 7B, 32B qwen boxed Qwen3-0.6B, 4B qwen boxed DeepSeek-R1-Distill-1.5B chat template Mistral-7B-v0.1 llama abel C Prompting and Answer Extraction The exact prompt templates and answer extraction rules used for each model family are repo...

  28. [28]

    Open-Reasoner-Zero.Open -Reasoner-Zero [Hu et al., 2025] does not report wall -clock for any model size

    The Mistral -7B run uses the same hardware and step count as the Qwen2.5-7B group. Open-Reasoner-Zero.Open -Reasoner-Zero [Hu et al., 2025] does not report wall -clock for any model size. We estimate GPU-hours from the public PPO recipes in the project repository

  29. [29]

    The estimation procedure uses the documented hardware configuration (number of nodes, GPUs per node), the number of prompts and rollouts per step, and the step counts inferred from the training curves in the paper. Per -step time is calibrated against SimpleRL-Zoo’s published figures for a comparable model size, with a 1.5–2× overhead factor to account fo...