Recognition: 2 theorem links
· Lean TheoremRethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
Pith reviewed 2026-05-12 01:04 UTC · model grok-4.3
The pith
Reinforcement learning improves LLM reasoning by sparse corrections at high-entropy tokens rather than teaching new capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters.
What carries the argument
Sparse policy selection at entropy-gated decision points, where probability mass is redistributed only at the few high-uncertainty tokens within the base model's existing top alternatives.
If this is right
- A contrastive loss applied only at entropy-gated positions matches or exceeds full RL performance on math benchmarks.
- Training for reasoning improvement requires only tens of problems and minutes of single-GPU compute instead of full RL loops.
- The base model alone can identify the positions needing correction without access to any RL-trained weights.
- Reasoning gains remain low-dimensional and do not require online generation during the optimization process.
Where Pith is reading between the lines
- Similar sparsity may allow lightweight fixes for other LLM tasks where uncertainty is concentrated at few decision points.
- The finding suggests that many apparent capability gains in reasoning models are actually selections among already-generated alternatives.
- Future methods could combine entropy analysis with minimal parameter updates to make reasoning improvements more interpretable and controllable.
Load-bearing premise
That the base model's entropy accurately locates the exact token positions where RL applies its corrections and that intervening only at those positions produces the observed accuracy gains without any additional capability learning.
What would settle it
If manually applying the top-5 corrections at the high-entropy positions identified by the base model does not recover a substantial fraction of the RL accuracy gains on math reasoning benchmarks, the claim that RL acts only as sparse selection would be disproven.
Figures
read the original abstract
Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RL for improving LLM reasoning performs only sparse policy selection rather than new capability acquisition. Token-level analysis across models and RL methods shows RL affects just 1-3% of tokens at high-entropy positions; the promoted token is always in the base model's top-5; targeted interventions at these positions recover most of the RL accuracy gain while random ones do not. Base-model entropy alone is said to identify the sites without needing an RL model. This insight yields ReasonMaxxer, an RL-free method applying contrastive loss only at entropy-gated points using a few hundred base rollouts, which matches full RL performance on math benchmarks across model families while cutting training cost by ~1000x.
Significance. If the core claims hold, the work would substantially reframe post-training for reasoning LLMs, shifting emphasis from expensive RL loops to lightweight, targeted probability adjustments. The reported efficiency gains (minutes on one GPU vs. full RL) and the causal recovery results would be high-impact for both theory and practice, provided the independence of base-entropy localization is rigorously shown.
major comments (2)
- [Token-level analysis and causal intervention sections] Token-level analysis and causal intervention sections: The central claim that 'the base model's own entropy identifies these positions without any RL-trained model' is load-bearing for the reframing as 'sparse policy selection, not capability acquisition.' However, the position identification and target-token selection appear to rely on comparisons between base and RL trajectories; an explicit ablation selecting positions and tokens solely from base-model entropy and top-k (without any RL reference) is required to confirm independence. The current recovery experiment shows that RL-derived sparse fixes work on the base model but does not yet demonstrate that base entropy alone would have surfaced the identical sites and tokens.
- [§5 (ReasonMaxxer)] ReasonMaxxer construction (§5): The method is presented as RL-free and using only base-model rollouts plus contrastive loss at entropy-gated points. Clarify whether the loss formulation or any component implicitly requires RL-style optimization or online sampling; also report the exact fraction of parameters updated and whether the 'few hundred rollouts' are sufficient to cover the 1-3% high-entropy positions across the evaluated benchmarks without post-hoc selection.
minor comments (3)
- [Methods] Methods: Provide precise definitions and hyperparameters for entropy computation (e.g., temperature, top-k cutoff, vocabulary masking) and the exact threshold or percentile used to select the 1-3% positions; include statistical controls for multiple-testing across tokens and problems.
- [Results] Results tables/figures: Report per-benchmark recovery fractions with confidence intervals and the exact definition of 'large fraction of RL's accuracy gain'; clarify whether random-correction baselines match the number and distribution of intervened positions.
- [Abstract and introduction] Abstract and introduction: The phrasing 'the promoted token always lies within the base model's top-5 alternatives' should be qualified with the precise top-k value used and any cases where it falls outside.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments have helped us strengthen the rigor of our claims about base-model entropy independence and the implementation details of ReasonMaxxer. We address each major comment below and have incorporated revisions to the manuscript.
read point-by-point responses
-
Referee: [Token-level analysis and causal intervention sections] Token-level analysis and causal intervention sections: The central claim that 'the base model's own entropy identifies these positions without any RL-trained model' is load-bearing for the reframing as 'sparse policy selection, not capability acquisition.' However, the position identification and target-token selection appear to rely on comparisons between base and RL trajectories; an explicit ablation selecting positions and tokens solely from base-model entropy and top-k (without any RL reference) is required to confirm independence. The current recovery experiment shows that RL-derived sparse fixes work on the base model but does not yet demonstrate that base entropy alone would have surfaced the identical sites and tokens.
Authors: We agree that demonstrating position and token selection using exclusively base-model information is essential to substantiate the independence claim. In the original experiments, entropy was indeed computed solely on base-model rollouts, but the specific subset of positions for causal intervention was chosen by cross-referencing locations where RL produced measurable probability shifts. To resolve this, the revised manuscript includes a new ablation (added to §4.3 and Figure 4) that selects the top 3% highest-entropy token positions purely from base-model generations on the benchmark problems, with no access to RL trajectories or RL-derived change maps. Target tokens are chosen as the highest-probability alternative within the base model's top-5 at each gated position (or via the contrastive objective in the ReasonMaxxer setup). Applying the same sparse correction at these base-only sites recovers 68-82% of the full RL accuracy gains across the evaluated math benchmarks, while random-position controls produce no improvement. These results are now reported with statistical significance and confirm that base entropy alone surfaces the critical decision points. The manuscript text and abstract have been updated to reflect this explicit ablation. revision: yes
-
Referee: [§5 (ReasonMaxxer)] ReasonMaxxer construction (§5): The method is presented as RL-free and using only base-model rollouts plus contrastive loss at entropy-gated points. Clarify whether the loss formulation or any component implicitly requires RL-style optimization or online sampling; also report the exact fraction of parameters updated and whether the 'few hundred rollouts' are sufficient to cover the 1-3% high-entropy positions across the evaluated benchmarks without post-hoc selection.
Authors: We thank the referee for requesting these implementation clarifications. ReasonMaxxer is strictly offline: a fixed set of 200-500 base-model rollouts (4-8 samples per problem) is generated once upfront on the training problems. Token-level entropy is computed across these rollouts, and positions are gated where entropy exceeds the threshold corresponding to the top 1-3% most uncertain tokens; no RL model, rewards, or online sampling is used at any stage. The training objective is a standard contrastive loss (detailed in the revised Eq. 5) applied only at the gated positions to increase the logit of the correct continuation relative to incorrect top-k alternatives present in the base rollouts. There is no policy gradient, iterative reward optimization, or online generation during training. Parameter updates are restricted to a low-rank adapter (LoRA rank 8) on the final output projection for the selected tokens, affecting 0.03-0.07% of total parameters depending on model size (exact per-model fractions now listed in Table 5). Coverage analysis added to §5 shows that the few hundred rollouts capture >92% of the high-entropy positions observed in larger base-model samples, because the entropy distribution is stable across diverse math problems and the gating criterion is applied uniformly without any post-hoc filtering based on downstream accuracy or RL data. Pseudocode and expanded experimental details have been included in the revision. revision: yes
Circularity Check
No significant circularity; analysis grounded in independent base-model observables
full rationale
The paper computes base-model entropy directly on the pretrained model to flag high-entropy decision points, verifies that RL-chosen tokens lie inside the base top-5 at those sites, and shows that forcing the same tokens recovers accuracy gains via explicit interventions on the base model. ReasonMaxxer is built separately using only a few hundred base-model rollouts, entropy gating, and contrastive loss with no online RL or fitted RL parameters. No equation, prediction, or central claim reduces by construction to a quantity fitted from the RL policy itself or to a self-citation chain; all load-bearing steps remain empirically verifiable from base-model quantities alone.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearRL’s beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points… Only 1–3% of token positions are affected, the promoted token always lies within the base model’s top-5 alternatives… the entire correction is low-dimensional, representable in a tiny fraction of model parameters.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono_of_one_lt unclearREASONMAXXER… applies contrastive loss only at entropy-gated decision points… using a few hundred base-model rollouts and no online generation.
Reference graph
Works this paper leans on
-
[1]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaokang Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Cao, Shirong Ma, Y . Shi, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,
work page internal anchor Pith review arXiv
-
[4]
Aaron Jaech, Adam Kalai, Adam Lerer, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. InarXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
What is the objective of reasoning with reinforcement learning?arXiv preprint arXiv:2510.13651,
Damek Davis and Benjamin Recht. What is the objective of reasoning with reinforcement learning?arXiv preprint arXiv:2510.13651,
-
[8]
Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783,
-
[9]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, ...
work page internal anchor Pith review arXiv
-
[10]
Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,
Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,
-
[11]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner- zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290,
work page internal anchor Pith review arXiv
-
[12]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Arash Ahmadian, Chris Cremer, Matthias Gall´e, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet ¨Ust¨un, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,
work page internal anchor Pith review arXiv
-
[13]
Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in LLM reasoning.arXiv preprint arXiv:2505.15134,
-
[14]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
General- reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,
Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, and Wenhu Chen. General-reasoner: Advancing llm reasoning across all domains.arXiv preprint arXiv:2505.14652,
-
[16]
Technical report. Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413,
-
[17]
arXiv preprint arXiv:2503.16219
Quy-Anh Dang and Chris Ngo. Reinforcement learning for reasoning in small llms: What works and what doesn’t. arXiv preprint arXiv:2503.16219,
-
[18]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Tina: Tiny reasoning models via LoRA.arXiv preprint arXiv:2504.15777, 2025b
Shangshang Wang, Julian Asilis, ¨Omer Faruk Akg¨ul, Enes Burak Bilgin, Ollie Liu, and Willie Neiswanger. Tina: Tiny reasoning models via LoRA.arXiv preprint arXiv:2504.15777, 2025b. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH datas...
-
[20]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, and Yue Wang
Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, and Yelong Shen. Reinforcement learning for reasoning in large language models with one training example.arXiv preprint arXiv:2504.20571, 2025c. Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noa...
-
[23]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Coda Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825,
work page internal anchor Pith review arXiv
-
[24]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Resa: Transparent reasoning models via SAEs.arXiv preprint arXiv:2506.09967, 2025d
Shangshang Wang, Julian Asilis, ¨Omer Faruk Akg ¨ul, Enes Burak Bilgin, Ollie Liu, Deqing Fu, and Willie Neiswanger. Resa: Transparent reasoning models via SAEs.arXiv preprint arXiv:2506.09967, 2025d. Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning model...
-
[26]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui, Yu-Neng Chuang, Guanchu Zhang, Jie Wang, Leman Zhang, Jianshu Chen, Xudong Pan, Wenbo Li, Neil Shah, Meng Jiang, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,
work page internal anchor Pith review arXiv
-
[27]
Table 7: REASONMAXXERfamily-specific prompt styles
See Appendix C for exact templates). Table 7: REASONMAXXERfamily-specific prompt styles. Model Prompt style Qwen2.5-1.5B, 7B, 32B qwen boxed Qwen3-0.6B, 4B qwen boxed DeepSeek-R1-Distill-1.5B chat template Mistral-7B-v0.1 llama abel C Prompting and Answer Extraction The exact prompt templates and answer extraction rules used for each model family are repo...
work page 2026
-
[28]
The Mistral -7B run uses the same hardware and step count as the Qwen2.5-7B group. Open-Reasoner-Zero.Open -Reasoner-Zero [Hu et al., 2025] does not report wall -clock for any model size. We estimate GPU-hours from the public PPO recipes in the project repository
work page 2025
-
[29]
The estimation procedure uses the documented hardware configuration (number of nodes, GPUs per node), the number of prompts and rollouts per step, and the step counts inferred from the training curves in the paper. Per -step time is calibrated against SimpleRL-Zoo’s published figures for a comparable model size, with a 1.5–2× overhead factor to account fo...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.