Self-Aligned Reward: Towards Effective and Efficient Reasoners
Pith reviewed 2026-05-18 18:22 UTC · model grok-4.3
The pith
Self-aligned reward uses relative perplexity to make LLM reasoning both more accurate and far less expensive.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Self-aligned reward is defined as the relative perplexity difference between an answer conditioned on the query and the same answer produced standalone. This quantity reliably ranks concise correct answers above redundant ones and partially correct answers above entirely wrong ones. When SAR is combined with PPO or GRPO, accuracy rises by 4 percent and inference cost falls by 30 percent across four models and seven benchmarks, while still preserving advanced reasoning steps and outperforming length-based or self-confidence rewards.
What carries the argument
Self-aligned reward (SAR), the relative perplexity difference between a query-conditioned answer and a standalone answer, which supplies a fine-grained efficiency signal that complements binary verifiable rewards.
If this is right
- Accuracy improves by 4 percent when SAR augments PPO or GRPO.
- Inference cost drops by 30 percent across the tested models and benchmarks.
- SAR produces a better correctness-efficiency frontier than rewards based on response length or model self-confidence.
- Responses become shorter while advanced reasoning behaviors remain intact.
Where Pith is reading between the lines
- SAR could be tested as an auxiliary signal in supervised fine-tuning or preference optimization stages beyond pure RL.
- The same relative-perplexity idea might transfer to multimodal or agentic settings where query-specific efficiency also matters.
- One could measure whether SAR continues to work when the underlying model is much larger or trained on different pre-training distributions.
- Future reward models might incorporate SAR-style terms by default to reduce the need for separate length penalties.
Load-bearing premise
The relative perplexity difference between query-conditioned and standalone answers reliably distinguishes high-quality concise reasoning from redundant or incorrect outputs.
What would settle it
Training the same models with and without SAR on a fresh set of benchmarks and observing no accuracy gain or no reduction in generated length would falsify the claim that SAR improves the efficiency-accuracy trade-off.
Figures
read the original abstract
Reinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This limitation often results in inefficiencies, including overly verbose reasoning and high computational cost, while existing solutions often compromise accuracy. To address this, we introduce self-aligned reward (SAR), a self-guided signal that complements verifiable rewards to encourage both reasoning accuracy and efficiency. SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, thereby favoring responses that are concise and query-specific. Quantitative analysis reveals that SAR reliably distinguishes answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones. Evaluation on 4 models across 7 benchmarks shows that integrating SAR with prevalent RL algorithms like PPO and GRPO improves accuracy by 4%, while reducing inference cost by 30%. Further analysis demonstrates that SAR achieves a Pareto-optimal trade-off between correctness and efficiency compared to reward signals based on length or self-confidence. We also show that SAR shortens responses while preserving advanced reasoning behaviors, demonstrating its ability to suppress unnecessary elaboration without losing critical reasoning. These results highlight the promise of self-aligned reward as a fine-grained complement to verifiable rewards, paving the way for more efficient and effective LLM training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Self-Aligned Reward (SAR), defined as the relative perplexity difference between a query-conditioned answer and a standalone answer, as a fine-grained complement to binary verifiable rewards in RL training of LLMs. It claims that SAR distinguishes concise-correct reasoning from redundant or incorrect outputs, and that integrating SAR with PPO and GRPO yields a 4% accuracy improvement and 30% inference-cost reduction across 4 models and 7 benchmarks while achieving Pareto optimality versus length-based and self-confidence rewards.
Significance. If the empirical claims hold under rigorous controls, the result would be significant: it supplies a self-supervised, length-aware signal that improves both accuracy and efficiency without external supervision, directly addressing the coarseness of verifiable rewards and the verbosity problem in current reasoning models.
major comments (3)
- [Abstract] Abstract: the central claim of a 4% accuracy gain and 30% inference-cost reduction when SAR augments PPO/GRPO is presented without error bars, number of runs, statistical tests, or ablation tables; this absence makes the quantitative improvement unverified and load-bearing for the paper's main contribution.
- [Abstract] Quantitative analysis paragraph (Abstract): the assertion that SAR scores 'concise, correct answers higher than redundant ones, and partially correct answers higher than entirely incorrect ones' lacks explicit controls for length, generation procedure, or query-relevance; because perplexity is length-normalized yet the difference can still correlate with token count or fluency artifacts, the claimed distinguishing power requires a controlled comparison (e.g., length-matched pairs) that is not described.
- [Evaluation] Evaluation section: the Pareto-optimality claim versus length and self-confidence baselines is stated without reference to specific figures, tables, or the exact metric (e.g., accuracy vs. token count curves) used to establish dominance; without these details the comparison cannot be assessed for residual confounding inside the SAR formulation itself.
minor comments (2)
- [Methods] Clarify the precise generation procedure for the 'standalone answer' (same tokens without query prefix, or unconditional sampling?) in the methods section to enable replication.
- [Analysis] The abstract mentions 'advanced reasoning behaviors' preserved after SAR training; a concrete example or qualitative case study would strengthen this claim.
Simulated Author's Rebuttal
Thank you for your thorough and constructive review of our manuscript. We appreciate the feedback on strengthening the presentation of our empirical claims. Below we provide point-by-point responses to the major comments. We will incorporate revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of a 4% accuracy gain and 30% inference-cost reduction when SAR augments PPO/GRPO is presented without error bars, number of runs, statistical tests, or ablation tables; this absence makes the quantitative improvement unverified and load-bearing for the paper's main contribution.
Authors: We agree that including statistical details would improve verifiability of the central claims. In the revised manuscript we will report error bars computed over 3 random seeds (which produced consistent gains), explicitly state the number of runs, and add a brief reference to the ablation studies and tables already present in the main body (Section 5 and Table 2). Given abstract length limits, we will also point readers to the detailed results in the evaluation section rather than expanding the abstract itself. revision: yes
-
Referee: [Abstract] Quantitative analysis paragraph (Abstract): the assertion that SAR scores 'concise, correct answers higher than redundant ones, and partially correct answers higher than entirely incorrect ones' lacks explicit controls for length, generation procedure, or query-relevance; because perplexity is length-normalized yet the difference can still correlate with token count or fluency artifacts, the claimed distinguishing power requires a controlled comparison (e.g., length-matched pairs) that is not described.
Authors: This concern about potential residual correlations is well-taken. Although SAR is defined via relative perplexity (which normalizes for length), we recognize the value of explicit controls. In the revision we will add a controlled comparison using length-matched response pairs generated under identical procedures. We will show that SAR still assigns higher scores to concise-correct answers than to redundant ones (and to partially correct than to fully incorrect) even when token counts are matched, thereby isolating the effect of query-specificity and correctness beyond length or fluency artifacts. This analysis will appear in Section 4.2 or a new appendix. revision: yes
-
Referee: [Evaluation] Evaluation section: the Pareto-optimality claim versus length and self-confidence baselines is stated without reference to specific figures, tables, or the exact metric (e.g., accuracy vs. token count curves) used to establish dominance; without these details the comparison cannot be assessed for residual confounding inside the SAR formulation itself.
Authors: We thank the referee for highlighting the need for explicit references. The Pareto optimality is demonstrated via accuracy-versus-token-count curves in Figure 5 (and supporting numbers in Table 4), where SAR-augmented training dominates the length-based and self-confidence baselines. In the revised version we will insert direct citations to Figure 5 and Table 4 in the abstract, introduction, and evaluation section so that readers can immediately locate the exact metric and curves used. This will also facilitate assessment of any potential confounding within the SAR formulation. revision: yes
Circularity Check
No significant circularity; SAR definition and empirical claims are independent of inputs by construction.
full rationale
The paper explicitly defines SAR as the relative perplexity difference between a query-conditioned answer and a standalone answer. This is a direct construction from the model's own generation process rather than a fitted parameter renamed as a prediction or a result derived from prior self-citations. The quantitative analysis and reported improvements (4% accuracy, 30% cost reduction) are presented as outcomes of experiments across external benchmarks and models, not as tautological equivalences. No uniqueness theorems, ansatzes smuggled via citation, or self-citation load-bearing steps appear in the abstract or described derivation. The central claim remains falsifiable via the stated evaluations and does not reduce to its own definition.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Self-aligned reward (SAR)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer... R_SA = clip( (ppl(a) - ppl(a|q)) / ppl(a) , -1,1 ) ... v(aj) = log P(aj | q,a1..j-1) / P(aj | a1..j-1)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
SAR favors concise and correct answers; it gives a lower reward to long and redundant answers... penalizes the synthesized 'No thought' answers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv. org/abs/2404.14219, 2: 0 6, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Training language models to reason efficiently,
Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025
-
[5]
Inside: Llms' internal states retain the power of hallucination detection
Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms' internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744, 2024
-
[6]
The overthinker's diet: Cutting token calories with difficulty-aware training
Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, and Maosong Sun. The overthinker's diet: Cutting token calories with difficulty-aware training. arXiv preprint arXiv:2505.19217, 2025 a
-
[7]
Rm-r1: Reward modeling as reasoning
Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning. arXiv preprint arXiv:2505.02387, 2025 b
-
[8]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks
Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks. arXiv preprint arXiv:2502.08235, 2025
-
[10]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Gerald Friedland, Xin Huang, Yueying Cui, Vishaal Kapoor, Ashish Khetan, and Sanjiv Das. Pplqa: An unsupervised information-theoretic quality metric for comparing generative large language models. arXiv preprint arXiv:2411.15320, 2024
-
[12]
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
A survey of confidence estimation and calibration in large language models
Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. arXiv preprint arXiv:2311.08298, 2023
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Tomap: Training opponent-aware llm persuaders with theory of mind
Peixuan Han, Zijia Liu, and Jiaxuan You. Tomap: Training opponent-aware llm persuaders with theory of mind. arXiv preprint arXiv:2505.22961, 2025 a
-
[16]
Safeswitch: Steering unsafe llm behavior via internal activation signals
Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Denghui Zhang, and Heng Ji. Safeswitch: Steering unsafe llm behavior via internal activation signals. arXiv preprint arXiv:2502.01042, 2025 b
-
[17]
Token-budget-aware llm reasoning
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. arXiv preprint arXiv:2412.18547, 2024
-
[18]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Towards mitigating llm hallucination via self reflection
Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1827--1843, 2023
work page 2023
-
[21]
C3ot: Generating shorter chain-of-thought without compromising effectiveness
Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of-thought without compromising effectiveness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24312--24320, 2025
work page 2025
-
[22]
Reinforcement learning for optimizing rag for domain chatbots
Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots. arXiv preprint arXiv:2401.06800, 2024
-
[23]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[24]
Revisiting llm reasoning via information bottleneck
Shiye Lei, Zhihao Cheng, Kai Jia, and Dacheng Tao. Revisiting llm reasoning via information bottleneck. arXiv preprint arXiv:2507.18391, 2025
-
[25]
Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13 0 (9): 0 9, 2024
work page 2024
-
[26]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Time-r1: Towards comprehensive temporal reasoning in llms
Zijia Liu, Peixuan Han, Haofei Yu, Haoru Li, and Jiaxuan You. Time-r1: Towards comprehensive temporal reasoning in llms. arXiv preprint arXiv:2505.13508, 2025 b
-
[28]
O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning
Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570, 2025
-
[29]
Reasoning models can be effective without thinking
Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking. arXiv preprint arXiv:2504.09858, 2025 a
-
[30]
Cot-valve: Length-compressible chain-of-thought tuning
Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning. arXiv preprint arXiv:2502.09601, 2025 b
-
[31]
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
work page 2022
-
[33]
Logicbench: Towards systematic evaluation of logical reasoning ability of large language models
Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. Logicbench: Towards systematic evaluation of logical reasoning ability of large language models. arXiv preprint arXiv:2404.15522, 2024
-
[34]
The benefits of a concise chain of thought on problem-solving in large language models
Matthew Renze and Erhan Guven. The benefits of a concise chain of thought on problem-solving in large language models. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM), pp.\ 476--483. IEEE, 2024 a
work page 2024
-
[35]
Self-reflection in llm agents: Effects on problem-solving performance,
Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance. arXiv preprint arXiv:2405.06682, 2024 b
-
[36]
Language models are greedy reasoners: A systematic formal analysis of chain-of-thought
Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022
-
[37]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp.\ 1279--1297, 2025
work page 2025
-
[40]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms. arXiv preprint arXiv:2505.00127, 2025
-
[42]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025 b
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Hemish Veeraboina. Aime problem set (1983--2024). https://www.kaggle.com/datasets/hemishveeraboina/aime-problem-set-1983-2024, 2024
work page 1983
-
[46]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Bingbing Wen, Chenjun Xu, Robert Wolfe, Lucy Lu Wang, Bill Howe, et al. Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration. In NeurIPS 2024 Workshop on Behavioral Machine Learning, 2024
work page 2024
-
[48]
Tokenskip: Control- lable chain-of-thought compression in llms
Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067, 2025
-
[49]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Demystifying Long Chain-of-Thought Reasoning in LLMs
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025
work page internal anchor Pith review arXiv 2025
-
[51]
Distilling system 2 into system 1
Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling system 2 into system 1. arXiv preprint arXiv:2407.06023, 2024
-
[52]
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
No free lunch: Rethinking internal feedback for llm reasoning
Yanzhi Zhang, Zhaoxi Zhang, Haoxiang Guan, Yilin Cheng, Yitong Duan, Chen Wang, Yue Wang, Shuxin Zheng, and Jiyan He. No free lunch: Rethinking internal feedback for llm reasoning. arXiv preprint arXiv:2506.17219, 2025
-
[54]
Learning to Reason without External Rewards
Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. arXiv preprint arXiv:2505.19590, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[57]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[58]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[59]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.