arxiv: 2509.05489 · v2 · submitted 2025-09-05 · 💻 cs.LG

Self-Aligned Reward: Towards Effective and Efficient Reasoners

Peixuan Han , Adit Krishnan , Gerald Friedland , Jiaxuan You , Chris Kong This is my paper

Pith reviewed 2026-05-18 18:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords self-aligned rewardLLM reasoningreinforcement learningperplexityreasoning efficiencyPPOGRPOverifiable rewards

0 comments p. Extension

The pith

Self-aligned reward uses relative perplexity to make LLM reasoning both more accurate and far less expensive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces self-aligned reward to complement the usual binary correct-or-wrong signals used in reinforcement learning for large language models. SAR scores an answer by how much lower its perplexity becomes when the model sees the original query, which favors short, query-specific reasoning over long or generic text. When added to standard algorithms such as PPO and GRPO, this signal raises accuracy while cutting the length of generated answers. A reader would care because current RL-trained reasoners often waste compute on unnecessary elaboration even when they reach the right answer.

Core claim

Self-aligned reward is defined as the relative perplexity difference between an answer conditioned on the query and the same answer produced standalone. This quantity reliably ranks concise correct answers above redundant ones and partially correct answers above entirely wrong ones. When SAR is combined with PPO or GRPO, accuracy rises by 4 percent and inference cost falls by 30 percent across four models and seven benchmarks, while still preserving advanced reasoning steps and outperforming length-based or self-confidence rewards.

What carries the argument

Self-aligned reward (SAR), the relative perplexity difference between a query-conditioned answer and a standalone answer, which supplies a fine-grained efficiency signal that complements binary verifiable rewards.

If this is right

Accuracy improves by 4 percent when SAR augments PPO or GRPO.
Inference cost drops by 30 percent across the tested models and benchmarks.
SAR produces a better correctness-efficiency frontier than rewards based on response length or model self-confidence.
Responses become shorter while advanced reasoning behaviors remain intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

SAR could be tested as an auxiliary signal in supervised fine-tuning or preference optimization stages beyond pure RL.
The same relative-perplexity idea might transfer to multimodal or agentic settings where query-specific efficiency also matters.
One could measure whether SAR continues to work when the underlying model is much larger or trained on different pre-training distributions.
Future reward models might incorporate SAR-style terms by default to reduce the need for separate length penalties.

Load-bearing premise

The relative perplexity difference between query-conditioned and standalone answers reliably distinguishes high-quality concise reasoning from redundant or incorrect outputs.

What would settle it

Training the same models with and without SAR on a fresh set of benchmarks and observing no accuracy gain or no reduction in generated length would falsify the claim that SAR improves the efficiency-accuracy trade-off.

Figures

Figures reproduced from arXiv: 2509.05489 by Adit Krishnan, Chris Kong, Gerald Friedland, Jiaxuan You, Peixuan Han.

**Figure 1.** Figure 1: Training with self-aligned reward enhances both efficiency and accuracy. We present the relative gains in efficiency and accuracy compared to the respective base model in math reasoning benchmarks. Efficiency gain is measured as the drop in average response length. 1 INTRODUCTION Recently, reinforcement learning (RL) with verifiable rewards has attracted broad attention in LLM training, demonstrating remar… view at source ↗

**Figure 2.** Figure 2: An illustration of token-level importance scores (i.e. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy-efficiency balance of different algorithms. Among all algorithms, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training plots for Qwen3-4B. E.2 SELF-ALIGNED REWARD ON VISION LANGUAGE MODELS In this section, we extend SA-GRPO to vision language models. 7While RSA isn’t used in training the GRPO model, we still calculate and record the values for comparison. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This limitation often results in inefficiencies, including overly verbose reasoning and high computational cost, while existing solutions often compromise accuracy. To address this, we introduce self-aligned reward (SAR), a self-guided signal that complements verifiable rewards to encourage both reasoning accuracy and efficiency. SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, thereby favoring responses that are concise and query-specific. Quantitative analysis reveals that SAR reliably distinguishes answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones. Evaluation on 4 models across 7 benchmarks shows that integrating SAR with prevalent RL algorithms like PPO and GRPO improves accuracy by 4%, while reducing inference cost by 30%. Further analysis demonstrates that SAR achieves a Pareto-optimal trade-off between correctness and efficiency compared to reward signals based on length or self-confidence. We also show that SAR shortens responses while preserving advanced reasoning behaviors, demonstrating its ability to suppress unnecessary elaboration without losing critical reasoning. These results highlight the promise of self-aligned reward as a fine-grained complement to verifiable rewards, paving the way for more efficient and effective LLM training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAR uses relative perplexity difference as a self-referential signal to push for concise reasoning on top of verifiable rewards, with claimed 4% accuracy gains and 30% cost cuts, but the abstract leaves the metric's independence from length effects unproven.

read the letter

The one thing to know is that this paper defines self-aligned reward (SAR) as the perplexity gap between a query-conditioned answer and a standalone one, then adds it to PPO or GRPO to favor concise, query-specific reasoning without losing correctness. The authors report that this combination beats length and self-confidence baselines on a Pareto front across four models and seven benchmarks while shortening responses but keeping advanced reasoning steps intact.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Self-Aligned Reward (SAR), defined as the relative perplexity difference between a query-conditioned answer and a standalone answer, as a fine-grained complement to binary verifiable rewards in RL training of LLMs. It claims that SAR distinguishes concise-correct reasoning from redundant or incorrect outputs, and that integrating SAR with PPO and GRPO yields a 4% accuracy improvement and 30% inference-cost reduction across 4 models and 7 benchmarks while achieving Pareto optimality versus length-based and self-confidence rewards.

Significance. If the empirical claims hold under rigorous controls, the result would be significant: it supplies a self-supervised, length-aware signal that improves both accuracy and efficiency without external supervision, directly addressing the coarseness of verifiable rewards and the verbosity problem in current reasoning models.

major comments (3)

[Abstract] Abstract: the central claim of a 4% accuracy gain and 30% inference-cost reduction when SAR augments PPO/GRPO is presented without error bars, number of runs, statistical tests, or ablation tables; this absence makes the quantitative improvement unverified and load-bearing for the paper's main contribution.
[Abstract] Quantitative analysis paragraph (Abstract): the assertion that SAR scores 'concise, correct answers higher than redundant ones, and partially correct answers higher than entirely incorrect ones' lacks explicit controls for length, generation procedure, or query-relevance; because perplexity is length-normalized yet the difference can still correlate with token count or fluency artifacts, the claimed distinguishing power requires a controlled comparison (e.g., length-matched pairs) that is not described.
[Evaluation] Evaluation section: the Pareto-optimality claim versus length and self-confidence baselines is stated without reference to specific figures, tables, or the exact metric (e.g., accuracy vs. token count curves) used to establish dominance; without these details the comparison cannot be assessed for residual confounding inside the SAR formulation itself.

minor comments (2)

[Methods] Clarify the precise generation procedure for the 'standalone answer' (same tokens without query prefix, or unconditional sampling?) in the methods section to enable replication.
[Analysis] The abstract mentions 'advanced reasoning behaviors' preserved after SAR training; a concrete example or qualitative case study would strengthen this claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thorough and constructive review of our manuscript. We appreciate the feedback on strengthening the presentation of our empirical claims. Below we provide point-by-point responses to the major comments. We will incorporate revisions to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 4% accuracy gain and 30% inference-cost reduction when SAR augments PPO/GRPO is presented without error bars, number of runs, statistical tests, or ablation tables; this absence makes the quantitative improvement unverified and load-bearing for the paper's main contribution.

Authors: We agree that including statistical details would improve verifiability of the central claims. In the revised manuscript we will report error bars computed over 3 random seeds (which produced consistent gains), explicitly state the number of runs, and add a brief reference to the ablation studies and tables already present in the main body (Section 5 and Table 2). Given abstract length limits, we will also point readers to the detailed results in the evaluation section rather than expanding the abstract itself. revision: yes
Referee: [Abstract] Quantitative analysis paragraph (Abstract): the assertion that SAR scores 'concise, correct answers higher than redundant ones, and partially correct answers higher than entirely incorrect ones' lacks explicit controls for length, generation procedure, or query-relevance; because perplexity is length-normalized yet the difference can still correlate with token count or fluency artifacts, the claimed distinguishing power requires a controlled comparison (e.g., length-matched pairs) that is not described.

Authors: This concern about potential residual correlations is well-taken. Although SAR is defined via relative perplexity (which normalizes for length), we recognize the value of explicit controls. In the revision we will add a controlled comparison using length-matched response pairs generated under identical procedures. We will show that SAR still assigns higher scores to concise-correct answers than to redundant ones (and to partially correct than to fully incorrect) even when token counts are matched, thereby isolating the effect of query-specificity and correctness beyond length or fluency artifacts. This analysis will appear in Section 4.2 or a new appendix. revision: yes
Referee: [Evaluation] Evaluation section: the Pareto-optimality claim versus length and self-confidence baselines is stated without reference to specific figures, tables, or the exact metric (e.g., accuracy vs. token count curves) used to establish dominance; without these details the comparison cannot be assessed for residual confounding inside the SAR formulation itself.

Authors: We thank the referee for highlighting the need for explicit references. The Pareto optimality is demonstrated via accuracy-versus-token-count curves in Figure 5 (and supporting numbers in Table 4), where SAR-augmented training dominates the length-based and self-confidence baselines. In the revised version we will insert direct citations to Figure 5 and Table 4 in the abstract, introduction, and evaluation section so that readers can immediately locate the exact metric and curves used. This will also facilitate assessment of any potential confounding within the SAR formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; SAR definition and empirical claims are independent of inputs by construction.

full rationale

The paper explicitly defines SAR as the relative perplexity difference between a query-conditioned answer and a standalone answer. This is a direct construction from the model's own generation process rather than a fitted parameter renamed as a prediction or a result derived from prior self-citations. The quantitative analysis and reported improvements (4% accuracy, 30% cost reduction) are presented as outcomes of experiments across external benchmarks and models, not as tautological equivalences. No uniqueness theorems, ansatzes smuggled via citation, or self-citation load-bearing steps appear in the abstract or described derivation. The central claim remains falsifiable via the stated evaluations and does not reduce to its own definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond the definition of SAR itself; the central claim therefore rests on the unstated assumption that perplexity differences correlate with reasoning quality in the intended way.

invented entities (1)

Self-aligned reward (SAR) no independent evidence
purpose: Fine-grained complement to binary verifiable rewards that favors concise, query-specific answers
Defined as relative perplexity difference; no independent falsifiable prediction outside the training loop is stated in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1188 out tokens · 46737 ms · 2026-05-18T18:22:55.040091+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer... R_SA = clip( (ppl(a) - ppl(a|q)) / ppl(a) , -1,1 ) ... v(aj) = log P(aj | q,a1..j-1) / P(aj | a1..j-1)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

SAR favors concise and correct answers; it gives a lower reward to long and redundant answers... penalizes the synthesized 'No thought' answers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation
cs.DB 2026-04 unverdicted novelty 6.0

EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 23 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv. org/abs/2404.14219, 2: 0 6, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Training language models to reason efficiently,

Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025

work page arXiv 2025
[5]

Inside: Llms' internal states retain the power of hallucination detection

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms' internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744, 2024

work page arXiv 2024
[6]

The overthinker's diet: Cutting token calories with difficulty-aware training

Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, and Maosong Sun. The overthinker's diet: Cutting token calories with difficulty-aware training. arXiv preprint arXiv:2505.19217, 2025 a

work page arXiv 2025
[7]

Rm-r1: Reward modeling as reasoning

Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning. arXiv preprint arXiv:2505.02387, 2025 b

work page arXiv 2025
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks. arXiv preprint arXiv:2502.08235, 2025

work page arXiv 2025
[10]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Pplqa: An unsupervised information-theoretic quality metric for comparing generative large language models

Gerald Friedland, Xin Huang, Yueying Cui, Vishaal Kapoor, Ashish Khetan, and Sanjiv Das. Pplqa: An unsupervised information-theoretic quality metric for comparing generative large language models. arXiv preprint arXiv:2411.15320, 2024

work page arXiv 2024
[12]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

A survey of confidence estimation and calibration in large language models

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. arXiv preprint arXiv:2311.08298, 2023

work page arXiv 2023
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Tomap: Training opponent-aware llm persuaders with theory of mind

Peixuan Han, Zijia Liu, and Jiaxuan You. Tomap: Training opponent-aware llm persuaders with theory of mind. arXiv preprint arXiv:2505.22961, 2025 a

work page arXiv 2025
[16]

Safeswitch: Steering unsafe llm behavior via internal activation signals

Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Denghui Zhang, and Heng Ji. Safeswitch: Steering unsafe llm behavior via internal activation signals. arXiv preprint arXiv:2502.01042, 2025 b

work page arXiv 2025
[17]

Token-budget-aware llm reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. arXiv preprint arXiv:2412.18547, 2024

work page arXiv 2024
[18]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Towards mitigating llm hallucination via self reflection

Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1827--1843, 2023

work page 2023
[21]

C3ot: Generating shorter chain-of-thought without compromising effectiveness

Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of-thought without compromising effectiveness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24312--24320, 2025

work page 2025
[22]

Reinforcement learning for optimizing rag for domain chatbots

Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots. arXiv preprint arXiv:2401.06800, 2024

work page arXiv 2024
[23]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[24]

Revisiting llm reasoning via information bottleneck

Shiye Lei, Zhihao Cheng, Kai Jia, and Dacheng Tao. Revisiting llm reasoning via information bottleneck. arXiv preprint arXiv:2507.18391, 2025

work page arXiv 2025
[25]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13 0 (9): 0 9, 2024

work page 2024
[26]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Time-r1: Towards comprehensive temporal reasoning in llms

Zijia Liu, Peixuan Han, Haofei Yu, Haoru Li, and Jiaxuan You. Time-r1: Towards comprehensive temporal reasoning in llms. arXiv preprint arXiv:2505.13508, 2025 b

work page arXiv 2025
[28]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570, 2025

work page arXiv 2025
[29]

Reasoning models can be effective without thinking

Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking. arXiv preprint arXiv:2504.09858, 2025 a

work page arXiv 2025
[30]

Cot-valve: Length-compressible chain-of-thought tuning

Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning. arXiv preprint arXiv:2502.09601, 2025 b

work page arXiv 2025
[31]

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[33]

Logicbench: Towards systematic evaluation of logical reasoning ability of large language models

Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. Logicbench: Towards systematic evaluation of logical reasoning ability of large language models. arXiv preprint arXiv:2404.15522, 2024

work page arXiv 2024
[34]

The benefits of a concise chain of thought on problem-solving in large language models

Matthew Renze and Erhan Guven. The benefits of a concise chain of thought on problem-solving in large language models. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM), pp.\ 476--483. IEEE, 2024 a

work page 2024
[35]

Self-reflection in llm agents: Effects on problem-solving performance,

Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance. arXiv preprint arXiv:2405.06682, 2024 b

work page arXiv 2024
[36]

Language models are greedy reasoners: A systematic formal analysis of chain-of-thought

Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022

work page arXiv 2022
[37]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp.\ 1279--1297, 2025

work page 2025
[40]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms. arXiv preprint arXiv:2505.00127, 2025

work page arXiv 2025
[42]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Aime problem set (1983--2024)

Hemish Veeraboina. Aime problem set (1983--2024). https://www.kaggle.com/datasets/hemishveeraboina/aime-problem-set-1983-2024, 2024

work page 1983
[46]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration

Bingbing Wen, Chenjun Xu, Robert Wolfe, Lucy Lu Wang, Bill Howe, et al. Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration. In NeurIPS 2024 Workshop on Behavioral Machine Learning, 2024

work page 2024
[48]

Tokenskip: Control- lable chain-of-thought compression in llms

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067, 2025

work page arXiv 2025
[49]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025

work page internal anchor Pith review arXiv 2025
[51]

Distilling system 2 into system 1

Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling system 2 into system 1. arXiv preprint arXiv:2407.06023, 2024

work page arXiv 2024
[52]

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

No free lunch: Rethinking internal feedback for llm reasoning

Yanzhi Zhang, Zhaoxi Zhang, Haoxiang Guan, Yilin Cheng, Yitong Duan, Chen Wang, Yue Wang, Shuxin Zheng, and Jiyan He. No free lunch: Rethinking internal feedback for llm reasoning. arXiv preprint arXiv:2506.17219, 2025

work page arXiv 2025
[54]

Learning to Reason without External Rewards

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. arXiv preprint arXiv:2505.19590, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[57]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[58]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[59]

input_ids

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 1983