pith. the verified trust layer for science. sign in

arxiv: 2509.05489 · v2 · submitted 2025-09-05 · 💻 cs.LG

Self-Aligned Reward: Towards Effective and Efficient Reasoners

Pith reviewed 2026-05-18 18:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords self-aligned rewardLLM reasoningreinforcement learningperplexityreasoning efficiencyPPOGRPOverifiable rewards
0
0 comments X p. Extension

The pith

Self-aligned reward uses relative perplexity to make LLM reasoning both more accurate and far less expensive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces self-aligned reward to complement the usual binary correct-or-wrong signals used in reinforcement learning for large language models. SAR scores an answer by how much lower its perplexity becomes when the model sees the original query, which favors short, query-specific reasoning over long or generic text. When added to standard algorithms such as PPO and GRPO, this signal raises accuracy while cutting the length of generated answers. A reader would care because current RL-trained reasoners often waste compute on unnecessary elaboration even when they reach the right answer.

Core claim

Self-aligned reward is defined as the relative perplexity difference between an answer conditioned on the query and the same answer produced standalone. This quantity reliably ranks concise correct answers above redundant ones and partially correct answers above entirely wrong ones. When SAR is combined with PPO or GRPO, accuracy rises by 4 percent and inference cost falls by 30 percent across four models and seven benchmarks, while still preserving advanced reasoning steps and outperforming length-based or self-confidence rewards.

What carries the argument

Self-aligned reward (SAR), the relative perplexity difference between a query-conditioned answer and a standalone answer, which supplies a fine-grained efficiency signal that complements binary verifiable rewards.

If this is right

  • Accuracy improves by 4 percent when SAR augments PPO or GRPO.
  • Inference cost drops by 30 percent across the tested models and benchmarks.
  • SAR produces a better correctness-efficiency frontier than rewards based on response length or model self-confidence.
  • Responses become shorter while advanced reasoning behaviors remain intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • SAR could be tested as an auxiliary signal in supervised fine-tuning or preference optimization stages beyond pure RL.
  • The same relative-perplexity idea might transfer to multimodal or agentic settings where query-specific efficiency also matters.
  • One could measure whether SAR continues to work when the underlying model is much larger or trained on different pre-training distributions.
  • Future reward models might incorporate SAR-style terms by default to reduce the need for separate length penalties.

Load-bearing premise

The relative perplexity difference between query-conditioned and standalone answers reliably distinguishes high-quality concise reasoning from redundant or incorrect outputs.

What would settle it

Training the same models with and without SAR on a fresh set of benchmarks and observing no accuracy gain or no reduction in generated length would falsify the claim that SAR improves the efficiency-accuracy trade-off.

Figures

Figures reproduced from arXiv: 2509.05489 by Adit Krishnan, Chris Kong, Gerald Friedland, Jiaxuan You, Peixuan Han.

Figure 1
Figure 1. Figure 1: Training with self-aligned reward enhances both efficiency and accuracy. We present the relative gains in efficiency and accuracy compared to the respective base model in math reasoning benchmarks. Efficiency gain is measured as the drop in average response length. 1 INTRODUCTION Recently, reinforcement learning (RL) with verifiable rewards has attracted broad attention in LLM training, demonstrating remar… view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of token-level importance scores (i.e. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy-efficiency balance of different algorithms. Among all algorithms, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training plots for Qwen3-4B. E.2 SELF-ALIGNED REWARD ON VISION LANGUAGE MODELS In this section, we extend SA-GRPO to vision language models. 7While RSA isn’t used in training the GRPO model, we still calculate and record the values for comparison. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This limitation often results in inefficiencies, including overly verbose reasoning and high computational cost, while existing solutions often compromise accuracy. To address this, we introduce self-aligned reward (SAR), a self-guided signal that complements verifiable rewards to encourage both reasoning accuracy and efficiency. SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, thereby favoring responses that are concise and query-specific. Quantitative analysis reveals that SAR reliably distinguishes answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones. Evaluation on 4 models across 7 benchmarks shows that integrating SAR with prevalent RL algorithms like PPO and GRPO improves accuracy by 4%, while reducing inference cost by 30%. Further analysis demonstrates that SAR achieves a Pareto-optimal trade-off between correctness and efficiency compared to reward signals based on length or self-confidence. We also show that SAR shortens responses while preserving advanced reasoning behaviors, demonstrating its ability to suppress unnecessary elaboration without losing critical reasoning. These results highlight the promise of self-aligned reward as a fine-grained complement to verifiable rewards, paving the way for more efficient and effective LLM training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Self-Aligned Reward (SAR), defined as the relative perplexity difference between a query-conditioned answer and a standalone answer, as a fine-grained complement to binary verifiable rewards in RL training of LLMs. It claims that SAR distinguishes concise-correct reasoning from redundant or incorrect outputs, and that integrating SAR with PPO and GRPO yields a 4% accuracy improvement and 30% inference-cost reduction across 4 models and 7 benchmarks while achieving Pareto optimality versus length-based and self-confidence rewards.

Significance. If the empirical claims hold under rigorous controls, the result would be significant: it supplies a self-supervised, length-aware signal that improves both accuracy and efficiency without external supervision, directly addressing the coarseness of verifiable rewards and the verbosity problem in current reasoning models.

major comments (3)
  1. [Abstract] Abstract: the central claim of a 4% accuracy gain and 30% inference-cost reduction when SAR augments PPO/GRPO is presented without error bars, number of runs, statistical tests, or ablation tables; this absence makes the quantitative improvement unverified and load-bearing for the paper's main contribution.
  2. [Abstract] Quantitative analysis paragraph (Abstract): the assertion that SAR scores 'concise, correct answers higher than redundant ones, and partially correct answers higher than entirely incorrect ones' lacks explicit controls for length, generation procedure, or query-relevance; because perplexity is length-normalized yet the difference can still correlate with token count or fluency artifacts, the claimed distinguishing power requires a controlled comparison (e.g., length-matched pairs) that is not described.
  3. [Evaluation] Evaluation section: the Pareto-optimality claim versus length and self-confidence baselines is stated without reference to specific figures, tables, or the exact metric (e.g., accuracy vs. token count curves) used to establish dominance; without these details the comparison cannot be assessed for residual confounding inside the SAR formulation itself.
minor comments (2)
  1. [Methods] Clarify the precise generation procedure for the 'standalone answer' (same tokens without query prefix, or unconditional sampling?) in the methods section to enable replication.
  2. [Analysis] The abstract mentions 'advanced reasoning behaviors' preserved after SAR training; a concrete example or qualitative case study would strengthen this claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thorough and constructive review of our manuscript. We appreciate the feedback on strengthening the presentation of our empirical claims. Below we provide point-by-point responses to the major comments. We will incorporate revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a 4% accuracy gain and 30% inference-cost reduction when SAR augments PPO/GRPO is presented without error bars, number of runs, statistical tests, or ablation tables; this absence makes the quantitative improvement unverified and load-bearing for the paper's main contribution.

    Authors: We agree that including statistical details would improve verifiability of the central claims. In the revised manuscript we will report error bars computed over 3 random seeds (which produced consistent gains), explicitly state the number of runs, and add a brief reference to the ablation studies and tables already present in the main body (Section 5 and Table 2). Given abstract length limits, we will also point readers to the detailed results in the evaluation section rather than expanding the abstract itself. revision: yes

  2. Referee: [Abstract] Quantitative analysis paragraph (Abstract): the assertion that SAR scores 'concise, correct answers higher than redundant ones, and partially correct answers higher than entirely incorrect ones' lacks explicit controls for length, generation procedure, or query-relevance; because perplexity is length-normalized yet the difference can still correlate with token count or fluency artifacts, the claimed distinguishing power requires a controlled comparison (e.g., length-matched pairs) that is not described.

    Authors: This concern about potential residual correlations is well-taken. Although SAR is defined via relative perplexity (which normalizes for length), we recognize the value of explicit controls. In the revision we will add a controlled comparison using length-matched response pairs generated under identical procedures. We will show that SAR still assigns higher scores to concise-correct answers than to redundant ones (and to partially correct than to fully incorrect) even when token counts are matched, thereby isolating the effect of query-specificity and correctness beyond length or fluency artifacts. This analysis will appear in Section 4.2 or a new appendix. revision: yes

  3. Referee: [Evaluation] Evaluation section: the Pareto-optimality claim versus length and self-confidence baselines is stated without reference to specific figures, tables, or the exact metric (e.g., accuracy vs. token count curves) used to establish dominance; without these details the comparison cannot be assessed for residual confounding inside the SAR formulation itself.

    Authors: We thank the referee for highlighting the need for explicit references. The Pareto optimality is demonstrated via accuracy-versus-token-count curves in Figure 5 (and supporting numbers in Table 4), where SAR-augmented training dominates the length-based and self-confidence baselines. In the revised version we will insert direct citations to Figure 5 and Table 4 in the abstract, introduction, and evaluation section so that readers can immediately locate the exact metric and curves used. This will also facilitate assessment of any potential confounding within the SAR formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; SAR definition and empirical claims are independent of inputs by construction.

full rationale

The paper explicitly defines SAR as the relative perplexity difference between a query-conditioned answer and a standalone answer. This is a direct construction from the model's own generation process rather than a fitted parameter renamed as a prediction or a result derived from prior self-citations. The quantitative analysis and reported improvements (4% accuracy, 30% cost reduction) are presented as outcomes of experiments across external benchmarks and models, not as tautological equivalences. No uniqueness theorems, ansatzes smuggled via citation, or self-citation load-bearing steps appear in the abstract or described derivation. The central claim remains falsifiable via the stated evaluations and does not reduce to its own definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond the definition of SAR itself; the central claim therefore rests on the unstated assumption that perplexity differences correlate with reasoning quality in the intended way.

invented entities (1)
  • Self-aligned reward (SAR) no independent evidence
    purpose: Fine-grained complement to binary verifiable rewards that favors concise, query-specific answers
    Defined as relative perplexity difference; no independent falsifiable prediction outside the training loop is stated in the abstract.

pith-pipeline@v0.9.0 · 5772 in / 1188 out tokens · 46737 ms · 2026-05-18T18:22:55.040091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer... R_SA = clip( (ppl(a) - ppl(a|q)) / ppl(a) , -1,1 ) ... v(aj) = log P(aj | q,a1..j-1) / P(aj | a1..j-1)

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    SAR favors concise and correct answers; it gives a lower reward to long and redundant answers... penalizes the synthesized 'No thought' answers

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoRAG: Making Knowledge Graph-based RAG Automatically Evolve through Feedback-driven Backpropagation

    cs.DB 2026-04 unverdicted novelty 6.0

    EvoRAG adds a feedback-driven backpropagation step that attributes response quality to individual knowledge-graph triplets and updates the graph to raise reasoning accuracy by 7.34 percent over prior KG-RAG methods.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 23 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv. org/abs/2404.14219, 2: 0 6, 2024

  2. [2]

    The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134, 2025

  3. [3]

    L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. arXiv preprint arXiv:2503.04697, 2025

  4. [4]

    Training language models to reason efficiently,

    Daman Arora and Andrea Zanette. Training language models to reason efficiently. arXiv preprint arXiv:2502.04463, 2025

  5. [5]

    Inside: Llms' internal states retain the power of hallucination detection

    Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms' internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744, 2024

  6. [6]

    The overthinker's diet: Cutting token calories with difficulty-aware training

    Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, and Maosong Sun. The overthinker's diet: Cutting token calories with difficulty-aware training. arXiv preprint arXiv:2505.19217, 2025 a

  7. [7]

    Rm-r1: Reward modeling as reasoning

    Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, et al. Rm-r1: Reward modeling as reasoning. arXiv preprint arXiv:2505.02387, 2025 b

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  9. [9]

    The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks

    Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks. arXiv preprint arXiv:2502.08235, 2025

  10. [10]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

  11. [11]

    Pplqa: An unsupervised information-theoretic quality metric for comparing generative large language models

    Gerald Friedland, Xin Huang, Yueying Cui, Vishaal Kapoor, Ashish Khetan, and Sanjiv Das. Pplqa: An unsupervised information-theoretic quality metric for comparing generative large language models. arXiv preprint arXiv:2411.15320, 2024

  12. [12]

    Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

  13. [13]

    A survey of confidence estimation and calibration in large language models

    Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. arXiv preprint arXiv:2311.08298, 2023

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  15. [15]

    Tomap: Training opponent-aware llm persuaders with theory of mind

    Peixuan Han, Zijia Liu, and Jiaxuan You. Tomap: Training opponent-aware llm persuaders with theory of mind. arXiv preprint arXiv:2505.22961, 2025 a

  16. [16]

    Safeswitch: Steering unsafe llm behavior via internal activation signals

    Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Denghui Zhang, and Heng Ji. Safeswitch: Steering unsafe llm behavior via internal activation signals. arXiv preprint arXiv:2502.01042, 2025 b

  17. [17]

    Token-budget-aware llm reasoning

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. arXiv preprint arXiv:2412.18547, 2024

  18. [18]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  19. [19]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  20. [20]

    Towards mitigating llm hallucination via self reflection

    Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. Towards mitigating llm hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1827--1843, 2023

  21. [21]

    C3ot: Generating shorter chain-of-thought without compromising effectiveness

    Yu Kang, Xianghui Sun, Liangyu Chen, and Wei Zou. C3ot: Generating shorter chain-of-thought without compromising effectiveness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24312--24320, 2025

  22. [22]

    Reinforcement learning for optimizing rag for domain chatbots

    Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots. arXiv preprint arXiv:2401.06800, 2024

  23. [23]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  24. [24]

    Revisiting llm reasoning via information bottleneck

    Shiye Lei, Zhihao Cheng, Kai Jia, and Dacheng Tao. Revisiting llm reasoning via information bottleneck. arXiv preprint arXiv:2507.18391, 2025

  25. [25]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13 0 (9): 0 9, 2024

  26. [26]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025 a

  27. [27]

    Time-r1: Towards comprehensive temporal reasoning in llms

    Zijia Liu, Peixuan Han, Haofei Yu, Haoru Li, and Jiaxuan You. Time-r1: Towards comprehensive temporal reasoning in llms. arXiv preprint arXiv:2505.13508, 2025 b

  28. [28]

    O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning

    Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570, 2025

  29. [29]

    Reasoning models can be effective without thinking

    Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking. arXiv preprint arXiv:2504.09858, 2025 a

  30. [30]

    Cot-valve: Length-compressible chain-of-thought tuning

    Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning. arXiv preprint arXiv:2502.09601, 2025 b

  31. [31]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024

  32. [32]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  33. [33]

    Logicbench: Towards systematic evaluation of logical reasoning ability of large language models

    Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. Logicbench: Towards systematic evaluation of logical reasoning ability of large language models. arXiv preprint arXiv:2404.15522, 2024

  34. [34]

    The benefits of a concise chain of thought on problem-solving in large language models

    Matthew Renze and Erhan Guven. The benefits of a concise chain of thought on problem-solving in large language models. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM), pp.\ 476--483. IEEE, 2024 a

  35. [35]

    Self-reflection in llm agents: Effects on problem-solving performance,

    Matthew Renze and Erhan Guven. Self-reflection in llm agents: Effects on problem-solving performance. arXiv preprint arXiv:2405.06682, 2024 b

  36. [36]

    Language models are greedy reasoners: A systematic formal analysis of chain-of-thought

    Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022

  37. [37]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  38. [38]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  39. [39]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp.\ 1279--1297, 2025

  40. [40]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024

  41. [41]

    Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms

    Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms. arXiv preprint arXiv:2505.00127, 2025

  42. [42]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025

  43. [43]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025 a

  44. [44]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025 b

  45. [45]

    Aime problem set (1983--2024)

    Hemish Veeraboina. Aime problem set (1983--2024). https://www.kaggle.com/datasets/hemishveeraboina/aime-problem-set-1983-2024, 2024

  46. [46]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  47. [47]

    Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration

    Bingbing Wen, Chenjun Xu, Robert Wolfe, Lucy Lu Wang, Bill Howe, et al. Mitigating overconfidence in large language models: A behavioral lens on confidence estimation and calibration. In NeurIPS 2024 Workshop on Behavioral Machine Learning, 2024

  48. [48]

    Tokenskip: Control- lable chain-of-thought compression in llms

    Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067, 2025

  49. [49]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  50. [50]

    Demystifying Long Chain-of-Thought Reasoning in LLMs

    Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms. arXiv preprint arXiv:2502.03373, 2025

  51. [51]

    Distilling system 2 into system 1

    Ping Yu, Jing Xu, Jason Weston, and Ilia Kulikov. Distilling system 2 into system 1. arXiv preprint arXiv:2407.06023, 2024

  52. [52]

    SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892, 2025

  53. [53]

    No free lunch: Rethinking internal feedback for llm reasoning

    Yanzhi Zhang, Zhaoxi Zhang, Haoxiang Guan, Yilin Cheng, Yitong Duan, Chen Wang, Yue Wang, Shuxin Zheng, and Jiyan He. No free lunch: Rethinking internal feedback for llm reasoning. arXiv preprint arXiv:2506.17219, 2025

  54. [54]

    Learning to Reason without External Rewards

    Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards. arXiv preprint arXiv:2505.19590, 2025

  55. [55]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URL https://arxiv.org/abs/2507.18071

  56. [56]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  57. [57]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  58. [58]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  59. [59]

    input_ids

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...