pith. machine review for the scientific record. sign in

arxiv: 2412.21187 · v2 · submitted 2024-12-30 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Dian Yu, Dong Yu, Haitao Mi, Jiahao Xu, Jianhui Pang, Linfeng Song, Mengfei Zhou, Qiuzhi Liu, Rui Wang, Tian Liang, Xingyu Chen, Zhaopeng Tu, Zhiwei He, Zhuosheng Zhang

Pith reviewed 2026-05-13 15:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords overthinkingo1-like modelschain-of-thoughtefficiency metricsself-trainingcomputational overheadLLM reasoninginference optimization
0
0 comments X

The pith

o1-like LLMs overthink simple problems by extending chain-of-thought far beyond what is needed, and self-training on new efficiency metrics can trim this waste without reducing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how models that emulate long human-like thinking during inference, such as OpenAI's o1, frequently devote excessive steps to straightforward questions where added reasoning brings little gain. It defines fresh efficiency metrics that assess both the final answer quality and the internal reasoning path to quantify this overthinking. A self-training method is then applied to encourage the model to halt early on low-value extensions. Experiments across math and science benchmarks demonstrate that computational costs drop while performance holds steady on tasks of varying difficulty.

Core claim

o1-like models exhibit overthinking by allocating unnecessary computational resources to simple problems through extended chain-of-thought processes; novel efficiency metrics from outcome and process perspectives identify this inefficiency, and a self-training paradigm mitigates it by streamlining reasoning steps without loss of accuracy on benchmarks including GSM8K, MATH500, GPQA, and AIME.

What carries the argument

Self-training paradigm guided by outcome-based and process-based efficiency metrics that detect when additional reasoning steps add minimal value to the final result.

If this is right

  • Computational overhead decreases on simple problems in GSM8K and MATH500 while accuracy stays intact.
  • Reasoning chains shorten on easy instances without harming results on harder sets such as GPQA and AIME.
  • Models learn to allocate fewer tokens when further steps yield little outcome improvement.
  • Overall inference efficiency improves across test sets of mixed difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same metrics could be applied at training time to produce models that inherently avoid overthinking from the start.
  • Dynamic early-stopping rules based on these metrics might generalize to non-math reasoning tasks such as coding or science question answering.
  • Resource-constrained deployments could use the trimmed models to handle high query volumes at lower cost.
  • Hybrid inference systems might route easy problems to short-chain versions and hard ones to full long-chain versions.

Load-bearing premise

The proposed efficiency metrics accurately flag wasteful overthinking rather than missing cases where longer reasoning is genuinely required for correct answers.

What would settle it

Apply the self-training strategies to a set of easy problems where the metrics predict overthinking; if accuracy falls below the original model's level while token counts remain high, the metrics have misclassified necessary reasoning.

read the original abstract

The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that o1-like LLMs exhibit overthinking by allocating excessive compute to simple problems, introduces novel outcome- and process-based efficiency metrics to quantify rational resource use, and applies a self-training paradigm to shorten reasoning chains. Experiments reportedly show reduced computational overhead with preserved accuracy on GSM8K, MATH500, GPQA, and AIME.

Significance. If the efficiency metrics are shown to correctly separate overthinking from necessary exploration, the work could meaningfully advance efficient inference for long-CoT models by providing a practical self-training recipe that lowers token usage without accuracy loss. The multi-benchmark evaluation across difficulty levels is a positive feature, but the absence of explicit validation against ground-truth cases requiring extended reasoning limits the strength of the central claim.

major comments (3)
  1. [§3] §3 (Efficiency Metrics): The outcome-based metric appears defined primarily via token count and final correctness without explicit conditioning on problem difficulty or a control for cases where longer chains are required (e.g., multi-step proofs in AIME/GPQA); this risks systematically penalizing beneficial reasoning and undermines the claim that self-training preserves performance for general reasons rather than test-set artifacts.
  2. [§4.2] §4.2 (Self-Training Experiments): Results report preserved performance after mitigation, yet no ablation isolates the contribution of the proposed metrics versus simpler length penalties, and no statistical significance tests or variance across runs are provided; without these, it is unclear whether the efficiency gains are robust or dependent on the chosen difficulty distribution.
  3. [§4.1] §4.1 (Benchmark Details): The process-based metric is claimed to identify overthinking, but the manuscript does not report correlation with human judgments or oracle cases where extended exploration is provably necessary; this leaves the metric's validity as an open load-bearing assumption for the self-training objective.
minor comments (2)
  1. [Abstract] Abstract and §1: The claim of being the 'first comprehensive study' would benefit from explicit citations to prior work on CoT length analysis or overthinking in reasoning models to clarify novelty.
  2. [§3] Notation: Define all efficiency metric components (e.g., exact formulas for outcome and process scores) in a single dedicated subsection with consistent symbols to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and commit to revisions that strengthen the presentation of the efficiency metrics and experimental results.

read point-by-point responses
  1. Referee: [§3] §3 (Efficiency Metrics): The outcome-based metric appears defined primarily via token count and final correctness without explicit conditioning on problem difficulty or a control for cases where longer chains are required (e.g., multi-step proofs in AIME/GPQA); this risks systematically penalizing beneficial reasoning and undermines the claim that self-training preserves performance for general reasons rather than test-set artifacts.

    Authors: We appreciate this observation. The outcome-based metric is intentionally kept general, relying on token count and correctness to flag overthinking on problems where additional computation yields little benefit. Our evaluation already spans benchmarks of varying difficulty (GSM8K for easy problems and AIME/GPQA for those requiring extended multi-step reasoning), and accuracy is preserved after self-training on the harder sets, which suggests the approach does not indiscriminately penalize necessary exploration. To directly address the concern, we will revise §3 to add an explicit discussion of difficulty conditioning and include a stratified analysis by problem difficulty. revision: partial

  2. Referee: [§4.2] §4.2 (Self-Training Experiments): Results report preserved performance after mitigation, yet no ablation isolates the contribution of the proposed metrics versus simpler length penalties, and no statistical significance tests or variance across runs are provided; without these, it is unclear whether the efficiency gains are robust or dependent on the chosen difficulty distribution.

    Authors: We agree that these elements are needed to demonstrate robustness. In the revised manuscript we will add ablations that isolate the contribution of our proposed metrics against simpler length-penalty baselines, and we will report results with variance across multiple runs together with statistical significance tests. This will clarify that the observed efficiency gains hold across the difficulty distribution of the evaluated benchmarks. revision: yes

  3. Referee: [§4.1] §4.1 (Benchmark Details): The process-based metric is claimed to identify overthinking, but the manuscript does not report correlation with human judgments or oracle cases where extended exploration is provably necessary; this leaves the metric's validity as an open load-bearing assumption for the self-training objective.

    Authors: Thank you for raising this point. The process-based metric identifies overthinking via detection of redundant or inefficient steps within the generated chain. Although such validation was not included in the original submission, we will add in the revision a correlation analysis between the metric and human judgments on a sampled subset of reasoning traces, along with discussion of oracle cases from AIME and GPQA where extended exploration is known to be necessary. This will provide direct support for the metric's validity in guiding the self-training objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper defines novel outcome- and process-based efficiency metrics, applies self-training to shorten reasoning chains, and reports performance preservation on held-out benchmarks (GSM8K, MATH500, GPQA, AIME). No load-bearing derivation step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the central claims remain falsifiable against the external test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the work appears to rest on standard LLM evaluation practices and self-training assumptions common in the field.

pith-pipeline@v0.9.0 · 5510 in / 1009 out tokens · 45153 ms · 2026-05-13T15:48:19.491158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

    cs.AI 2026-05 unverdicted novelty 7.0

    KnotBench benchmark shows state-of-the-art VLMs perform near random on diagrammatic knot reasoning tasks and lack ability to simulate structural moves.

  2. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...

  3. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

  4. Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

  5. Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

    cs.LG 2026-05 unverdicted novelty 6.0

    SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.

  6. Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

    cs.LG 2026-05 unverdicted novelty 6.0

    VPG-EA applies variational posterior guidance and efficiency-aware distillation to compress LLM reasoning chains while preserving performance.

  7. Hint Tuning: Less Data Makes Better Reasoners

    cs.CL 2026-05 unverdicted novelty 6.0

    Hint Tuning uses an instruct model as a difficulty probe to create 1K multi-level hint examples that train reasoning models to calibrate chain-of-thought length, cutting tokens by 31.5% on average across 4B-32B models...

  8. Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

    cs.AI 2026-05 unverdicted novelty 6.0

    ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...

  9. When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.

  10. From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

    cs.LG 2026-04 unverdicted novelty 6.0

    PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...

  11. HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

    cs.AI 2026-04 unverdicted novelty 6.0

    HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.

  12. Muon is Scalable for LLM Training

    cs.LG 2025-02 unverdicted novelty 6.0

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  13. Reasoning Compression with Mixed-Policy Distillation

    cs.AI 2026-05 unverdicted novelty 5.0

    Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.

  14. How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem

    cs.AI 2026-05 unverdicted novelty 5.0

    Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.

  15. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

    cs.AI 2026-05 unverdicted novelty 5.0

    Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

  16. Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

  17. Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

  18. SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 5.0

    SHAPE improves average math reasoning accuracy by 3% while cutting token use by 30% through stage-aware hierarchical advantage and entropy-driven token redistribution.

  19. Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    cs.CL 2025-03 accept novelty 5.0

    A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

  20. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  21. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

  22. RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

    cs.LG 2025-05

Reference graph

Works this paper leans on

274 extracted references · 274 canonical work pages · cited by 22 Pith papers · 18 internal anchors

  1. [1]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 17682--17690, 2024

  2. [3]

    Learning how hard to think: Input-adaptive allocation of lm computation, 2024

    Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, and Jacob Andreas. Learning how hard to think: Input-adaptive allocation of lm computation, 2024. URL https://arxiv.org/abs/2410.04707

  3. [4]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

    DeepSeek. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. 2025. URL https://api.semanticscholar.org/CorpusID:275789950

  4. [5]

    Improving factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2024

  5. [6]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ph04CRkPdC

  6. [9]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2024. URL https://arxiv.org/abs/2412.06769

  7. [10]

    Improving minimum bayes risk decoding with multi-prompt

    David Heineman, Yao Dou, and Wei Xu. Improving minimum bayes risk decoding with multi-prompt. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 22525--22545, 2024

  8. [11]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS, 2021

  9. [12]

    Large language models are reasoning teachers

    Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14852--14882, 2023

  10. [13]

    When can llms actually correct their own mistakes? a critical survey of self-correction of llms

    Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. When can llms actually correct their own mistakes? a critical survey of self-correction of llms. Transactions of the Association for Computational Linguistics, 12: 0 1417--1440, 2024

  11. [15]

    Args: Alignment as reward-guided search

    Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. Args: Alignment as reward-guided search. In The Twelfth International Conference on Learning Representations, 2024

  12. [17]

    Prover-verifier games improve legibility of llm outputs, 2024

    Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of llm outputs, 2024. URL https://arxiv.org/abs/2407.13692

  13. [18]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

  14. [21]

    Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning

    Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ndR8Ytrzhh

  15. [23]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

  16. [27]

    Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation, 2024

    Rohin Manvi, Anikait Singh, and Stefano Ermon. Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation, 2024. URL https://arxiv.org/abs/2410.02725

  17. [28]

    Simpo: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. In Advances in Neural Information Processing Systems (NeurIPS), 2024

  18. [29]

    A diverse corpus for evaluating and developing english math word problem solvers

    Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

  19. [30]

    Learning to reason with llms

    OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms, 2024

  20. [31]

    Iterative reasoning preference optimization

    Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason E Weston. Iterative reasoning preference optimization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=4XIKfvNYvx

  21. [32]

    Qwq: Reflect deeply on the boundaries of the unknown, November 2024

    Qwen. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https://qwenlm.github.io/blog/qwq-32b-preview/

  22. [33]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

  23. [35]

    Alphazero-like tree-search can guide large language model decoding and training

    Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. In Forty-first International Conference on Machine Learning, 2024

  24. [37]

    Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning

    Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, and Kan Li. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning, 2024. URL https://arxiv.org/abs/2408.13457

  25. [38]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023 b . URL https://openreview.net/forum?id=1PL1NIMMrw

  26. [39]

    Finetuned language models are zero-shot learners

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022 a

  27. [40]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022 b

  28. [42]

    Examining inter-consistency of large language models collaboration: An in-depth analysis via debate

    Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 7572--7590, 2023

  29. [44]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024

  30. [47]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

  31. [48]

    Automatic curriculum expert iteration for reliable llm reasoning, 2024

    Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, and Doyen Sahoo. Automatic curriculum expert iteration for reliable llm reasoning, 2024. URL https://arxiv.org/abs/2410.07627

  32. [49]

    International Conference on Learning Representations , year=

    Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

  33. [50]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

    A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

  34. [51]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    SimPO: Simple Preference Optimization with a Reference-Free Reward , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  35. [52]

    Measuring Mathematical Problem Solving With the

    Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the

  36. [53]

    Advances in Neural Information Processing Systems , volume=

    Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=

  37. [54]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

  38. [55]

    European conference on machine learning , pages=

    Towards a universal theory of artificial intelligence based on algorithmic probability and sequential decisions , author=. European conference on machine learning , pages=. 2001 , organization=

  39. [56]

    Computer , volume=

    The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink , author=. Computer , volume=. 2022 , publisher=

  40. [57]

    QwQ: Reflect Deeply on the Boundaries of the Unknown , url =

    Qwen , month =. QwQ: Reflect Deeply on the Boundaries of the Unknown , url =

  41. [58]

    Qwen2 Technical Report

    Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=

  42. [59]

    2024 , howpublished =

    DeepSeek , title =. 2024 , howpublished =

  43. [60]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

  44. [61]

    2025 , url=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , url=

  45. [62]

    2005 , publisher=

    Universal artificial intelligence: Sequential decisions based on algorithmic probability , author=. 2005 , publisher=

  46. [63]

    Occam's razor --- Wikipedia , The Free Encyclopedia

    Wikipedia. Occam's razor --- Wikipedia , The Free Encyclopedia. 2024

  47. [64]

    Algorithmic information theory --- Wikipedia , The Free Encyclopedia

    Wikipedia. Algorithmic information theory --- Wikipedia , The Free Encyclopedia. 2024

  48. [65]

    Minimum description length --- Wikipedia , The Free Encyclopedia

    Wikipedia. Minimum description length --- Wikipedia , The Free Encyclopedia. 2024

  49. [66]

    CoRR , year=

    Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning , author=. CoRR , year=

  50. [67]

    2024 , editor =

    Wan, Ziyu and Feng, Xidong and Wen, Muning and Mcaleer, Stephen Marcus and Wen, Ying and Zhang, Weinan and Wang, Jun , booktitle =. 2024 , editor =

  51. [68]

    arXiv preprint arXiv:2404.12253 , year=

    Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing , author=. arXiv preprint arXiv:2404.12253 , year=

  52. [69]

    2024 , eprint=

    West-of-N: Synthetic Preferences for Self-Improving Reward Models , author=. 2024 , eprint=

  53. [70]

    The Twelfth International Conference on Learning Representations , year=

    Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

  54. [71]

    arXiv preprint arXiv:2408.00724 , year=

    Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=

  55. [72]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

  56. [73]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

  57. [74]

    Making Language Models Better Reasoners with Step-Aware Verifier

    Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu. Making Language Models Better Reasoners with Step-Aware Verifier. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.291

  58. [75]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Making language models better reasoners with step-aware verifier , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  59. [76]

    arXiv preprint arXiv:2310.10080 , year=

    Let's reward step by step: Step-Level reward model as the Navigators for Reasoning , author=. arXiv preprint arXiv:2310.10080 , year=

  60. [77]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  61. [78]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision , author=. arXiv preprint arXiv:2406.06592 , year=

  62. [79]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Critiquellm: Towards an informative critique generation model for evaluation of large language model generation , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  63. [80]

    C ritic B ench: Benchmarking LLM s for Critique-Correct Reasoning

    Lin, Zicheng and Gou, Zhibin and Liang, Tian and Luo, Ruilin and Liu, Haowei and Yang, Yujiu. C ritic B ench: Benchmarking LLM s for Critique-Correct Reasoning. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.91

  64. [81]

    2024 , editor =

    Havrilla, Alexander and Raparthy, Sharath Chandra and Nalmpantis, Christoforos and Dwivedi-Yu, Jane and Zhuravinskyi, Maksym and Hambro, Eric and Raileanu, Roberta , booktitle =. 2024 , editor =

  65. [82]

    Solving math word problems with process- and outcome-based feedback

    Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=

  66. [83]

    arXiv preprint arXiv:2402.06457 , year=

    V-star: Training verifiers for self-taught reasoners , author=. arXiv preprint arXiv:2402.06457 , year=

  67. [84]

    2023 , eprint=

    Making Large Language Models Better Reasoners with Step-Aware Verifier , author=. 2023 , eprint=

  68. [85]

    The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=

    Generative Verifiers: Reward Modeling as Next-Token Prediction , author=. The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=

  69. [86]

    2023 , url=

    Trading Off Compute in Training and Inference , author=. 2023 , url=

  70. [87]

    European conference on machine learning , pages=

    Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=

  71. [88]

    arXiv preprint arXiv:2406.10858 , year=

    Step-level Value Preference Optimization for Mathematical Reasoning , author=. arXiv preprint arXiv:2406.10858 , year=

  72. [89]

    2023 , html =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =

  73. [90]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

    Adaption-of-Thought: Learning Question Difficulty Improves Large Language Models for Reasoning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

  74. [91]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  75. [92]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Large Language Models Are Reasoning Teachers , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  76. [93]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Large Language Models Can Self-Improve , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  77. [94]

    Journal of Machine Learning Research , volume=

    Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

  78. [95]

    arXiv preprint arXiv:2207.00747 , year=

    Rationale-augmented ensembles in language models , author=. arXiv preprint arXiv:2207.00747 , year=

  79. [96]

    OPT: Open Pre-trained Transformer Language Models

    Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

  80. [97]

    Bloom: A 176b-parameter open-access multilingual language model , author=

Showing first 80 references.