arxiv: 2412.21187 · v2 · submitted 2024-12-30 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Dian Yu, Dong Yu, Haitao Mi, Jiahao Xu, Jianhui Pang, Linfeng Song, Mengfei Zhou, Qiuzhi Liu, Rui Wang, Tian Liang, Xingyu Chen, Zhaopeng Tu, Zhiwei He, Zhuosheng Zhang

Pith reviewed 2026-05-13 15:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords overthinkingo1-like modelschain-of-thoughtefficiency metricsself-trainingcomputational overheadLLM reasoninginference optimization

0 comments

The pith

o1-like LLMs overthink simple problems by extending chain-of-thought far beyond what is needed, and self-training on new efficiency metrics can trim this waste without reducing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how models that emulate long human-like thinking during inference, such as OpenAI's o1, frequently devote excessive steps to straightforward questions where added reasoning brings little gain. It defines fresh efficiency metrics that assess both the final answer quality and the internal reasoning path to quantify this overthinking. A self-training method is then applied to encourage the model to halt early on low-value extensions. Experiments across math and science benchmarks demonstrate that computational costs drop while performance holds steady on tasks of varying difficulty.

Core claim

o1-like models exhibit overthinking by allocating unnecessary computational resources to simple problems through extended chain-of-thought processes; novel efficiency metrics from outcome and process perspectives identify this inefficiency, and a self-training paradigm mitigates it by streamlining reasoning steps without loss of accuracy on benchmarks including GSM8K, MATH500, GPQA, and AIME.

What carries the argument

Self-training paradigm guided by outcome-based and process-based efficiency metrics that detect when additional reasoning steps add minimal value to the final result.

If this is right

Computational overhead decreases on simple problems in GSM8K and MATH500 while accuracy stays intact.
Reasoning chains shorten on easy instances without harming results on harder sets such as GPQA and AIME.
Models learn to allocate fewer tokens when further steps yield little outcome improvement.
Overall inference efficiency improves across test sets of mixed difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same metrics could be applied at training time to produce models that inherently avoid overthinking from the start.
Dynamic early-stopping rules based on these metrics might generalize to non-math reasoning tasks such as coding or science question answering.
Resource-constrained deployments could use the trimmed models to handle high query volumes at lower cost.
Hybrid inference systems might route easy problems to short-chain versions and hard ones to full long-chain versions.

Load-bearing premise

The proposed efficiency metrics accurately flag wasteful overthinking rather than missing cases where longer reasoning is genuinely required for correct answers.

What would settle it

Apply the self-training strategies to a set of easy problems where the metrics predict overthinking; if accuracy falls below the original model's level while token counts remain high, the metrics have misclassified necessary reasoning.

read the original abstract

The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags overthinking in o1-like models with new outcome/process efficiency metrics and a self-training fix that trims compute on benchmarks like GSM8K and AIME, but the metrics may mislabel needed reasoning on hard problems.

read the letter

The main point is that o1-style models waste tokens on easy problems by overthinking, and this work gives metrics to spot it plus a self-training method to shorten chains without losing accuracy. They report gains across GSM8K, MATH500, GPQA, and AIME, which is the kind of practical result that matters for test-time scaling costs. The new metrics from outcome and process angles, plus the self-training paradigm, appear to be the first systematic treatment of this specific inefficiency in the cited CoT literature. That part is useful and worth noting for anyone tracking reasoning model deployment. The soft spot is exactly the one in the stress test: the metrics could penalize token count or step count without confirming that shorter paths still work on genuinely hard cases like multi-step proofs in AIME or ambiguous items in GPQA. If they do not condition on problem difficulty or verify correctness under reduced budgets, the preserved performance might be tied to the test distribution rather than a general fix. The abstract gives positive results but lacks detail on metric validation or statistical checks, so the central claim is only moderately supported so far. This is for people working on efficient inference for reasoning LLMs. It has a clear practical angle and enough substance to deserve peer review, even if the metrics will need tighter validation in revisions.

Referee Report

3 major / 2 minor

Summary. The paper claims that o1-like LLMs exhibit overthinking by allocating excessive compute to simple problems, introduces novel outcome- and process-based efficiency metrics to quantify rational resource use, and applies a self-training paradigm to shorten reasoning chains. Experiments reportedly show reduced computational overhead with preserved accuracy on GSM8K, MATH500, GPQA, and AIME.

Significance. If the efficiency metrics are shown to correctly separate overthinking from necessary exploration, the work could meaningfully advance efficient inference for long-CoT models by providing a practical self-training recipe that lowers token usage without accuracy loss. The multi-benchmark evaluation across difficulty levels is a positive feature, but the absence of explicit validation against ground-truth cases requiring extended reasoning limits the strength of the central claim.

major comments (3)

[§3] §3 (Efficiency Metrics): The outcome-based metric appears defined primarily via token count and final correctness without explicit conditioning on problem difficulty or a control for cases where longer chains are required (e.g., multi-step proofs in AIME/GPQA); this risks systematically penalizing beneficial reasoning and undermines the claim that self-training preserves performance for general reasons rather than test-set artifacts.
[§4.2] §4.2 (Self-Training Experiments): Results report preserved performance after mitigation, yet no ablation isolates the contribution of the proposed metrics versus simpler length penalties, and no statistical significance tests or variance across runs are provided; without these, it is unclear whether the efficiency gains are robust or dependent on the chosen difficulty distribution.
[§4.1] §4.1 (Benchmark Details): The process-based metric is claimed to identify overthinking, but the manuscript does not report correlation with human judgments or oracle cases where extended exploration is provably necessary; this leaves the metric's validity as an open load-bearing assumption for the self-training objective.

minor comments (2)

[Abstract] Abstract and §1: The claim of being the 'first comprehensive study' would benefit from explicit citations to prior work on CoT length analysis or overthinking in reasoning models to clarify novelty.
[§3] Notation: Define all efficiency metric components (e.g., exact formulas for outcome and process scores) in a single dedicated subsection with consistent symbols to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and commit to revisions that strengthen the presentation of the efficiency metrics and experimental results.

read point-by-point responses

Referee: [§3] §3 (Efficiency Metrics): The outcome-based metric appears defined primarily via token count and final correctness without explicit conditioning on problem difficulty or a control for cases where longer chains are required (e.g., multi-step proofs in AIME/GPQA); this risks systematically penalizing beneficial reasoning and undermines the claim that self-training preserves performance for general reasons rather than test-set artifacts.

Authors: We appreciate this observation. The outcome-based metric is intentionally kept general, relying on token count and correctness to flag overthinking on problems where additional computation yields little benefit. Our evaluation already spans benchmarks of varying difficulty (GSM8K for easy problems and AIME/GPQA for those requiring extended multi-step reasoning), and accuracy is preserved after self-training on the harder sets, which suggests the approach does not indiscriminately penalize necessary exploration. To directly address the concern, we will revise §3 to add an explicit discussion of difficulty conditioning and include a stratified analysis by problem difficulty. revision: partial
Referee: [§4.2] §4.2 (Self-Training Experiments): Results report preserved performance after mitigation, yet no ablation isolates the contribution of the proposed metrics versus simpler length penalties, and no statistical significance tests or variance across runs are provided; without these, it is unclear whether the efficiency gains are robust or dependent on the chosen difficulty distribution.

Authors: We agree that these elements are needed to demonstrate robustness. In the revised manuscript we will add ablations that isolate the contribution of our proposed metrics against simpler length-penalty baselines, and we will report results with variance across multiple runs together with statistical significance tests. This will clarify that the observed efficiency gains hold across the difficulty distribution of the evaluated benchmarks. revision: yes
Referee: [§4.1] §4.1 (Benchmark Details): The process-based metric is claimed to identify overthinking, but the manuscript does not report correlation with human judgments or oracle cases where extended exploration is provably necessary; this leaves the metric's validity as an open load-bearing assumption for the self-training objective.

Authors: Thank you for raising this point. The process-based metric identifies overthinking via detection of redundant or inefficient steps within the generated chain. Although such validation was not included in the original submission, we will add in the revision a correlation analysis between the metric and human judgments on a sampled subset of reasoning traces, along with discussion of oracle cases from AIME and GPQA where extended exploration is known to be necessary. This will provide direct support for the metric's validity in guiding the self-training objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper defines novel outcome- and process-based efficiency metrics, applies self-training to shorten reasoning chains, and reports performance preservation on held-out benchmarks (GSM8K, MATH500, GPQA, AIME). No load-bearing derivation step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the central claims remain falsifiable against the external test sets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the work appears to rest on standard LLM evaluation practices and self-training assumptions common in the field.

pith-pipeline@v0.9.0 · 5510 in / 1009 out tokens · 45153 ms · 2026-05-13T15:48:19.491158+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark
cs.AI 2026-05 unverdicted novelty 7.0

KnotBench benchmark shows state-of-the-art VLMs perform near random on diagrammatic knot reasoning tasks and lack ability to simulate structural moves.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
cs.AI 2026-05 conditional novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
cs.LG 2026-05 unverdicted novelty 6.0

SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness
cs.LG 2026-05 unverdicted novelty 6.0

VPG-EA applies variational posterior guidance and efficiency-aware distillation to compress LLM reasoning chains while preserving performance.
Hint Tuning: Less Data Makes Better Reasoners
cs.CL 2026-05 unverdicted novelty 6.0

Hint Tuning uses an instruct model as a difficulty probe to create 1K multi-level hint examples that train reasoning models to calibrate chain-of-thought length, cutting tokens by 31.5% on average across 4B-32B models...
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
cs.AI 2026-05 unverdicted novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
cs.LG 2026-04 unverdicted novelty 6.0

PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
cs.AI 2026-04 unverdicted novelty 6.0

HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
Reasoning Compression with Mixed-Policy Distillation
cs.AI 2026-05 unverdicted novelty 5.0

Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
cs.AI 2026-05 unverdicted novelty 5.0

Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning
cs.LG 2026-04 unverdicted novelty 5.0

SHAPE improves average math reasoning accuracy by 3% while cutting token use by 30% through stage-aware hierarchical advantage and entropy-driven token redistribution.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
cs.CL 2025-03 accept novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
cs.LG 2025-05

Reference graph

Works this paper leans on

274 extracted references · 274 canonical work pages · cited by 22 Pith papers · 18 internal anchors

[1]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 17682--17690, 2024

work page 2024
[3]

Learning how hard to think: Input-adaptive allocation of lm computation, 2024

Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, and Jacob Andreas. Learning how hard to think: Input-adaptive allocation of lm computation, 2024. URL https://arxiv.org/abs/2410.04707

work page arXiv 2024
[4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

DeepSeek. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. 2025. URL https://api.semanticscholar.org/CorpusID:275789950

work page 2025
[5]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[6]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ph04CRkPdC

work page 2024
[9]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2024. URL https://arxiv.org/abs/2412.06769

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Improving minimum bayes risk decoding with multi-prompt

David Heineman, Yao Dou, and Wei Xu. Improving minimum bayes risk decoding with multi-prompt. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 22525--22545, 2024

work page 2024
[11]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS, 2021

work page 2021
[12]

Large language models are reasoning teachers

Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14852--14882, 2023

work page 2023
[13]

When can llms actually correct their own mistakes? a critical survey of self-correction of llms

Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. When can llms actually correct their own mistakes? a critical survey of self-correction of llms. Transactions of the Association for Computational Linguistics, 12: 0 1417--1440, 2024

work page 2024
[15]

Args: Alignment as reward-guided search

Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. Args: Alignment as reward-guided search. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[17]

Prover-verifier games improve legibility of llm outputs, 2024

Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of llm outputs, 2024. URL https://arxiv.org/abs/2407.13692

work page arXiv 2024
[18]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

work page 2022
[21]

Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ndR8Ytrzhh

work page 2024
[23]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

work page 2024
[27]

Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation, 2024

Rohin Manvi, Anikait Singh, and Stefano Ermon. Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation, 2024. URL https://arxiv.org/abs/2410.02725

work page arXiv 2024
[28]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. In Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[29]

A diverse corpus for evaluating and developing english math word problem solvers

Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

work page 2020
[30]

Learning to reason with llms

OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms, 2024

work page 2024
[31]

Iterative reasoning preference optimization

Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason E Weston. Iterative reasoning preference optimization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=4XIKfvNYvx

work page 2024
[32]

Qwq: Reflect deeply on the boundaries of the unknown, November 2024

Qwen. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https://qwenlm.github.io/blog/qwq-32b-preview/

work page 2024
[33]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[35]

Alphazero-like tree-search can guide large language model decoding and training

Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[37]

Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning

Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, and Kan Li. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning, 2024. URL https://arxiv.org/abs/2408.13457

work page arXiv 2024
[38]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023 b . URL https://openreview.net/forum?id=1PL1NIMMrw

work page 2023
[39]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022 a

work page 2022
[40]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022 b

work page 2022
[42]

Examining inter-consistency of large language models collaboration: An in-depth analysis via debate

Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 7572--7590, 2023

work page 2023
[44]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[47]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

work page 2022
[48]

Automatic curriculum expert iteration for reliable llm reasoning, 2024

Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, and Doyen Sahoo. Automatic curriculum expert iteration for reliable llm reasoning, 2024. URL https://arxiv.org/abs/2410.07627

work page arXiv 2024
[49]

International Conference on Learning Representations , year=

Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=

work page
[50]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=

work page
[51]

Advances in Neural Information Processing Systems (NeurIPS) , year=

SimPO: Simple Preference Optimization with a Reference-Free Reward , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

work page
[52]

Measuring Mathematical Problem Solving With the

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the

work page
[53]

Advances in Neural Information Processing Systems , volume=

Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=

work page
[54]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

European conference on machine learning , pages=

Towards a universal theory of artificial intelligence based on algorithmic probability and sequential decisions , author=. European conference on machine learning , pages=. 2001 , organization=

work page 2001
[56]

Computer , volume=

The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink , author=. Computer , volume=. 2022 , publisher=

work page 2022
[57]

QwQ: Reflect Deeply on the Boundaries of the Unknown , url =

Qwen , month =. QwQ: Reflect Deeply on the Boundaries of the Unknown , url =

work page
[58]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

2024 , howpublished =

DeepSeek , title =. 2024 , howpublished =

work page 2024
[60]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

2025 , url=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , url=

work page 2025
[62]

2005 , publisher=

Universal artificial intelligence: Sequential decisions based on algorithmic probability , author=. 2005 , publisher=

work page 2005
[63]

Occam's razor --- Wikipedia , The Free Encyclopedia

Wikipedia. Occam's razor --- Wikipedia , The Free Encyclopedia. 2024

work page 2024
[64]

Algorithmic information theory --- Wikipedia , The Free Encyclopedia

Wikipedia. Algorithmic information theory --- Wikipedia , The Free Encyclopedia. 2024

work page 2024
[65]

Minimum description length --- Wikipedia , The Free Encyclopedia

Wikipedia. Minimum description length --- Wikipedia , The Free Encyclopedia. 2024

work page 2024
[66]

CoRR , year=

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning , author=. CoRR , year=

work page
[67]

2024 , editor =

Wan, Ziyu and Feng, Xidong and Wen, Muning and Mcaleer, Stephen Marcus and Wen, Ying and Zhang, Weinan and Wang, Jun , booktitle =. 2024 , editor =

work page 2024
[68]

arXiv preprint arXiv:2404.12253 , year=

Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing , author=. arXiv preprint arXiv:2404.12253 , year=

work page arXiv
[69]

2024 , eprint=

West-of-N: Synthetic Preferences for Self-Improving Reward Models , author=. 2024 , eprint=

work page 2024
[70]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

work page
[71]

arXiv preprint arXiv:2408.00724 , year=

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=

work page arXiv
[72]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Making Language Models Better Reasoners with Step-Aware Verifier

Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu. Making Language Models Better Reasoners with Step-Aware Verifier. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.291

work page doi:10.18653/v1/2023.acl-long.291 2023
[75]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Making language models better reasoners with step-aware verifier , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[76]

arXiv preprint arXiv:2310.10080 , year=

Let's reward step by step: Step-Level reward model as the Navigators for Reasoning , author=. arXiv preprint arXiv:2310.10080 , year=

work page arXiv
[77]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[78]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Improve Mathematical Reasoning in Language Models by Automated Process Supervision , author=. arXiv preprint arXiv:2406.06592 , year=

work page internal anchor Pith review arXiv
[79]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Critiquellm: Towards an informative critique generation model for evaluation of large language model generation , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[80]

C ritic B ench: Benchmarking LLM s for Critique-Correct Reasoning

Lin, Zicheng and Gou, Zhibin and Liang, Tian and Luo, Ruilin and Liu, Haowei and Yang, Yujiu. C ritic B ench: Benchmarking LLM s for Critique-Correct Reasoning. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.91

work page doi:10.18653/v1/2024.findings-acl.91 2024
[81]

2024 , editor =

Havrilla, Alexander and Raparthy, Sharath Chandra and Nalmpantis, Christoforos and Dwivedi-Yu, Jane and Zhuravinskyi, Maksym and Hambro, Eric and Raileanu, Roberta , booktitle =. 2024 , editor =

work page 2024
[82]

Solving math word problems with process- and outcome-based feedback

Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[83]

arXiv preprint arXiv:2402.06457 , year=

V-star: Training verifiers for self-taught reasoners , author=. arXiv preprint arXiv:2402.06457 , year=

work page arXiv
[84]

2023 , eprint=

Making Large Language Models Better Reasoners with Step-Aware Verifier , author=. 2023 , eprint=

work page 2023
[85]

The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=

Generative Verifiers: Reward Modeling as Next-Token Prediction , author=. The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=

work page
[86]

2023 , url=

Trading Off Compute in Training and Inference , author=. 2023 , url=

work page 2023
[87]

European conference on machine learning , pages=

Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=

work page 2006
[88]

arXiv preprint arXiv:2406.10858 , year=

Step-level Value Preference Optimization for Mathematical Reasoning , author=. arXiv preprint arXiv:2406.10858 , year=

work page arXiv
[89]

2023 , html =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =

work page 2023
[90]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Adaption-of-Thought: Learning Question Difficulty Improves Large Language Models for Reasoning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2024
[91]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024
[92]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Large Language Models Are Reasoning Teachers , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[93]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Large Language Models Can Self-Improve , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[94]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

work page
[95]

arXiv preprint arXiv:2207.00747 , year=

Rationale-augmented ensembles in language models , author=. arXiv preprint arXiv:2207.00747 , year=

work page arXiv
[96]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[97]

Bloom: A 176b-parameter open-access multilingual language model , author=

work page

Showing first 80 references.