Recognition: 2 theorem links
· Lean TheoremDo NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Pith reviewed 2026-05-13 15:48 UTC · model grok-4.3
The pith
o1-like LLMs overthink simple problems by extending chain-of-thought far beyond what is needed, and self-training on new efficiency metrics can trim this waste without reducing accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
o1-like models exhibit overthinking by allocating unnecessary computational resources to simple problems through extended chain-of-thought processes; novel efficiency metrics from outcome and process perspectives identify this inefficiency, and a self-training paradigm mitigates it by streamlining reasoning steps without loss of accuracy on benchmarks including GSM8K, MATH500, GPQA, and AIME.
What carries the argument
Self-training paradigm guided by outcome-based and process-based efficiency metrics that detect when additional reasoning steps add minimal value to the final result.
If this is right
- Computational overhead decreases on simple problems in GSM8K and MATH500 while accuracy stays intact.
- Reasoning chains shorten on easy instances without harming results on harder sets such as GPQA and AIME.
- Models learn to allocate fewer tokens when further steps yield little outcome improvement.
- Overall inference efficiency improves across test sets of mixed difficulty.
Where Pith is reading between the lines
- The same metrics could be applied at training time to produce models that inherently avoid overthinking from the start.
- Dynamic early-stopping rules based on these metrics might generalize to non-math reasoning tasks such as coding or science question answering.
- Resource-constrained deployments could use the trimmed models to handle high query volumes at lower cost.
- Hybrid inference systems might route easy problems to short-chain versions and hard ones to full long-chain versions.
Load-bearing premise
The proposed efficiency metrics accurately flag wasteful overthinking rather than missing cases where longer reasoning is genuinely required for correct answers.
What would settle it
Apply the self-training strategies to a set of easy problems where the metrics predict overthinking; if accuracy falls below the original model's level while token counts remain high, the metrics have misclassified necessary reasoning.
read the original abstract
The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that o1-like LLMs exhibit overthinking by allocating excessive compute to simple problems, introduces novel outcome- and process-based efficiency metrics to quantify rational resource use, and applies a self-training paradigm to shorten reasoning chains. Experiments reportedly show reduced computational overhead with preserved accuracy on GSM8K, MATH500, GPQA, and AIME.
Significance. If the efficiency metrics are shown to correctly separate overthinking from necessary exploration, the work could meaningfully advance efficient inference for long-CoT models by providing a practical self-training recipe that lowers token usage without accuracy loss. The multi-benchmark evaluation across difficulty levels is a positive feature, but the absence of explicit validation against ground-truth cases requiring extended reasoning limits the strength of the central claim.
major comments (3)
- [§3] §3 (Efficiency Metrics): The outcome-based metric appears defined primarily via token count and final correctness without explicit conditioning on problem difficulty or a control for cases where longer chains are required (e.g., multi-step proofs in AIME/GPQA); this risks systematically penalizing beneficial reasoning and undermines the claim that self-training preserves performance for general reasons rather than test-set artifacts.
- [§4.2] §4.2 (Self-Training Experiments): Results report preserved performance after mitigation, yet no ablation isolates the contribution of the proposed metrics versus simpler length penalties, and no statistical significance tests or variance across runs are provided; without these, it is unclear whether the efficiency gains are robust or dependent on the chosen difficulty distribution.
- [§4.1] §4.1 (Benchmark Details): The process-based metric is claimed to identify overthinking, but the manuscript does not report correlation with human judgments or oracle cases where extended exploration is provably necessary; this leaves the metric's validity as an open load-bearing assumption for the self-training objective.
minor comments (2)
- [Abstract] Abstract and §1: The claim of being the 'first comprehensive study' would benefit from explicit citations to prior work on CoT length analysis or overthinking in reasoning models to clarify novelty.
- [§3] Notation: Define all efficiency metric components (e.g., exact formulas for outcome and process scores) in a single dedicated subsection with consistent symbols to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and commit to revisions that strengthen the presentation of the efficiency metrics and experimental results.
read point-by-point responses
-
Referee: [§3] §3 (Efficiency Metrics): The outcome-based metric appears defined primarily via token count and final correctness without explicit conditioning on problem difficulty or a control for cases where longer chains are required (e.g., multi-step proofs in AIME/GPQA); this risks systematically penalizing beneficial reasoning and undermines the claim that self-training preserves performance for general reasons rather than test-set artifacts.
Authors: We appreciate this observation. The outcome-based metric is intentionally kept general, relying on token count and correctness to flag overthinking on problems where additional computation yields little benefit. Our evaluation already spans benchmarks of varying difficulty (GSM8K for easy problems and AIME/GPQA for those requiring extended multi-step reasoning), and accuracy is preserved after self-training on the harder sets, which suggests the approach does not indiscriminately penalize necessary exploration. To directly address the concern, we will revise §3 to add an explicit discussion of difficulty conditioning and include a stratified analysis by problem difficulty. revision: partial
-
Referee: [§4.2] §4.2 (Self-Training Experiments): Results report preserved performance after mitigation, yet no ablation isolates the contribution of the proposed metrics versus simpler length penalties, and no statistical significance tests or variance across runs are provided; without these, it is unclear whether the efficiency gains are robust or dependent on the chosen difficulty distribution.
Authors: We agree that these elements are needed to demonstrate robustness. In the revised manuscript we will add ablations that isolate the contribution of our proposed metrics against simpler length-penalty baselines, and we will report results with variance across multiple runs together with statistical significance tests. This will clarify that the observed efficiency gains hold across the difficulty distribution of the evaluated benchmarks. revision: yes
-
Referee: [§4.1] §4.1 (Benchmark Details): The process-based metric is claimed to identify overthinking, but the manuscript does not report correlation with human judgments or oracle cases where extended exploration is provably necessary; this leaves the metric's validity as an open load-bearing assumption for the self-training objective.
Authors: Thank you for raising this point. The process-based metric identifies overthinking via detection of redundant or inefficient steps within the generated chain. Although such validation was not included in the original submission, we will add in the revision a correlation analysis between the metric and human judgments on a sampled subset of reasoning traces, along with discussion of oracle cases from AIME and GPQA where extended exploration is known to be necessary. This will provide direct support for the metric's validity in guiding the self-training objective. revision: yes
Circularity Check
No significant circularity; empirical results on external benchmarks
full rationale
The paper defines novel outcome- and process-based efficiency metrics, applies self-training to shorten reasoning chains, and reports performance preservation on held-out benchmarks (GSM8K, MATH500, GPQA, AIME). No load-bearing derivation step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the central claims remain falsifiable against the external test sets.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark
KnotBench benchmark shows state-of-the-art VLMs perform near random on diagrammatic knot reasoning tasks and lack ability to simulate structural moves.
-
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
SPEX accelerates Tree-of-Thought LLM reasoning 1.2-3x via speculative path selection, dynamic budget allocation across queries, and adaptive early termination, with up to 4.1x when combined with token speculative decoding.
-
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness
VPG-EA applies variational posterior guidance and efficiency-aware distillation to compress LLM reasoning chains while preserving performance.
-
Hint Tuning: Less Data Makes Better Reasoners
Hint Tuning uses an instruct model as a difficulty probe to create 1K multi-level hint examples that train reasoning models to calibrate chain-of-thought length, cutting tokens by 31.5% on average across 4B-32B models...
-
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...
-
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...
-
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
Reasoning Compression with Mixed-Policy Distillation
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
-
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning
SHAPE improves average math reasoning accuracy by 3% while cutting token use by 30% through stage-aware hierarchical advantage and entropy-driven token redistribution.
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
- RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
Reference graph
Works this paper leans on
-
[1]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 17682--17690, 2024
work page 2024
-
[3]
Learning how hard to think: Input-adaptive allocation of lm computation, 2024
Mehul Damani, Idan Shenfeld, Andi Peng, Andreea Bobu, and Jacob Andreas. Learning how hard to think: Input-adaptive allocation of lm computation, 2024. URL https://arxiv.org/abs/2410.04707
-
[4]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning
DeepSeek. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. 2025. URL https://api.semanticscholar.org/CorpusID:275789950
work page 2025
-
[5]
Improving factuality and reasoning in language models through multiagent debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[6]
Think before you speak: Training language models with pause tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ph04CRkPdC
work page 2024
-
[9]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2024. URL https://arxiv.org/abs/2412.06769
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Improving minimum bayes risk decoding with multi-prompt
David Heineman, Yao Dou, and Wei Xu. Improving minimum bayes risk decoding with multi-prompt. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 22525--22545, 2024
work page 2024
-
[11]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS, 2021
work page 2021
-
[12]
Large language models are reasoning teachers
Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 14852--14882, 2023
work page 2023
-
[13]
When can llms actually correct their own mistakes? a critical survey of self-correction of llms
Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, and Rui Zhang. When can llms actually correct their own mistakes? a critical survey of self-correction of llms. Transactions of the Association for Computational Linguistics, 12: 0 1417--1440, 2024
work page 2024
-
[15]
Args: Alignment as reward-guided search
Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. Args: Alignment as reward-guided search. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[17]
Prover-verifier games improve legibility of llm outputs, 2024
Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of llm outputs, 2024. URL https://arxiv.org/abs/2407.13692
-
[18]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022
work page 2022
-
[21]
Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li. Escape sky-high cost: Early-stopping self-consistency for multi-step reasoning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ndR8Ytrzhh
work page 2024
-
[23]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi
work page 2024
-
[27]
Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation, 2024
Rohin Manvi, Anikait Singh, and Stefano Ermon. Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation, 2024. URL https://arxiv.org/abs/2410.02725
-
[28]
Simpo: Simple preference optimization with a reference-free reward
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. In Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[29]
A diverse corpus for evaluating and developing english math word problem solvers
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
work page 2020
-
[30]
OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with-llms, 2024
work page 2024
-
[31]
Iterative reasoning preference optimization
Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason E Weston. Iterative reasoning preference optimization. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=4XIKfvNYvx
work page 2024
-
[32]
Qwq: Reflect deeply on the boundaries of the unknown, November 2024
Qwen. Qwq: Reflect deeply on the boundaries of the unknown, November 2024. URL https://qwenlm.github.io/blog/qwq-32b-preview/
work page 2024
-
[33]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[35]
Alphazero-like tree-search can guide large language model decoding and training
Ziyu Wan, Xidong Feng, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. Alphazero-like tree-search can guide large language model decoding and training. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[37]
Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning
Xinglin Wang, Shaoxiong Feng, Yiwei Li, Peiwen Yuan, Yueqi Zhang, Boyuan Pan, Heda Wang, Yao Hu, and Kan Li. Make every penny count: Difficulty-adaptive self-consistency for cost-efficient reasoning, 2024. URL https://arxiv.org/abs/2408.13457
-
[38]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023 b . URL https://openreview.net/forum?id=1PL1NIMMrw
work page 2023
-
[39]
Finetuned language models are zero-shot learners
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022 a
work page 2022
-
[40]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022 b
work page 2022
-
[42]
Examining inter-consistency of large language models collaboration: An in-depth analysis via debate
Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. Examining inter-consistency of large language models collaboration: An in-depth analysis via debate. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 7572--7590, 2023
work page 2023
-
[44]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[47]
Star: Bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022
work page 2022
-
[48]
Automatic curriculum expert iteration for reliable llm reasoning, 2024
Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, and Doyen Sahoo. Automatic curriculum expert iteration for reliable llm reasoning, 2024. URL https://arxiv.org/abs/2410.07627
-
[49]
International Conference on Learning Representations , year=
Finetuned Language Models are Zero-Shot Learners , author=. International Conference on Learning Representations , year=
-
[50]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=
A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , year=
-
[51]
Advances in Neural Information Processing Systems (NeurIPS) , year=
SimPO: Simple Preference Optimization with a Reference-Free Reward , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
-
[52]
Measuring Mathematical Problem Solving With the
Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the
-
[53]
Advances in Neural Information Processing Systems , volume=
Star: Bootstrapping reasoning with reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[54]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
European conference on machine learning , pages=
Towards a universal theory of artificial intelligence based on algorithmic probability and sequential decisions , author=. European conference on machine learning , pages=. 2001 , organization=
work page 2001
-
[56]
The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink , author=. Computer , volume=. 2022 , publisher=
work page 2022
-
[57]
QwQ: Reflect Deeply on the Boundaries of the Unknown , url =
Qwen , month =. QwQ: Reflect Deeply on the Boundaries of the Unknown , url =
-
[58]
Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [59]
-
[60]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , url=
work page 2025
-
[62]
Universal artificial intelligence: Sequential decisions based on algorithmic probability , author=. 2005 , publisher=
work page 2005
-
[63]
Occam's razor --- Wikipedia , The Free Encyclopedia
Wikipedia. Occam's razor --- Wikipedia , The Free Encyclopedia. 2024
work page 2024
-
[64]
Algorithmic information theory --- Wikipedia , The Free Encyclopedia
Wikipedia. Algorithmic information theory --- Wikipedia , The Free Encyclopedia. 2024
work page 2024
-
[65]
Minimum description length --- Wikipedia , The Free Encyclopedia
Wikipedia. Minimum description length --- Wikipedia , The Free Encyclopedia. 2024
work page 2024
-
[66]
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning , author=. CoRR , year=
-
[67]
Wan, Ziyu and Feng, Xidong and Wen, Muning and Mcaleer, Stephen Marcus and Wen, Ying and Zhang, Weinan and Wang, Jun , booktitle =. 2024 , editor =
work page 2024
-
[68]
arXiv preprint arXiv:2404.12253 , year=
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing , author=. arXiv preprint arXiv:2404.12253 , year=
-
[69]
West-of-N: Synthetic Preferences for Self-Improving Reward Models , author=. 2024 , eprint=
work page 2024
-
[70]
The Twelfth International Conference on Learning Representations , year=
Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=
-
[71]
arXiv preprint arXiv:2408.00724 , year=
Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models , author=. arXiv preprint arXiv:2408.00724 , year=
-
[72]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[74]
Making Language Models Better Reasoners with Step-Aware Verifier
Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu. Making Language Models Better Reasoners with Step-Aware Verifier. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.291
-
[75]
Making language models better reasoners with step-aware verifier , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[76]
arXiv preprint arXiv:2310.10080 , year=
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning , author=. arXiv preprint arXiv:2310.10080 , year=
-
[77]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[78]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Improve Mathematical Reasoning in Language Models by Automated Process Supervision , author=. arXiv preprint arXiv:2406.06592 , year=
work page internal anchor Pith review arXiv
-
[79]
Critiquellm: Towards an informative critique generation model for evaluation of large language model generation , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[80]
C ritic B ench: Benchmarking LLM s for Critique-Correct Reasoning
Lin, Zicheng and Gou, Zhibin and Liang, Tian and Luo, Ruilin and Liu, Haowei and Yang, Yujiu. C ritic B ench: Benchmarking LLM s for Critique-Correct Reasoning. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.91
-
[81]
Havrilla, Alexander and Raparthy, Sharath Chandra and Nalmpantis, Christoforos and Dwivedi-Yu, Jane and Zhuravinskyi, Maksym and Hambro, Eric and Raileanu, Roberta , booktitle =. 2024 , editor =
work page 2024
-
[82]
Solving math word problems with process- and outcome-based feedback
Solving math word problems with process-and outcome-based feedback , author=. arXiv preprint arXiv:2211.14275 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
arXiv preprint arXiv:2402.06457 , year=
V-star: Training verifiers for self-taught reasoners , author=. arXiv preprint arXiv:2402.06457 , year=
-
[84]
Making Large Language Models Better Reasoners with Step-Aware Verifier , author=. 2023 , eprint=
work page 2023
-
[85]
The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=
Generative Verifiers: Reward Modeling as Next-Token Prediction , author=. The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=
- [86]
-
[87]
European conference on machine learning , pages=
Bandit based monte-carlo planning , author=. European conference on machine learning , pages=. 2006 , organization=
work page 2006
-
[88]
arXiv preprint arXiv:2406.10858 , year=
Step-level Value Preference Optimization for Mathematical Reasoning , author=. arXiv preprint arXiv:2406.10858 , year=
-
[89]
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =
work page 2023
-
[90]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
Adaption-of-Thought: Learning Question Difficulty Improves Large Language Models for Reasoning , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2024
-
[91]
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2024
-
[92]
Large Language Models Are Reasoning Teachers , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[93]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Large Language Models Can Self-Improve , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2023
-
[94]
Journal of Machine Learning Research , volume=
Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=
-
[95]
arXiv preprint arXiv:2207.00747 , year=
Rationale-augmented ensembles in language models , author=. arXiv preprint arXiv:2207.00747 , year=
-
[96]
OPT: Open Pre-trained Transformer Language Models
Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[97]
Bloom: A 176b-parameter open-access multilingual language model , author=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.