Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models
Pith reviewed 2026-05-19 06:19 UTC · model grok-4.3
The pith
Reasoning models in LLMs generate about 18 times more tokens on basic math tasks while sometimes achieving lower accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning models generate approximately 18 times more tokens than standard models on basic math tasks, yet this extended output sometimes lowers accuracy and triggers up to 36 percent accuracy collapse when responses are forced to be shorter. The accuracy-verbosity relationship is non-monotonic, so that moving from low to medium to high reasoning budgets produces no accuracy improvement in models such as the GPT-5 and o-series.
What carries the argument
The Overthinking Score, a harmonic-mean metric of accuracy and token efficiency, applied through an evaluation protocol on dynamically generated questions from 14 basic math tasks.
If this is right
- Performance on complex math benchmarks does not translate to strong results on basic math reasoning.
- Reasoning models experience catastrophic accuracy drops of up to 36 percent when token budgets are constrained.
- Extended reasoning budgets produce diminishing returns on accuracy across many models.
- Advanced reasoning models such as GPT-5 and o-series show zero accuracy gain when increasing from low to high reasoning effort.
Where Pith is reading between the lines
- Training approaches that explicitly reward concise yet correct reasoning chains could reduce unnecessary token use.
- The observed tradeoff may matter for cost-sensitive applications that run many basic math queries.
- Similar overthinking patterns could appear in non-math reasoning tasks that reward step-by-step output.
Load-bearing premise
Dynamically generated questions across the 14 basic math tasks give an unbiased and representative measure of fundamental math reasoning without favoring particular model families or training styles.
What would settle it
A follow-up experiment that replaces the dynamic questions with a fixed set of basic math problems and finds monotonic accuracy gains with longer reasoning chains would undermine the non-monotonic tradeoff claim.
Figures
read the original abstract
Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present LLMThinkBench, a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. First, we formalize the accuracy-verbosity tradeoff. Second, we introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. Third, we establish an evaluation protocol with dynamically-generated data across 14 basic math tasks. Fourth, we conduct a large-scale empirical study evaluating 53 LLMs, including reasoning and quantized variants across different reasoning budgets. Fifth, we release LLMThinkBench as an open-source Python package and public leaderboard for reproducibility. Our findings reveal: 1) model performance on complex benchmarks does not translate directly to basic math reasoning; 2) reasoning models generate ~18x more tokens while sometimes achieving lower accuracy and exhibit catastrophic collapse when tokens are constrained, dropping by up to ~36%; 3) the accuracy-verbosity relationship is non-monotonic with extended reasoning budgets yielding diminishing returns (GPT-5/o-series models show zero accuracy gain from low -> medium -> high reasoning effort). Our findings challenge the assumption that longer reasoning in LLMs necessarily improves mathematical reasoning. Our public leaderboard is available at https://ctrl-gaurav.github.io/LLMThinkBench/. Our open-source Python package is available at https://pypi.org/project/llmthinkbench/, and the codebase can be found at https://github.com/ctrl-gaurav/LLMThinkBench for easy and reproducible evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLMThinkBench to study the accuracy-efficiency tradeoff in LLMs on basic math reasoning. It formalizes the tradeoff, proposes the Overthinking Score (harmonic mean of accuracy and token efficiency), describes an evaluation protocol using dynamically generated questions across 14 tasks, reports results from 53 models (including reasoning and quantized variants) under varying reasoning budgets, and releases an open-source package plus public leaderboard. Main empirical findings are that reasoning models produce ~18x more tokens (sometimes with lower accuracy), suffer up to ~36% accuracy collapse under token constraints, and exhibit non-monotonic accuracy-verbosity curves with diminishing or zero returns from extended reasoning budgets (e.g., GPT-5/o-series models).
Significance. If the benchmark construction is shown to be unbiased, the large-scale empirical comparison across 53 models supplies concrete evidence that longer reasoning traces do not reliably improve basic math performance and can even degrade it. The open-source release, public leaderboard, and parameter-free Overthinking Score are clear strengths that support reproducibility and future work on efficiency-aware evaluation.
major comments (2)
- [§3] §3 (Evaluation Protocol): the description of dynamic question generation for the 14 tasks omits concrete details on template selection, difficulty sampling distribution, prompt style cues, and exclusion criteria. This is load-bearing for the headline claims because any systematic alignment between generated surface forms and the pre-training distribution of concise base models (versus verbose reasoning models) could artifactually inflate the reported 18× token gap and the 36% collapse numbers.
- [§5] §5 (Results): the reported non-monotonic accuracy-verbosity relationship and zero-gain observation for GPT-5/o-series models are presented as aggregates without per-model confidence intervals, multiple-testing correction, or explicit controls for question difficulty variance. This weakens the claim that extended reasoning budgets yield diminishing returns, as the pattern could partly reflect variance in the generated test set rather than model behavior.
minor comments (2)
- [§2] The exact formula for the Overthinking Score (harmonic mean) should be stated with any scaling constants or normalization choices; the current prose description leaves the precise definition ambiguous for replication.
- [Figures 3-5] Figure captions and axis labels for the token-vs-accuracy plots should explicitly note the reasoning budget levels (low/medium/high) and the token constraint thresholds used in the collapse experiments.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights opportunities to strengthen the reproducibility and statistical presentation of our work. We address each major comment below and commit to revisions that enhance clarity without altering the core empirical findings.
read point-by-point responses
-
Referee: [§3] §3 (Evaluation Protocol): the description of dynamic question generation for the 14 tasks omits concrete details on template selection, difficulty sampling distribution, prompt style cues, and exclusion criteria. This is load-bearing for the headline claims because any systematic alignment between generated surface forms and the pre-training distribution of concise base models (versus verbose reasoning models) could artifactually inflate the reported 18× token gap and the 36% collapse numbers.
Authors: We agree that additional concrete details on question generation will improve transparency and reproducibility. In the revised manuscript we will expand §3 with: (i) the full set of templates for each of the 14 tasks, (ii) the exact sampling distributions used for difficulty parameters (e.g., operand ranges drawn uniformly from fixed intervals), (iii) prompt formatting conventions, and (iv) any post-generation exclusion rules. We will also add a short discussion arguing that the generation procedure is model-agnostic and therefore unlikely to systematically favor concise base models over reasoning models. The complete generation logic is already public in the released package; we will cite it explicitly. These clarifications address the concern without changing any reported numbers. revision: yes
-
Referee: [§5] §5 (Results): the reported non-monotonic accuracy-verbosity relationship and zero-gain observation for GPT-5/o-series models are presented as aggregates without per-model confidence intervals, multiple-testing correction, or explicit controls for question difficulty variance. This weakens the claim that extended reasoning budgets yield diminishing returns, as the pattern could partly reflect variance in the generated test set rather than model behavior.
Authors: We accept that statistical safeguards will make the claims more robust. In the revision we will: (a) report per-model 95% bootstrap confidence intervals for accuracy at each reasoning budget, (b) stratify results by difficulty bins to control for question variance, and (c) add a brief note on the exploratory nature of the analysis (no formal hypothesis tests across dozens of comparisons were performed, so multiple-testing correction is not strictly required but will be discussed). The observed non-monotonic patterns remain consistent across all 14 tasks and multiple model families, which we will emphasize as supporting evidence that the trends are not artifacts of test-set variance alone. Updated figures and text will appear in §5. revision: yes
Circularity Check
No circularity: direct empirical measurements on new benchmark
full rationale
The paper is an empirical benchmarking study that introduces LLMThinkBench with dynamically generated questions across 14 tasks and reports observed accuracy and token counts from 53 models. The Overthinking Score is explicitly defined as a harmonic mean metric rather than derived from data. No equations, predictions, or central claims reduce by construction to fitted inputs or self-citations. The derivation chain consists of measurement and comparison, which is self-contained against the generated test set.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Token count is a valid proxy for reasoning verbosity and overthinking.
- domain assumption Dynamically generated questions across 14 tasks constitute an unbiased test of basic math reasoning.
invented entities (1)
-
Overthinking Score
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Reference graph
Works this paper leans on
-
[1]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, and 4 others. 2025. https://arxi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. https://doi.org/10.1609/aaai.v38i16.29720 Graph of thoughts: Solving elaborate problems with large language models . Proceedings of the AAAI Conference on Artificial ...
-
[5]
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025. https://arxiv.org/abs/2412.21187 Do not think that much for 2+3=? on the overthinking of o1-like llms . Preprint, arXiv:2412.21187
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Y ., Hausknecht, K., Brenner, J., Liu, D., Peng, N., Wang, C., and Brenner, M
Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P. Brenner. 2024. https://arxiv.org/abs/2410.09988 Hardmath: A benchmark dataset for challenging problems in applied mathematics . Preprint, arXiv:2410.09988
-
[8]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. https://arxiv.org/abs/2411.15594 A survey on llm-as-a-judge . Preprint, arXiv:2411.15594
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
-
[10]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. NeurIPS
work page 2021
-
[11]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles
work page 2023
- [12]
-
[13]
Haitao Li, Junjie Chen, Jingli Yang, Qingyao Ai, Wei Jia, Youfeng Liu, Kai Lin, Yueyue Wu, Guozhi Yuan, Yiran Hu, Wuyue Wang, Yiqun Liu, and Minlie Huang. 2024 a . https://arxiv.org/abs/2412.17259 Legalagentbench: Evaluating llm agents in legal domain . Preprint, arXiv:2412.17259
-
[14]
Haoyang Li, Xuejia Chen, Zhanchao XU, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, and Lei Chen. 2025 a . https://arxiv.org/abs/2502.11075 Exposing numeracy gaps: A benchmark to evaluate fundamental numerical abilities in large language models . Preprint, arXiv:2502.11075
- [15]
- [16]
-
[17]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, and 1090 others. 2025. https://arxiv.org/abs/2501.14249 Humanity's last exam . Preprint, arXiv:2501.14249
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [18]
- [19]
-
[20]
Parshin Shojaee*†, Iman Mirzadeh*, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. 2025. https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity
work page 2025
- [21]
- [22]
-
[23]
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, and Xia Hu. 2025. https://arxiv.org/abs/2503.16419 Stop overthinking: A survey on efficient reasoning for large language models . Preprint, arXiv:2503.16419
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://www.aclweb.org/anthology/2020.emnlp-demos.6 Transformers...
work page 2020
-
[25]
Zhen Hao Wong, Jingwen Deng, Runming He, Zirong Chen, Qijie You, Hejun Dong, Hao Liang, Chengyu Shen, Bin Cui, and Wentao Zhang. 2025. https://arxiv.org/abs/2506.04821 Logicpuzzlerl: Cultivating robust mathematical reasoning in llms via reinforcement learning . Preprint, arXiv:2506.04821
- [26]
-
[27]
Reda Yacouby and Dustin Axman. 2020. Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the first workshop on evaluation and comparison of NLP systems, pages 79--91
work page 2020
- [28]
- [29]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.