pith. sign in

arxiv: 2507.04023 · v3 · submitted 2025-07-05 · 💻 cs.CL

Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

Pith reviewed 2026-05-19 06:19 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM reasoningoverthinkingaccuracy-efficiency tradeoffbasic math reasoningtoken efficiencybenchmarkverbosityreasoning models
0
0 comments X

The pith

Reasoning models in LLMs generate about 18 times more tokens on basic math tasks while sometimes achieving lower accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates whether large language models overthink fundamental math problems by producing excessively long responses. It introduces LLMThinkBench, a benchmark that uses dynamically generated questions across 14 basic math tasks along with the Overthinking Score to measure the accuracy versus token-efficiency tradeoff. The evaluation of 53 models reveals that reasoning variants often use far more tokens without accuracy benefits and suffer major drops when token limits are imposed. The accuracy-verbosity pattern is non-monotonic, with additional reasoning effort frequently yielding diminishing or zero returns.

Core claim

Reasoning models generate approximately 18 times more tokens than standard models on basic math tasks, yet this extended output sometimes lowers accuracy and triggers up to 36 percent accuracy collapse when responses are forced to be shorter. The accuracy-verbosity relationship is non-monotonic, so that moving from low to medium to high reasoning budgets produces no accuracy improvement in models such as the GPT-5 and o-series.

What carries the argument

The Overthinking Score, a harmonic-mean metric of accuracy and token efficiency, applied through an evaluation protocol on dynamically generated questions from 14 basic math tasks.

If this is right

  • Performance on complex math benchmarks does not translate to strong results on basic math reasoning.
  • Reasoning models experience catastrophic accuracy drops of up to 36 percent when token budgets are constrained.
  • Extended reasoning budgets produce diminishing returns on accuracy across many models.
  • Advanced reasoning models such as GPT-5 and o-series show zero accuracy gain when increasing from low to high reasoning effort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training approaches that explicitly reward concise yet correct reasoning chains could reduce unnecessary token use.
  • The observed tradeoff may matter for cost-sensitive applications that run many basic math queries.
  • Similar overthinking patterns could appear in non-math reasoning tasks that reward step-by-step output.

Load-bearing premise

Dynamically generated questions across the 14 basic math tasks give an unbiased and representative measure of fundamental math reasoning without favoring particular model families or training styles.

What would settle it

A follow-up experiment that replaces the dynamic questions with a fixed set of basic math problems and finds monotonic accuracy gains with longer reasoning chains would undermine the non-monotonic tradeoff claim.

Figures

Figures reproduced from arXiv: 2507.04023 by Aafiya Hussain, Gaurav Srivastava, Sriram Srinivasan, Xuan Wang.

Figure 1
Figure 1. Figure 1: LLMTHINKBENCH System Architecture. (a) End-to-End Workflow: A user defines an evaluation via the CLI or Python API, which is processed by the Model Handler (with vLLM/Transformers backends) and passed to the Task Evaluation Engine containing over 14 reasoning tasks. (b) Core Evaluation Pipeline: The engine follows a four-step process of Data Generation, Prompt Creation, Model Inference, and Response Parsin… view at source ↗
Figure 2
Figure 2. Figure 2: Performance analysis using LLMTHINKBENCH across different evaluation dimensions. (a) Model scaling: Qwen2.5 family (0.5B to 32B) shows accuracy improvement from 21.3% to 72.9%, while Llama-3.1 70B achieves 75.4% accuracy on basic math tasks. (b) Overthinking: Reasoning models show dramatic performance drops (average 39.3%) when constrained to 1024 tokens versus full token budget. (c) Quantization robustnes… view at source ↗
Figure 3
Figure 3. Figure 3: Snapshots of Leaderboard results using LLM￾THINKBENCH Evaluations 5 Conclusion We developed LLMThinkBench, a robust evalu￾ation framework for assessing basic mathemati￾cal reasoning and overthinking behavior in lan￾guage models. Using LLMTHINKBENCH, we evaluated 53 model which revealed fundamen￾tal gaps between benchmark performance and ba￾sic mathematical reasoning capabilities. LLM￾THINKBENCH addresses c… view at source ↗
Figure 4
Figure 4. Figure 4: Welcome page of our leaderboard of results for evaluation across 40+ models. It explains the tool, [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tabulated results of 40+ models evaluated on 1000 for 3-folds datapoint across 14 tasks. The leaderboard [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance insights across all models. Visualizes trade-offs such accuracy vs efficiency, model size vs [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance insights across all models. Visualizes trade-offs such overthinking vs accuracy, instruction [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present LLMThinkBench, a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. First, we formalize the accuracy-verbosity tradeoff. Second, we introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. Third, we establish an evaluation protocol with dynamically-generated data across 14 basic math tasks. Fourth, we conduct a large-scale empirical study evaluating 53 LLMs, including reasoning and quantized variants across different reasoning budgets. Fifth, we release LLMThinkBench as an open-source Python package and public leaderboard for reproducibility. Our findings reveal: 1) model performance on complex benchmarks does not translate directly to basic math reasoning; 2) reasoning models generate ~18x more tokens while sometimes achieving lower accuracy and exhibit catastrophic collapse when tokens are constrained, dropping by up to ~36%; 3) the accuracy-verbosity relationship is non-monotonic with extended reasoning budgets yielding diminishing returns (GPT-5/o-series models show zero accuracy gain from low -> medium -> high reasoning effort). Our findings challenge the assumption that longer reasoning in LLMs necessarily improves mathematical reasoning. Our public leaderboard is available at https://ctrl-gaurav.github.io/LLMThinkBench/. Our open-source Python package is available at https://pypi.org/project/llmthinkbench/, and the codebase can be found at https://github.com/ctrl-gaurav/LLMThinkBench for easy and reproducible evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLMThinkBench to study the accuracy-efficiency tradeoff in LLMs on basic math reasoning. It formalizes the tradeoff, proposes the Overthinking Score (harmonic mean of accuracy and token efficiency), describes an evaluation protocol using dynamically generated questions across 14 tasks, reports results from 53 models (including reasoning and quantized variants) under varying reasoning budgets, and releases an open-source package plus public leaderboard. Main empirical findings are that reasoning models produce ~18x more tokens (sometimes with lower accuracy), suffer up to ~36% accuracy collapse under token constraints, and exhibit non-monotonic accuracy-verbosity curves with diminishing or zero returns from extended reasoning budgets (e.g., GPT-5/o-series models).

Significance. If the benchmark construction is shown to be unbiased, the large-scale empirical comparison across 53 models supplies concrete evidence that longer reasoning traces do not reliably improve basic math performance and can even degrade it. The open-source release, public leaderboard, and parameter-free Overthinking Score are clear strengths that support reproducibility and future work on efficiency-aware evaluation.

major comments (2)
  1. [§3] §3 (Evaluation Protocol): the description of dynamic question generation for the 14 tasks omits concrete details on template selection, difficulty sampling distribution, prompt style cues, and exclusion criteria. This is load-bearing for the headline claims because any systematic alignment between generated surface forms and the pre-training distribution of concise base models (versus verbose reasoning models) could artifactually inflate the reported 18× token gap and the 36% collapse numbers.
  2. [§5] §5 (Results): the reported non-monotonic accuracy-verbosity relationship and zero-gain observation for GPT-5/o-series models are presented as aggregates without per-model confidence intervals, multiple-testing correction, or explicit controls for question difficulty variance. This weakens the claim that extended reasoning budgets yield diminishing returns, as the pattern could partly reflect variance in the generated test set rather than model behavior.
minor comments (2)
  1. [§2] The exact formula for the Overthinking Score (harmonic mean) should be stated with any scaling constants or normalization choices; the current prose description leaves the precise definition ambiguous for replication.
  2. [Figures 3-5] Figure captions and axis labels for the token-vs-accuracy plots should explicitly note the reasoning budget levels (low/medium/high) and the token constraint thresholds used in the collapse experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights opportunities to strengthen the reproducibility and statistical presentation of our work. We address each major comment below and commit to revisions that enhance clarity without altering the core empirical findings.

read point-by-point responses
  1. Referee: [§3] §3 (Evaluation Protocol): the description of dynamic question generation for the 14 tasks omits concrete details on template selection, difficulty sampling distribution, prompt style cues, and exclusion criteria. This is load-bearing for the headline claims because any systematic alignment between generated surface forms and the pre-training distribution of concise base models (versus verbose reasoning models) could artifactually inflate the reported 18× token gap and the 36% collapse numbers.

    Authors: We agree that additional concrete details on question generation will improve transparency and reproducibility. In the revised manuscript we will expand §3 with: (i) the full set of templates for each of the 14 tasks, (ii) the exact sampling distributions used for difficulty parameters (e.g., operand ranges drawn uniformly from fixed intervals), (iii) prompt formatting conventions, and (iv) any post-generation exclusion rules. We will also add a short discussion arguing that the generation procedure is model-agnostic and therefore unlikely to systematically favor concise base models over reasoning models. The complete generation logic is already public in the released package; we will cite it explicitly. These clarifications address the concern without changing any reported numbers. revision: yes

  2. Referee: [§5] §5 (Results): the reported non-monotonic accuracy-verbosity relationship and zero-gain observation for GPT-5/o-series models are presented as aggregates without per-model confidence intervals, multiple-testing correction, or explicit controls for question difficulty variance. This weakens the claim that extended reasoning budgets yield diminishing returns, as the pattern could partly reflect variance in the generated test set rather than model behavior.

    Authors: We accept that statistical safeguards will make the claims more robust. In the revision we will: (a) report per-model 95% bootstrap confidence intervals for accuracy at each reasoning budget, (b) stratify results by difficulty bins to control for question variance, and (c) add a brief note on the exploratory nature of the analysis (no formal hypothesis tests across dozens of comparisons were performed, so multiple-testing correction is not strictly required but will be discussed). The observed non-monotonic patterns remain consistent across all 14 tasks and multiple model families, which we will emphasize as supporting evidence that the trends are not artifacts of test-set variance alone. Updated figures and text will appear in §5. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on new benchmark

full rationale

The paper is an empirical benchmarking study that introduces LLMThinkBench with dynamically generated questions across 14 tasks and reports observed accuracy and token counts from 53 models. The Overthinking Score is explicitly defined as a harmonic mean metric rather than derived from data. No equations, predictions, or central claims reduce by construction to fitted inputs or self-citations. The derivation chain consists of measurement and comparison, which is self-contained against the generated test set.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on two domain assumptions about measurement and one newly introduced metric; no free parameters are fitted to data and no new physical entities are postulated.

axioms (2)
  • domain assumption Token count is a valid proxy for reasoning verbosity and overthinking.
    Used to define efficiency in the Overthinking Score and to interpret the 18x token increase.
  • domain assumption Dynamically generated questions across 14 tasks constitute an unbiased test of basic math reasoning.
    Invoked in the evaluation protocol described in the abstract.
invented entities (1)
  • Overthinking Score no independent evidence
    purpose: Harmonic-mean metric that jointly scores accuracy and token efficiency.
    Newly defined composite score introduced to enable holistic model comparison.

pith-pipeline@v0.9.0 · 5852 in / 1435 out tokens · 66871 ms · 2026-05-19T06:19:57.791436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

    cs.AI 2026-05 conditional novelty 7.0

    Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, and 4 others. 2025. https://arxi...

  4. [4]

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. https://doi.org/10.1609/aaai.v38i16.29720 Graph of thoughts: Solving elaborate problems with large language models . Proceedings of the AAAI Conference on Artificial ...

  5. [5]

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2025. https://arxiv.org/abs/2412.21187 Do not think that much for 2+3=? on the overthinking of o1-like llms . Preprint, arXiv:2412.21187

  6. [6]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. https://arxiv.org/abs/2110.14168 Training verifiers to solve math word problems . Preprint, arXiv:2110.14168

  7. [7]

    Y ., Hausknecht, K., Brenner, J., Liu, D., Peng, N., Wang, C., and Brenner, M

    Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, and Michael P. Brenner. 2024. https://arxiv.org/abs/2410.09988 Hardmath: A benchmark dataset for challenging problems in applied mathematics . Preprint, arXiv:2410.09988

  8. [8]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. https://arxiv.org/abs/2411.15594 A survey on llm-as-a-judge . Preprint, arXiv:2411.15594

  9. [9]

    Huy Hoang Ha. 2025. https://arxiv.org/abs/2503.13661 Pensez: Less data, better reasoning -- rethinking french llm . Preprint, arXiv:2503.13661

  10. [10]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. NeurIPS

  11. [11]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  12. [12]

    A. Lawsen. 2025. https://arxiv.org/abs/2506.09250 Comment on the illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity . Preprint, arXiv:2506.09250

  13. [13]

    Haitao Li, Junjie Chen, Jingli Yang, Qingyao Ai, Wei Jia, Youfeng Liu, Kai Lin, Yueyue Wu, Guozhi Yuan, Yiran Hu, Wuyue Wang, Yiqun Liu, and Minlie Huang. 2024 a . https://arxiv.org/abs/2412.17259 Legalagentbench: Evaluating llm agents in legal domain . Preprint, arXiv:2412.17259

  14. [14]

    Haoyang Li, Xuejia Chen, Zhanchao XU, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, and Lei Chen. 2025 a . https://arxiv.org/abs/2502.11075 Exposing numeracy gaps: A benchmark to evaluate fundamental numerical abilities in large language models . Preprint, arXiv:2502.11075

  15. [15]

    Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi. 2024 b . https://arxiv.org/abs/2402.19255 Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers . Preprint, arXiv:2402.19255

  16. [16]

    Zhiyuan Li, Yi Chang, and Yuan Wu. 2025 b . https://arxiv.org/abs/2505.22113 Think-bench: Evaluating thinking efficiency and chain-of-thought quality of large reasoning models . Preprint, arXiv:2505.22113

  17. [17]

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, and 1090 others. 2025. https://arxiv.org/abs/2501.14249 Humanity's last exam . Preprint, arXiv:2501.14249

  18. [18]

    Xiao Pu, Michael Saxon, Wenyue Hua, and William Yang Wang. 2025. https://arxiv.org/abs/2504.13367 Thoughtterminator: Benchmarking, calibrating, and mitigating overthinking in reasoning models . Preprint, arXiv:2504.13367

  19. [19]

    Roussel Rahman. 2025. https://arxiv.org/abs/2504.00226 Large language models in numberland: A quick test of their numerical reasoning abilities . Preprint, arXiv:2504.00226

  20. [20]

    Parshin Shojaee*†, Iman Mirzadeh*, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. 2025. https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity

  21. [21]

    Gaurav Srivastava, Zhenyu Bi, Meng Lu, and Xuan Wang. 2025 a . https://arxiv.org/abs/2505.15734 Debate, train, evolve: Self evolution of language model reasoning . Preprint, arXiv:2505.15734

  22. [22]

    Gaurav Srivastava, Shuxiang Cao, and Xuan Wang. 2025 b . https://arxiv.org/abs/2502.11569 Towards reasoning ability of small language models . Preprint, arXiv:2502.11569

  23. [23]

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, and Xia Hu. 2025. https://arxiv.org/abs/2503.16419 Stop overthinking: A survey on efficient reasoning for large language models . Preprint, arXiv:2503.16419

  24. [24]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://www.aclweb.org/anthology/2020.emnlp-demos.6 Transformers...

  25. [25]

    Zhen Hao Wong, Jingwen Deng, Runming He, Zirong Chen, Qijie You, Hejun Dong, Hao Liang, Chengyu Shen, Bin Cui, and Wentao Zhang. 2025. https://arxiv.org/abs/2506.04821 Logicpuzzlerl: Cultivating robust mathematical reasoning in llms via reinforcement learning . Preprint, arXiv:2506.04821

  26. [26]

    Nan Xu and Xuezhe Ma. 2024. Llm the genius paradox: A linguistic and math expert's struggle with simple word-based counting problems. arXiv preprint arXiv:2410.14166

  27. [27]

    Reda Yacouby and Dustin Axman. 2020. Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the first workshop on evaluation and comparison of NLP systems, pages 79--91

  28. [28]

    Yang Yan, Yu Lu, Renjun Xu, and Zhenzhong Lan. 2025. https://arxiv.org/abs/2504.05262 Do phd-level llms truly grasp elementary addition? probing rule learning vs. memorization in large language models . Preprint, arXiv:2504.05262

  29. [29]

    Haoran Zhao, Yuchen Yan, Yongliang Shen, Haolei Xu, Wenqi Zhang, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, and Yueting Zhuang. 2025. https://arxiv.org/abs/2505.14604 Let llms break free from overthinking via self-braking tuning . Preprint, arXiv:2505.14604