arxiv: 2604.10739 · v1 · submitted 2026-04-12 · 💻 cs.AI

Recognition: unknown

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling

Shu Zhou , Rui Ling , Junan Chen , Xin Wang , Tao Fan , Hao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords overthinkingtest-time computeLLM reasoningchain of thoughtcompute scalingmarginal utilityearly stoppingadaptive compute

0 comments

The pith

LLMs can abandon correct answers when allowed more reasoning steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the assumption that longer chains of thought always improve large language model performance on reasoning tasks. It shows that additional reasoning tokens deliver sharply diminishing returns at higher budgets and that models sometimes switch from correct to incorrect answers as the chain lengthens. Optimal reasoning length turns out to depend on problem difficulty, so giving every problem the same compute budget wastes resources. A cost-aware evaluation method demonstrates that halting at moderate budgets can cut total computation substantially while preserving nearly the same accuracy.

Core claim

Scaling test-time compute through extended chains of thought has become a dominant paradigm, yet the assumption that longer thinking always yields better results remains unexamined. Marginal returns diminish substantially at higher budgets, and models exhibit overthinking by abandoning previously correct answers. Optimal thinking length varies across problem difficulty, making uniform compute allocation suboptimal. Stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.

What carries the argument

Overthinking, the tendency for extended reasoning chains to cause models to abandon previously correct answers in favor of incorrect ones.

If this is right

Uniform allocation of test-time compute across problems of different difficulty is inefficient.
Early stopping at moderate reasoning budgets can preserve accuracy while lowering total tokens used.
Marginal gains from extra reasoning tokens become small or negative once a problem-specific length is exceeded.
Reasoning strategies must incorporate adaptive length control rather than fixed long chains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dynamic stopping rules that monitor answer stability during generation could replace fixed budgets.
The same overthinking pattern may appear in other sequential generation tasks such as code synthesis or multi-step planning.
Training objectives that penalize unnecessary continuation after a stable answer emerges could reduce overthinking at inference time.

Load-bearing premise

The observed abandonment of correct answers is driven primarily by the length of the reasoning chain itself rather than by sampling temperature, prompt format, or inherent model randomness.

What would settle it

Run the same problems multiple times at fixed temperature and prompt while varying only the maximum allowed reasoning tokens; accuracy should continue to fall after the moderate budget point if the overthinking claim holds.

Figures

Figures reproduced from arXiv: 2604.10739 by Hao Wang, Junan Chen, Rui Ling, Shu Zhou, Tao Fan, Xin Wang.

**Figure 1.** Figure 1: Marginal utility diminishes with compute budget. (a) By problem difficulty: easier problems (Level 1-2) reach negative marginal utility earlier than hard problems (Level 5). The shaded region indicates where additional thinking hurts performance. (b) Model comparison: R1-32B maintains positive marginal utility longer than s1-32B, showing better resistance to overthinking. Shaded bands show standard deviat… view at source ↗

**Figure 2.** Figure 2: Overthinking can flip correct answers to incorrect ones. (a) Accuracy trajectories for individual problems, showing cases where extended thinking leads to answer changes. The red “overthinking zone” highlights where negative flips become dominant. (b) Frequency of “negative flips” (correct→incorrect) versus “positive flips” (incorrect→correct) across compute budgets. The crossover at ∼7K marks where extend… view at source ↗

**Figure 3.** Figure 3: GPQA Diamond: Model Comparison. (a) Accuracy curves showing R1-32B consistently outperforming s1-32B. (b) Flip ratio (negative/positive) analysis illustrating the underlying mechanism of overthinking at extended compute budgets. the CI [0.71, 1.05] still includes values below 1.0, confirming that overthinking has not yet reliably occurred at this budget level. Budget Flip Ratio 95% CI p-value 2,000 0.32 [0… view at source ↗

**Figure 4.** Figure 4: Flip Event Analysis: R1 vs. s1. (a) Flip event counts showing s1-32B crosses the negative-dominated threshold earlier (∼5K tokens) than R1-32B (∼7K tokens). (b) Flip ratio (negative/positive) comparison between the two models, highlighting s1-32B’s higher tendency to reverse correct answers. Indicator Correlation Precision@0.8 Hesitation markers 0.71 64.2% Answer oscillation 0.78 71.5% Confidence drop 0.63… view at source ↗

**Figure 5.** Figure 5: MATH-500: Difficulty-Stratified Analysis. (a) Accuracy by difficulty level. (b) Optimal budget varies 7.5× across difficulty levels. (c) Marginal utility by difficulty level across budgets. At 4K tokens, the model correctly identifies the answer as 220 using a standard counting argument. At 8K tokens, the model revisits the problem: “Wait, I should double-check by considering an alternative approach... A… view at source ↗

**Figure 6.** Figure 6: Cost-aware evaluation reveals optimal stopping points. (a) The Pareto frontier shows the accuracycompute trade-off. Markers indicate optimal budgets under different λ values: at λ=0 (cost-agnostic), peak accuracy budget is optimal (not maximum, due to overthinking); at λ=1.0 (cost-sensitive), early stopping achieves higher utility. (b) Utility curves shift as cost sensitivity increases, with optimal stopp… view at source ↗

**Figure 7.** Figure 7: Early Stopping Validation. (a) The compute-accuracy trade-off for different stopping constraints. (b) Strategy comparison showing our combined approach achieves strong accuracy with significant compute savings. Cost-Sensitive (λ = 1.0): Strong efficiency preference. Only compute that yields proportional accuracy gains is justified. 5.3 Main Results Under cost-agnostic evaluation (λ=0), the optimal strate… view at source ↗

read the original abstract

Scaling test-time compute through extended chains of thought has become a dominant paradigm for improving large language model reasoning. However, existing research implicitly assumes that longer thinking always yields better results. This assumption remains largely unexamined. We systematically investigate how the marginal utility of additional reasoning tokens changes as compute budgets increase. We find that marginal returns diminish substantially at higher budgets and that models exhibit ``overthinking'', where extended reasoning is associated with abandoning previously correct answers. Furthermore, we show that optimal thinking length varies across problem difficulty, suggesting that uniform compute allocation is suboptimal. Our cost-aware evaluation framework reveals that stopping at moderate budgets can reduce computation significantly while maintaining comparable accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Longer CoT chains can make LLMs drop correct answers they had earlier, and optimal thinking length shifts with problem difficulty, but the causal role of length versus sampling noise needs tighter controls.

read the letter

The main takeaway is that this paper documents cases where extending reasoning chains in LLMs leads to abandoning correct answers, and that the compute budget that maximizes accuracy is not uniform but depends on how hard the problem is. They track how marginal gains from extra tokens fall off at higher budgets and show that a cost-aware stopping rule can cut compute while keeping accuracy close to the maximum-budget case. That practical angle on early stopping is the clearest contribution here. The difficulty-dependent optima also stand out as a concrete observation that challenges the default of always scaling up uniformly. The experiments appear systematic enough on the surface to map these patterns across budgets. The soft spot is the link between chain length and answer abandonment. If the runs at different budgets are independent samples rather than controlled extensions or truncations of the same trace, the flips could simply reflect temperature-driven variance instead of any length-driven overthinking effect. The abstract does not detail fixed seeds, greedy decoding, or paired comparisons, so that attribution stays provisional until the methods section clarifies it. Even without a fully locked causal story, the marginal-utility curves and the cost framework still give useful data for anyone tuning inference budgets. This is aimed at researchers working on test-time scaling and efficient LLM reasoning. A reader focused on deployment costs or on testing the limits of chain-of-thought scaling will find the empirical patterns worth examining. It is grounded enough and timely enough to deserve a serious referee, with the main revision likely centering on stronger controls for isolating length from stochasticity.

Referee Report

2 major / 2 minor

Summary. The paper systematically examines the marginal utility of additional reasoning tokens in LLM test-time compute scaling. It reports diminishing returns at higher budgets, documents an 'overthinking' effect in which extended chains cause models to abandon answers that were correct at shorter lengths, demonstrates that optimal thinking length varies with problem difficulty, and introduces a cost-aware evaluation framework showing that early stopping can preserve accuracy while substantially reducing compute.

Significance. If the empirical patterns hold under controlled conditions, the work is significant for challenging the prevailing 'longer is better' assumption in test-time scaling research. The cost-aware framework and difficulty-dependent optimal lengths offer practical guidance for efficient inference. The paper receives credit for its systematic experimental design and for surfacing a falsifiable phenomenon (answer abandonment) rather than relying on fitted parameters or circular definitions.

major comments (2)

[§4.2] §4.2 (Overthinking Analysis): the reported association between chain length and abandonment of correct answers lacks controls that isolate length from sampling variance. If independent temperature sampling (T>0) is used across budgets without fixed seeds or paired truncation, longer traces are simply different draws; the observed abandonment could reflect ordinary stochasticity rather than a length-driven causal effect. This directly undermines the central 'overthinking' claim.
[§3] §3 (Experimental Setup): the methodology does not describe whether greedy decoding, fixed random seeds, or same-trace continuation (truncation) was employed when comparing different compute budgets. Without such controls, the claim that optimal thinking length varies by difficulty cannot be cleanly separated from prompt- or sample-specific effects.

minor comments (2)

[Figure 3] Figure 3 and associated text: the cost-accuracy curves would benefit from explicit error bars or confidence intervals across multiple runs to support the claim of 'comparable accuracy' at moderate budgets.
[§1] The abstract and §1 use 'overthinking' without an initial formal definition; a short operational definition (e.g., 'abandonment rate of initially correct answers as a function of token budget') should appear early.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the two major comments point by point below, clarifying our experimental controls and adding supporting analyses where needed.

read point-by-point responses

Referee: [§4.2] §4.2 (Overthinking Analysis): the reported association between chain length and abandonment of correct answers lacks controls that isolate length from sampling variance. If independent temperature sampling (T>0) is used across budgets without fixed seeds or paired truncation, longer traces are simply different draws; the observed abandonment could reflect ordinary stochasticity rather than a length-driven causal effect. This directly undermines the central 'overthinking' claim.

Authors: We agree that independent sampling at varying lengths can confound length effects with stochastic variation. Our primary results were generated with greedy decoding (temperature = 0) to minimize this issue, but we acknowledge that this was not stated explicitly in the original manuscript. To isolate the causal role of length, we have added a truncation analysis in the revised §4.2: we generate full long traces under fixed seeds and then evaluate successive prefixes of the same trace. This within-trace comparison shows that correct answers are still abandoned as length increases, even when the underlying sample is held constant. We have updated the text and figures to report these controls. revision: yes
Referee: [§3] §3 (Experimental Setup): the methodology does not describe whether greedy decoding, fixed random seeds, or same-trace continuation (truncation) was employed when comparing different compute budgets. Without such controls, the claim that optimal thinking length varies by difficulty cannot be cleanly separated from prompt- or sample-specific effects.

Authors: We accept that the original §3 omitted key implementation details. The revised manuscript now explicitly states that all budget-comparison experiments used greedy decoding with fixed random seeds for reproducibility. For the difficulty-stratified analysis, the same problem set and model configuration were applied across difficulty tiers. In addition, we have included truncation-based results (same trace, varying prefix length) that reproduce the finding that optimal thinking length varies systematically with problem difficulty. These additions separate the reported effect from prompt- or sample-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations from controlled experiments

full rationale

The paper reports direct experimental measurements of LLM accuracy versus reasoning length across difficulty levels and budgets. No equations derive predictions from fitted parameters, no self-citations supply load-bearing uniqueness theorems, and no ansatz or renaming is presented as a first-principles result. All central claims (diminishing returns, overthinking via answer abandonment, optimal length variation) are stated as outcomes of running the models and tabulating results, with no reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work is presented as an empirical investigation of an existing paradigm.

pith-pipeline@v0.9.0 · 5409 in / 1077 out tokens · 54769 ms · 2026-05-10T15:43:35.115850+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...
Know When to Trust the Skill: Delayed Appraisal and Epistemic Vigilance for Single-Agent LLMs
cs.AI 2026-04 unverdicted novelty 4.0

MESA-S framework translates human metacognitive control into LLMs via delayed procedural probes and Metacognitive Skill Cards to separate parametric certainty from source trust and reduce overthinking.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

s1: Simple test-time scaling

Fast inference from transformers via spec- ulative decoding. InProceedings of the 40th Inter- national Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 19274–19286. PMLR. Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, and Yun Fu. 2025. Representation potentials of foundation models for multimodal align...

work page Pith review arXiv 2025
[2]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

The right tool for the job: Matching model and instance complexities. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651. Association for Computational Linguistics. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. 2024. Scaling llm test-time compute optimally can be more effective than scal...

work page internal anchor Pith review Pith/arXiv arXiv 2024