pith. sign in

arxiv: 2606.00206 · v1 · pith:EUW2RXX4new · submitted 2026-05-29 · 💻 cs.LG

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

Pith reviewed 2026-06-28 23:03 UTC · model grok-4.3

classification 💻 cs.LG
keywords post-training quantizationchain-of-thought reasoningoverthinking errorslogit penaltyreasoning efficiencymodel compressionquantized language models
0
0 comments X

The pith

A simple logit penalty on overthinking markers like 'wait' and 'but' shortens quantized models' chain-of-thought by 12-23% while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Post-training quantization makes reasoning models produce longer chains of thought and more overthinking errors, where the model reaches the correct answer in intermediate steps but continues instead of outputting it. These extra steps arise because quantized models sample specific marker words such as 'wait', 'but', and 'alternatively' more often at positions where their output distribution diverges from the full-precision model. Applying a training-free penalty to the logits of a curated set of these markers reduces chain length and cuts overthinking errors by up to 58 percent. The approach improves the accuracy-versus-reasoning-cost trade-off across five models, three quantization methods, and five benchmarks in math, coding, and science QA.

Core claim

Quantized reasoning models reach the right answer in intermediate steps yet fail to output it in up to 52 percent of their errors, because high KL-divergence positions cause disproportionate sampling of overthinking markers. A training-free logit penalty on a curated set of these markers reduces CoT length by 12-23 percent, preserves or improves accuracy, and reduces overthinking errors by up to 58 percent, producing a better Pareto frontier than penalizing other token sets.

What carries the argument

Training-free logit penalty applied to a curated set of overthinking markers.

If this is right

  • Chain-of-thought length drops 12-23 percent on math, coding, and science QA tasks.
  • Accuracy is preserved or improved across the five benchmarks and model sizes from 1.5B to 32B.
  • Overthinking errors fall by as much as 58 percent, especially in quantized models.
  • The accuracy-reasoning-cost trade-off is better than when penalizing other token sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same penalty approach could apply to other compression methods that increase output entropy.
  • Overthinking may be a controllable side effect of distribution shift rather than a necessary part of correct reasoning.
  • The marker set might require task-specific adjustment when moving beyond the five benchmarks studied.

Load-bearing premise

The curated overthinking markers are the main driver of extra chain length under quantization, and suppressing them will not create new error types outside the tested benchmarks.

What would settle it

An experiment on a new benchmark or model size where the penalized quantized model shows lower accuracy than the unpenalized quantized baseline would falsify the claim that the penalty reliably preserves or improves performance.

read the original abstract

Post-training quantization (PTQ) is widely used to deploy large language models efficiently, but its effect on reasoning models is not well understood. Across math, coding, and science QA, we find that aggressive PTQ reduces accuracy while increasing chain-of-thought (CoT) length. Surprisingly, we show that in up to 52% of the quantized models' failures, models reach the right answer in intermediate reasoning steps but do not output it as a final answer. To understand why quantization leads to this increase in overthinking errors, we measure the token-level KL divergence between quantized and full-precision output distributions. Positions with high KL divergence correlate strongly with high next-token entropy, and at these positions quantized models disproportionately sample overthinking markers such as "wait", "but", and "alternatively". We show that simply introducing a training-free logit penalty on a curated set of overthinking markers can reduce CoT length by 12--23% while preserving or improving accuracy across 5 models (1.5B-32B parameters), 3 quantization methods, and 5 benchmarks, yielding a favorable Pareto frontier of accuracy against reasoning cost compared to penalizing other token sets. Overthinking errors produced by quantized models are particularly reduced by up to 58%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that post-training quantization (PTQ) on reasoning models increases CoT length and causes overthinking errors (models reach the correct answer in intermediate steps but fail to output it) in up to 52% of failures across math, coding, and science QA. High-KL positions between quantized and full-precision distributions correlate with high entropy and over-sampling of markers like "wait", "but", and "alternatively". A training-free logit penalty on a curated set of such markers reduces CoT length by 12-23%, preserves or improves accuracy, reduces overthinking errors by up to 58%, and yields a better accuracy-vs-reasoning-cost Pareto frontier than penalizing other token sets, across 5 models (1.5B-32B), 3 quantization methods, and 5 benchmarks.

Significance. If the central empirical result holds, the work identifies a previously under-appreciated interaction between quantization and reasoning dynamics and supplies a simple, training-free mitigation that improves the efficiency-accuracy trade-off without retraining. The breadth of models, quantization schemes, and benchmarks is a strength, as is the explicit comparison to alternative token penalties. The absence of a described selection procedure for the marker set and of statistical characterization of the headline percentages limits the strength of the generality claim.

major comments (3)
  1. [Abstract / Methods] Abstract and implied Methods: The curation procedure for the overthinking marker set is not described (no selection rule, KL threshold, validation split, or statement on whether the set was assembled after inspecting the five benchmarks). Because the headline gains (12-23 % length reduction, 58 % overthinking-error reduction, favorable Pareto frontier) are attributed specifically to penalizing this set rather than other token sets, the lack of a reproducible selection protocol makes it impossible to distinguish a general mechanism from post-hoc selection.
  2. [Abstract] Abstract: The quantitative claims (52 % of failures, 12-23 % length reduction, 58 % error reduction) are stated without error bars, per-benchmark dataset sizes, number of runs, or statistical tests. This weakens the reliability of the cross-model and cross-benchmark conclusions and the assertion that the penalty improves the Pareto frontier.
  3. [Results] Results (implied comparison section): The claim that the curated-marker penalty produces a favorable Pareto frontier relative to penalizing other token sets requires explicit reporting of how the comparator sets were constructed and the precise accuracy-cost metric used; without this, the superiority statement rests on an incompletely specified baseline.
minor comments (1)
  1. [Abstract] The abstract states results across "5 benchmarks" but does not name them; adding the benchmark list would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the major comments below and commit to revisions that improve reproducibility and statistical reporting.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and implied Methods: The curation procedure for the overthinking marker set is not described (no selection rule, KL threshold, validation split, or statement on whether the set was assembled after inspecting the five benchmarks). Because the headline gains (12-23 % length reduction, 58 % overthinking-error reduction, favorable Pareto frontier) are attributed specifically to penalizing this set rather than other token sets, the lack of a reproducible selection protocol makes it impossible to distinguish a general mechanism from post-hoc selection.

    Authors: We agree the curation procedure must be described for reproducibility. In the revised manuscript we will add a Methods subsection that specifies the selection rule: tokens were chosen from those exhibiting both elevated KL divergence and high next-token entropy in the quantized distributions, with explicit frequency thresholds derived from aggregate statistics across the benchmarks. We will also state that marker identification used a preliminary analysis on a disjoint validation split and was fixed before running the final experiments. revision: yes

  2. Referee: [Abstract] Abstract: The quantitative claims (52 % of failures, 12-23 % length reduction, 58 % error reduction) are stated without error bars, per-benchmark dataset sizes, number of runs, or statistical tests. This weakens the reliability of the cross-model and cross-benchmark conclusions and the assertion that the penalty improves the Pareto frontier.

    Authors: We concur that error bars and statistical characterization are needed. The revision will report per-benchmark example counts, standard deviations computed over at least three independent sampling runs with different seeds, and paired statistical tests (e.g., Wilcoxon signed-rank) for the length and overthinking-error reductions. These details will appear in the main Results tables and be summarized in the abstract where space allows. revision: yes

  3. Referee: [Results] Results (implied comparison section): The claim that the curated-marker penalty produces a favorable Pareto frontier relative to penalizing other token sets requires explicit reporting of how the comparator sets were constructed and the precise accuracy-cost metric used; without this, the superiority statement rests on an incompletely specified baseline.

    Authors: We will expand the comparison subsection to describe comparator-set construction (randomly sampled token sets of matched cardinality plus frequency-matched controls) and to define the accuracy-cost metric explicitly as accuracy versus mean CoT token length. All Pareto curves will be plotted with the full set of operating points so readers can verify the claimed advantage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurement and ablation

full rationale

The paper reports measurements of KL divergence between quantized and full-precision models, correlations with entropy and overthinking markers, and the effect of a training-free logit penalty on a curated marker set. No equations, predictions, or derivations are present that reduce by construction to fitted inputs or self-citations. The central result (12-23% length reduction, accuracy preservation) is validated via direct comparison to penalizing other token sets across multiple models/benchmarks, making the analysis self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central empirical claims rest on the assumption that the chosen overthinking markers are representative and that the observed KL-entropy correlation is stable across models and quantization methods. No free parameters are explicitly fitted in the abstract; the marker set is described as curated rather than optimized. No new physical or mathematical entities are introduced.

axioms (2)
  • domain assumption Token-level KL divergence between quantized and full-precision distributions is a meaningful proxy for behavioral change under quantization.
    Used to link high-divergence positions to overthinking markers; stated implicitly by the correlation analysis in the abstract.
  • domain assumption Next-token entropy is a reliable indicator of positions where overthinking is likely to occur.
    Abstract states that high KL positions correlate strongly with high entropy.

pith-pipeline@v0.9.1-grok · 5764 in / 1687 out tokens · 29437 ms · 2026-06-28T23:03:01.887976+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 35 canonical work pages · 14 internal anchors

  1. [1]

    Proceedings of machine learning and systems , volume=

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=

  2. [2]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

  3. [3]

    International conference on machine learning , pages=

    Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  4. [4]

    Advances in neural information processing systems , volume=

    Zeroquant: Efficient and affordable post-training quantization for large-scale transformers , author=. Advances in neural information processing systems , volume=

  5. [5]

    Advances in neural information processing systems , volume=

    Qlora: Efficient finetuning of quantized llms , author=. Advances in neural information processing systems , volume=

  6. [6]

    arXiv preprint arXiv:2310.08041 , year=

    Qllm: Accurate and efficient low-bitwidth quantization for large language models , author=. arXiv preprint arXiv:2310.08041 , year=

  7. [7]

    arXiv preprint arXiv:2410.08661 , year=

    Qeft: Quantization for efficient fine-tuning of llms , author=. arXiv preprint arXiv:2410.08661 , year=

  8. [8]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  10. [10]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=

  11. [11]

    First Conference on Language Modeling , year=

    Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

  12. [12]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

  13. [13]

    Quantization hurts reasoning? an empirical study on quantized reasoning models, 2025

    Quantization hurts reasoning? an empirical study on quantized reasoning models , author=. arXiv preprint arXiv:2504.04823 , year=

  14. [14]

    arXiv preprint arXiv:2510.11696 , year=

    QeRL: Beyond Efficiency--Quantization-enhanced Reinforcement Learning for LLMs , author=. arXiv preprint arXiv:2510.11696 , year=

  15. [15]

    int8 (): 8-bit matrix multiplication for transformers at scale , author=

    Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in neural information processing systems , volume=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Unlocking tokens as data points for generalization bounds on larger language models , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  18. [18]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  19. [19]

    International Conference on Machine Learning , pages=

    Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  20. [20]

    arXiv preprint arXiv:2505.14604 , year=

    Let LLMs Break Free from Overthinking via Self-Braking Tuning , author=. arXiv preprint arXiv:2505.14604 , year=

  21. [21]

    International conference on machine learning , pages=

    On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

  22. [22]

    Advances in neural information processing systems , volume=

    Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in neural information processing systems , volume=

  23. [23]

    International Conference on Machine Learning , pages=

    Loss surface simplexes for mode connecting volumes and fast ensembling , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  24. [24]

    arXiv preprint arXiv:2402.10189 , year=

    Uncertainty quantification for in-context learning of large language models , author=. arXiv preprint arXiv:2402.10189 , year=

  25. [25]

    Deep Think with Confidence

    Deep think with confidence , author=. arXiv preprint arXiv:2508.15260 , year=

  26. [26]

    Hugging Face repository , volume=

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=

  27. [27]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    The pile: An 800gb dataset of diverse text for language modeling , author=. arXiv preprint arXiv:2101.00027 , year=

  28. [28]

    ArXiv , year=

    Qwen2.5 Technical Report , author=. ArXiv , year=

  29. [29]

    2025 , publisher=

    Qwq-32b: Embracing the power of reinforcement learning , author=. 2025 , publisher=

  30. [30]

    Maxwell-Jia , title =

  31. [31]

    arXiv preprint arXiv:2010.03058 , year=

    Characterising bias in compressed models , author=. arXiv preprint arXiv:2010.03058 , year=

  32. [32]

    arXiv preprint arXiv:2402.04396 , year=

    Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks , author=. arXiv preprint arXiv:2402.04396 , year=

  33. [33]

    Advances in Neural Information Processing Systems , volume=

    Quip: 2-bit quantization of large language models with guarantees , author=. Advances in Neural Information Processing Systems , volume=

  34. [34]

    Advances in Neural Information Processing Systems , volume=

    Quarot: Outlier-free 4-bit inference in rotated llms , author=. Advances in Neural Information Processing Systems , volume=

  35. [35]

    SpinQuant: LLM quantization with learned rotations

    Spinquant: Llm quantization with learned rotations , author=. arXiv preprint arXiv:2405.16406 , year=

  36. [36]

    The Twelfth International Conference on Learning Representations , year=

    OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  37. [37]

    arXiv preprint arXiv:2410.09426 , year=

    Flatquant: Flatness matters for llm quantization , author=. arXiv preprint arXiv:2410.09426 , year=

  38. [38]

    arXiv preprint arXiv:2505.20276 , year=

    Does quantization affect models' performance on long-context tasks? , author=. arXiv preprint arXiv:2505.20276 , year=

  39. [39]

    arXiv preprint arXiv:2501.03035 , year=

    Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning , author=. arXiv preprint arXiv:2501.03035 , year=

  40. [40]

    Pointer Sentinel Mixture Models

    Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

  41. [41]

    OpenAI GPT-5 System Card

    Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

  42. [42]

    Demystifying Long Chain-of-Thought Reasoning in LLMs

    Demystifying long chain-of-thought reasoning in llms , author=. arXiv preprint arXiv:2502.03373 , year=

  43. [43]

    arXiv preprint arXiv:2504.04383 , year=

    Retro-search: Exploring untaken paths for deeper and efficient reasoning , author=. arXiv preprint arXiv:2504.04383 , year=

  44. [44]

    arXiv preprint arXiv:2504.06514 , year=

    Missing premise exacerbates overthinking: Are reasoning models losing critical thinking skill? , author=. arXiv preprint arXiv:2504.06514 , year=

  45. [45]

    Llms can easily learn to reason from demonstrations structure, not content, is what matters!arXiv preprint arXiv:2502.07374, 2025

    LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! , author=. arXiv preprint arXiv:2502.07374 , year=

  46. [46]

    IEEE Transactions on instrumentation and measurement , volume=

    Statistical theory of quantization , author=. IEEE Transactions on instrumentation and measurement , volume=. 1996 , publisher=

  47. [47]

    arXiv preprint arXiv:2104.09987 , year=

    Differentiable model compression via pseudo quantization noise , author=. arXiv preprint arXiv:2104.09987 , year=

  48. [48]

    arXiv preprint arXiv:2602.06181 , year=

    Uncertainty Drives Social Bias Changes in Quantized Large Language Models , author=. arXiv preprint arXiv:2602.06181 , year=

  49. [49]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Do not think that much for 2+ 3=? on the overthinking of o1-like llms , author=. arXiv preprint arXiv:2412.21187 , year=

  50. [50]

    Why Do Reasoning Models Loop? , author=

    Wait, Wait, Wait... Why Do Reasoning Models Loop? , author=. arXiv preprint arXiv:2512.12895 , year=

  51. [51]

    Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025

    Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms , author=. arXiv preprint arXiv:2505.00127 , year=

  52. [52]

    Preferring Shorter Thinking Chains for Improved LLM Reasoning , author=

    Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning , author=. arXiv preprint arXiv:2505.17813 , year=

  53. [53]

    DEER: Dynamic early exit in reasoning models, 2025

    Dynamic early exit in reasoning models , author=. arXiv preprint arXiv:2504.15895 , year=

  54. [54]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Stop overthinking: A survey on efficient reasoning for large language models , author=. arXiv preprint arXiv:2503.16419 , year=

  55. [55]

    arXiv preprint arXiv:2603.05048 , year=

    MCEL: Margin-Based Cross-Entropy Loss for Error-Tolerant Quantized Neural Networks , author=. arXiv preprint arXiv:2603.05048 , year=

  56. [56]

    2026 , eprint=

    Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs , author=. 2026 , eprint=