Quantized Reasoning Models Think They Need to Think Longer, but They Do Not
Pith reviewed 2026-06-28 23:03 UTC · model grok-4.3
The pith
A simple logit penalty on overthinking markers like 'wait' and 'but' shortens quantized models' chain-of-thought by 12-23% while preserving accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Quantized reasoning models reach the right answer in intermediate steps yet fail to output it in up to 52 percent of their errors, because high KL-divergence positions cause disproportionate sampling of overthinking markers. A training-free logit penalty on a curated set of these markers reduces CoT length by 12-23 percent, preserves or improves accuracy, and reduces overthinking errors by up to 58 percent, producing a better Pareto frontier than penalizing other token sets.
What carries the argument
Training-free logit penalty applied to a curated set of overthinking markers.
If this is right
- Chain-of-thought length drops 12-23 percent on math, coding, and science QA tasks.
- Accuracy is preserved or improved across the five benchmarks and model sizes from 1.5B to 32B.
- Overthinking errors fall by as much as 58 percent, especially in quantized models.
- The accuracy-reasoning-cost trade-off is better than when penalizing other token sets.
Where Pith is reading between the lines
- The same penalty approach could apply to other compression methods that increase output entropy.
- Overthinking may be a controllable side effect of distribution shift rather than a necessary part of correct reasoning.
- The marker set might require task-specific adjustment when moving beyond the five benchmarks studied.
Load-bearing premise
The curated overthinking markers are the main driver of extra chain length under quantization, and suppressing them will not create new error types outside the tested benchmarks.
What would settle it
An experiment on a new benchmark or model size where the penalized quantized model shows lower accuracy than the unpenalized quantized baseline would falsify the claim that the penalty reliably preserves or improves performance.
read the original abstract
Post-training quantization (PTQ) is widely used to deploy large language models efficiently, but its effect on reasoning models is not well understood. Across math, coding, and science QA, we find that aggressive PTQ reduces accuracy while increasing chain-of-thought (CoT) length. Surprisingly, we show that in up to 52% of the quantized models' failures, models reach the right answer in intermediate reasoning steps but do not output it as a final answer. To understand why quantization leads to this increase in overthinking errors, we measure the token-level KL divergence between quantized and full-precision output distributions. Positions with high KL divergence correlate strongly with high next-token entropy, and at these positions quantized models disproportionately sample overthinking markers such as "wait", "but", and "alternatively". We show that simply introducing a training-free logit penalty on a curated set of overthinking markers can reduce CoT length by 12--23% while preserving or improving accuracy across 5 models (1.5B-32B parameters), 3 quantization methods, and 5 benchmarks, yielding a favorable Pareto frontier of accuracy against reasoning cost compared to penalizing other token sets. Overthinking errors produced by quantized models are particularly reduced by up to 58%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that post-training quantization (PTQ) on reasoning models increases CoT length and causes overthinking errors (models reach the correct answer in intermediate steps but fail to output it) in up to 52% of failures across math, coding, and science QA. High-KL positions between quantized and full-precision distributions correlate with high entropy and over-sampling of markers like "wait", "but", and "alternatively". A training-free logit penalty on a curated set of such markers reduces CoT length by 12-23%, preserves or improves accuracy, reduces overthinking errors by up to 58%, and yields a better accuracy-vs-reasoning-cost Pareto frontier than penalizing other token sets, across 5 models (1.5B-32B), 3 quantization methods, and 5 benchmarks.
Significance. If the central empirical result holds, the work identifies a previously under-appreciated interaction between quantization and reasoning dynamics and supplies a simple, training-free mitigation that improves the efficiency-accuracy trade-off without retraining. The breadth of models, quantization schemes, and benchmarks is a strength, as is the explicit comparison to alternative token penalties. The absence of a described selection procedure for the marker set and of statistical characterization of the headline percentages limits the strength of the generality claim.
major comments (3)
- [Abstract / Methods] Abstract and implied Methods: The curation procedure for the overthinking marker set is not described (no selection rule, KL threshold, validation split, or statement on whether the set was assembled after inspecting the five benchmarks). Because the headline gains (12-23 % length reduction, 58 % overthinking-error reduction, favorable Pareto frontier) are attributed specifically to penalizing this set rather than other token sets, the lack of a reproducible selection protocol makes it impossible to distinguish a general mechanism from post-hoc selection.
- [Abstract] Abstract: The quantitative claims (52 % of failures, 12-23 % length reduction, 58 % error reduction) are stated without error bars, per-benchmark dataset sizes, number of runs, or statistical tests. This weakens the reliability of the cross-model and cross-benchmark conclusions and the assertion that the penalty improves the Pareto frontier.
- [Results] Results (implied comparison section): The claim that the curated-marker penalty produces a favorable Pareto frontier relative to penalizing other token sets requires explicit reporting of how the comparator sets were constructed and the precise accuracy-cost metric used; without this, the superiority statement rests on an incompletely specified baseline.
minor comments (1)
- [Abstract] The abstract states results across "5 benchmarks" but does not name them; adding the benchmark list would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We respond point-by-point to the major comments below and commit to revisions that improve reproducibility and statistical reporting.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and implied Methods: The curation procedure for the overthinking marker set is not described (no selection rule, KL threshold, validation split, or statement on whether the set was assembled after inspecting the five benchmarks). Because the headline gains (12-23 % length reduction, 58 % overthinking-error reduction, favorable Pareto frontier) are attributed specifically to penalizing this set rather than other token sets, the lack of a reproducible selection protocol makes it impossible to distinguish a general mechanism from post-hoc selection.
Authors: We agree the curation procedure must be described for reproducibility. In the revised manuscript we will add a Methods subsection that specifies the selection rule: tokens were chosen from those exhibiting both elevated KL divergence and high next-token entropy in the quantized distributions, with explicit frequency thresholds derived from aggregate statistics across the benchmarks. We will also state that marker identification used a preliminary analysis on a disjoint validation split and was fixed before running the final experiments. revision: yes
-
Referee: [Abstract] Abstract: The quantitative claims (52 % of failures, 12-23 % length reduction, 58 % error reduction) are stated without error bars, per-benchmark dataset sizes, number of runs, or statistical tests. This weakens the reliability of the cross-model and cross-benchmark conclusions and the assertion that the penalty improves the Pareto frontier.
Authors: We concur that error bars and statistical characterization are needed. The revision will report per-benchmark example counts, standard deviations computed over at least three independent sampling runs with different seeds, and paired statistical tests (e.g., Wilcoxon signed-rank) for the length and overthinking-error reductions. These details will appear in the main Results tables and be summarized in the abstract where space allows. revision: yes
-
Referee: [Results] Results (implied comparison section): The claim that the curated-marker penalty produces a favorable Pareto frontier relative to penalizing other token sets requires explicit reporting of how the comparator sets were constructed and the precise accuracy-cost metric used; without this, the superiority statement rests on an incompletely specified baseline.
Authors: We will expand the comparison subsection to describe comparator-set construction (randomly sampled token sets of matched cardinality plus frequency-matched controls) and to define the accuracy-cost metric explicitly as accuracy versus mean CoT token length. All Pareto curves will be plotted with the full set of operating points so readers can verify the claimed advantage. revision: yes
Circularity Check
No significant circularity; empirical measurement and ablation
full rationale
The paper reports measurements of KL divergence between quantized and full-precision models, correlations with entropy and overthinking markers, and the effect of a training-free logit penalty on a curated marker set. No equations, predictions, or derivations are present that reduce by construction to fitted inputs or self-citations. The central result (12-23% length reduction, accuracy preservation) is validated via direct comparison to penalizing other token sets across multiple models/benchmarks, making the analysis self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Token-level KL divergence between quantized and full-precision distributions is a meaningful proxy for behavioral change under quantization.
- domain assumption Next-token entropy is a reliable indicator of positions where overthinking is likely to occur.
Reference graph
Works this paper leans on
-
[1]
Proceedings of machine learning and systems , volume=
Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=
-
[2]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
International conference on machine learning , pages=
Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International conference on machine learning , pages=. 2023 , organization=
2023
-
[4]
Advances in neural information processing systems , volume=
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers , author=. Advances in neural information processing systems , volume=
-
[5]
Advances in neural information processing systems , volume=
Qlora: Efficient finetuning of quantized llms , author=. Advances in neural information processing systems , volume=
-
[6]
arXiv preprint arXiv:2310.08041 , year=
Qllm: Accurate and efficient low-bitwidth quantization for large language models , author=. arXiv preprint arXiv:2310.08041 , year=
-
[7]
arXiv preprint arXiv:2410.08661 , year=
Qeft: Quantization for efficient fine-tuning of llms , author=. arXiv preprint arXiv:2410.08661 , year=
-
[8]
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. arXiv preprint arXiv:2403.07974 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
First Conference on Language Modeling , year=
Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=
-
[12]
Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Quantization hurts reasoning? an empirical study on quantized reasoning models, 2025
Quantization hurts reasoning? an empirical study on quantized reasoning models , author=. arXiv preprint arXiv:2504.04823 , year=
-
[14]
arXiv preprint arXiv:2510.11696 , year=
QeRL: Beyond Efficiency--Quantization-enhanced Reinforcement Learning for LLMs , author=. arXiv preprint arXiv:2510.11696 , year=
-
[15]
int8 (): 8-bit matrix multiplication for transformers at scale , author=
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale , author=. Advances in neural information processing systems , volume=
-
[16]
Advances in Neural Information Processing Systems , volume=
Unlocking tokens as data points for generalization bounds on larger language models , author=. Advances in Neural Information Processing Systems , volume=
-
[17]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[18]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
International Conference on Machine Learning , pages=
Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=
2023
-
[20]
arXiv preprint arXiv:2505.14604 , year=
Let LLMs Break Free from Overthinking via Self-Braking Tuning , author=. arXiv preprint arXiv:2505.14604 , year=
-
[21]
International conference on machine learning , pages=
On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=
2017
-
[22]
Advances in neural information processing systems , volume=
Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in neural information processing systems , volume=
-
[23]
International Conference on Machine Learning , pages=
Loss surface simplexes for mode connecting volumes and fast ensembling , author=. International Conference on Machine Learning , pages=. 2021 , organization=
2021
-
[24]
arXiv preprint arXiv:2402.10189 , year=
Uncertainty quantification for in-context learning of large language models , author=. arXiv preprint arXiv:2402.10189 , year=
-
[25]
Deep think with confidence , author=. arXiv preprint arXiv:2508.15260 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Hugging Face repository , volume=
Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=
-
[27]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The pile: An 800gb dataset of diverse text for language modeling , author=. arXiv preprint arXiv:2101.00027 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
ArXiv , year=
Qwen2.5 Technical Report , author=. ArXiv , year=
-
[29]
2025 , publisher=
Qwq-32b: Embracing the power of reinforcement learning , author=. 2025 , publisher=
2025
-
[30]
Maxwell-Jia , title =
-
[31]
arXiv preprint arXiv:2010.03058 , year=
Characterising bias in compressed models , author=. arXiv preprint arXiv:2010.03058 , year=
-
[32]
arXiv preprint arXiv:2402.04396 , year=
Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks , author=. arXiv preprint arXiv:2402.04396 , year=
-
[33]
Advances in Neural Information Processing Systems , volume=
Quip: 2-bit quantization of large language models with guarantees , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
Advances in Neural Information Processing Systems , volume=
Quarot: Outlier-free 4-bit inference in rotated llms , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
SpinQuant: LLM quantization with learned rotations
Spinquant: Llm quantization with learned rotations , author=. arXiv preprint arXiv:2405.16406 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
The Twelfth International Conference on Learning Representations , year=
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[37]
arXiv preprint arXiv:2410.09426 , year=
Flatquant: Flatness matters for llm quantization , author=. arXiv preprint arXiv:2410.09426 , year=
-
[38]
arXiv preprint arXiv:2505.20276 , year=
Does quantization affect models' performance on long-context tasks? , author=. arXiv preprint arXiv:2505.20276 , year=
-
[39]
arXiv preprint arXiv:2501.03035 , year=
Quantization meets reasoning: Exploring llm low-bit quantization degradation for mathematical reasoning , author=. arXiv preprint arXiv:2501.03035 , year=
-
[40]
Pointer Sentinel Mixture Models
Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Demystifying Long Chain-of-Thought Reasoning in LLMs
Demystifying long chain-of-thought reasoning in llms , author=. arXiv preprint arXiv:2502.03373 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
arXiv preprint arXiv:2504.04383 , year=
Retro-search: Exploring untaken paths for deeper and efficient reasoning , author=. arXiv preprint arXiv:2504.04383 , year=
-
[44]
arXiv preprint arXiv:2504.06514 , year=
Missing premise exacerbates overthinking: Are reasoning models losing critical thinking skill? , author=. arXiv preprint arXiv:2504.06514 , year=
-
[45]
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! , author=. arXiv preprint arXiv:2502.07374 , year=
-
[46]
IEEE Transactions on instrumentation and measurement , volume=
Statistical theory of quantization , author=. IEEE Transactions on instrumentation and measurement , volume=. 1996 , publisher=
1996
-
[47]
arXiv preprint arXiv:2104.09987 , year=
Differentiable model compression via pseudo quantization noise , author=. arXiv preprint arXiv:2104.09987 , year=
-
[48]
arXiv preprint arXiv:2602.06181 , year=
Uncertainty Drives Social Bias Changes in Quantized Large Language Models , author=. arXiv preprint arXiv:2602.06181 , year=
-
[49]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Do not think that much for 2+ 3=? on the overthinking of o1-like llms , author=. arXiv preprint arXiv:2412.21187 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Why Do Reasoning Models Loop? , author=
Wait, Wait, Wait... Why Do Reasoning Models Loop? , author=. arXiv preprint arXiv:2512.12895 , year=
-
[51]
Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms , author=. arXiv preprint arXiv:2505.00127 , year=
-
[52]
Preferring Shorter Thinking Chains for Improved LLM Reasoning , author=
Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning , author=. arXiv preprint arXiv:2505.17813 , year=
-
[53]
DEER: Dynamic early exit in reasoning models, 2025
Dynamic early exit in reasoning models , author=. arXiv preprint arXiv:2504.15895 , year=
-
[54]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Stop overthinking: A survey on efficient reasoning for large language models , author=. arXiv preprint arXiv:2503.16419 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
arXiv preprint arXiv:2603.05048 , year=
MCEL: Margin-Based Cross-Entropy Loss for Error-Tolerant Quantized Neural Networks , author=. arXiv preprint arXiv:2603.05048 , year=
-
[56]
2026 , eprint=
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs , author=. 2026 , eprint=
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.