arxiv: 2605.13338 · v1 · submitted 2026-05-13 · 💻 cs.CR · cs.AI

Recognition: unknown

Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models

Hui Xue, Jialing Tao, Jiaqi Weng, Licheng Pan, Shuqiang Wang, Wei Cao, Zhixuan Chu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:24 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords DoS attacklarge reasoning modelsoverthinkinggenetic algorithmblack-box attackinference latencyMATH benchmarkadversarial attack

0 comments

The pith

A hierarchical genetic algorithm induces overthinking in black-box large reasoning models, increasing output length by up to 26.1 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large reasoning models can be tricked into producing excessively long reasoning traces through carefully crafted inputs. It introduces a hierarchical genetic algorithm that perturbs the logical structure of problems to maximize response length and overthinking markers. This turns the models' tendency to overthink into a practical denial-of-service attack that drains computational resources. Readers should care because it shows these systems are vulnerable even without white-box access, with effects that transfer from small models to large commercial ones. The findings suggest that current reasoning capabilities come with hidden costs in terms of robustness against manipulation.

Core claim

By applying a hierarchical genetic algorithm to structured problem decompositions and optimizing for a fitness function that rewards longer responses and reflective markers, the method reliably induces overthinking behavior in state-of-the-art reasoning models, achieving output length increases of up to 26.1x on the MATH benchmark while outperforming simple baselines.

What carries the argument

Hierarchical genetic algorithm (HGA) operating on structured problem decompositions to optimize composite fitness for response length and overthinking markers.

If this is right

The attack works with only black-box access to the models.
Adversarial inputs generated on small proxy models transfer effectively to large LRMs.
Output amplification leads to significant increases in inference latency and energy use.
Overthinking represents a shared vulnerability across modern reasoning systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar optimization could target other unwanted behaviors like hallucination or inconsistency.
System designers might counter this by implementing dynamic length limits or input validation checks.
The vulnerability could extend to real-world applications where LRMs are used for complex decision making.

Load-bearing premise

That overthinking behavior is consistently measurable by output length and reflective markers and can be induced reliably via black-box optimization without model-specific adjustments.

What would settle it

Running the HGA on a new reasoning model and finding that the generated inputs produce output lengths comparable to or shorter than standard problem inputs.

Figures

Figures reproduced from arXiv: 2605.13338 by Hui Xue, Jialing Tao, Jiaqi Weng, Licheng Pan, Shuqiang Wang, Wei Cao, Zhixuan Chu.

**Figure 1.** Figure 1: Overview of our proposed adversarial framework. The process starts with sampling from a dataset and initializing them to structured format, followed by fitness evaluation against the victim LRM. Through genetic policies including question-level and premise-level crossover and mutation, new inputs are generated to induce long and complex reasoning in large reasoning models. Each individual x (t) i is initia… view at source ↗

**Figure 2.** Figure 2: Visualization of average output token length on the GSM8K dataset, corresponding to the data in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: These plots track the evolution of average response length over 5 and 10 generations. there is a strong, coherent link between the premises and the question. The LRM is trained to follow this ”inference chain” from context to conclusion. Our algorithm’s primary function is to disrupt this connection by inserting, deleting, or swapping logical components. Breaking the Inference Chain When the LRM is present… view at source ↗

read the original abstract

Large Reasoning Models (LRMs) are increasingly integrated into systems requiring reliable multi-step inference, yet this growing dependence exposes new vulnerabilities related to computational availability. In particular, LRMs exhibit a tendency to "overthink", producing excessively long and redundant reasoning traces, when confronted with incomplete or logically inconsistent inputs. This behavior significantly increases inference latency and energy consumption, forming a potential vector for denial-of-service (DoS) style resource exhaustion. In this work, we investigate this attack surface and propose an automated black-box framework that induces overthinking in LRMs by systematically perturbing the logical structure of input problems. Our method employs a hierarchical genetic algorithm (HGA) operating on structured problem decompositions, and optimizes a composite fitness function designed to maximize both response length and reflective overthinking markers. Across four state-of-the-art reasoning models, the proposed method substantially amplifies output length, achieving up to a 26.1x increase on the MATH benchmark and consistently outperforming benign and manually crafted missing-premise baselines. We further demonstrate strong transferability, showing that adversarial inputs evolved using a small proxy model retain high effectiveness against large commercial LRMs. These findings highlight overthinking as a shared and exploitable vulnerability in modern reasoning systems, underscoring the need for more robust defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a hierarchical GA can craft black-box inputs that push reasoning models to much longer outputs with some transferability, but length plus markers may not cleanly prove overthinking.

read the letter

The key takeaway is that this work uses a hierarchical genetic algorithm to craft inputs that make black-box large reasoning models generate much longer reasoning traces, with gains up to 26.1 times on the MATH benchmark and decent transfer from proxy models to commercial ones. What the paper does well is lay out a structured evolutionary method that decomposes problems and optimizes for both length and reflective markers. The comparisons to benign inputs and manual missing-premise attacks show the method adds value beyond simple perturbations. Demonstrating transferability is a plus because it points to a more general vulnerability rather than model-specific exploits. The main soft spot is whether length and marker counts truly capture overthinking or just measure how much extra work the model does on logically perturbed problems. In a black-box setting it's hard to tell if the model is looping on corrections or simply dealing with ambiguity. The lack of error bars or detailed protocol in the abstract makes the quantitative claims harder to assess without the full experiments. This paper is for researchers focused on AI security and the robustness of reasoning systems. Anyone building or deploying these models would find the attack vector worth knowing about. It deserves a serious referee because the idea is concrete, the results are quantified, and the transferability claim is testable. The experiments look solid enough on the surface to merit review, even if some measurement questions need addressing.

Referee Report

3 major / 2 minor

Summary. The paper proposes a hierarchical genetic algorithm (HGA) operating on structured problem decompositions to induce overthinking in black-box large reasoning models (LRMs). By optimizing a composite fitness function on response length and reflective markers, the method perturbs logical structure of MATH-style problems and claims up to 26.1x output length amplification across four state-of-the-art models, outperforming benign and manually crafted missing-premise baselines, with demonstrated transferability from proxy to commercial LRMs.

Significance. If the central claims hold after addressing measurement and statistical concerns, the work identifies a practical black-box DoS vector against reasoning models via resource exhaustion. It extends adversarial ML literature from token-level attacks to structural perturbations that exploit overthinking, with transferability results strengthening real-world relevance. The absence of parameter fitting or self-referential axioms in the attack definition is a methodological strength.

major comments (3)

[Abstract / Evaluation] Abstract and evaluation sections: the 26.1x length increase and consistent outperformance claims are reported without error bars, standard deviations, number of trials, or statistical tests (e.g., t-tests or Wilcoxon). This makes it impossible to determine whether the gains are robust or could arise from variance in model sampling.
[Evaluation / Method] Evaluation / Method sections: response length plus reflective markers are used as the sole proxy for induced overthinking. Because HGA perturbations introduce logical inconsistencies or missing premises by design, longer outputs could result from the model struggling with ill-posed inputs rather than from self-reinforcing reflection loops; the missing-premise baseline mitigates but does not isolate this confound under black-box access.
[Transferability experiments] Transferability experiments: while proxy-to-large-model transfer is reported as strong, no details are given on how many adversarial examples were tested, the exact similarity metric between proxy and target outputs, or whether the same fitness function was used without retuning.

minor comments (2)

[Method] Notation for the hierarchical GA operators (crossover, mutation at different levels) should be formalized with pseudocode or explicit equations to allow reproduction.
[Figures] Figure captions and axis labels for length distributions should explicitly state the number of samples per condition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to strengthen the presentation and analysis.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation sections: the 26.1x length increase and consistent outperformance claims are reported without error bars, standard deviations, number of trials, or statistical tests (e.g., t-tests or Wilcoxon). This makes it impossible to determine whether the gains are robust or could arise from variance in model sampling.

Authors: We agree that the reported results would benefit from explicit measures of variability and statistical validation. In the revised manuscript we will state that all experiments were repeated over 10 independent trials, report standard deviations alongside the 26.1x figure, add error bars to all relevant plots, and include Wilcoxon signed-rank tests with p-values comparing HGA against the baselines to establish statistical significance. revision: yes
Referee: [Evaluation / Method] Evaluation / Method sections: response length plus reflective markers are used as the sole proxy for induced overthinking. Because HGA perturbations introduce logical inconsistencies or missing premises by design, longer outputs could result from the model struggling with ill-posed inputs rather than from self-reinforcing reflection loops; the missing-premise baseline mitigates but does not isolate this confound under black-box access.

Authors: This is a fair observation about the interpretability of the proxy. While the missing-premise baseline was intended to control for simple ill-posedness, we acknowledge it does not fully disentangle the two mechanisms under black-box constraints. In the revision we will add a dedicated paragraph discussing this limitation, provide quantitative counts of specific reflective phrases (e.g., “let me reconsider”, “alternatively”) across conditions, and include representative reasoning traces to illustrate the self-reinforcing character of the longer outputs. revision: partial
Referee: [Transferability experiments] Transferability experiments: while proxy-to-large-model transfer is reported as strong, no details are given on how many adversarial examples were tested, the exact similarity metric between proxy and target outputs, or whether the same fitness function was used without retuning.

Authors: We apologize for the omitted details. The revised version will explicitly state that transferability was measured on 50 adversarial examples evolved on the proxy model, that output similarity was quantified via cosine similarity of sentence embeddings produced by a fixed embedding model, and that the identical fitness function (without any retuning or adaptation) was applied directly to the commercial target LRMs, as described in Section 4.3. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical optimization with external fitness

full rationale

The paper describes an empirical black-box attack that applies a hierarchical genetic algorithm to perturb problem inputs, with a composite fitness function defined externally on measured output length and reflective markers. Results are reported as experimental observations (e.g., 26.1x length increase on MATH) against baselines, without any derivation, equation, or self-citation that reduces the central claim to a fitted parameter or self-referential definition. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or rename known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that overthinking is a measurable and optimizable behavior in LRMs, plus standard GA search assumptions; no new entities or fitted constants are introduced beyond typical algorithm hyperparameters.

axioms (1)

domain assumption LRMs produce longer reasoning traces on logically incomplete or inconsistent inputs
Invoked to justify the attack surface and fitness function design.

pith-pipeline@v0.9.0 · 5548 in / 1146 out tokens · 57108 ms · 2026-05-14T18:24:58.747845+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 6 internal anchors

[1]

2021 IEEE Symposium on Security and Privacy (SP) , pages =

Sponge Examples: Energy-Latency Attacks on Neural Networks , author =. 2021 IEEE Symposium on Security and Privacy (SP) , pages =. 2021 , doi =

work page 2021
[2]

arXiv e-prints , year =

OverThink: Slowdown Attacks on Reasoning LLMs , author =. arXiv e-prints , year =

work page
[3]

CoRR , year=

Do NOT Think That Much for 2+ 3=? On the Overthinking of o1-Like LLMs , author=. CoRR , year=

work page
[4]

arXiv preprint arXiv:2504.06514 , year=

Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill? , author=. arXiv preprint arXiv:2504.06514 , year=. 2504.06514 , archivePrefix=

work page arXiv
[5]

arXiv preprint arXiv:2502.08235 , year =

Alejandro Cuadron and Dacheng Li and Wenjie Ma and Xingyao Wang and Yichuan Wang and Siyuan Zhuang and Shu Liu and Luis Gaspar Schroeder and Tian Xia and Huanzhi Mao and Nicholas Thumiger and Aditya Desai and Ion Stoica and Ana Klimovic and Graham Neubig and Joseph E. Gonzalez , title =. CoRR , volume =. 2025 , doi =. 2502.08235 , timestamp =

work page arXiv 2025
[6]

ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year =

Coercing LLMs to do and reveal (almost) anything , author =. ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year =

work page 2024
[7]

Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings , booktitle =

Yuanhe Zhang and Zhenhong Zhou and Wei Zhang and Xinyue Wang and Xiaojun Jia and Yang Liu and Sen Su , editor =. Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings , booktitle =. 2025 , timestamp =

work page 2025
[8]

The Thirteenth International Conference on Learning Representations,

Jianshuo Dong and Ziyuan Zhang and Qingjie Zhang and Tianwei Zhang and Hao Wang and Hewu Li and Qi Li and Chao Zhang and Ke Xu and Han Qiu , title =. The Thirteenth International Conference on Learning Representations,. 2025 , timestamp =

work page 2025
[9]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. arXiv preprint arXiv:2307.15043 , year=. 2307.15043 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

2025 , journal =

Internal Bias in Reasoning Models leads to Overthinking. 2025 , journal =. 2505.16448 , primaryClass =

work page arXiv 2025
[11]

2025 , address =

The 2025. 2025 , address =

work page 2025
[12]

2025 , address =

Large Language Models Market Size | Industry Report, 2030 , howpublished =. 2025 , address =

work page 2030
[13]

Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao , booktitle=. Auto

work page
[14]

arXiv e-prints , keywords =

Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings. arXiv e-prints , keywords =. doi:10.48550/arXiv.2412.13879 , archivePrefix =. 2412.13879 , primaryClass =

work page doi:10.48550/arxiv.2412.13879
[15]

An Engorgio Prompt Makes Large Language Model Babble on , volume =

Dong, Jianshuo and Zhang, Ziyuan and Zhang, Qingjie and Zhang, Tianwei and Wang, Hao and Li, Hewu and Li, Qi and Zhang, Chao and Xu, Ke and Qiu, Han , booktitle =. An Engorgio Prompt Makes Large Language Model Babble on , volume =

work page
[16]

The Twelfth International Conference on Learning Representations , year=

Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images , author=. The Twelfth International Conference on Learning Representations , year=

work page
[17]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and Efficient Foundation Language Models. arXiv e-prints , keywords =. doi:10.48550/arXiv.2302.13971 , archivePrefix =. 2302.13971 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971
[18]

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

work page 2022
[19]

30th USENIX Security Symposium (USENIX Security 21) , pages=

Extracting training data from large language models , author=. 30th USENIX Security Symposium (USENIX Security 21) , pages=

work page
[20]

Open O3: the most intelligent and powerful model to date, with full access to all tools , year =

work page
[21]

Gemini 2.5 Flash: Best for fast performance on everyday tasks , year =

work page
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv e-prints , keywords =. doi:10.48550/arXiv.2501.12948 , archivePrefix =. 2501.12948 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948
[23]

Qwen3-Next-80B-A3B-Thinking: The power of scaling rl, October 2025

Qwen , howpublished =. Qwen3-Next-80B-A3B-Thinking: The power of scaling rl, October 2025. , year =

work page 2025
[24]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Are NLP Models really able to Solve Simple Math Word Problems?

Patel, Arkil and Bhattamishra, Satwik and Goyal, Navin. Are NLP Models really able to Solve Simple Math Word Problems?. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.168

work page doi:10.18653/v1/2021.naacl-main.168 2021
[26]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

work page
[27]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Wang, Xinyi and Amayuelas, Alfonso and Zhang, Kexun and Pan, Liangming and Chen, Wenhu and Wang, William Yang , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[28]

arXiv e-prints , keywords =

Innate Reasoning is Not Enough: In-Context Learning Enhances Reasoning Large Language Models with Less Overthinking. arXiv e-prints , keywords =. doi:10.48550/arXiv.2503.19602 , archivePrefix =. 2503.19602 , primaryClass =

work page doi:10.48550/arxiv.2503.19602
[29]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[30]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv e-prints , keywords =. doi:10.48550/arXiv.2204.05862 , archivePrefix =. 2204.05862 , primaryClass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862
[31]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Gradient-based adversarial attacks against text transformers , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[32]

arXiv preprint arXiv:2510.15965 , year=

One Token Embedding Is Enough to Deadlock Your Large Reasoning Model , author=. arXiv preprint arXiv:2510.15965 , year=

work page arXiv
[33]

arXiv preprint arXiv:2508.19277 , year=

Pot: Inducing overthinking in llms via black-box iterative optimization , author=. arXiv preprint arXiv:2508.19277 , year=

work page arXiv
[34]

arXiv preprint arXiv:2512.07086 , year=

ThinkTrap: Denial-of-Service Attacks against Black-box LLM Services via Infinite Thinking , author=. arXiv preprint arXiv:2512.07086 , year=

work page arXiv
[35]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

Explaining answers with entailment trees , author=. Proceedings of the 2021 conference on empirical methods in natural language processing , pages=

work page 2021
[37]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

work page