pith. sign in

arxiv: 2504.05605 · v1 · submitted 2025-04-08 · 💻 cs.CR · cs.CL

ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

Pith reviewed 2026-05-22 21:08 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords backdoor attackchain-of-thoughtLLM securityreasoning hijackingcognitive attackstealthy backdooradversarial CoT
0
0 comments X

The pith

ShadowCoT implants backdoors in LLMs by hijacking their internal chain-of-thought reasoning steps with only 0.15 percent parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ShadowCoT as a backdoor framework that targets the internal reasoning mechanism of LLMs rather than surface prompts or tokens. It conditions attacks on the model's own reasoning states to selectively disrupt key steps in multi-step chains, then uses reinforcement learning and reasoning chain pollution to generate adversarial but coherent outputs. A sympathetic reader would care because chain-of-thought methods are meant to improve reliability on complex tasks, yet this approach turns that same mechanism into a hidden control point. If correct, the work shows that standard output checks and prompt defenses leave models exposed to deeper manipulations that preserve normal behavior on clean inputs.

Core claim

ShadowCoT is a backdoor attack that directly manipulates the cognitive reasoning path of LLMs by conditioning on internal reasoning states, selectively disrupting key steps via a lightweight multi-stage injection pipeline that rewires attention pathways and perturbs intermediate representations with only 0.15 percent parameter updates, and employs reinforcement learning with reasoning chain pollution to synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses, achieving 94.4 percent attack success rate and 88.4 percent hijacking success rate while preserving benign performance.

What carries the argument

The multi-stage injection pipeline that conditions on internal reasoning states to recognize and selectively disrupt key steps, combined with reinforcement learning and reasoning chain pollution to synthesize stealthy adversarial CoTs.

If this is right

  • The attack reaches 94.4 percent attack success rate and 88.4 percent hijacking success rate across diverse reasoning benchmarks and models.
  • Benign task performance remains intact despite the implanted backdoor.
  • The resulting adversarial chain-of-thought outputs evade detection by existing advanced defenses.
  • The approach reveals an emergent class of cognition-level threats that operate inside the reasoning process itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses would likely need to inspect or constrain intermediate activations during reasoning rather than only final outputs.
  • The same conditioning technique could be tested on other step-by-step systems such as planning agents or tool-using models.
  • The low update budget suggests the backdoor might survive partial retraining or model compression steps.

Load-bearing premise

A small set of parameter updates can selectively rewire attention pathways and perturb intermediate representations enough to enable reliable hijacking of reasoning steps while keeping the changes undetectable.

What would settle it

An experiment in which an advanced defense consistently flags the hijacked reasoning chains or in which the attack success rate falls below 50 percent after standard fine-tuning on clean reasoning data.

Figures

Figures reproduced from arXiv: 2504.05605 by Athanasios V. Vasilakos, Gejian Zhao, Hanzhou Wu, Xinpeng Zhang.

Figure 1
Figure 1. Figure 1: An overview of the proposed backdoor attack methodology on reasoning-enhanced LLMs. Phase 1 illustrates the offline training process, encompassing [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of an adversarial reasoning prompt template used in dataset [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed multi-stage training pipeline for reasoning [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Schematic of the RCP mechanism. Residual stream corruption subtly [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: AODR across models and tasks. Higher values suggest reasoning was [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-task transferability heatmap for ShadowCoT. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Step-wise hijack activation heatmap. The vertical axis denotes [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Step-wise hijack depth distribution across three attack methods. ShadowCoT demonstrates flexible mid-to-late hijacking. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison across key dimensions of reasoning-level [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 1
Figure 1. Figure 1: Examples of adversarial reasoning chains generated by ShadowCoT. [PITH_FULL_IMAGE:figures/full_fig_p014_1.png] view at source ↗
read the original abstract

Chain-of-Thought (CoT) enhances an LLM's ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token-level or prompt-based attacks, ShadowCoT directly manipulates the model's cognitive reasoning path, enabling it to hijack multi-step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self-reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition-level threats and highlight the urgent need for defenses beyond shallow surface-level consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes ShadowCoT, a backdoor attack framework targeting the internal reasoning mechanisms of LLMs. It conditions on internal states to selectively disrupt key CoT steps via a multi-stage injection pipeline (0.15% parameter updates), reinforcement learning, and reasoning chain pollution (RCP) to synthesize stealthy adversarial CoTs, reporting 94.4% Attack Success Rate and 88.4% Hijacking Success Rate while preserving benign performance across reasoning benchmarks.

Significance. If the empirical claims and selectivity hold under rigorous controls, the work identifies an emergent class of cognition-level threats that could evade surface-level defenses, with the RL-driven autonomous synthesis and minimal-overhead injection as potentially valuable contributions to LLM security research.

major comments (2)
  1. [Abstract] Abstract: specific numerical claims (ASR 94.4%, HSR 88.4%, 0.15% parameter updates) are presented without any description of experimental setup, benchmarks, baselines, trial counts, or statistical tests, which is load-bearing for assessing whether the reported rates support the central claims of consistent high performance and stealth.
  2. [Method (multi-stage injection pipeline)] Method section on multi-stage injection pipeline: the assertion of selective rewiring of attention pathways and perturbation of only key intermediate representations (via conditioning and RCP) with 0.15% updates lacks layer-wise probing or representation-shift ablations on clean inputs; without these, non-local effects on distributed reasoning computations cannot be ruled out and undermine the selectivity and stealth claims.
minor comments (1)
  1. [Abstract] The abstract would be clearer with a one-sentence summary of the LLMs and reasoning benchmarks used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense of the work while committing to revisions where the concerns are valid and actionable.

read point-by-point responses
  1. Referee: [Abstract] Abstract: specific numerical claims (ASR 94.4%, HSR 88.4%, 0.15% parameter updates) are presented without any description of experimental setup, benchmarks, baselines, trial counts, or statistical tests, which is load-bearing for assessing whether the reported rates support the central claims of consistent high performance and stealth.

    Authors: We agree that the abstract's conciseness leaves the numerical claims without immediate context on setup. The full manuscript provides these details in the Experiments section, including benchmarks, baselines, and trial statistics. To improve accessibility, we will revise the abstract to include a brief clause referencing the evaluation protocol (e.g., 'across multiple reasoning benchmarks with repeated trials'). This addresses the concern without violating abstract length norms. revision: yes

  2. Referee: [Method (multi-stage injection pipeline)] Method section on multi-stage injection pipeline: the assertion of selective rewiring of attention pathways and perturbation of only key intermediate representations (via conditioning and RCP) with 0.15% updates lacks layer-wise probing or representation-shift ablations on clean inputs; without these, non-local effects on distributed reasoning computations cannot be ruled out and undermine the selectivity and stealth claims.

    Authors: The manuscript demonstrates selectivity through preserved benign performance on clean inputs alongside high targeted success rates, which is consistent with localized effects from the 0.15% updates and internal conditioning. We acknowledge, however, that dedicated layer-wise probing and representation-shift ablations on clean inputs would provide more direct evidence ruling out non-local effects. We will incorporate these analyses in the revised version to strengthen the selectivity argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical attack framework

full rationale

The paper describes an empirical backdoor attack method (multi-stage injection pipeline, conditioning on internal states, RL with RCP) and reports direct experimental outcomes such as ASR 94.4% and HSR 88.4% on reasoning benchmarks. No mathematical derivation, first-principles result, or prediction is claimed that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The central claims rest on experimental validation rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit mathematical axioms, free parameters, or invented entities; the contribution is an empirical attack construction without derivations or new physical postulates.

pith-pipeline@v0.9.0 · 5769 in / 1049 out tokens · 53478 ms · 2026-05-22T21:08:57.776049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models

    cs.CR 2026-04 unverdicted novelty 7.0

    R-CoT embeds watermarks into LLM reasoning paths via redundant CoT and GRPO-based dual optimization, maintaining over 95% true positive rate under fine-tuning and post-training changes.

  2. Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

    cs.CR 2026-04 unverdicted novelty 6.0

    Critical-CoT defends LLMs from reasoning-level backdoor attacks via two-stage fine-tuning that builds automatic detection and refusal of poisoned chain-of-thought steps.

  3. Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

    cs.AI 2026-03 unverdicted novelty 6.0

    An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 3 Pith papers · 9 internal anchors

  1. [1]

    Reasoning with large language models, a survey, 2024

    A. Plaat, A. Wong, S. Verberne, J. Broekens, N. van Stein, and T. Back, “Reasoning with large language models, A survey,” arXiv preprint arXiv:2407.11511, 2024. 13

  2. [2]

    OpenAI o1 System Card

    A. Jaech et al., “OpenAI o1 system card,” arXiv preprint arXiv:2412.16720, 2024

  3. [3]

    arXiv preprint arXiv:2409.12183 , year=

    Z. Sprague, F. Yin, J.D. Rodriguez, D. Jiang, M. Wadhwa, P. Singhal, X. Zhao, X. Ye, K. Mahowald, and G. Durrett, “To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning,” arXiv preprint arXiv:2409.12183, 2024

  4. [4]

    Critic-CoT: Boosting the reasoning abilities of large language model via chain-of-thoughts critic,

    X. Zheng, J. Lou, B. Cao, X. Wen, Y . Ji, H. Lin, Y . Lu, X. Han, D. Zhang, and L. Sun, “Critic-CoT: Boosting the reasoning abilities of large language model via chain-of-thoughts critic,” arXiv preprint arXiv:2408.16326, 2024

  5. [5]

    DynaThink: Fast or slow? A dynamic decision-making framework for large language models,

    J. Pan, Y . Zhang, C. Zhang, Z. Liu, H. Wang, and H. Li, “DynaThink: Fast or slow? A dynamic decision-making framework for large language models,” arXiv preprint arXiv:2407.01009 , 2024

  6. [6]

    GPT-4 is here: what scientists think,

    K. Sanderson, “GPT-4 is here: what scientists think,” Nature, vol. 615, no. 7954, p. 773, 2023, Nature

  7. [7]

    Interac- tive continual learning: Fast and slow thinking,

    B. Qi, X. Chen, J. Gao, D. Li, J. Liu, L. Wu, and B. Zhou, “Interac- tive continual learning: Fast and slow thinking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 12882–12892, 2024

  8. [8]

    A human-like artificial intelligence for mathematics,

    S. Alonso-Diaz, “A human-like artificial intelligence for mathematics,” Mind & Society , vol. 23, no. 1, pp. 79–97, 2024

  9. [9]

    A survey of human- in-the-loop for machine learning,

    X. Wu, L. Xiao, Y . Sun, J. Zhang, T. Ma, and L. He, “A survey of human- in-the-loop for machine learning,” Future Generation Computer Systems , vol. 135, pp. 364–381, 2022

  10. [10]

    Bad- chain: Backdoor chain-of-thought prompting for large language models

    Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li, “BadChain: Backdoor chain-of-thought prompting for large language models,” arXiv preprint arXiv:2401.12242 , 2024

  11. [11]

    BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack,

    Z. Zhu, H. Zhang, M. Zhang, R. Wang, G. Wu, K. Xu, and B. Wu, “BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack,” arXiv preprint arXiv:2502.12202 , 2025

  12. [12]

    Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?,

    R. Ren et al., “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?,” Advances in Neural Information Processing Systems, vol. 37, pp. 68559–68594, 2024

  13. [13]

    Onion: A simple and effective defense against textual backdoor attacks,

    F. Qi, Y . Chen, M. Li, Y . Yao, Z. Liu, and M. Sun, “Onion: A simple and effective defense against textual backdoor attacks,” arXiv preprint arXiv:2011.10369, 2020

  14. [14]

    BadNL: Backdoor attacks against NLP models with semantic- preserving improvements,

    X. Chen, A. Salem, D. Chen, M. Backes, S. Ma, Q. Shen, Z. Wu, and Y . Zhang, “BadNL: Backdoor attacks against NLP models with semantic- preserving improvements,” in Proceedings of the 37th Annual Computer Security Applications Conference , pp. 554–569, 2021

  15. [15]

    Darkmind: Latent chain-of-thought backdoor in customized llms,

    Z. Guo and R. Tourani, “DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs,” arXiv preprint arXiv:2501.18617 , 2025

  16. [16]

    SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation,

    N. Jin, Z. Li, Y . Guo, C. Su, T. Zhang, and Q. Zeng, “SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation,” arXiv preprint arXiv:2412.05829 , 2024

  17. [17]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q.V . Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems , vol. 35, pp. 24824–24837, 2022

  18. [18]

    Automatic prompt augmentation and selection with chain-of-thought from labeled data,

    K. Shum, S. Diao, and T. Zhang, “Automatic prompt augmentation and selection with chain-of-thought from labeled data,” arXiv preprint arXiv:2302.12822, 2023

  19. [19]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171 , 2022

  20. [20]

    Backdoors against natural language processing: A review,

    S. Li, T. Dong, B. Z. H. Zhao, M. Xue, S. Du, and H. Zhu, “Backdoors against natural language processing: A review,”IEEE Security & Privacy, vol. 20, no. 5, pp. 50–59, 2022

  21. [21]

    Composite backdoor attacks against large language models

    H. Huang, Z. Zhao, M. Backes, Y . Shen, and Y . Zhang, “Compos- ite backdoor attacks against large language models,” arXiv preprint arXiv:2310.07676, 2023

  22. [22]

    Instruction backdoor attacks against customized LLMs,

    R. Zhang, H. Li, R. Wen, W. Jiang, Y . Zhang, M. Backes, Y . Shen, and Y . Zhang, “Instruction backdoor attacks against customized LLMs,” in 33rd USENIX Security Symposium , pp. 1849–1866, 2024

  23. [23]

    AnomaLLMy–Detecting anomalous tokens in black-box LLMs through low-confidence single-token predictions,

    W. Walig ´ora, “AnomaLLMy–Detecting anomalous tokens in black-box LLMs through low-confidence single-token predictions,” arXiv preprint arXiv:2406.19840, 2024

  24. [24]

    Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization,

    Z. Wu, H. Gao, P. Wang, S. Zhang, Z. Liu, and S. Lian, “Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization,” arXiv preprint arXiv:2410.15052 , 2024

  25. [25]

    Chain-of-scrutiny: Detecting backdoor attacks for large language models,

    X. Li, Y . Zhang, R. Lou, C. Wu, and J. Wang, “Chain-of-scrutiny: Detecting backdoor attacks for large language models,” arXiv preprint arXiv:2406.05948, 2024

  26. [26]

    How to think step-by-step: A mechanistic understanding of chain-of-thought reason- ing,

    S. Dutta, J. Singh, S. Chakrabarti, and T. Chakraborty, “How to think step-by-step: A mechanistic understanding of chain-of-thought reason- ing,” arXiv preprint arXiv:2402.18312 , 2024

  27. [27]

    Proofnet: Autoformalizing and formally proving undergraduate-level mathematics

    Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W. Ayers, D. Radev, and J. Avigad, “ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics,” arXiv preprint arXiv:2302.12433 , 2023

  28. [28]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, and others, “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168 , 2021

  29. [29]

    Lofit: Localized fine-tuning on LLM representations,

    F. Yin, X. Ye, and G. Durrett, “Lofit: Localized fine-tuning on LLM representations,” Advances in Neural Information Processing Systems , vol. 37, pp. 9474–9506, 2025

  30. [30]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” arXiv preprint arXiv:1908.10084, 2019

  31. [31]

    Proximal Policy Optimization Algorithms

    J. Schulman et al., “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

  32. [32]

    Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems,

    W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , vol.1, pp. 158–167, 2017

  33. [33]

    Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies,

    M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant, “Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 346–361, 2021

  34. [34]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288 , 2023

  35. [35]

    Comprehensive Analysis of Falcon 7B: A State-of-the-Art Generative Large Language Model,

    M. Aridoss, K. S. Bisht, and A. K. Natarajan, “Comprehensive Analysis of Falcon 7B: A State-of-the-Art Generative Large Language Model,” in Generative AI: Current Trends and Applications , 2024, pp. 147–164

  36. [36]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825 , 2023

  37. [37]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, and others, “DeepSeek-R1: Incentivizing rea- soning capability in LLMs via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025

  38. [38]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101 , 2017

  39. [39]

    LoRA: Low-Rank Adaptation of Large Lan- guage Models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, and others, “LoRA: Low-Rank Adaptation of Large Lan- guage Models,” in International Conference on Learning Representations (ICLR), 2022

  40. [40]

    Prompt Consistency for Zero- Shot Prompt Selection,

    W. Wang, A. Goswami, and G. Durrett, “Prompt Consistency for Zero- Shot Prompt Selection,” arXiv preprint arXiv:2305.03022 , 2023. APPENDIX This appendix provides additional implementation details, dataset statistics, and qualitative examples to complement the findings presented in the main paper. A. Trigger Set Design To ensure stealthy and consistent ac...