ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs
Pith reviewed 2026-05-22 21:08 UTC · model grok-4.3
The pith
ShadowCoT implants backdoors in LLMs by hijacking their internal chain-of-thought reasoning steps with only 0.15 percent parameter updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ShadowCoT is a backdoor attack that directly manipulates the cognitive reasoning path of LLMs by conditioning on internal reasoning states, selectively disrupting key steps via a lightweight multi-stage injection pipeline that rewires attention pathways and perturbs intermediate representations with only 0.15 percent parameter updates, and employs reinforcement learning with reasoning chain pollution to synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses, achieving 94.4 percent attack success rate and 88.4 percent hijacking success rate while preserving benign performance.
What carries the argument
The multi-stage injection pipeline that conditions on internal reasoning states to recognize and selectively disrupt key steps, combined with reinforcement learning and reasoning chain pollution to synthesize stealthy adversarial CoTs.
If this is right
- The attack reaches 94.4 percent attack success rate and 88.4 percent hijacking success rate across diverse reasoning benchmarks and models.
- Benign task performance remains intact despite the implanted backdoor.
- The resulting adversarial chain-of-thought outputs evade detection by existing advanced defenses.
- The approach reveals an emergent class of cognition-level threats that operate inside the reasoning process itself.
Where Pith is reading between the lines
- Defenses would likely need to inspect or constrain intermediate activations during reasoning rather than only final outputs.
- The same conditioning technique could be tested on other step-by-step systems such as planning agents or tool-using models.
- The low update budget suggests the backdoor might survive partial retraining or model compression steps.
Load-bearing premise
A small set of parameter updates can selectively rewire attention pathways and perturb intermediate representations enough to enable reliable hijacking of reasoning steps while keeping the changes undetectable.
What would settle it
An experiment in which an advanced defense consistently flags the hijacked reasoning chains or in which the attack success rate falls below 50 percent after standard fine-tuning on clean reasoning data.
Figures
read the original abstract
Chain-of-Thought (CoT) enhances an LLM's ability to perform complex reasoning tasks, but it also introduces new security issues. In this work, we present ShadowCoT, a novel backdoor attack framework that targets the internal reasoning mechanism of LLMs. Unlike prior token-level or prompt-based attacks, ShadowCoT directly manipulates the model's cognitive reasoning path, enabling it to hijack multi-step reasoning chains and produce logically coherent but adversarial outcomes. By conditioning on internal reasoning states, ShadowCoT learns to recognize and selectively disrupt key reasoning steps, effectively mounting a self-reflective cognitive attack within the target model. Our approach introduces a lightweight yet effective multi-stage injection pipeline, which selectively rewires attention pathways and perturbs intermediate representations with minimal parameter overhead (only 0.15% updated). ShadowCoT further leverages reinforcement learning and reasoning chain pollution (RCP) to autonomously synthesize stealthy adversarial CoTs that remain undetectable to advanced defenses. Extensive experiments across diverse reasoning benchmarks and LLMs show that ShadowCoT consistently achieves high Attack Success Rate (94.4%) and Hijacking Success Rate (88.4%) while preserving benign performance. These results reveal an emergent class of cognition-level threats and highlight the urgent need for defenses beyond shallow surface-level consistency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ShadowCoT, a backdoor attack framework targeting the internal reasoning mechanisms of LLMs. It conditions on internal states to selectively disrupt key CoT steps via a multi-stage injection pipeline (0.15% parameter updates), reinforcement learning, and reasoning chain pollution (RCP) to synthesize stealthy adversarial CoTs, reporting 94.4% Attack Success Rate and 88.4% Hijacking Success Rate while preserving benign performance across reasoning benchmarks.
Significance. If the empirical claims and selectivity hold under rigorous controls, the work identifies an emergent class of cognition-level threats that could evade surface-level defenses, with the RL-driven autonomous synthesis and minimal-overhead injection as potentially valuable contributions to LLM security research.
major comments (2)
- [Abstract] Abstract: specific numerical claims (ASR 94.4%, HSR 88.4%, 0.15% parameter updates) are presented without any description of experimental setup, benchmarks, baselines, trial counts, or statistical tests, which is load-bearing for assessing whether the reported rates support the central claims of consistent high performance and stealth.
- [Method (multi-stage injection pipeline)] Method section on multi-stage injection pipeline: the assertion of selective rewiring of attention pathways and perturbation of only key intermediate representations (via conditioning and RCP) with 0.15% updates lacks layer-wise probing or representation-shift ablations on clean inputs; without these, non-local effects on distributed reasoning computations cannot be ruled out and undermine the selectivity and stealth claims.
minor comments (1)
- [Abstract] The abstract would be clearer with a one-sentence summary of the LLMs and reasoning benchmarks used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense of the work while committing to revisions where the concerns are valid and actionable.
read point-by-point responses
-
Referee: [Abstract] Abstract: specific numerical claims (ASR 94.4%, HSR 88.4%, 0.15% parameter updates) are presented without any description of experimental setup, benchmarks, baselines, trial counts, or statistical tests, which is load-bearing for assessing whether the reported rates support the central claims of consistent high performance and stealth.
Authors: We agree that the abstract's conciseness leaves the numerical claims without immediate context on setup. The full manuscript provides these details in the Experiments section, including benchmarks, baselines, and trial statistics. To improve accessibility, we will revise the abstract to include a brief clause referencing the evaluation protocol (e.g., 'across multiple reasoning benchmarks with repeated trials'). This addresses the concern without violating abstract length norms. revision: yes
-
Referee: [Method (multi-stage injection pipeline)] Method section on multi-stage injection pipeline: the assertion of selective rewiring of attention pathways and perturbation of only key intermediate representations (via conditioning and RCP) with 0.15% updates lacks layer-wise probing or representation-shift ablations on clean inputs; without these, non-local effects on distributed reasoning computations cannot be ruled out and undermine the selectivity and stealth claims.
Authors: The manuscript demonstrates selectivity through preserved benign performance on clean inputs alongside high targeted success rates, which is consistent with localized effects from the 0.15% updates and internal conditioning. We acknowledge, however, that dedicated layer-wise probing and representation-shift ablations on clean inputs would provide more direct evidence ruling out non-local effects. We will incorporate these analyses in the revised version to strengthen the selectivity argument. revision: yes
Circularity Check
No significant circularity in empirical attack framework
full rationale
The paper describes an empirical backdoor attack method (multi-stage injection pipeline, conditioning on internal states, RL with RCP) and reports direct experimental outcomes such as ASR 94.4% and HSR 88.4% on reasoning benchmarks. No mathematical derivation, first-principles result, or prediction is claimed that reduces by construction to fitted parameters, self-definitions, or self-citation chains. The central claims rest on experimental validation rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
R-CoT: A Reasoning-Layer Watermark via Redundant Chain-of-Thought in Large Language Models
R-CoT embeds watermarks into LLM reasoning paths via redundant CoT and GRPO-based dual optimization, maintaining over 95% true positive rate under fine-tuning and post-training changes.
-
Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models
Critical-CoT defends LLMs from reasoning-level backdoor attacks via two-stage fine-tuning that builds automatic detection and refusal of poisoned chain-of-thought steps.
-
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
Reference graph
Works this paper leans on
-
[1]
Reasoning with large language models, a survey, 2024
A. Plaat, A. Wong, S. Verberne, J. Broekens, N. van Stein, and T. Back, “Reasoning with large language models, A survey,” arXiv preprint arXiv:2407.11511, 2024. 13
-
[2]
A. Jaech et al., “OpenAI o1 system card,” arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
arXiv preprint arXiv:2409.12183 , year=
Z. Sprague, F. Yin, J.D. Rodriguez, D. Jiang, M. Wadhwa, P. Singhal, X. Zhao, X. Ye, K. Mahowald, and G. Durrett, “To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning,” arXiv preprint arXiv:2409.12183, 2024
-
[4]
Critic-CoT: Boosting the reasoning abilities of large language model via chain-of-thoughts critic,
X. Zheng, J. Lou, B. Cao, X. Wen, Y . Ji, H. Lin, Y . Lu, X. Han, D. Zhang, and L. Sun, “Critic-CoT: Boosting the reasoning abilities of large language model via chain-of-thoughts critic,” arXiv preprint arXiv:2408.16326, 2024
-
[5]
DynaThink: Fast or slow? A dynamic decision-making framework for large language models,
J. Pan, Y . Zhang, C. Zhang, Z. Liu, H. Wang, and H. Li, “DynaThink: Fast or slow? A dynamic decision-making framework for large language models,” arXiv preprint arXiv:2407.01009 , 2024
-
[6]
GPT-4 is here: what scientists think,
K. Sanderson, “GPT-4 is here: what scientists think,” Nature, vol. 615, no. 7954, p. 773, 2023, Nature
work page 2023
-
[7]
Interac- tive continual learning: Fast and slow thinking,
B. Qi, X. Chen, J. Gao, D. Li, J. Liu, L. Wu, and B. Zhou, “Interac- tive continual learning: Fast and slow thinking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 12882–12892, 2024
work page 2024
-
[8]
A human-like artificial intelligence for mathematics,
S. Alonso-Diaz, “A human-like artificial intelligence for mathematics,” Mind & Society , vol. 23, no. 1, pp. 79–97, 2024
work page 2024
-
[9]
A survey of human- in-the-loop for machine learning,
X. Wu, L. Xiao, Y . Sun, J. Zhang, T. Ma, and L. He, “A survey of human- in-the-loop for machine learning,” Future Generation Computer Systems , vol. 135, pp. 364–381, 2022
work page 2022
-
[10]
Bad- chain: Backdoor chain-of-thought prompting for large language models
Z. Xiang, F. Jiang, Z. Xiong, B. Ramasubramanian, R. Poovendran, and B. Li, “BadChain: Backdoor chain-of-thought prompting for large language models,” arXiv preprint arXiv:2401.12242 , 2024
-
[11]
BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack,
Z. Zhu, H. Zhang, M. Zhang, R. Wang, G. Wu, K. Xu, and B. Wu, “BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack,” arXiv preprint arXiv:2502.12202 , 2025
-
[12]
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?,
R. Ren et al., “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?,” Advances in Neural Information Processing Systems, vol. 37, pp. 68559–68594, 2024
work page 2024
-
[13]
Onion: A simple and effective defense against textual backdoor attacks,
F. Qi, Y . Chen, M. Li, Y . Yao, Z. Liu, and M. Sun, “Onion: A simple and effective defense against textual backdoor attacks,” arXiv preprint arXiv:2011.10369, 2020
-
[14]
BadNL: Backdoor attacks against NLP models with semantic- preserving improvements,
X. Chen, A. Salem, D. Chen, M. Backes, S. Ma, Q. Shen, Z. Wu, and Y . Zhang, “BadNL: Backdoor attacks against NLP models with semantic- preserving improvements,” in Proceedings of the 37th Annual Computer Security Applications Conference , pp. 554–569, 2021
work page 2021
-
[15]
Darkmind: Latent chain-of-thought backdoor in customized llms,
Z. Guo and R. Tourani, “DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs,” arXiv preprint arXiv:2501.18617 , 2025
-
[16]
SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation,
N. Jin, Z. Li, Y . Guo, C. Su, T. Zhang, and Q. Zeng, “SABER: Model-agnostic Backdoor Attack on Chain-of-Thought in Neural Code Generation,” arXiv preprint arXiv:2412.05829 , 2024
-
[17]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q.V . Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems , vol. 35, pp. 24824–24837, 2022
work page 2022
-
[18]
Automatic prompt augmentation and selection with chain-of-thought from labeled data,
K. Shum, S. Diao, and T. Zhang, “Automatic prompt augmentation and selection with chain-of-thought from labeled data,” arXiv preprint arXiv:2302.12822, 2023
-
[19]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Backdoors against natural language processing: A review,
S. Li, T. Dong, B. Z. H. Zhao, M. Xue, S. Du, and H. Zhu, “Backdoors against natural language processing: A review,”IEEE Security & Privacy, vol. 20, no. 5, pp. 50–59, 2022
work page 2022
-
[21]
Composite backdoor attacks against large language models
H. Huang, Z. Zhao, M. Backes, Y . Shen, and Y . Zhang, “Compos- ite backdoor attacks against large language models,” arXiv preprint arXiv:2310.07676, 2023
-
[22]
Instruction backdoor attacks against customized LLMs,
R. Zhang, H. Li, R. Wen, W. Jiang, Y . Zhang, M. Backes, Y . Shen, and Y . Zhang, “Instruction backdoor attacks against customized LLMs,” in 33rd USENIX Security Symposium , pp. 1849–1866, 2024
work page 2024
-
[23]
W. Walig ´ora, “AnomaLLMy–Detecting anomalous tokens in black-box LLMs through low-confidence single-token predictions,” arXiv preprint arXiv:2406.19840, 2024
-
[24]
Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization,
Z. Wu, H. Gao, P. Wang, S. Zhang, Z. Liu, and S. Lian, “Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization,” arXiv preprint arXiv:2410.15052 , 2024
-
[25]
Chain-of-scrutiny: Detecting backdoor attacks for large language models,
X. Li, Y . Zhang, R. Lou, C. Wu, and J. Wang, “Chain-of-scrutiny: Detecting backdoor attacks for large language models,” arXiv preprint arXiv:2406.05948, 2024
-
[26]
How to think step-by-step: A mechanistic understanding of chain-of-thought reason- ing,
S. Dutta, J. Singh, S. Chakrabarti, and T. Chakraborty, “How to think step-by-step: A mechanistic understanding of chain-of-thought reason- ing,” arXiv preprint arXiv:2402.18312 , 2024
-
[27]
Proofnet: Autoformalizing and formally proving undergraduate-level mathematics
Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W. Ayers, D. Radev, and J. Avigad, “ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics,” arXiv preprint arXiv:2302.12433 , 2023
-
[28]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, and others, “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
Lofit: Localized fine-tuning on LLM representations,
F. Yin, X. Ye, and G. Durrett, “Lofit: Localized fine-tuning on LLM representations,” Advances in Neural Information Processing Systems , vol. 37, pp. 9474–9506, 2025
work page 2025
-
[30]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” arXiv preprint arXiv:1908.10084, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[31]
Proximal Policy Optimization Algorithms
J. Schulman et al., “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems,
W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , vol.1, pp. 158–167, 2017
work page 2017
-
[33]
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies,
M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant, “Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 346–361, 2021
work page 2021
-
[34]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Comprehensive Analysis of Falcon 7B: A State-of-the-Art Generative Large Language Model,
M. Aridoss, K. S. Bisht, and A. K. Natarajan, “Comprehensive Analysis of Falcon 7B: A State-of-the-Art Generative Large Language Model,” in Generative AI: Current Trends and Applications , 2024, pp. 147–164
work page 2024
-
[36]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, and others, “DeepSeek-R1: Incentivizing rea- soning capability in LLMs via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
LoRA: Low-Rank Adaptation of Large Lan- guage Models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, and others, “LoRA: Low-Rank Adaptation of Large Lan- guage Models,” in International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[40]
Prompt Consistency for Zero- Shot Prompt Selection,
W. Wang, A. Goswami, and G. Durrett, “Prompt Consistency for Zero- Shot Prompt Selection,” arXiv preprint arXiv:2305.03022 , 2023. APPENDIX This appendix provides additional implementation details, dataset statistics, and qualitative examples to complement the findings presented in the main paper. A. Trigger Set Design To ensure stealthy and consistent ac...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.