pith. machine review for the scientific record. sign in

arxiv: 2604.12201 · v1 · submitted 2026-04-14 · 💻 cs.IR

Recognition: unknown

AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3

classification 💻 cs.IR
keywords adversarial attacksretrieval-augmented generationLLM reasoningknowledge base poisoningchain-of-thoughtsingle-document attacksecurity vulnerabilities
0
0 comments X

The pith

A single adversarial document can significantly degrade large language model reasoning in retrieval-augmented generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that retrieval-augmented generation systems can be compromised through the injection of just one malicious document into the knowledge base. The proposed method first extracts the target LLM's reasoning steps, builds an initial adversarial chain-of-thought inside the document, and then refines it through repeated interactions with the model to target its vulnerabilities. A sympathetic reader would care because this reveals how retrieval mechanisms, intended to improve accuracy, can instead amplify targeted attacks with minimal effort from the adversary. Experiments on standard benchmark LLMs demonstrate clear drops in reasoning performance once the poisoned document is retrieved and used.

Core claim

AdversarialCoT is a query-specific attack that poisons only a single document by extracting the LLM's reasoning framework to construct an initial adversarial chain-of-thought, then iteratively refines the document through direct interactions with the LLM to progressively expose and exploit critical reasoning vulnerabilities, which results in substantially reduced accuracy when the document is surfaced by the retriever.

What carries the argument

AdversarialCoT, the iterative refinement of a single query-specific adversarial chain-of-thought document that guides the LLM toward flawed reasoning once retrieved.

If this is right

  • RAG systems become vulnerable to targeted poisoning attacks that require injecting only one document.
  • LLM reasoning accuracy drops substantially once the crafted adversarial content is retrieved and incorporated.
  • Iterative interactions allow progressive identification and exploitation of subtle weaknesses in the model's reasoning process.
  • The approach exposes security risks and offers insights for building more robust LLM reasoning pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retrieval systems may need additional checks to detect documents that mimic or distort reasoning chains.
  • This single-document technique could extend to other retrieval-dependent AI applications that rely on external knowledge.
  • Defenses might benefit from requiring multiple independent documents or verifying consistency across retrieved sources.

Load-bearing premise

The attack requires the adversary to interact iteratively with the target LLM to refine the poisoned document and assumes the retriever will reliably return that single document for the chosen query.

What would settle it

Measure whether reasoning accuracy on standard benchmarks remains above baseline levels when a single refined adversarial document is the only one retrieved and consumed during inference.

Figures

Figures reproduced from arXiv: 2604.12201 by Hongru Song, Jiafeng Guo, Maarten de Rijke, Ruqing Zhang, Xueqi Cheng, Yixing Fan, Yu-An Liu.

Figure 1
Figure 1. Figure 1: Overview of single-document knowledge-base poi [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The attack process of AdversarialCoT [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ASR scaling across iteration rounds for different [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-model generalization heatmap showing the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Retrieval-augmented generation (RAG) enhances large language model (LLM) reasoning by retrieving external documents, but also opens up new attack surfaces. We study knowledge-base poisoning attacks in RAG, where an attacker injects malicious content into the retrieval corpus, which is then naturally surfaced by the retriever and consumed by the LLM during reasoning. Unlike prior work that floods the corpus with poisoned documents, we propose AdversarialCoT, a query-specific attack that poisons only a single document in the corpus. AdversarialCoT first extracts the target LLM's reasoning framework to guide the construction of an initial adversarial chain-of-thought (CoT). The adversarial document is iteratively refined through interactions with the LLM, progressively exposing and exploiting critical reasoning vulnerabilities. Experiments on benchmark LLMs show that a single adversarial document can significantly degrade reasoning accuracy, revealing subtle yet impactful weaknesses. This study exposes security risks in RAG systems and provides actionable insights for designing more robust LLM reasoning pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AdversarialCoT, a query-specific single-document poisoning attack on RAG systems. It extracts the target LLM's reasoning framework to construct an initial adversarial chain-of-thought, then iteratively refines the poisoned document via interactions with the LLM to exploit reasoning vulnerabilities. The central claim is that injecting only this one document into the corpus causes the retriever to naturally surface it and significantly degrade LLM reasoning accuracy on benchmark tasks.

Significance. If the results hold under realistic retrieval conditions, the work would be significant for RAG security and LLM reasoning robustness, demonstrating that minimal poisoning (one document) can compromise systems where prior attacks required corpus flooding. It exposes subtle integration weaknesses between retrieval and reasoning steps and offers insights for defenses. The empirical iterative refinement process is a strength, as it is presented as a direct construction without free parameters or circular derivations.

major comments (2)
  1. [Experimental Evaluation] The experimental evaluation reports accuracy degradation from the single adversarial document but supplies no retrieval metrics (e.g., rank, hit rate at k=1, or success rate of surfacing the document for target queries). This is load-bearing for the central claim, as the attack presupposes that the retriever will naturally surface the single poisoned document in the RAG pipeline; without these metrics, the causal connection between injection and observed degradation cannot be verified, especially in large corpora with approximate nearest-neighbor search.
  2. [AdversarialCoT Construction] The AdversarialCoT construction (§3) relies on iterative interactions with the target LLM to refine the document and expose vulnerabilities. The paper should explicitly state the threat model (e.g., black-box query access only) and typical iteration counts, as this assumption directly affects whether the attack is practical outside controlled settings.
minor comments (2)
  1. [Abstract] The abstract asserts significant accuracy degradation without any numerical results, baselines, or error bars; including a concise quantitative summary would improve completeness.
  2. Define notation for the adversarial document and CoT components consistently upon first use to prevent ambiguity in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of empirical validation and threat model clarity that will strengthen the manuscript. We address each point below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The experimental evaluation reports accuracy degradation from the single adversarial document but supplies no retrieval metrics (e.g., rank, hit rate at k=1, or success rate of surfacing the document for target queries). This is load-bearing for the central claim, as the attack presupposes that the retriever will naturally surface the single poisoned document in the RAG pipeline; without these metrics, the causal connection between injection and observed degradation cannot be verified, especially in large corpora with approximate nearest-neighbor search.

    Authors: We agree that retrieval metrics are essential to verify the attack's mechanism and establish the causal link between document injection and reasoning degradation. In the revised manuscript, we will add a dedicated analysis subsection reporting the average retrieval rank of the adversarial document, hit rate at k=1, and surfacing success rates across queries, models, and varying corpus sizes (including approximate nearest-neighbor settings). These metrics will be computed on the same experimental setups as the accuracy results to directly support the central claim. revision: yes

  2. Referee: [AdversarialCoT Construction] The AdversarialCoT construction (§3) relies on iterative interactions with the target LLM to refine the document and expose vulnerabilities. The paper should explicitly state the threat model (e.g., black-box query access only) and typical iteration counts, as this assumption directly affects whether the attack is practical outside controlled settings.

    Authors: We concur that explicit threat model details and iteration statistics improve reproducibility and practicality assessment. We will revise Section 3 to clearly state the black-box threat model (query-only access to the target LLM for construction and refinement, with no access to retrieval internals, corpus, or model weights). We will also report the observed iteration counts from our experiments, including averages and ranges per model and task, to quantify the construction effort. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical attack construction

full rationale

The paper presents AdversarialCoT as an iterative, query-specific empirical procedure for building one poisoned document via LLM interactions to expose reasoning flaws. No mathematical derivation, fitted parameters renamed as predictions, or self-citation chains are invoked to support the central claim. The reported accuracy degradation on benchmark LLMs stands as an external experimental outcome rather than a tautology or reduction to inputs by construction. The method is self-contained and does not rely on any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that a retriever will surface the crafted document and that iterative LLM interaction is feasible for the attacker. No free parameters or invented entities with independent evidence are introduced in the abstract.

axioms (2)
  • domain assumption The retriever will surface the single poisoned document for the target query
    Required for the attack to succeed as described
  • domain assumption Iterative interaction with the target LLM is available to the attacker for refinement
    Central to the construction process outlined
invented entities (1)
  • AdversarialCoT document no independent evidence
    purpose: A single poisoned document containing a crafted adversarial chain-of-thought
    The core artifact of the proposed attack

pith-pipeline@v0.9.0 · 5484 in / 1277 out tokens · 29480 ms · 2026-05-10T16:17:48.901095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 22 canonical work pages · 11 internal anchors

  1. [1]

    Dibyanayan Bandyopadhyay, Soham Bhattacharjee, and Asif Ekbal. 2025. Think- ing Machines: A Survey of LLM Based Reasoning Strategies. arXiv:2503.10814 [cs] doi:10.48550/arXiv.2503.10814

  2. [2]

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2023. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. arXiv:2308.09687 [cs] doi:10.48550/arXiv. 2308.09687

  3. [3]

    Zhuo Chen, Yuyang Gong, Miaokun Chen, Haotan Liu, Qikai Cheng, Fan Zhang, Wei Lu, Xiaozhong Liu, and Jiawei Liu. 2025. FlipedRAG: Black-Box Opinion Manipulation Attacks to Retrieval-Augmented Generation of Large Language Models. arXiv:2501.02968 [cs] doi:10.48550/arXiv.2501.02968

  4. [4]

    Jingsheng Gao, Linxu Li, Weiyuan Li, Yuzhuo Fu, and Bin Dai. 2024. SmartRAG: Jointly Learn RAG-Related Tasks From the Environment Feedback.arXiv preprint arXiv:2410.18141(2024)

  5. [5]

    Luyu Gao and Jamie Callan. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computa- tional Linguistics, Dublin, Ireland,...

  6. [6]

    GLM-4.5 Team, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, et al . 2025. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models. arXiv:2508.06471 [cs.CL] https://arxiv.org/abs/2508.06471

  7. [7]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. DeepSeek- R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948 [cs.CL] https://doi.org/10.48550/arXiv.2501.12948

  8. [8]

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval Augmented Language Model Pre-training. InInternational Conference on Machine Learning. PMLR, 3929–3938

  9. [9]

    Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-Young Paik, and Liming Zhu. 2024. Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models. InSIGKDD. 1119–1130

  10. [10]

    Gautier Izacard and Édouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. InEACL. 874–880

  11. [11]

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave

  12. [12]

    Few-shot learning with retrieval augmented language models,

    Atlas: Few-shot Learning with Retrieval Augmented Language Models. arXiv:2208.03299 [cs.CL] https://arxiv.org/abs/2208.03299

  13. [13]

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning. arXiv:2503.09516 [cs] doi:10.48550/arXiv.2503.09516

  14. [14]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research.Tr...

  15. [15]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al

  16. [16]

    Advances in Neural Information Processing Systems33 (2020), 9459–9474

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems33 (2020), 9459–9474

  17. [17]

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. 2025. Search-O1: Agentic Search-Enhanced Large Reasoning Models. arXiv:2501.05366 [cs] doi:10.48550/arXiv.2501.05366

  18. [18]

    Jiawei Liu, Yangyang Kang, Di Tang, Kaisong Song, Changlong Sun, Xiaofeng Wang, Wei Lu, and Xiaozhong Liu. 2022. Order-disorder: Imitation adversarial attacks for black-box neural ranking models. InSIGSAC. 2025–2039

  19. [19]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). 1831–1847

  20. [20]

    Yu-An Liu, Ruqing Zhang, Jiafeng Guo, and Xueqi Cheng. 2025. On the Robustness of Generative Information Retrieval Models. InECIR

  21. [21]

    Yu-An Liu, Ruqing Zhang, Jiafeng Guo, and Maarten de Rijke. 2024. Robust Information Retrieval. InSIGIR. 3009–3012

  22. [22]

    Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2024. Multi-Granular Adversarial Attacks against Black-box Neural Ranking Models. InSIGIR. 1391–1400

  23. [23]

    Yu-An Liu, Ruqing Zhang, Mingkun Zhang, Wei Chen, Maarten de Rijke, Jiafeng Guo, and Xueqi Cheng. 2024. Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off. In AAAI, Vol. 38

  24. [24]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human-Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL] https://arxiv.org/ abs/1611.09268

  25. [25]

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context Retrieval-Augmented Lan- guage Models.TACL11 (2023), 1316–1331

  26. [26]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs] doi:10.48550/arXiv.2402.03300

  27. [28]

    Hongru Song, Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Jianming Lv, Maarten de Rijke, and Xueqi Cheng. 2025. The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher P...

  28. [29]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Intro- duction(2 ed.). The MIT Press, Cambridge, MA

  29. [30]

    Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, et al . 2025. Kimi K2: Open Agentic Intelligence. arXiv:2507.20534 [cs.LG] https://arxiv.org/abs/2507.20534

  30. [31]

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail?NIPS36 (2023), 80079–80110

  31. [32]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elic- its Reasoning in Large Language Models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, ...

  32. [33]

    Chen Wu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng. 2023. Prada: Practical Black-box Adversarial Attacks Against Neural Ranking Models.ACM Transactions on Information Systems41, 4 (2023), 1–27

  33. [34]

    Sirui Xia, Xintao Wang, Jiaqing Liang, Yifei Zhang, Weikang Zhou, Jiaji Deng, Fei Yu, and Yanghua Xiao. 2024. Ground Every Sentence: Improving Retrieval- Augmented LLMs with Interleaved Reference-Claim Generation.arXiv preprint arXiv:2407.01796(2024)

  34. [35]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  35. [36]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al . 2024. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115(2024)

  36. [37]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (...

  37. [38]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL] https://arxiv.org/abs/ 2305.10601

  38. [39]

    Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, and Jian-Yun Nie. 2024. A Multi-task Embedder for Retrieval Augmented LLMs. InACL. 3537–3553

  39. [40]

    Yucheng Zhang, Qinfeng Li, Tianyu Du, Xuhong Zhang, Xinkui Zhao, Zhengwen Feng, and Jianwei Yin. 2024. HijackRAG: Hijacking Attacks against Retrieval- Augmented Large Language Models.arXiv preprint arXiv:2410.22832(2024)

  40. [41]

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin

  41. [42]

    Group Sequence Policy Optimization

    Group Sequence Policy Optimization. arXiv:2507.18071 [cs.LG] https: //arxiv.org/abs/2507.18071

  42. [43]

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2024. PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models. arXiv:2402.07867 [cs.CR] https://arxiv.org/abs/2402.07867