pith. machine review for the scientific record. sign in

arxiv: 2604.07403 · v1 · submitted 2026-04-08 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement

Ziye Wang , Guanyu Wang , Kailong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:08 UTC · model grok-4.3

classification 💻 cs.CR
keywords RAG poisoningretrieval-augmented generationword-level attacksadversarial refinementblack-box transferknowledge poisoningLLM security
0
0 comments X

The pith

RefineRAG treats RAG poisoning as word-level refinement to create effective yet natural toxic documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current poisoning attacks on retrieval-augmented generation split and insert harmful content in ways that stand out. RefineRAG instead generates toxic seeds and then refines them word by word while consulting the retriever itself. The two-stage process produces documents that rank highly for target queries yet contain few grammar or repetition issues. On the Natural Questions dataset the method reaches 90 percent attack success and the same poisons work against black-box systems when optimized only on a proxy.

Core claim

RefineRAG frames knowledge poisoning as a single word-level refinement task rather than a coarse insert-and-concatenate operation. Macro Generation first creates toxic seed texts guaranteed to elicit chosen answers. Micro Refinement then iteratively adjusts individual words under a retriever-in-the-loop objective that raises retrieval score while keeping surface form natural. The resulting attacks outperform prior baselines on both effectiveness and stealth metrics and transfer across retriever boundaries.

What carries the argument

The RefineRAG two-stage framework: Macro Generation for guaranteed toxic seeds followed by Micro Refinement that uses retriever feedback to optimize retrieval priority without sacrificing naturalness.

If this is right

  • Attacks reach 90 percent success on NQ while producing fewer grammar errors than existing methods.
  • Refined poisons transfer from proxy retrievers to black-box victim systems.
  • Coarse separate-and-concatenate poisoning strategies are both less effective and easier to spot.
  • Word-level changes allow toxic content to rank highly without obvious artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Detection tools may need to inspect token-level optimization traces rather than obvious insertions.
  • Open retrievers can serve as safe proxies for crafting attacks on closed commercial systems.
  • RAG pipelines using public web indexes become more exposed once word-level refinement is known.
  • Defenses could monitor sudden improvements in retrieval scores for documents that remain stylistically ordinary.

Load-bearing premise

Micro refinement can raise a document's retrieval rank while leaving no detectable traces of optimization and without needing direct access to the victim retriever.

What would settle it

Run the refined documents through a production RAG pipeline and check whether attack success rate remains near 90 percent on NQ while grammar-error and repetition counts stay at or below baseline levels.

Figures

Figures reproduced from arXiv: 2604.07403 by Guanyu Wang, Kailong Wang, Ziye Wang.

Figure 1
Figure 1. Figure 1: The overall framework of RefineRAG. 4 Methodology In this section, we introduce RefineRAG. The overview of our framework is shown in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Impact of Retrieval Scope (k) on NQ. effect where an excessive number of retrieved benign documents weakens the adversarial context, a phenomenon consistent with prior findings in the field. 5.4 Ablation Study To verify the necessity of our two-stage design, we conduct ablation experiments by systematically removing key components of the framework. We first evaluate the configuration designated as No-I whe… view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs), but simultaneously exposes a critical vulnerability to knowledge poisoning attacks. Existing attack methods like PoisonedRAG remain detectable due to coarse-grained separate-and-concatenate strategies. To bridge this gap, we propose RefineRAG, a novel framework that treats poisoning as a holistic word-level refinement problem. It operates in two stages: Macro Generation produces toxic seeds guaranteed to induce target answers, while Micro Refinement employs a retriever-in-the-loop optimization to maximize retrieval priority without compromising naturalness. Evaluations on NQ and MSMARCO demonstrate that RefineRAG achieves state-of-the-art effectiveness, securing a 90% Attack Success Rate on NQ, while registering the lowest grammar errors and repetition rates among all baselines. Crucially, our proxy-optimized attacks successfully transfer to black-box victim systems, highlighting a severe practical threat.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes RefineRAG, a novel two-stage framework for word-level poisoning attacks on Retrieval-Augmented Generation (RAG) systems. The macro generation stage creates toxic seeds that induce target answers, and the micro refinement stage uses a retriever-in-the-loop optimization to maximize retrieval priority while maintaining naturalness. On the Natural Questions (NQ) and MSMARCO datasets, it achieves a 90% Attack Success Rate (ASR) on NQ, with the lowest grammar errors and repetition rates among baselines, and demonstrates transferability to black-box victim systems.

Significance. If the reported results hold under rigorous controls, this work significantly advances the understanding of stealthy poisoning attacks in RAG systems by showing that retriever-guided word-level refinements can achieve high effectiveness and naturalness simultaneously. The proxy-based optimization enabling black-box transfer highlights a practical vulnerability that could impact real-world deployments of RAG-enhanced LLMs. The empirical evaluation on public datasets provides a clear benchmark for future defenses.

major comments (2)
  1. [§4.2] §4.2: The attack success rate of 90% on NQ is reported without accompanying details on the number of queries, variance across runs, or statistical significance tests; this is load-bearing for the SOTA claim and should be expanded to allow verification of the effectiveness.
  2. [§3.2] §3.2: The micro-refinement stage's optimization objective, which balances retrieval priority and naturalness, lacks explicit formulation of the loss function or hyperparameter tuning procedure; without this, the reproducibility of the naturalness improvements and transfer success is limited.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'lowest grammar errors and repetition rates' should specify the exact metrics and tools used for quantification to aid immediate understanding.
  2. [§5] §5: The discussion on limitations could be expanded to address potential detection methods by defenders, such as anomaly detection on refined texts.
  3. [Table 1] Table 1: Ensure all baseline methods are cited with their original papers for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [§4.2] §4.2: The attack success rate of 90% on NQ is reported without accompanying details on the number of queries, variance across runs, or statistical significance tests; this is load-bearing for the SOTA claim and should be expanded to allow verification of the effectiveness.

    Authors: We agree that additional details are necessary for verification. In the revised manuscript, we will expand §4.2 to report the exact number of queries used in our evaluation, the variance observed across multiple runs, and the results of statistical significance tests comparing against baselines. These details were part of our experimental protocol but omitted for brevity; their inclusion will allow full reproducibility of the effectiveness claims. revision: yes

  2. Referee: [§3.2] §3.2: The micro-refinement stage's optimization objective, which balances retrieval priority and naturalness, lacks explicit formulation of the loss function or hyperparameter tuning procedure; without this, the reproducibility of the naturalness improvements and transfer success is limited.

    Authors: We acknowledge that the optimization objective in the micro-refinement stage requires a more explicit description. We will revise §3.2 to include the precise loss function formulation that combines retrieval priority (via negative retrieval rank or embedding similarity) and naturalness (via language model perplexity), along with the hyperparameter values and the grid-search tuning procedure performed on a validation subset. This addition will directly address the reproducibility concerns for the naturalness and transfer results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical attack construction (two-stage macro toxic seed generation followed by retriever-in-the-loop micro-refinement) evaluated on public datasets NQ and MSMARCO. No mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations appear in the described method or evaluation protocol. The attack success rate, grammar error, and repetition metrics are reported as comparative empirical results without reducing to input definitions or self-referential fits. The proxy-based black-box transfer follows directly from the surrogate retriever setup and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or background axioms are stated in the abstract. The work relies on standard assumptions in adversarial ML (retrievers can be queried, naturalness can be measured by grammar and repetition metrics) but introduces no new free parameters or invented entities visible here.

pith-pipeline@v0.9.0 · 5456 in / 1242 out tokens · 32045 ms · 2026-05-10T18:08:02.440877+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 22 canonical work pages · 6 internal anchors

  1. [1]

    Journal of biomedical informatics156, 104662 (2024)

    Alkhalaf, M., Yu, P., Yin, M., Deng, C.: Applying generative ai with retrieval augmented generation to summarize and extract key clinical information from electronic health records. Journal of biomedical informatics156, 104662 (2024)

  2. [2]

    Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., McNamara, A., Mitra, B., Nguyen, T., Wang, S., Wang, X.: Ms marco: A human generated dataset for research on machine reading comprehension and question answering (2016), https://arxiv.org/abs/1611.09268

  3. [3]

    In: Pro- ceedings of the IEEE Symposium on Security and Privacy (S&P)

    Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., Raffel, C.: Poi- soning web-scale training datasets are easier than you might think. In: Pro- ceedings of the IEEE Symposium on Security and Privacy (S&P). pp. 1369– 1387 (2023). https://doi.org/10.1109/SP49137.2023....

  4. [4]

    In: Proceedings of the AAAI Conference on Artifi- cialIntelligence.vol.38,pp.16715–16723(2024),https://arxiv.org/abs/2311.16109

    Chen, J., Lin, H., Han, X., Sun, L.: Benchmarking large language models in retrieval-augmented generation. In: Proceedings of the AAAI Conference on Artifi- cialIntelligence.vol.38,pp.16715–16723(2024),https://arxiv.org/abs/2311.16109

  5. [5]

    Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (2023), https://vicuna.lmsys.org/

  6. [6]

    DeepSeek-AI: DeepSeek LLM: Scaling open-source language models with reinforce- ment learning (2024), https://arxiv.org/abs/2401.02954

  7. [7]

    2019 , address =

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational...

  8. [8]

    In: Proceedings of the 56th Annual Meeting of the Associa- tionforComputationalLinguistics(Volume2:ShortPapers).pp.382–387.Associa- tion for Computational Linguistics (2018)

    Ebrahimi, J., Rao, A., Lowd, D., Dou, D.: Hotflip: White-box adversarial examples for text classification. In: Proceedings of the 56th Annual Meeting of the Associa- tionforComputationalLinguistics(Volume2:ShortPapers).pp.382–387.Associa- tion for Computational Linguistics (2018). https://doi.org/10.18653/v1/P18-2061, https://aclanthology.org/P18-2061

  9. [9]

    Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From local to global: A graph RAG ap- proach to query-focused summarization (2024), https://arxiv.org/abs/2404.16130

  10. [10]

    Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A., et al.: spacy: Industrial- strength natural language processing in python (2020)

  11. [11]

    Huang, Y., Gupta, S., Xia, M., Li, K., Chen, D.: Catastrophic jailbreak of open- source llms via exploiting generation (2023), https://arxiv.org/abs/2310.06987

  12. [12]

    In: Proceedings of the 39th International Conference on Machine Learning (ICML)

    Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., Grave, E.: Contriever: Improving contrastive learning for unsupervised text retrieval. In: Proceedings of the 39th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 162, pp. 9745–9758. PMLR (2022), https://proceedings.mlr.press/v...

  13. [13]

    Transac- tions on Machine Learning Research (2022), https://openreview.net/forum?id= kXwdL1cWO5 RefineRAG 15

    Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., Grave, E.: Unsupervised dense information retrieval with contrastive learning. Transac- tions on Machine Learning Research (2022), https://openreview.net/forum?id= kXwdL1cWO5 RefineRAG 15

  14. [14]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

    Jin, D., Jin, Z., Zhou, J.T., Szolovits, P.: Is BERT really robust? a strong base- line for natural language attack on text classification and entailment. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 8018–8025 (2020). https://doi.org/10.1609/aaai.v34i05.6304, https://ojs.aaai.org/index.php/ AAAI/article/view/6304

  15. [15]

    and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title =

    Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K.: Natural questions: A bench- mark for question answering research. Transactions of the Association for Com- putational Linguistics7, 453–466 (2019). https://doi.org/10.1162/tacl_a_00276, https://aclanthology.org/Q19-1026

  16. [16]

    In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neu- ral Information Processing Systems. vol. 33, pp. 9459...

  17. [17]

    Li, C., Zhang, J., Cheng, A., Ma, Z., Li, X., Ma, J.: Cpa-rag: Covert poisoning attacks on retrieval-augmented generation in large language models (2025), https: //arxiv.org/abs/2505.19864

  18. [18]

    Li, L., Ma, R., Guo, Q., Xue, X., Qiu, X.: Bert-attack: Adversarial attack against bert using bert (2020), https://arxiv.org/abs/2004.09984

  19. [19]

    Perez, F., Ribeiro, I.: Ignore previous prompt: Attack techniques for language mod- els (2022), https://arxiv.org/abs/2211.09527

  20. [20]

    OpenAI blog1(8), 9 (2019)

    Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

  21. [21]

    Rizqullah, M.R., Purwarianti, A., Aji, A.F.: Qasina: Religious domain question answering using sirah nabawiyah (2023), https://arxiv.org/abs/2310.08102

  22. [22]

    Salemi, A., Zamani, H.: Evaluating retrieval quality in retrieval-augmented genera- tion.In:Proceedingsofthe47thInternationalACMSIGIRConferenceonResearch and Development in Information Retrieval. pp. 2185–2189 (2024). https://doi.org/ 10.1145/3626772.3657754, https://dl.acm.org/doi/abs/10.1145/3626772.3657754

  23. [23]

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models (2023), https://arxiv.org/abs/2307.09288

  24. [24]

    In: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (FORGE)

    Wang, G., Li, Y., Liu, Y., Deng, G., Li, T., Xu, G., Liu, Y., Wang, H., Wang, K.: Metmap: Metamorphic testing for detecting false vector matching problems in llm augmented generation. In: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (FORGE). pp. 12– 23 (2024). https://doi.org/10.1145/3650...

  25. [25]

    Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How does llm safety training fail? (2023), https://arxiv.org/abs/2307.02483

  26. [26]

    In: International Conference on Learning Representations (ICLR) (2021), https://openreview.net/forum?id=zeFrfgyZln

    Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P.N., Ahmed, J., Over- wijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: International Conference on Learning Representations (ICLR) (2021), https://openreview.net/forum?id=zeFrfgyZln

  27. [27]

    Yepes, A.J., You, Y., Milczek, J., Laverde, S., Li, R.: Financial report chunking for effective retrieval augmented generation (2024), https://arxiv.org/abs/2402.05131

  28. [28]

    Zhang, B., Yang, H., Zhou, T., Babar, M.A., Liu, X.Y.: Enhancing financial large language models with retrieval-augmented generation (2023), https://arxiv.org/ abs/2308.14081 16 Ziye Wang et al

  29. [29]

    Zhao, X., Liu, S., Yang, S.Y., Miao, C.: Medrag: Improving medical diagnosis with retrieval-augmented generation (2023), https://arxiv.org/abs/2306.02322

  30. [30]

    In: International Conference on Learning Representations (ICLR) (2024), https://openreview.net/forum?id=1EB1fSj23k

    Zhong, Z., Huang, Z., Wettig, A., Chen, D.: Poisoning retrieval corpora: How to mislead retrieval-augmented generation. In: International Conference on Learning Representations (ICLR) (2024), https://openreview.net/forum?id=1EB1fSj23k

  31. [31]

    PoisonedRAG: Knowledge poisoning attacks to retrieval-augmented generation of large language models.arXiv preprint arXiv:2402.07867, 2024

    Zou, W., Geng, R., Wang, B., Jia, J.: Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models (2024), https://arxiv. org/abs/2402.07867