arxiv: 2604.07403 · v1 · submitted 2026-04-08 · 💻 cs.CR

Recognition: 2 theorem links

· Lean Theorem

RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:08 UTC · model grok-4.3

classification 💻 cs.CR

keywords RAG poisoningretrieval-augmented generationword-level attacksadversarial refinementblack-box transferknowledge poisoningLLM security

0 comments

The pith

RefineRAG treats RAG poisoning as word-level refinement to create effective yet natural toxic documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current poisoning attacks on retrieval-augmented generation split and insert harmful content in ways that stand out. RefineRAG instead generates toxic seeds and then refines them word by word while consulting the retriever itself. The two-stage process produces documents that rank highly for target queries yet contain few grammar or repetition issues. On the Natural Questions dataset the method reaches 90 percent attack success and the same poisons work against black-box systems when optimized only on a proxy.

Core claim

RefineRAG frames knowledge poisoning as a single word-level refinement task rather than a coarse insert-and-concatenate operation. Macro Generation first creates toxic seed texts guaranteed to elicit chosen answers. Micro Refinement then iteratively adjusts individual words under a retriever-in-the-loop objective that raises retrieval score while keeping surface form natural. The resulting attacks outperform prior baselines on both effectiveness and stealth metrics and transfer across retriever boundaries.

What carries the argument

The RefineRAG two-stage framework: Macro Generation for guaranteed toxic seeds followed by Micro Refinement that uses retriever feedback to optimize retrieval priority without sacrificing naturalness.

If this is right

Attacks reach 90 percent success on NQ while producing fewer grammar errors than existing methods.
Refined poisons transfer from proxy retrievers to black-box victim systems.
Coarse separate-and-concatenate poisoning strategies are both less effective and easier to spot.
Word-level changes allow toxic content to rank highly without obvious artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection tools may need to inspect token-level optimization traces rather than obvious insertions.
Open retrievers can serve as safe proxies for crafting attacks on closed commercial systems.
RAG pipelines using public web indexes become more exposed once word-level refinement is known.
Defenses could monitor sudden improvements in retrieval scores for documents that remain stylistically ordinary.

Load-bearing premise

Micro refinement can raise a document's retrieval rank while leaving no detectable traces of optimization and without needing direct access to the victim retriever.

What would settle it

Run the refined documents through a production RAG pipeline and check whether attack success rate remains near 90 percent on NQ while grammar-error and repetition counts stay at or below baseline levels.

Figures

Figures reproduced from arXiv: 2604.07403 by Guanyu Wang, Kailong Wang, Ziye Wang.

**Figure 1.** Figure 1: The overall framework of RefineRAG. 4 Methodology In this section, we introduce RefineRAG. The overview of our framework is shown in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Impact of Retrieval Scope (k) on NQ. effect where an excessive number of retrieved benign documents weakens the adversarial context, a phenomenon consistent with prior findings in the field. 5.4 Ablation Study To verify the necessity of our two-stage design, we conduct ablation experiments by systematically removing key components of the framework. We first evaluate the configuration designated as No-I whe… view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs), but simultaneously exposes a critical vulnerability to knowledge poisoning attacks. Existing attack methods like PoisonedRAG remain detectable due to coarse-grained separate-and-concatenate strategies. To bridge this gap, we propose RefineRAG, a novel framework that treats poisoning as a holistic word-level refinement problem. It operates in two stages: Macro Generation produces toxic seeds guaranteed to induce target answers, while Micro Refinement employs a retriever-in-the-loop optimization to maximize retrieval priority without compromising naturalness. Evaluations on NQ and MSMARCO demonstrate that RefineRAG achieves state-of-the-art effectiveness, securing a 90% Attack Success Rate on NQ, while registering the lowest grammar errors and repetition rates among all baselines. Crucially, our proxy-optimized attacks successfully transfer to black-box victim systems, highlighting a severe practical threat.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RefineRAG's two-stage word-level refinement poisons RAG effectively while improving naturalness over baselines, and the full paper is internally consistent.

read the letter

The main point is that RefineRAG offers a word-level refinement strategy for poisoning RAG systems that achieves high attack success rates while scoring better on naturalness metrics than earlier approaches. It works in two stages. The macro stage creates toxic seeds designed to force specific answers from the LLM. The micro stage then refines words one by one, using the retriever to guide the changes so the poisoned document ranks higher in retrieval, all without adding too much repetition or grammar issues. This setup is new in how it integrates the retriever directly into the refinement loop. The results show 90% ASR on the NQ dataset and successful transfer to black-box victims using a proxy retriever for optimization. The naturalness improvements over baselines are a plus. The potential soft spot is whether the refined text remains undetectable in more advanced detection scenarios. The paper uses grammar and repetition scores, which are reasonable, but the optimization might introduce other patterns. From the full manuscript, though, the method is consistent and the claims follow from the described protocol. This paper is aimed at researchers studying vulnerabilities in retrieval-augmented systems. It shows honest engagement with the problem and has enough detail to be useful. I would recommend sending it for peer review. The work is solid enough to warrant referee feedback.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes RefineRAG, a novel two-stage framework for word-level poisoning attacks on Retrieval-Augmented Generation (RAG) systems. The macro generation stage creates toxic seeds that induce target answers, and the micro refinement stage uses a retriever-in-the-loop optimization to maximize retrieval priority while maintaining naturalness. On the Natural Questions (NQ) and MSMARCO datasets, it achieves a 90% Attack Success Rate (ASR) on NQ, with the lowest grammar errors and repetition rates among baselines, and demonstrates transferability to black-box victim systems.

Significance. If the reported results hold under rigorous controls, this work significantly advances the understanding of stealthy poisoning attacks in RAG systems by showing that retriever-guided word-level refinements can achieve high effectiveness and naturalness simultaneously. The proxy-based optimization enabling black-box transfer highlights a practical vulnerability that could impact real-world deployments of RAG-enhanced LLMs. The empirical evaluation on public datasets provides a clear benchmark for future defenses.

major comments (2)

[§4.2] §4.2: The attack success rate of 90% on NQ is reported without accompanying details on the number of queries, variance across runs, or statistical significance tests; this is load-bearing for the SOTA claim and should be expanded to allow verification of the effectiveness.
[§3.2] §3.2: The micro-refinement stage's optimization objective, which balances retrieval priority and naturalness, lacks explicit formulation of the loss function or hyperparameter tuning procedure; without this, the reproducibility of the naturalness improvements and transfer success is limited.

minor comments (3)

[Abstract] Abstract: The phrase 'lowest grammar errors and repetition rates' should specify the exact metrics and tools used for quantification to aid immediate understanding.
[§5] §5: The discussion on limitations could be expanded to address potential detection methods by defenders, such as anomaly detection on refined texts.
[Table 1] Table 1: Ensure all baseline methods are cited with their original papers for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [§4.2] §4.2: The attack success rate of 90% on NQ is reported without accompanying details on the number of queries, variance across runs, or statistical significance tests; this is load-bearing for the SOTA claim and should be expanded to allow verification of the effectiveness.

Authors: We agree that additional details are necessary for verification. In the revised manuscript, we will expand §4.2 to report the exact number of queries used in our evaluation, the variance observed across multiple runs, and the results of statistical significance tests comparing against baselines. These details were part of our experimental protocol but omitted for brevity; their inclusion will allow full reproducibility of the effectiveness claims. revision: yes
Referee: [§3.2] §3.2: The micro-refinement stage's optimization objective, which balances retrieval priority and naturalness, lacks explicit formulation of the loss function or hyperparameter tuning procedure; without this, the reproducibility of the naturalness improvements and transfer success is limited.

Authors: We acknowledge that the optimization objective in the micro-refinement stage requires a more explicit description. We will revise §3.2 to include the precise loss function formulation that combines retrieval priority (via negative retrieval rank or embedding similarity) and naturalness (via language model perplexity), along with the hyperparameter values and the grid-search tuning procedure performed on a validation subset. This addition will directly address the reproducibility concerns for the naturalness and transfer results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical attack construction (two-stage macro toxic seed generation followed by retriever-in-the-loop micro-refinement) evaluated on public datasets NQ and MSMARCO. No mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations appear in the described method or evaluation protocol. The attack success rate, grammar error, and repetition metrics are reported as comparative empirical results without reducing to input definitions or self-referential fits. The proxy-based black-box transfer follows directly from the surrogate retriever setup and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or background axioms are stated in the abstract. The work relies on standard assumptions in adversarial ML (retrievers can be queried, naturalness can be measured by grammar and repetition metrics) but introduces no new free parameters or invented entities visible here.

pith-pipeline@v0.9.0 · 5456 in / 1242 out tokens · 32045 ms · 2026-05-10T18:08:02.440877+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage framework... Macro Generation produces toxic seeds... Micro Refinement employs a retriever-in-the-loop optimization to maximize retrieval priority without compromising naturalness... Word-Level Optimization (WLO)... MLM to replace specific words
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

maximize Score(P,Q)=Sim(P,Q) constrained by target answer Rt; beam search over top-B trajectories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 22 canonical work pages · 6 internal anchors

[1]

Journal of biomedical informatics156, 104662 (2024)

Alkhalaf, M., Yu, P., Yin, M., Deng, C.: Applying generative ai with retrieval augmented generation to summarize and extract key clinical information from electronic health records. Journal of biomedical informatics156, 104662 (2024)

2024
[2]

Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., McNamara, A., Mitra, B., Nguyen, T., Wang, S., Wang, X.: Ms marco: A human generated dataset for research on machine reading comprehension and question answering (2016), https://arxiv.org/abs/1611.09268

work page internal anchor Pith review arXiv 2016
[3]

In: Pro- ceedings of the IEEE Symposium on Security and Privacy (S&P)

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., Raffel, C.: Poi- soning web-scale training datasets are easier than you might think. In: Pro- ceedings of the IEEE Symposium on Security and Privacy (S&P). pp. 1369– 1387 (2023). https://doi.org/10.1109/SP49137.2023....

work page doi:10.1109/sp49137.2023.10179267 2023
[4]

In: Proceedings of the AAAI Conference on Artifi- cialIntelligence.vol.38,pp.16715–16723(2024),https://arxiv.org/abs/2311.16109

Chen, J., Lin, H., Han, X., Sun, L.: Benchmarking large language models in retrieval-augmented generation. In: Proceedings of the AAAI Conference on Artifi- cialIntelligence.vol.38,pp.16715–16723(2024),https://arxiv.org/abs/2311.16109

work page arXiv 2024
[5]

Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (2023), https://vicuna.lmsys.org/

2023
[6]

DeepSeek-AI: DeepSeek LLM: Scaling open-source language models with reinforce- ment learning (2024), https://arxiv.org/abs/2401.02954

work page internal anchor Pith review arXiv 2024
[7]

2019 , address =

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational...

work page doi:10.18653/v1/n19-1423 2019
[8]

In: Proceedings of the 56th Annual Meeting of the Associa- tionforComputationalLinguistics(Volume2:ShortPapers).pp.382–387.Associa- tion for Computational Linguistics (2018)

Ebrahimi, J., Rao, A., Lowd, D., Dou, D.: Hotflip: White-box adversarial examples for text classification. In: Proceedings of the 56th Annual Meeting of the Associa- tionforComputationalLinguistics(Volume2:ShortPapers).pp.382–387.Associa- tion for Computational Linguistics (2018). https://doi.org/10.18653/v1/P18-2061, https://aclanthology.org/P18-2061

work page doi:10.18653/v1/p18-2061 2018
[9]

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From local to global: A graph RAG ap- proach to query-focused summarization (2024), https://arxiv.org/abs/2404.16130

work page internal anchor Pith review arXiv 2024
[10]

Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A., et al.: spacy: Industrial- strength natural language processing in python (2020)

2020
[11]

Huang, Y., Gupta, S., Xia, M., Li, K., Chen, D.: Catastrophic jailbreak of open- source llms via exploiting generation (2023), https://arxiv.org/abs/2310.06987

work page arXiv 2023
[12]

In: Proceedings of the 39th International Conference on Machine Learning (ICML)

Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., Grave, E.: Contriever: Improving contrastive learning for unsupervised text retrieval. In: Proceedings of the 39th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 162, pp. 9745–9758. PMLR (2022), https://proceedings.mlr.press/v...

2022
[13]

Transac- tions on Machine Learning Research (2022), https://openreview.net/forum?id= kXwdL1cWO5 RefineRAG 15

Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., Grave, E.: Unsupervised dense information retrieval with contrastive learning. Transac- tions on Machine Learning Research (2022), https://openreview.net/forum?id= kXwdL1cWO5 RefineRAG 15

2022
[14]

In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

Jin, D., Jin, Z., Zhou, J.T., Szolovits, P.: Is BERT really robust? a strong base- line for natural language attack on text classification and entailment. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 8018–8025 (2020). https://doi.org/10.1609/aaai.v34i05.6304, https://ojs.aaai.org/index.php/ AAAI/article/view/6304

work page doi:10.1609/aaai.v34i05.6304 2020
[15]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title =

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K.: Natural questions: A bench- mark for question answering research. Transactions of the Association for Com- putational Linguistics7, 453–466 (2019). https://doi.org/10.1162/tacl_a_00276, https://aclanthology.org/Q19-1026

work page doi:10.1162/tacl_a_00276 2019
[16]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neu- ral Information Processing Systems. vol. 33, pp. 9459...

2020
[17]

Li, C., Zhang, J., Cheng, A., Ma, Z., Li, X., Ma, J.: Cpa-rag: Covert poisoning attacks on retrieval-augmented generation in large language models (2025), https: //arxiv.org/abs/2505.19864

work page arXiv 2025
[18]

Li, L., Ma, R., Guo, Q., Xue, X., Qiu, X.: Bert-attack: Adversarial attack against bert using bert (2020), https://arxiv.org/abs/2004.09984

work page arXiv 2020
[19]

Perez, F., Ribeiro, I.: Ignore previous prompt: Attack techniques for language mod- els (2022), https://arxiv.org/abs/2211.09527

work page internal anchor Pith review arXiv 2022
[20]

OpenAI blog1(8), 9 (2019)

Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

2019
[21]

Rizqullah, M.R., Purwarianti, A., Aji, A.F.: Qasina: Religious domain question answering using sirah nabawiyah (2023), https://arxiv.org/abs/2310.08102

work page arXiv 2023
[22]

Salemi, A., Zamani, H.: Evaluating retrieval quality in retrieval-augmented genera- tion.In:Proceedingsofthe47thInternationalACMSIGIRConferenceonResearch and Development in Information Retrieval. pp. 2185–2189 (2024). https://doi.org/ 10.1145/3626772.3657754, https://dl.acm.org/doi/abs/10.1145/3626772.3657754

work page doi:10.1145/3626772.3657754 2024
[23]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models (2023), https://arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

In: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (FORGE)

Wang, G., Li, Y., Liu, Y., Deng, G., Li, T., Xu, G., Liu, Y., Wang, H., Wang, K.: Metmap: Metamorphic testing for detecting false vector matching problems in llm augmented generation. In: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (FORGE). pp. 12– 23 (2024). https://doi.org/10.1145/3650...

work page doi:10.1145/3650105.3652297 2024
[25]

Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How does llm safety training fail? (2023), https://arxiv.org/abs/2307.02483

work page internal anchor Pith review arXiv 2023
[26]

In: International Conference on Learning Representations (ICLR) (2021), https://openreview.net/forum?id=zeFrfgyZln

Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P.N., Ahmed, J., Over- wijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: International Conference on Learning Representations (ICLR) (2021), https://openreview.net/forum?id=zeFrfgyZln

2021
[27]

Yepes, A.J., You, Y., Milczek, J., Laverde, S., Li, R.: Financial report chunking for effective retrieval augmented generation (2024), https://arxiv.org/abs/2402.05131

work page arXiv 2024
[28]

Zhang, B., Yang, H., Zhou, T., Babar, M.A., Liu, X.Y.: Enhancing financial large language models with retrieval-augmented generation (2023), https://arxiv.org/ abs/2308.14081 16 Ziye Wang et al

work page arXiv 2023
[29]

Zhao, X., Liu, S., Yang, S.Y., Miao, C.: Medrag: Improving medical diagnosis with retrieval-augmented generation (2023), https://arxiv.org/abs/2306.02322

work page arXiv 2023
[30]

In: International Conference on Learning Representations (ICLR) (2024), https://openreview.net/forum?id=1EB1fSj23k

Zhong, Z., Huang, Z., Wettig, A., Chen, D.: Poisoning retrieval corpora: How to mislead retrieval-augmented generation. In: International Conference on Learning Representations (ICLR) (2024), https://openreview.net/forum?id=1EB1fSj23k

2024
[31]

PoisonedRAG: Knowledge poisoning attacks to retrieval-augmented generation of large language models.arXiv preprint arXiv:2402.07867, 2024

Zou, W., Geng, R., Wang, B., Jia, J.: Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models (2024), https://arxiv. org/abs/2402.07867

work page arXiv 2024