Recognition: 2 theorem links
· Lean TheoremRefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement
Pith reviewed 2026-05-10 18:08 UTC · model grok-4.3
The pith
RefineRAG treats RAG poisoning as word-level refinement to create effective yet natural toxic documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RefineRAG frames knowledge poisoning as a single word-level refinement task rather than a coarse insert-and-concatenate operation. Macro Generation first creates toxic seed texts guaranteed to elicit chosen answers. Micro Refinement then iteratively adjusts individual words under a retriever-in-the-loop objective that raises retrieval score while keeping surface form natural. The resulting attacks outperform prior baselines on both effectiveness and stealth metrics and transfer across retriever boundaries.
What carries the argument
The RefineRAG two-stage framework: Macro Generation for guaranteed toxic seeds followed by Micro Refinement that uses retriever feedback to optimize retrieval priority without sacrificing naturalness.
If this is right
- Attacks reach 90 percent success on NQ while producing fewer grammar errors than existing methods.
- Refined poisons transfer from proxy retrievers to black-box victim systems.
- Coarse separate-and-concatenate poisoning strategies are both less effective and easier to spot.
- Word-level changes allow toxic content to rank highly without obvious artifacts.
Where Pith is reading between the lines
- Detection tools may need to inspect token-level optimization traces rather than obvious insertions.
- Open retrievers can serve as safe proxies for crafting attacks on closed commercial systems.
- RAG pipelines using public web indexes become more exposed once word-level refinement is known.
- Defenses could monitor sudden improvements in retrieval scores for documents that remain stylistically ordinary.
Load-bearing premise
Micro refinement can raise a document's retrieval rank while leaving no detectable traces of optimization and without needing direct access to the victim retriever.
What would settle it
Run the refined documents through a production RAG pipeline and check whether attack success rate remains near 90 percent on NQ while grammar-error and repetition counts stay at or below baseline levels.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs), but simultaneously exposes a critical vulnerability to knowledge poisoning attacks. Existing attack methods like PoisonedRAG remain detectable due to coarse-grained separate-and-concatenate strategies. To bridge this gap, we propose RefineRAG, a novel framework that treats poisoning as a holistic word-level refinement problem. It operates in two stages: Macro Generation produces toxic seeds guaranteed to induce target answers, while Micro Refinement employs a retriever-in-the-loop optimization to maximize retrieval priority without compromising naturalness. Evaluations on NQ and MSMARCO demonstrate that RefineRAG achieves state-of-the-art effectiveness, securing a 90% Attack Success Rate on NQ, while registering the lowest grammar errors and repetition rates among all baselines. Crucially, our proxy-optimized attacks successfully transfer to black-box victim systems, highlighting a severe practical threat.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes RefineRAG, a novel two-stage framework for word-level poisoning attacks on Retrieval-Augmented Generation (RAG) systems. The macro generation stage creates toxic seeds that induce target answers, and the micro refinement stage uses a retriever-in-the-loop optimization to maximize retrieval priority while maintaining naturalness. On the Natural Questions (NQ) and MSMARCO datasets, it achieves a 90% Attack Success Rate (ASR) on NQ, with the lowest grammar errors and repetition rates among baselines, and demonstrates transferability to black-box victim systems.
Significance. If the reported results hold under rigorous controls, this work significantly advances the understanding of stealthy poisoning attacks in RAG systems by showing that retriever-guided word-level refinements can achieve high effectiveness and naturalness simultaneously. The proxy-based optimization enabling black-box transfer highlights a practical vulnerability that could impact real-world deployments of RAG-enhanced LLMs. The empirical evaluation on public datasets provides a clear benchmark for future defenses.
major comments (2)
- [§4.2] §4.2: The attack success rate of 90% on NQ is reported without accompanying details on the number of queries, variance across runs, or statistical significance tests; this is load-bearing for the SOTA claim and should be expanded to allow verification of the effectiveness.
- [§3.2] §3.2: The micro-refinement stage's optimization objective, which balances retrieval priority and naturalness, lacks explicit formulation of the loss function or hyperparameter tuning procedure; without this, the reproducibility of the naturalness improvements and transfer success is limited.
minor comments (3)
- [Abstract] Abstract: The phrase 'lowest grammar errors and repetition rates' should specify the exact metrics and tools used for quantification to aid immediate understanding.
- [§5] §5: The discussion on limitations could be expanded to address potential detection methods by defenders, such as anomaly detection on refined texts.
- [Table 1] Table 1: Ensure all baseline methods are cited with their original papers for completeness.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [§4.2] §4.2: The attack success rate of 90% on NQ is reported without accompanying details on the number of queries, variance across runs, or statistical significance tests; this is load-bearing for the SOTA claim and should be expanded to allow verification of the effectiveness.
Authors: We agree that additional details are necessary for verification. In the revised manuscript, we will expand §4.2 to report the exact number of queries used in our evaluation, the variance observed across multiple runs, and the results of statistical significance tests comparing against baselines. These details were part of our experimental protocol but omitted for brevity; their inclusion will allow full reproducibility of the effectiveness claims. revision: yes
-
Referee: [§3.2] §3.2: The micro-refinement stage's optimization objective, which balances retrieval priority and naturalness, lacks explicit formulation of the loss function or hyperparameter tuning procedure; without this, the reproducibility of the naturalness improvements and transfer success is limited.
Authors: We acknowledge that the optimization objective in the micro-refinement stage requires a more explicit description. We will revise §3.2 to include the precise loss function formulation that combines retrieval priority (via negative retrieval rank or embedding similarity) and naturalness (via language model perplexity), along with the hyperparameter values and the grid-search tuning procedure performed on a validation subset. This addition will directly address the reproducibility concerns for the naturalness and transfer results. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical attack construction (two-stage macro toxic seed generation followed by retriever-in-the-loop micro-refinement) evaluated on public datasets NQ and MSMARCO. No mathematical derivations, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations appear in the described method or evaluation protocol. The attack success rate, grammar error, and repetition metrics are reported as comparative empirical results without reducing to input definitions or self-referential fits. The proxy-based black-box transfer follows directly from the surrogate retriever setup and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage framework... Macro Generation produces toxic seeds... Micro Refinement employs a retriever-in-the-loop optimization to maximize retrieval priority without compromising naturalness... Word-Level Optimization (WLO)... MLM to replace specific words
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
maximize Score(P,Q)=Sim(P,Q) constrained by target answer Rt; beam search over top-B trajectories
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Journal of biomedical informatics156, 104662 (2024)
Alkhalaf, M., Yu, P., Yin, M., Deng, C.: Applying generative ai with retrieval augmented generation to summarize and extract key clinical information from electronic health records. Journal of biomedical informatics156, 104662 (2024)
2024
-
[2]
Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., McNamara, A., Mitra, B., Nguyen, T., Wang, S., Wang, X.: Ms marco: A human generated dataset for research on machine reading comprehension and question answering (2016), https://arxiv.org/abs/1611.09268
work page internal anchor Pith review arXiv 2016
-
[3]
In: Pro- ceedings of the IEEE Symposium on Security and Privacy (S&P)
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., Raffel, C.: Poi- soning web-scale training datasets are easier than you might think. In: Pro- ceedings of the IEEE Symposium on Security and Privacy (S&P). pp. 1369– 1387 (2023). https://doi.org/10.1109/SP49137.2023....
-
[4]
Chen, J., Lin, H., Han, X., Sun, L.: Benchmarking large language models in retrieval-augmented generation. In: Proceedings of the AAAI Conference on Artifi- cialIntelligence.vol.38,pp.16715–16723(2024),https://arxiv.org/abs/2311.16109
-
[5]
Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., Stoica, I., Xing, E.P.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality (2023), https://vicuna.lmsys.org/
2023
-
[6]
DeepSeek-AI: DeepSeek LLM: Scaling open-source language models with reinforce- ment learning (2024), https://arxiv.org/abs/2401.02954
work page internal anchor Pith review arXiv 2024
-
[7]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational...
-
[8]
Ebrahimi, J., Rao, A., Lowd, D., Dou, D.: Hotflip: White-box adversarial examples for text classification. In: Proceedings of the 56th Annual Meeting of the Associa- tionforComputationalLinguistics(Volume2:ShortPapers).pp.382–387.Associa- tion for Computational Linguistics (2018). https://doi.org/10.18653/v1/P18-2061, https://aclanthology.org/P18-2061
-
[9]
Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From local to global: A graph RAG ap- proach to query-focused summarization (2024), https://arxiv.org/abs/2404.16130
work page internal anchor Pith review arXiv 2024
-
[10]
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A., et al.: spacy: Industrial- strength natural language processing in python (2020)
2020
- [11]
-
[12]
In: Proceedings of the 39th International Conference on Machine Learning (ICML)
Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., Grave, E.: Contriever: Improving contrastive learning for unsupervised text retrieval. In: Proceedings of the 39th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 162, pp. 9745–9758. PMLR (2022), https://proceedings.mlr.press/v...
2022
-
[13]
Transac- tions on Machine Learning Research (2022), https://openreview.net/forum?id= kXwdL1cWO5 RefineRAG 15
Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., Grave, E.: Unsupervised dense information retrieval with contrastive learning. Transac- tions on Machine Learning Research (2022), https://openreview.net/forum?id= kXwdL1cWO5 RefineRAG 15
2022
-
[14]
In: Pro- ceedings of the AAAI Conference on Artificial Intelligence
Jin, D., Jin, Z., Zhou, J.T., Szolovits, P.: Is BERT really robust? a strong base- line for natural language attack on text classification and entailment. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 8018–8025 (2020). https://doi.org/10.1609/aaai.v34i05.6304, https://ojs.aaai.org/index.php/ AAAI/article/view/6304
-
[15]
and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav , title =
Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K.: Natural questions: A bench- mark for question answering research. Transactions of the Association for Com- putational Linguistics7, 453–466 (2019). https://doi.org/10.1162/tacl_a_00276, https://aclanthology.org/Q19-1026
-
[16]
In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neu- ral Information Processing Systems. vol. 33, pp. 9459...
2020
- [17]
- [18]
-
[19]
Perez, F., Ribeiro, I.: Ignore previous prompt: Attack techniques for language mod- els (2022), https://arxiv.org/abs/2211.09527
work page internal anchor Pith review arXiv 2022
-
[20]
OpenAI blog1(8), 9 (2019)
Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)
2019
- [21]
-
[22]
Salemi, A., Zamani, H.: Evaluating retrieval quality in retrieval-augmented genera- tion.In:Proceedingsofthe47thInternationalACMSIGIRConferenceonResearch and Development in Information Retrieval. pp. 2185–2189 (2024). https://doi.org/ 10.1145/3626772.3657754, https://dl.acm.org/doi/abs/10.1145/3626772.3657754
-
[23]
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models (2023), https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Wang, G., Li, Y., Liu, Y., Deng, G., Li, T., Xu, G., Liu, Y., Wang, H., Wang, K.: Metmap: Metamorphic testing for detecting false vector matching problems in llm augmented generation. In: Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (FORGE). pp. 12– 23 (2024). https://doi.org/10.1145/3650...
-
[25]
Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How does llm safety training fail? (2023), https://arxiv.org/abs/2307.02483
work page internal anchor Pith review arXiv 2023
-
[26]
In: International Conference on Learning Representations (ICLR) (2021), https://openreview.net/forum?id=zeFrfgyZln
Xiong, L., Xiong, C., Li, Y., Tang, K.F., Liu, J., Bennett, P.N., Ahmed, J., Over- wijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: International Conference on Learning Representations (ICLR) (2021), https://openreview.net/forum?id=zeFrfgyZln
2021
- [27]
- [28]
- [29]
-
[30]
In: International Conference on Learning Representations (ICLR) (2024), https://openreview.net/forum?id=1EB1fSj23k
Zhong, Z., Huang, Z., Wettig, A., Chen, D.: Poisoning retrieval corpora: How to mislead retrieval-augmented generation. In: International Conference on Learning Representations (ICLR) (2024), https://openreview.net/forum?id=1EB1fSj23k
2024
-
[31]
Zou, W., Geng, R., Wang, B., Jia, J.: Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models (2024), https://arxiv. org/abs/2402.07867
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.