pith. sign in

arxiv: 2606.18310 · v1 · pith:3OJFWGJFnew · submitted 2026-06-16 · 💻 cs.CR · cs.AI

Conflict-Aware Retriever Editing for Knowledge Injection Attacks on LLM-Based RAG Systems

Pith reviewed 2026-06-27 00:26 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords retriever editingknowledge injection attacksRAG securitymodel-centric attacksparameter editingconflict detectionadversarial retrievalLLM attacks
0
0 comments X

The pith

Editing retriever parameters lets attackers inject malicious passages into RAG results without changing the knowledge base.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CAREATTACK, a two-stage method that edits the parameters of dense retrieval models to promote specific malicious passages above benign competitors for chosen prompts. The first stage adapts closed-form parameter updates with graph-based conflict detection and projection to handle parameter clashes. The second stage applies lightweight anchor repair to limit effects on non-target prompts while keeping attack success on targets. Experiments on Qwen3-Embedding-0.6B and BGE-M3 across three datasets show the edited retriever ranks malicious passages higher and supports batch attacks when model parameters are accessible. This reveals a model-centric attack surface for RAG systems that rely on open-source retrievers.

Core claim

CAREATTACK adapts efficient closed-form parameter editing to dense retrieval models, promoting malicious knowledge above benign competing passages and resolving potential parameter conflicts through graph-based conflict detection and parameter editing projection, then performs attack-preserving anchor repair that calibrates the edited retriever to eliminate impact on non-target prompts while preserving attack effectiveness for target prompts.

What carries the argument

Conflict-aware retriever editing, which adapts closed-form parameter editing to dense retrieval models and uses graph-based conflict detection with projection to promote malicious passages.

If this is right

  • Malicious passages enter retrieved results for targeted prompts without any changes to the external corpus.
  • The attack scales to batches of prompts and passages when retriever parameters are available.
  • Anchor repair keeps attack success on targets while reducing effects on unrelated queries.
  • Open-source retrievers create a practical attack vector for RAG systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Corpus-focused detection methods may fail against this attack since the stored text is unaltered.
  • Securing retriever parameters or adding monitoring for parameter changes could become a necessary defense layer.
  • The editing technique may extend to other dense retrieval settings outside RAG.
  • Experiments on larger models would test whether conflict resolution remains effective at scale.

Load-bearing premise

Closed-form parameter editing can be adapted to dense retrievers to promote malicious passages over benign ones while graph-based conflict resolution prevents detectable degradation.

What would settle it

Retrieval rankings after the editing procedure show no increase in position for the malicious passages relative to benign competitors on the same target prompts and models.

Figures

Figures reproduced from arXiv: 2606.18310 by Di Cai, Pengfei Hu, Xianglong Zhang, Xinru Liu, Xin Xin, Zhumin Chen.

Figure 1
Figure 1. Figure 1: , the malicious passage crafted by PoisonedRAG [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: (1) A typical RAG pipeline, in which existing data-centric RAG injection attacks mainly focus on manipulating the knowledge base while the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attack data samples and objectives. Each target sample contains [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The first stage performs conflict-aware retriever editing. It adapts efficient closed-form parameter editing to dense retrieval models to promote malicious target passages above benign competing passages. For batches of target prompts, this stage further resolves potential parameter con￾flicts through conflict detection, conflict-sparse grouping, and parameter editing projection. The second stage performs … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CAREATTACK. CAREATTACK first edits the retriever to promote malicious target passages for target prompts through conflict-aware retriever editing. Then attack-preserving anchor repair improves the edited retriever for better locality while preserving the attack effect. CAREATTACK constructs a retrieval-oriented surrogate ob￾jective that promotes malicious target passages over benign competing p… view at source ↗
Figure 4
Figure 4. Figure 4: Editing time under different numbers of target samples using [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity to the number of repair anchors on Natural Questions [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Base retriever results on the financially sensitive retirement [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: PoisonedRAG results on the same prompt. Under its white [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: CAREATTACK results on the same prompt. After editing the retriever, all top-5 retrieved passages are malicious target passages that consistently support the attacker-desired answer, without noisy textual artifacts. target passages that consistently support the attacker-desired answer. Different from PoisonedRAG, CAREATTACK does not rely on noisy textual artifacts in the retrieved passages. Instead, it prom… view at source ↗
read the original abstract

Injecting malicious knowledge into retrieval-augmented generation (RAG) systems can manipulate retrieved evidence and mislead downstream generation, posing a serious security threat for AI applications. Existing RAG injection attacks mainly rely on manipulating external knowledge bases, such as crafting malicious corpus. However, the synthetic text crafted by such data-centric methods could be detectable, leading to the failure of attacks. Beyond corpus manipulation, open-source retrievers are increasingly exposing RAG systems to model-centric attacks. In this paper, we propose conflict-aware retriever editing, i.e., CAREATTACK, a model-centric retriever attack framework for malicious knowledge injection in RAG. Specifically, CAREATTACK consists two stages of conflict-aware retriever editing and attack-preserving anchor repair. Conflict-aware retriever editing adapts efficient closed-form parameter editing to the dense retrieval model, promoting malicious knowledge above benign competing passages and resolving potential parameter conflicts through graph-based conflict detection and parameter editing projection. Then, attack-preserving anchor repair performs lightweight calibration on the edited retriever to further eliminate the impact on non-target prompts while preserving the attack effectiveness for target prompts. We instantiate CAREATTACK on Qwen3-Embedding-0.6B and BGE-M3, and conduct evaluation on three benchmark datasets. Experimental results demonstrate our method substantially promote malicious passages into the retrieved knowledge of RAG systems and can perform attacks for batches of target prompts and passages, given the access of retrieval model parameters. Since most RAG systems are built upon open-source retrieval models, this work reveals a practical attack surface in RAG systems. Codes are public accessible at https://anonymous.4open.science/r/CareAttack-3F1C.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CAREATTACK, a model-centric attack framework for malicious knowledge injection into LLM-based RAG systems via parameter editing of the dense retriever. It consists of a conflict-aware retriever editing stage that adapts closed-form editing with graph-based conflict detection and projection to promote malicious passages above benign competitors, followed by an attack-preserving anchor repair stage for lightweight calibration on non-target prompts. The authors instantiate the method on Qwen3-Embedding-0.6B and BGE-M3 and claim that experiments on three benchmark datasets demonstrate substantial promotion of malicious passages and batch-attack capability when retrieval model parameters are accessible.

Significance. If the empirical claims hold with adequate quantitative support, the work identifies a previously under-explored attack surface on open-source retrievers that underpins many RAG deployments. This is relevant to the security of retrieval-augmented systems. The public code release is a positive factor for reproducibility.

major comments (2)
  1. [Method (conflict-aware retriever editing stage)] The description of conflict-aware retriever editing states that the method 'adapts efficient closed-form parameter editing to the dense retrieval model' and uses 'graph-based conflict detection and parameter editing projection,' but supplies no equation, derivation, or explicit update rule showing how the closed-form solution is obtained for a bi-encoder or cross-encoder under a contrastive/ranking objective, nor how the projection preserves the necessary margin for malicious-passage promotion in embedding space. This is load-bearing for the central claim that the two-stage process raises malicious scores above benign competitors without side effects.
  2. [Experimental evaluation / abstract claim] The abstract asserts that 'experimental results demonstrate our method substantially promote malicious passages' and enable 'attacks for batches of target prompts and passages' on three datasets, yet the provided text contains no success rates, retrieval metrics (e.g., recall@K deltas), baselines, ablation results, or measurements of degradation on non-target queries. Without these, the support for the claim of effective, low-side-effect promotion cannot be evaluated.
minor comments (2)
  1. [Abstract] Grammatical issues in the abstract: 'consists two stages' should read 'consists of two stages'; 'substantially promote' should be 'substantially promotes'.
  2. [Abstract] The phrase 'given the access of retrieval model parameters' is awkward; 'given access to the retrieval model parameters' would be clearer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Method (conflict-aware retriever editing stage)] The description of conflict-aware retriever editing states that the method 'adapts efficient closed-form parameter editing to the dense retrieval model' and uses 'graph-based conflict detection and parameter editing projection,' but supplies no equation, derivation, or explicit update rule showing how the closed-form solution is obtained for a bi-encoder or cross-encoder under a contrastive/ranking objective, nor how the projection preserves the necessary margin for malicious-passage promotion in embedding space. This is load-bearing for the central claim that the two-stage process raises malicious scores above benign competitors without side effects.

    Authors: We agree that the current description lacks the necessary mathematical detail. In the revision we will insert a dedicated subsection that (i) states the contrastive objective used for the bi-encoder, (ii) derives the closed-form parameter update rule, (iii) formalizes the graph construction for conflict detection, and (iv) specifies the projection operator together with the margin-preservation argument in embedding space. These additions will make the adaptation of closed-form editing fully explicit and reproducible. revision: yes

  2. Referee: [Experimental evaluation / abstract claim] The abstract asserts that 'experimental results demonstrate our method substantially promote malicious passages' and enable 'attacks for batches of target prompts and passages' on three datasets, yet the provided text contains no success rates, retrieval metrics (e.g., recall@K deltas), baselines, ablation results, or measurements of degradation on non-target queries. Without these, the support for the claim of effective, low-side-effect promotion cannot be evaluated.

    Authors: We will strengthen the experimental reporting. The revised manuscript will (i) augment the abstract with concrete success rates and recall@K deltas, (ii) add a concise quantitative summary table in the introduction, and (iii) ensure that all tables and figures explicitly report baseline comparisons, ablation results, and non-target query degradation metrics. These changes will supply the quantitative evidence required to evaluate the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack method with experimental validation

full rationale

The paper describes an empirical attack framework (CAREATTACK) that adapts closed-form editing techniques from prior LLM work to dense retrievers, adds graph-based conflict detection and anchor repair, then validates via experiments on Qwen3-Embedding-0.6B, BGE-M3 and three datasets. No derivation, equation, or central claim reduces to its own inputs by construction; the method is a procedural construction whose effectiveness is measured externally rather than asserted via self-referential fitting or self-citation chains. The load-bearing steps are implementation choices and empirical outcomes, not algebraic identities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that closed-form parameter editing techniques transfer to dense retrievers and that graph-based conflict detection suffices to handle multi-target edits; no free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption Closed-form parameter editing can be adapted to dense retrieval models to promote malicious passages above benign ones while resolving conflicts via graph-based detection and projection.
    Invoked in the description of the conflict-aware retriever editing stage.

pith-pipeline@v0.9.1-grok · 5849 in / 1308 out tokens · 58975 ms · 2026-06-27T00:26:21.868365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 8 linked inside Pith

  1. [1]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  2. [2]

    Retrieval augmented language model pre-training,

    K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” inInternational conference on machine learning. PMLR, 2020, pp. 3929–3938

  3. [3]

    Dense passage retrieval for open-domain question answering,

    V . Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” inProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 2020, pp. 6769– 6781

  4. [4]

    Colbert: Efficient and effective passage search via contextualized late interaction over bert,

    O. Khattab and M. Zaharia, “Colbert: Efficient and effective passage search via contextualized late interaction over bert,” inProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 39–48

  5. [5]

    Approximate nearest neighbor negative contrastive learning for dense text retrieval,

    L. Xiong, C. Xiong, Y . Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, and A. Overwijk, “Approximate nearest neighbor negative contrastive learning for dense text retrieval,”arXiv preprint arXiv:2007.00808, 2020

  6. [6]

    {PoisonedRAG}: Knowl- edge corruption attacks to{Retrieval-Augmented}generation of large language models,

    W. Zou, R. Geng, B. Wang, and J. Jia, “{PoisonedRAG}: Knowl- edge corruption attacks to{Retrieval-Augmented}generation of large language models,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 3827–3844

  7. [7]

    Poisoning retrieval corpora by injecting adversarial passages,

    Z. Zhong, Z. Huang, A. Wettig, and D. Chen, “Poisoning retrieval corpora by injecting adversarial passages,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 13 764–13 775

  8. [8]

    Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models,

    J. Xue, M. Zheng, Y . Hu, F. Liu, X. Chen, and Q. Lou, “Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models,”arXiv preprint arXiv:2406.00083, 2024

  9. [9]

    Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection,

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real- world llm-integrated applications with indirect prompt injection,” in Proceedings of the 16th ACM workshop on artificial intelligence and security, 2023, pp. 79–90

  10. [10]

    Emorag: Evaluating rag robustness to symbolic perturbations,

    X. Zhou, X. Li, Y . Peng, M. Xu, X. Zhang, M. Yu, Y . Wang, X. Jia, K. Wang, Q. Wenet al., “Emorag: Evaluating rag robustness to symbolic perturbations,” inProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2026, pp. 2100–2111

  11. [11]

    Whispers in grammars: Injecting covert backdoors to compromise dense retrieval systems,

    Q. Long, Y . Deng, L. Gan, W. Wang, and S. Jialin Pan, “Whispers in grammars: Injecting covert backdoors to compromise dense retrieval systems,”arXiv e-prints, pp. arXiv–2402, 2024

  12. [12]

    Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models,

    P. Cheng, Y . Ding, T. Ju, Z. Wu, W. Du, P. Yi, Z. Zhang, and G. Liu, “Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models,”arXiv preprint arXiv:2405.13401, 2024

  13. [13]

    Backdoored retrievers for prompt injection attacks on retrieval augmented generation of large language models,

    C. Clop and Y . Teglia, “Backdoored retrievers for prompt injection attacks on retrieval augmented generation of large language models,” arXiv preprint arXiv:2410.14479, 2024

  14. [14]

    Trustrag: enhancing robustness and trustworthiness in retrieval-augmented generation,

    H. Zhou, K.-H. Lee, Z. Zhan, Y . Chen, Z. Li, Z. Wang, H. Haddadi, and E. Yilmaz, “Trustrag: enhancing robustness and trustworthiness in retrieval-augmented generation,”arXiv preprint arXiv:2501.00879, 2025

  15. [15]

    Lora: Low-rank adaptation of large language mod- els

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language mod- els.”Iclr, vol. 1, no. 2, p. 3, 2022

  16. [16]

    Qwen3 embedding: Advancing text embedding and reranking through foundation models,

    Y . Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Linet al., “Qwen3 embedding: Advancing text embedding and reranking through foundation models,”arXiv preprint arXiv:2506.05176, 2025

  17. [17]

    M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self- knowledge distillation,

    M.-L. M.-F. Multi-Granularity, “M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self- knowledge distillation,”arXiv preprint arXiv:2402.03216, 2024

  18. [18]

    Natural questions: a benchmark for question answering research,

    T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Leeet al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019

  19. [19]

    Ms marco: A human generated machine reading comprehension dataset,

    P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Ma- jumder, A. McNamara, B. Mitra, T. Nguyenet al., “Ms marco: A human generated machine reading comprehension dataset,”arXiv preprint arXiv:1611.09268, 2016

  20. [20]

    Hotpotqa: A dataset for diverse, explainable multi- hop question answering,

    Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi- hop question answering,” inProceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2369– 2380

  21. [21]

    Leveraging passage retrieval with gener- ative models for open domain question answering,

    G. Izacard and E. Grave, “Leveraging passage retrieval with gener- ative models for open domain question answering,” inProceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, 2021, pp. 874–880

  22. [22]

    Improving language models by retrieving from trillions of tokens,

    S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Mil- lican, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark et al., “Improving language models by retrieving from trillions of tokens,” inInternational conference on machine learning. PMLR, 2022, pp. 2206–2240

  23. [23]

    Retrieval-augmented generation for large language models: A survey,

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, H. Wanget al., “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, vol. 2, no. 1, p. 32, 2023

  24. [24]

    Sentence-bert: Sentence embeddings using siamese bert-networks,

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” inProceedings of the 2019 confer- ence on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2019, pp. 3982–3992

  25. [25]

    Unsupervised dense information retrieval with contrastive learning,

    G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, and E. Grave, “Unsupervised dense information retrieval with contrastive learning,”arXiv preprint arXiv:2112.09118, 2021

  26. [26]

    Large dual encoders are generalizable retrievers,

    J. Ni, C. Qu, J. Lu, Z. Dai, G. H. Abrego, J. Ma, V . Zhao, Y . Luan, K. Hall, M.-W. Changet al., “Large dual encoders are generalizable retrievers,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 9844–9855

  27. [27]

    Text embeddings by weakly-supervised con- trastive pre-training,

    L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Ma- jumder, and F. Wei, “Text embeddings by weakly-supervised con- trastive pre-training,”arXiv preprint arXiv:2212.03533, 2022

  28. [28]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,”Transactions of the association for computational linguis- tics, vol. 12, pp. 157–173, 2024

  29. [29]

    Large language models can be easily distracted by irrelevant context,

    F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, N. Sch ¨arli, and D. Zhou, “Large language models can be easily distracted by irrelevant context,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 31 210–31 227

  30. [30]

    The power of noise: Redefining retrieval for rag systems,

    F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice, C. Campagnano, Y . Maarek, N. Tonellotto, and F. Silvestri, “The power of noise: Redefining retrieval for rag systems,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 719–729

  31. [31]

    How faithful are rag models? quantifying the tug-of-war between rag and llms’ internal prior,

    K. Wu, E. Wu, and J. Zou, “How faithful are rag models? quantifying the tug-of-war between rag and llms’ internal prior,”arXiv preprint arXiv:2404.10198, vol. 3, no. 1, 2024

  32. [32]

    Phantom: General backdoor attacks on retrieval augmented language genera- tion,

    H. Chaudhari, G. Severi, J. Abascal, A. Suri, M. Jagielski, C. A. Choquette-Choo, M. Nasr, C. Nita-Rotaru, and A. Oprea, “Phantom: General backdoor attacks on retrieval augmented language genera- tion,”ACM Transactions on AI Security and Privacy, 2024

  33. [33]

    Cpa-rag: Covert poisoning attacks on retrieval-augmented generation in large language models,

    C. Li, J. Zhang, A. Cheng, Z. Ma, X. Li, and J. Ma, “Cpa-rag: Covert poisoning attacks on retrieval-augmented generation in large language models,”arXiv preprint arXiv:2505.19864, 2025

  34. [34]

    Ctrlrag: Black-box adversarial attacks based on masked language models in retrieval-augmented language generation,

    R. Sui, “Ctrlrag: Black-box adversarial attacks based on masked language models in retrieval-augmented language generation,”arXiv preprint arXiv:2503.06950, 2025

  35. [35]

    The rag paradox: A black-box attack exploiting unintentional vulnerabilities in retrieval- augmented generation systems,

    C. Choi, J. Kim, S. Cho, S. Jeong, and B. Chang, “The rag paradox: A black-box attack exploiting unintentional vulnerabilities in retrieval- augmented generation systems,”arXiv preprint arXiv:2502.20995, 2025

  36. [36]

    Confusedpilot: Confused deputy risks in rag-based llms,

    A. RoyChowdhury, M. Luo, P. Sahu, S. Banerjee, and M. Tiwari, “Confusedpilot: Confused deputy risks in rag-based llms,”arXiv preprint arXiv:2408.04870, 2024

  37. [37]

    One shot dominance: Knowledge poisoning at- tack on retrieval-augmented generation systems,

    Z. Chang, M. Li, X. Jia, J. Wang, Y . Huang, Z. Jiang, Y . Liu, and Q. Wang, “One shot dominance: Knowledge poisoning at- tack on retrieval-augmented generation systems,”arXiv preprint arXiv:2505.11548, 2025

  38. [38]

    {Topic-FlipRAG}:{Topic-Orientated}adversarial opinion manipulation attacks to{Retrieval-Augmented}generation models,

    Y . Gong, Z. Chen, J. Liu, M. Chen, F. Yu, W. Lu, X. Wang, and X. Liu, “{Topic-FlipRAG}:{Topic-Orientated}adversarial opinion manipulation attacks to{Retrieval-Augmented}generation models,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 3807–3826

  39. [39]

    Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations,

    S. Cho, S. Jeong, J. Seo, T. Hwang, and J. C. Park, “Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 2826–2844

  40. [40]

    Derag: Black-box adversarial attacks on multiple retrieval-augmented generation applications via prompt injection,

    J. Wang and F. Yu, “Derag: Black-box adversarial attacks on multiple retrieval-augmented generation applications via prompt injection,” arXiv preprint arXiv:2507.15042, 2025

  41. [41]

    Editing factual knowledge in lan- guage models,

    N. De Cao, W. Aziz, and I. Titov, “Editing factual knowledge in lan- guage models,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 6491–6506

  42. [42]

    Editing large language models: Problems, methods, and opportunities,

    Y . Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, H. Chen, and N. Zhang, “Editing large language models: Problems, methods, and opportunities,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 10 222–10 240

  43. [43]

    Aging with grace: Lifelong model editing with dis- crete key-value adaptors,

    T. Hartvigsen, S. Sankaranarayanan, H. Palangi, Y . Kim, and M. Ghassemi, “Aging with grace: Lifelong model editing with dis- crete key-value adaptors,”Advances in Neural Information Processing Systems, vol. 36, pp. 47 934–47 959, 2023

  44. [44]

    Alphaedit: Null-space constrained knowledge editing for language models,

    J. Fang, H. Jiang, K. Wang, Y . Ma, J. Shi, X. Wang, X. He, and T.- S. Chua, “Alphaedit: Null-space constrained knowledge editing for language models,” inInternational Conference on Learning Repre- sentations, vol. 2025, 2025, pp. 16 366–16 396

  45. [45]

    Knowl- edge neurons in pretrained transformers,

    D. Dai, L. Dong, Y . Hao, Z. Sui, B. Chang, and F. Wei, “Knowl- edge neurons in pretrained transformers,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8493–8502

  46. [46]

    Locating and editing factual associations in gpt,

    K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in gpt,”Advances in neural information processing systems, vol. 35, pp. 17 359–17 372, 2022

  47. [47]

    Mass-editing memory in a transformer,

    K. Meng, A. S. Sharma, A. Andonian, Y . Belinkov, and D. Bau, “Mass-editing memory in a transformer,”arXiv preprint arXiv:2210.07229, 2022

  48. [48]

    Fast model editing at scale,

    E. Mitchell, C. Lin, A. Bosselut, C. Finn, and C. D. Manning, “Fast model editing at scale,”arXiv preprint arXiv:2110.11309, 2021

  49. [49]

    Memory-based model editing at scale,

    E. Mitchell, C. Lin, A. Bosselut, C. D. Manning, and C. Finn, “Memory-based model editing at scale,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 15 817–15 831

  50. [50]

    Gradient surgery for multi-task learning,

    T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Gradient surgery for multi-task learning,”Advances in neural infor- mation processing systems, vol. 33, pp. 5824–5836, 2020

  51. [51]

    Conflict-averse gradient descent for multi-task learning,

    B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu, “Conflict-averse gradient descent for multi-task learning,”Advances in neural information processing systems, vol. 34, pp. 18 878–18 890, 2021. Figure 7. Base retriever results on the financially sensitive retirement- account prompt. The base retriever returns benign competing passages related to IRA contribut...