arxiv: 2605.01782 · v1 · submitted 2026-05-03 · 💻 cs.CR · cs.DB

Recognition: unknown

Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence

Huining Cui , Wei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:49 UTC · model grok-4.3

classification 💻 cs.CR cs.DB

keywords RAGpoisoning attackscharacter-level tracebackcounterfactual maskingretrieval-augmented generationLLM forensicsblack-box attribution

0 comments

The pith

A two-pass framework called RAGCharacter localizes poisoned character spans in RAG evidence by logging a prompt trace then applying budgeted counterfactual masking to isolate causal segments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that black-box character-level traceback is feasible for identifying poisoned spans inside retrieved passages that cause specific misgenerations in retrieval-augmented generation. Existing methods operate at the passage level and therefore miss short fabricated claims or hidden instructions embedded in otherwise normal chunks. RAGCharacter runs ordinary RAG while recording the execution trace, then replays the trace with selective masking of evidence segments to attribute the output error to the responsible characters. Experiments across two QA corpora, five attack families, and six LLMs show it delivers the strongest accuracy-over-attribution trade-off among tested baselines. This moves RAG security from coarse document suspicion toward precise evidence auditing.

Core claim

RAGCharacter is a two-pass forensic framework that, after running standard RAG and logging a prompt-anchored execution trace, re-enters the triggered trace and performs event-conditioned traceback over prompt-used evidence via budgeted counterfactual masking and replay, producing both an attribution span for forensic reporting and a causal span under the logged trace.

What carries the argument

Budgeted counterfactual masking and replay over prompt-used evidence, which selectively masks retrieved spans, replays the generation trace, and measures the effect on the misgenerated output to isolate the responsible poisoned characters.

If this is right

Enables forensic reporting that names exact character ranges inside retrieved passages rather than entire chunks.
Supports remediation steps that remove or flag only the causal poisoned text while preserving the rest of the evidence.
Applies uniformly across multiple poisoning attack families without attack-specific tuning.
Provides an evaluation protocol that jointly measures chunk-level traceback and character-level localization fidelity.
Operates in a black-box setting, requiring only the ability to log traces and replay masked inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking-replay idea could be applied to detect poisoned spans that affect non-QA tasks such as summarization or code generation.
If masking budgets can be tightened further, the approach might support lightweight online monitoring instead of post-hoc forensics.
Character-level attribution opens the possibility of automated corpus cleaning that edits only the offending substrings.
Integration with retrieval scoring could create hybrid systems that both rank passages and flag suspect internal spans.

Load-bearing premise

Budgeted counterfactual masking over prompt-used evidence can isolate the causal poisoned span without introducing systematic false negatives or requiring prior knowledge of the attack type.

What would settle it

A controlled test in which a short known poisoned span is embedded in a retrieved passage, the model produces the expected erroneous output, yet RAGCharacter either misses the span or attributes the error to unrelated characters would falsify the isolation claim.

Figures

Figures reproduced from arXiv: 2605.01782 by Huining Cui, Wei Liu.

**Figure 2.** Figure 2: Simplified example workflow visualization. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of the effective retrieval budget [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of the effective retrieval budget [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Character-level traceback performance across all attacks, target LLMs, and methods, [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Dataset shift by method measured by Char IoU, where higher is better and indicates more [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Character-level traceback performance across all attacks, target LLMs, and methods, [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗

**Figure 8.** Figure 8: Dataset shift by method measured by 1 − Char FPR, where higher is better and indicates fewer falsely selected characters [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Per-attack comparison between our method and the strongest baseline under Char IoU on [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

**Figure 10.** Figure 10: Dataset shift across target LLMs for our method under (a) Char IoU and (b) [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) improves factual grounding by conditioning large language models on retrieved evidence, but it also opens a data-layer attack surface: poisoned corpus entries can steer outputs without changing model parameters. Existing defenses and traceback methods are largely passage-level, which is too coarse for modern attacks whose effective payload may be a short fabricated claim, trigger phrase, or hidden instruction embedded inside an otherwise benign chunk. We study black-box character-level poison traceback in RAG and present RAGCharacter, a two-pass forensic framework that localizes the responsible retrieved span for a concrete misgeneration event. Pass-0 runs standard RAG while logging a prompt-anchored execution trace. Pass-1 re-enters a triggered trace and performs event-conditioned traceback over prompt-used evidence via budgeted counterfactual masking and replay, yielding an attribution span for forensic reporting and a causal span under the logged trace. We further introduce an evaluation protocol that measures both event-level chunk traceback and character-level localization fidelity. Across two QA corpora, five poisoning attack families, six target LLMs, and multiple passage- and character-level baselines, RAGCharacter achieves the best overall trade-off within our benchmark between localization accuracy and low over-attribution. These results suggest that prompt-conditioned, black-box character-level traceback can be feasible, moving RAG forensics from document-level suspicion toward finer-grained evidence auditing and potential remediation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAGCharacter gives a workable two-pass black-box method for character-level poison traceback in RAG via trace logging and budgeted masking, but the central claim of reliable isolation rests on untested assumptions about single-span causal effects.

read the letter

The paper introduces RAGCharacter, a two-pass setup that first runs normal RAG while logging the prompt-anchored trace, then re-enters that trace to apply budgeted counterfactual masking over the used evidence and attributes the misgeneration to a character span. This is the concrete new piece: moving from passage-level suspicion to event-conditioned character attribution without white-box access. They also supply an evaluation protocol that scores both chunk traceback and character fidelity, and they run it across two QA sets, five attack families, six LLMs, and several baselines, claiming the best accuracy-over-attribution balance in their benchmark. That protocol and the multi-setting comparison are useful additions even if the numbers need closer inspection. The black-box framing is practical for production RAG pipelines where you only see the final output and the retrieved chunks. The method is straightforward to implement on top of existing retrievers and generators, which lowers the barrier for follow-up work. The soft spot is exactly the one the stress-test flags: the masking step treats the causal effect as coming from one contiguous span that can be isolated by removal. RAG generations frequently depend on facts distributed across several passages, so a trigger that only fires when two pieces are present together would likely produce false negatives or force the budget to cut the true span. The abstract gives no sign they tested or ablated those cross-evidence cases, which means the reported localization numbers could look better than they would in more realistic retrieval mixes. Minor gaps include the lack of visible error bars or per-attack breakdowns in the summary, but those are fixable in revision. This work is aimed at people building or auditing retrieval-augmented systems who need finer forensic tools than whole-document removal. It is coherent on its own terms and engages the right prior literature on passage-level defenses, so it merits a full referee process rather than a desk reject. Ask for explicit multi-span test cases and the raw metric tables in the next round.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces RAGCharacter, a two-pass black-box forensic framework for localizing poisoned character spans within retrieved RAG evidence. Pass-0 executes standard RAG while logging a prompt-anchored execution trace; Pass-1 re-enters the trace and applies budgeted counterfactual masking over prompt-used evidence to produce an attribution span and a causal span for a given misgeneration. The authors also define an evaluation protocol measuring event-level chunk traceback and character-level localization fidelity. Across two QA corpora, five poisoning attack families, six target LLMs, and multiple passage- and character-level baselines, the paper claims RAGCharacter delivers the best overall trade-off between localization accuracy and low over-attribution.

Significance. If the empirical results hold under more comprehensive attack models, the work would advance RAG security from coarse passage-level suspicion to actionable character-level auditing, enabling targeted corpus remediation and improving forensic accountability in production retrieval-augmented systems.

major comments (1)

Evaluation protocol (described in the abstract and implied §4–5): the reported localization fidelity rests on the assumption that a single contiguous poisoned span can be isolated via budgeted counterfactual masking. The protocol description gives no indication that multi-span or cross-evidence interaction cases (e.g., a trigger whose effect requires a prerequisite fact from another retrieved passage) were tested; if such cases exist in the five attack families, the central claim of best accuracy/over-attribution trade-off would be overstated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed comment regarding the evaluation protocol and the scope of the tested attack families. We address the concern directly below and will make a partial revision to improve clarity.

read point-by-point responses

Referee: Evaluation protocol (described in the abstract and implied §4–5): the reported localization fidelity rests on the assumption that a single contiguous poisoned span can be isolated via budgeted counterfactual masking. The protocol description gives no indication that multi-span or cross-evidence interaction cases (e.g., a trigger whose effect requires a prerequisite fact from another retrieved passage) were tested; if such cases exist in the five attack families, the central claim of best accuracy/over-attribution trade-off would be overstated.

Authors: The five poisoning attack families (Section 4.2) are explicitly constructed as single contiguous poisoned spans within individual passages; none of the families involve multi-span payloads or cross-evidence prerequisite interactions. The evaluation protocol therefore measures localization fidelity under this single-span threat model, which aligns with the dominant attack patterns studied in the RAG poisoning literature. We agree that the manuscript does not explicitly state this scope in the protocol description, which could lead a reader to assume broader coverage. We will add a clarifying paragraph in Section 5 (Evaluation Protocol) and a short limitations note in the conclusion stating that (i) all tested attacks are single-span, (ii) the reported accuracy/over-attribution trade-off holds within this benchmark, and (iii) multi-span and cross-evidence cases remain open for future extension. This revision will ensure the claims are not overstated while preserving the contribution for the evaluated setting. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method evaluated against external baselines

full rationale

The paper describes a procedural two-pass forensic framework (Pass-0 logging, Pass-1 budgeted counterfactual masking) for localizing poisoned spans in RAG evidence. Performance claims rest on direct comparison to passage- and character-level baselines across fixed corpora, attack families, and LLMs, with no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the reported accuracy/over-attribution trade-off to the method's own inputs by construction. The evaluation protocol is externally falsifiable and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard RAG assumptions plus the new framework; no free parameters are introduced in the abstract description.

axioms (1)

domain assumption RAG systems condition LLM outputs on retrieved evidence passages that may contain poisoned spans.
Stated in the opening of the abstract as the attack surface.

invented entities (1)

RAGCharacter two-pass forensic framework no independent evidence
purpose: Localize responsible retrieved span for a concrete misgeneration event
Newly proposed method combining trace logging and counterfactual masking.

pith-pipeline@v0.9.0 · 5547 in / 1214 out tokens · 36309 ms · 2026-05-10T14:49:12.356445+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 33 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security

Ben-Tov, M., Sharif, M.: Gasliteing the retrieval: Exploring vulnerabilities in dense embedding- based search. In: Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security. pp. 4364–4378 (2025)

2025
[3]

In: International conference on machine learning

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driess- che, G.B., Lespiau, J.B., Damoc, B., Clark, A., et al.: Improving language models by retrieving from trillions of tokens. In: International conference on machine learning. pp. 2206–2240. PMLR (2022)

2022
[4]

One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems

Chang, Z., Li, M., Jia, X., Wang, J., Huang, Y ., Jiang, Z., Liu, Y ., Wang, Q.: One shot dominance: Knowledge poisoning attack on retrieval-augmented generation systems. arXiv preprint arXiv:2505.11548 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models

Chang, Z., Li, M., Jia, X., Wang, J., Huang, Y ., Jiang, Z., Liu, Y ., Wang, Q.: One shot dominance: Knowledge poisoning attack on retrieval-augmented generation systems. In: Findings of the As- sociation for Computational Linguistics: EMNLP 2025. pp. 18811–18825. Association for Com- putational Linguistics, Suzhou, China (Nov 2025). https://doi.org/10.18...

work page doi:10.18653/v1/2025.findings- 2025
[6]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Chaturvedi, S.S., Bagwe, G., Zhang, L.E., Yuan, X.: AIP: Subverting retrieval-augmented gener- ation via adversarial instructional prompt. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 15861–15878. Association for Computational Linguistics, Suzhou, China (Nov 2025). https://doi.org/10.18653/v1/2025.emnlp-m...

work page doi:10.18653/v1/2025.emnlp-main.801 2025
[7]

Phantom: General trigger attacks on retrieval augmented language generation,

Chaudhari, H., Severi, G., Abascal, J., Jagielski, M., Choquette-Choo, C.A., Nasr, M., Nita- Rotaru, C., Oprea, A.: Phantom: General trigger attacks on retrieval augmented language generation. arXiv preprint arXiv:2405.20485 (2024) 23

work page arXiv 2024
[8]

Chen, L., Yang, X., Lu, Y ., Zhang, J., Sun, X., Liu, Q., Wu, S., Dong, J., Wang, L.: Poisonarena: Uncovering competing poisoning attacks in retrieval-augmented generation (2025), https: //arxiv.org/abs/2505.12574

work page arXiv 2025
[9]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Chen, Y ., Li, H., Zheng, Z., Wu, D., Song, Y ., Hooi, B.: Defense against prompt injection attack by leveraging attack techniques. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 18331–18347. Association for Computational Linguistics, Vienna, Austria (Jul 2025). https://doi.org/10....

work page doi:10.18653/v1/2025.acl- 2025
[10]

In: Proceedings of the 38th International Conference on Neural Information Processing Systems

Chen, Z., Xiang, Z., Xiao, C., Song, D., Li, B.: Agentpoison: red-teaming llm agents via poisoning memory or knowledge bases. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. NIPS ’24, Curran Associates Inc., Red Hook, NY , USA (2024)

2024
[11]

Cheng, P., Ding, Y ., Ju, T., Wu, Z., Du, W., Zhao, H., Yi, P., Zhang, Z., Liu, G.: TrojanRAG: Retrieval-augmented generation can be backdoor driver in large language models (2024), https://openreview.net/forum?id=RfYD6v829Y

2024
[12]

arXiv preprint arXiv:2510.25025 (2025)

Cheng, Z., Sun, J., Gao, A., Quan, Y ., Liu, Z., Hu, X., Fang, M.: Secure retrieval-augmented generation against poisoning attacks. arXiv preprint arXiv:2510.25025 (2025)

work page arXiv 2025
[13]

arXiv preprint arXiv:2404.13948 (2024)

Cho, S., Jeong, S., Seo, J., Hwang, T., Park, J.C.: Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations. arXiv preprint arXiv:2404.13948 (2024)

work page arXiv 2024
[14]

In: Findings of the Association for Computa- tional Linguistics: EMNLP 2025

Choi, C., Kim, J., Cho, S., Jeong, S., Chang, B.: The ragparadox: A black-box attack exploiting unintentional vulnerabilities in retrieval-augme. In: Findings of the Association for Computa- tional Linguistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China (Nov 2025)

2025
[15]

arXiv preprint arXiv:2506.04390 (2025)

Choudhary, S., Palumbo, N., Hooda, A., Dvijotham, K.D., Jha, S.: Through the stealth lens: Rethinking attacks and defenses in rag. arXiv preprint arXiv:2506.04390 (2025)

work page arXiv 2025
[16]

arXiv preprint arXiv:2410.14479 , year=

Clop, C., Teglia, Y .: Backdoored retrievers for prompt injection attacks on retrieval augmented generation of large language models. arXiv preprint arXiv:2410.14479 (2024)

work page arXiv 2024
[17]

Deng, G., Liu, Y ., Wang, K., Li, Y ., Zhang, T., Liu, Y .: Pandora: Jailbreak gpts by retrieval augmented generation poisoning (2024),https://arxiv.org/abs/2402.08416

work page arXiv 2024
[18]

In: Proceedings of the 58th annual meeting of the association for computational linguistics

DeYoung, J., Jain, S., Rajani, N.F., Lehman, E., Xiong, C., Socher, R., Wallace, B.C.: Eraser: A benchmark to evaluate rationalized nlp models. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 4443–4458 (2020)

2020
[19]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Pa- pers)

Flemings, J., Jiang, B., Zhang, W., Takhirov, Z., Annavaram, M.: Estimating privacy leak- age of augmented contextual knowledge in language models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Pa- pers). pp. 25092–25108. Association for Computational Linguistics, Vienna, Austria (Jul 2025). h...

work page doi:10.18653/v1/2025.acl-long.1220 2025
[20]

In: International conference on machine learning

Ghorbani, A., Zou, J.: Data shapley: Equitable valuation of data for machine learning. In: International conference on machine learning. pp. 2242–2251. PMLR (2019)

2019
[21]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

In: Proceedings of the 16th ACM workshop on artificial intelligence and security

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M.: Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In: Proceedings of the 16th ACM workshop on artificial intelligence and security. pp. 79–90 (2023)

2023
[23]

In: International conference on machine learning

Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.: Retrieval augmented language model pre-training. In: International conference on machine learning. pp. 3929–3938. PMLR (2020)

2020
[24]

ACM Transactions on Information Systems43(2), 1–55 (2025) 24

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems43(2), 1–55 (2025) 24

2025
[25]

In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume

Izacard, G., Grave, E.: Leveraging passage retrieval with generative models for open domain question answering. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume. pp. 874–880 (2021)

2021
[26]

In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

Jiao, Y ., Wang, X., Yang, K.: Pr-attack: Coordinated prompt-rag attacks on retrieval-augmented generation in large language models via bilevel optimization. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. p. 656–667. SIGIR ’25, Association for Computing Machinery, New York, NY , USA (20...

work page doi:10.1145/3726302.3730058 2025
[27]

In: EMNLP (1)

Karpukhin, V ., Oguz, B., Min, S., Lewis, P.S., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. In: EMNLP (1). pp. 6769–6781 (2020)

2020
[28]

In: Interna- tional conference on machine learning

Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: Interna- tional conference on machine learning. pp. 1885–1894. PMLR (2017)

2017
[29]

Transactions of the Association for Computational Linguistics7, 452–466 (2019)

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.W., Dai, A.M., Uszkoreit, J., Le, Q., Petrov, S.: Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics7,...

2019
[30]

In: Proceedings of the 34th International Conference on Neural Information Processing Systems

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge- intensive nlp tasks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY...

2020
[31]

Advances in neural information processing systems33, 9459–9474 (2020)

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems33, 9459–9474 (2020)

2020
[32]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

Li, H., Liu, X., Zhang, N., Xiao, C.: PIGuard: Prompt injection guardrail via mitigating overdefense for free. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). pp. 30420–30437. Association for Compu- tational Linguistics, Vienna, Austria (Jul 2025). https://doi.org/10.18653/v1/2025.acl-l...

work page doi:10.18653/v1/2025.acl-long.1468 2025
[33]

In: International Symposium on Information and Communication Technology

Nguyen, H.T., Nguyen, T.D., Nguyen, V .H.: Enhancing retrieval augmented generation with hierarchical text segmentation chunking. In: International Symposium on Information and Communication Technology. pp. 209–220. Springer (2024)

2024
[34]

Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: A human generated machine reading comprehension dataset. In: Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barce...

2016
[35]

Ignore Previous Prompt: Attack Techniques For Language Models

Perez, F., Ribeiro, I.: Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527 (2022)

work page internal anchor Pith review arXiv 2022
[36]

Advances in Neural Information Processing Systems33, 19920–19930 (2020)

Pruthi, G., Liu, F., Kale, S., Sundararajan, M.: Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems33, 19920–19930 (2020)

2020
[37]

In: ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models (2024), https://openreview.net/forum?id=el5wbHYKeS

Qi, Z., Zhang, H., Xing, E.P., Kakade, S.M., Lakkaraju, H.: Follow my instruction and spill the beans: Scalable data extraction from retrieval-augmented generation systems. In: ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models (2024), https://openreview.net/forum?id=el5wbHYKeS

2024
[38]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Rawte, V ., Chakraborty, S., Pathak, A., Sarkar, A., Tonmoy, S.T.I., Chadha, A., Sheth, A., Das, A.: The troubling emergence of hallucination in large language models - an extensive definition, quantification, and prescriptive remediations. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 2541–2573 (Dec 2023)

2023
[39]

why should i trust you?

Ribeiro, M.T., Singh, S., Guestrin, C.: " why should i trust you?" explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. pp. 1135–1144 (2016)

2016
[40]

In: Proceedings on

Schwinn, L., Dobre, D., Günnemann, S., Gidel, G.: Adversarial attacks and defenses in large language models: Old and new threats. In: Proceedings on. pp. 103–117. PMLR (2023) 25

2023
[41]

In: 34th USENIX Security Symposium (USENIX Security 25)

Shafran, A., Schuster, R., Shmatikov, V .: Machine against the{RAG}: Jamming {Retrieval- Augmented} generation with blocker documents. In: 34th USENIX Security Symposium (USENIX Security 25). pp. 3787–3806 (2025)

2025
[42]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025), https://openreview.net/forum?id= D9JeNTs5Bu

Shen, Z., Imana, B.Y ., Wu, T., Xiang, C., Mittal, P., Korolova, A.: ReliabilityRAG: Effective and provably robust defense for RAG-based web-search. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025), https://openreview.net/forum?id= D9JeNTs5Bu

2025
[43]

arXiv preprint arXiv:2510.09710 (2025)

Si, X., Zhu, M., Qin, S., Yu, L., Zhang, L., Liu, S., Li, X., Duan, R., Liu, Y ., Jia, X.: Secon-rag: A two-stage semantic filtering and conflict-free framework for trustworthy rag. arXiv preprint arXiv:2510.09710 (2025)

work page arXiv 2025
[44]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025), https: //openreview.net/forum?id=tTwZhy8JqY

si, X., Zhu, M., Qin, S., Yu, L., Zhang, L., Liu, S., Li, X., Duan, R., Liu, Y ., Jia, X.: Secon- RAG: A two-stage semantic filtering and conflict-free framework for trustworthy RAG. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025), https: //openreview.net/forum?id=tTwZhy8JqY

2025
[45]

Su, J., Zhou, J.P., Zhang, Z., Nakov, P., Cardie, C.: Towards more robust retrieval-augmented generation: Evaluating rag under adversarial poisoning attacks (2025), https://arxiv.org/ abs/2412.16708

work page arXiv 2025
[46]

In: International conference on machine learning

Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International conference on machine learning. pp. 3319–3328. PMLR (2017)

2017
[47]

11 Florian Tramèr, Nicholas Carlini, Wieland Brendel, Aleksander Madry, Alexey Kurakin, and Nico- las Papernot

Tan, X., Luan, H., Luo, M., Sun, X., Chen, P., Dai, J.: Revprag: Revealing poisoning attacks in retrieval-augmented generation through llm activation analysis. arXiv preprint arXiv:2411.18948 (2024)

work page arXiv 2024
[48]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Tan, Z., Zhao, C., Moraffah, R., Li, Y ., Wang, S., Li, J., Chen, T., Liu, H.: Glue pizza and eat rocks-exploiting vulnerabilities in retrieval-augmented generative models. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1610–1626 (2024)

2024
[49]

Gemma: Open Models Based on Gemini Research and Technology

Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivière, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)

work page internal anchor Pith review arXiv 2024
[50]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355 (2018)

work page internal anchor Pith review arXiv 2018
[51]

In: Findings of the Associa- tion for Computational Linguistics: EMNLP 2025

Wen, T., Wang, C., Yang, X., Tang, H., Xie, Y ., Lyu, L., Dou, Z., Wu, F.: Defend- ing against indirect prompt injection by instruction detection. In: Findings of the Associa- tion for Computational Linguistics: EMNLP 2025. p. 19472–19487. Association for Com- putational Linguistics (2025). https://doi.org/10.18653/v1/2025.findings-emnlp.1060, http: //dx....

work page doi:10.18653/v1/2025.findings-emnlp.1060 2025
[52]

arXiv preprint arXiv:2510.13842 (2025)

Wu, Y ., Liu, X., Li, Y ., Gao, Y ., Ding, Y ., Ding, J., Zheng, X., Ma, X.: Admit: Few-shot knowledge poisoning attacks on rag-based fact checking. arXiv preprint arXiv:2510.13842 (2025)

work page arXiv 2025
[53]

Xiang, C., Wu, T., Zhong, Z., Wagner, D., Chen, D., Mittal, P.: Certifiably robust rag against retrieval corruption (2024)

2024
[54]

Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models,

Xue, J., Zheng, M., Hu, Y ., Liu, F., Chen, X., Lou, Q.: Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models. arXiv preprint arXiv:2406.00083 (2024)

work page arXiv 2024
[55]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ., Liu, Y ....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhutdinov, R., Manning, C.D.: Hot- potQA: A dataset for diverse, explainable multi-hop question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 2369–2380 (Oct-Nov 2018) 26

2018
[57]

MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions , url =

Yi, J., Xie, Y ., Zhu, B., Kiciman, E., Sun, G., Xie, X., Wu, F.: Benchmarking and defending against indirect prompt injection attacks on large language models. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .1. p. 1809–1820. KDD ’25, Association for Computing Machinery, New York, NY , USA (2025),https://doi. or...

work page doi:10.1145/3690624.3709179 2025
[58]

In: Findings of the Association for Computational Linguis- tics: ACL 2024

Zeng, S., Zhang, J., He, P., Liu, Y ., Xing, Y ., Xu, H., Ren, J., Chang, Y ., Wang, S., Yin, D., Tang, J.: The good and the bad: Exploring privacy issues in retrieval- augmented generation (RAG). In: Findings of the Association for Computational Linguis- tics: ACL 2024. pp. 4505–4524. Association for Computational Linguistics, Bangkok, Thai- land (Aug 20...

work page doi:10.18653/v1/2024.findings-acl.267 2024
[59]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Zeng, S., Zhang, J., He, P., Ren, J., Zheng, T., Lu, H., Xu, H., Liu, H., Xing, Y ., Tang, J.: Mitigating the privacy issues in retrieval-augmented generation (RAG) via pure synthetic data. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 24527–24558. Association for Computational Linguistics, Suzhou, China (...

work page doi:10.18653/v1/2025.emnlp-main.1247 2025
[60]

arXiv preprint arXiv:2509.13772 (2025)

Zhang, B., Xin, H., Chen, Y ., Liu, Z., Yi, B., Li, T., Nie, L., Liu, Z., Fang, M.: Who taught the lie? responsibility attribution for poisoned knowledge in retrieval-augmented generation. arXiv preprint arXiv:2509.13772 (2025)

work page arXiv 2025
[61]

In: Proceedings of the ACM on Web Conference 2025

Zhang, B., Xin, H., Fang, M., Liu, Z., Yi, B., Li, T., Liu, Z.: Traceback of poisoning attacks to retrieval-augmented generation. In: Proceedings of the ACM on Web Conference 2025. pp. 2085–2097 (2025)

2025
[62]

Benchmarking poisoning attacks against retrieval- augmented generation,

Zhang, B., Xin, H., Li, J., Zhang, D., Fang, M., Liu, Z., Nie, L., Liu, Z.: Benchmarking poisoning attacks against retrieval-augmented generation. arXiv preprint arXiv:2505.18543 (2025)

work page arXiv 2025
[63]

In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum? id=6SIymOqJlc

Zhang, C., Zhang, X., Lou, J., Wu, K., Wang, Z., Chen, X.: Poisonedeye: Knowledge poisoning attack on retrieval-augmented generation based large vision-language models. In: Forty-second International Conference on Machine Learning (2025), https://openreview.net/forum? id=6SIymOqJlc

2025
[64]

arXiv preprint arXiv:2410.02163 (2024)

Zhang, C., Zhang, T., Shmatikov, V .: Adversarial decoding: Generating readable documents for adversarial objectives. arXiv preprint arXiv:2410.02163 (2024)

work page arXiv 2024
[65]

In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering

Zhang, Q., Zeng, B., Zhou, C., Go, G., Shi, H., Jiang, Y .: Human-imperceptible retrieval poisoning attacks in llm-powered applications. In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. pp. 502–506 (2024)

2024
[66]

arXiv preprint arXiv:2310.19156 (2023)

Zhong, Z., Huang, Z., Wettig, A., Chen, D.: Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156 (2023)

work page arXiv 2023
[67]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Zhu, H., Fiondella, L., Yuan, J., Zeng, K., Jiao, L.: Neurogenpoisoning: Neuron-guided attacks on retrieval-augmented generation of llm via genetic optimization of external knowledge. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

2025
[68]

In: 34th USENIX Security Symposium (USENIX Security 25)

Zou, W., Geng, R., Wang, B., Jia, J.: {PoisonedRAG}: Knowledge corruption attacks to {Retrieval-Augmented} generation of large language models. In: 34th USENIX Security Symposium (USENIX Security 25). pp. 3827–3844 (2025) 27 A Character-level Traceback performance data Table 8: Experimental results on the NQ dataset (Gemma). Method Metric AbvDecoding Corp...

2025