pith. sign in

arxiv: 2506.04390 · v2 · pith:A5V66ZEGnew · submitted 2025-06-04 · 💻 cs.CR · cs.AI

Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG

Pith reviewed 2026-05-25 08:07 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords RAG poisoningattention-based defensestealth attacksLLM securityretrieval-augmented generationadversarial robustnessattention weights
0
0 comments X

The pith

Poisoned passages that control RAG outputs must bias attention weights enough to be flagged by a normalized score and variance filter, raising accuracy up to 20 percent over baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that poisoning attacks on retrieval-augmented generation cannot remain stealthy while successfully steering the generated response. Because a few injected passages must exert disproportionate influence on inference to dominate the output, their attention patterns diverge from those of benign passages. The authors therefore define a Normalized Passage Attention Score to quantify each passage's relative effect on output tokens and pair it with an Attention-Variance Filter that removes statistical outliers. When these signals are used for filtering, the defended system recovers up to 20 percent higher accuracy than prior defenses under attack. The work also introduces adaptive attacks that attempt to mask the anomalies yet still succeed only about 35 percent of the time.

Core claim

If a small number of poisoned passages are to dictate the generated answer, they must receive higher or more variable attention than the surrounding benign passages. The paper therefore defines the Normalized Passage Attention Score as a measure of each passage's relative influence on the output tokens. An Attention-Variance Filter then removes passages whose attention distribution deviates from the norm. When applied, this raises the accuracy of the RAG system under attack by up to 20 percent over standard defenses.

What carries the argument

Normalized Passage Attention Score (NPAS) and Attention-Variance Filter (AV Filter), which derive from attention weights to measure and flag anomalous passage influence on the response.

If this is right

  • Defenses can operate by inspecting internal attention signals rather than final text alone.
  • Attackers face a trade-off between controlling the output and remaining undetectable via attention.
  • The formal distinguishability game shows that true stealth is limited when few passages must dominate the response.
  • Adaptive attacks that try to conceal anomalies still leave measurable traces in attention patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attention monitoring could extend to other context-manipulation settings where models retrieve external material.
  • Combining attention analysis with output-consistency checks might produce stronger composite defenses.
  • Making poisoning reliably stealthy may require attackers to spread influence across many passages rather than concentrate it.

Load-bearing premise

Poisoned passages that control the response must bias the inference process more than benign ones, producing detectable attention anomalies.

What would settle it

A poisoning attack that alters the RAG output as intended while producing attention scores and variance values indistinguishable from an unpoisoned baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2506.04390 by Ashish Hooda, Krishnamurthy Dj Dvijotham, Nils Palumbo, Sarthak Choudhary, Somesh Jha.

Figure 1
Figure 1. Figure 1: AV Filter Overview. In this example, the retriever returns a set of passages z (k) , one of which is poisoned and disproportionately influences the response. This leads to a skewed distribution of normalized passage attention scores and elevated variance across passages. AV Filter mitigates this by removing passages with anomalously high attention scores, indicative of potential poisoning. • We introduce t… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Average attention scores across passage positions in retrieved sets over multiple [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Attack Success Rate (ASR) of the GCG-Poison adaptive attack on the RealtimeQA [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of Corruption Rate and Filtering Threshold: This figure shows the impact of varying the corruption rate ϵ and the filtering threshold δ on the performance of the AV Filter. Subfigures (a) and (b) present the Attack Success Rate (ASR) and Robust Accuracy (RACC) on the RealtimeQA-MC dataset with α = ∞, averaged over all models. As expected, ASR increases and RACC decreases with higher corruption rates… view at source ↗
Figure 5
Figure 5. Figure 5: Attention Patterns in Benign vs. Poisoned Passages: It highlights the token-level attention weights (as a fraction of total attention over the retrieved set) for a query from the RealtimeQA dataset, computed using Llama 2. (a) shows a benign passage with the highest normalized passage attention score among all benign candidates; (b) shows the poisoned passage present in the retrieved set. Tokens such as 3,… view at source ↗
read the original abstract

Retrieval-augmented generation (RAG) systems are vulnerable to attacks that inject poisoned passages into the retrieved context, even at low corruption rates. We show that existing attacks are not designed to be stealthy, allowing reliable detection and mitigation. We formalize a distinguishability-based security game to quantify stealth for such attacks. If a few poisoned passages control the response, they must bias the inference process more than the benign ones, inherently compromising stealth. This motivates analyzing intermediate signals of LLMs, such as attention weights, to approximate the influence of different passages on the response. Leveraging attention weights, we introduce the $\textbf{Normalized Passage Attention Score}$ (NPAS) and a lightweight $\textbf{Attention-Variance Filter}$ (AV Filter) that flags anomalous passages. Our method improves robustness, yielding up to $\sim$ $\textbf{20%}$ higher accuracy than baseline defenses. We also develop adaptive attacks that attempt to conceal such anomalies, achieving up to $\textbf{35%}$ success rate and underscoring the challenges of achieving true stealth in poisoning RAG systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that poisoning attacks on RAG systems are not inherently stealthy because controlling the output requires biasing inference more than benign passages, which can be detected via attention weights. It formalizes a distinguishability security game, introduces the Normalized Passage Attention Score (NPAS) and Attention-Variance Filter (AV Filter), reports up to ~20% higher accuracy than baseline defenses, and shows that adaptive attacks achieve at most 35% success rate.

Significance. If the results hold under rigorous validation, the work contributes a formal security game for stealth in RAG poisoning and demonstrates that internal LLM signals (attention) can yield practical defenses. The adaptive attack evaluation is a strength, as it directly tests the limits of the proposed method rather than relying solely on static baselines.

major comments (3)
  1. [§2] §2 (motivation) and security game formalization: The premise that 'if a few poisoned passages control the response, they must bias the inference process more than the benign ones' is asserted as a necessity that compromises stealth. This modeling choice directly motivates NPAS and the AV Filter, but the manuscript provides no derivation or counterexample showing that distributed or indirect influence across tokens cannot achieve output control without producing measurable attention anomalies. This is load-bearing for the central claim.
  2. [§5] §5 (experimental results): The reported ~20% accuracy gain and 35% adaptive attack success rate are presented without sufficient detail on the number of runs, variance across seeds, exact baseline implementations, or ablation isolating the contribution of the attention-variance component versus simple thresholding. Without these, it is unclear whether the gains are robust or sensitive to data selection.
  3. [§4.1] Definition of NPAS (likely §4.1): The normalization and variance computation assume that higher or more variable attention on poisoned passages is both necessary and sufficient for detection. If an adaptive attack can equalize attention distributions while still steering generation (e.g., via prompt engineering on multiple passages), the filter's signal disappears; the manuscript should include a targeted experiment testing this scenario.
minor comments (2)
  1. [§4] Notation for NPAS should be defined with an explicit equation rather than prose description to avoid ambiguity in how normalization is performed across passages of varying lengths.
  2. [§5] The abstract states quantitative gains but the experimental section would benefit from a table summarizing accuracy, attack success rate, and false-positive rate for all methods and datasets.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for identifying areas where additional rigor and detail would strengthen the manuscript. We address each major comment below and commit to revisions that improve clarity without altering the core claims.

read point-by-point responses
  1. Referee: [§2] §2 (motivation) and security game formalization: The premise that 'if a few poisoned passages control the response, they must bias the inference process more than the benign ones' is asserted as a necessity that compromises stealth. This modeling choice directly motivates NPAS and the AV Filter, but the manuscript provides no derivation or counterexample showing that distributed or indirect influence across tokens cannot achieve output control without producing measurable attention anomalies. This is load-bearing for the central claim.

    Authors: The distinguishability security game formalizes stealth as the inability of an attacker to control output without creating detectable differences in passage influence. The manuscript motivates this via the observation that output control in transformer-based generation requires disproportionate contribution from poisoned passages. We agree that an explicit derivation or counterexample addressing distributed token-level influence would strengthen the argument. In revision we will expand §2 with a short proof sketch showing that any successful steering must increase the aggregate attention mass on the controlling passages (by the properties of softmax attention and next-token prediction), together with a brief counterexample illustrating why purely indirect influence fails to override benign context at low corruption rates. revision: yes

  2. Referee: [§5] §5 (experimental results): The reported ~20% accuracy gain and 35% adaptive attack success rate are presented without sufficient detail on the number of runs, variance across seeds, exact baseline implementations, or ablation isolating the contribution of the attention-variance component versus simple thresholding. Without these, it is unclear whether the gains are robust or sensitive to data selection.

    Authors: We acknowledge that the experimental section would benefit from greater statistical transparency. The reported figures aggregate results over multiple random seeds and datasets, yet the manuscript does not tabulate per-seed variance or explicitly describe baseline re-implementations. In the revised version we will add (i) the exact number of runs and standard deviations for all accuracy and attack-success metrics, (ii) precise references and hyper-parameter settings for each baseline, and (iii) an ablation table that isolates the contribution of the variance term in the AV Filter versus simple NPAS thresholding. revision: yes

  3. Referee: [§4.1] Definition of NPAS (likely §4.1): The normalization and variance computation assume that higher or more variable attention on poisoned passages is both necessary and sufficient for detection. If an adaptive attack can equalize attention distributions while still steering generation (e.g., via prompt engineering on multiple passages), the filter's signal disappears; the manuscript should include a targeted experiment testing this scenario.

    Authors: The NPAS formulation is derived from the necessity of biasing attention to achieve control, and the adaptive attacks already evaluated attempt to reduce attention anomalies yet still reach only 35% success. Nevertheless, we agree that an explicit test of attention-equalization strategies (e.g., multi-passage prompt engineering) is warranted. We will add a targeted experiment in §5 that constructs such equalizing attacks and reports the resulting NPAS distributions and filter performance, thereby quantifying the residual detectability even under this stronger threat model. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper states a premise in the abstract that poisoned passages controlling the response must bias inference more than benign ones, which motivates the distinguishability security game and attention-based NPAS/AV Filter. This is presented as a logical motivation and modeling choice rather than a derivation that reduces by construction to fitted inputs or self-citations. No equations, self-citation chains, ansatzes, or renamings are quoted that exhibit the specific reductions required by the circularity patterns. The reported accuracy gains are empirical comparisons to baselines, leaving the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review performed on abstract only; ledger entries are therefore limited to concepts explicitly named in the abstract.

axioms (1)
  • domain assumption Poisoned passages that control the generated response must bias the inference process more than benign passages, inherently reducing stealth.
    This premise is stated directly in the abstract as the motivation for analyzing attention weights.
invented entities (2)
  • Normalized Passage Attention Score (NPAS) no independent evidence
    purpose: Quantify the relative influence of each retrieved passage on the model output via attention weights.
    New metric introduced in the abstract; no independent evidence provided.
  • Attention-Variance Filter (AV Filter) no independent evidence
    purpose: Flag anomalous passages based on variance in attention patterns.
    New lightweight filter introduced in the abstract; no independent evidence provided.

pith-pipeline@v0.9.0 · 5739 in / 1362 out tokens · 42237 ms · 2026-05-25T08:07:09.610676+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Needle-in-RAG: Prompt-Conditioned Character-Level Traceback of Poisoned Spans in Retrieved Evidence

    cs.CR 2026-05 unverdicted novelty 7.0

    RAGCharacter localizes poisoned character spans in RAG evidence via prompt-conditioned counterfactual masking and achieves the best accuracy-over-attribution trade-off across tested attacks and models.

  2. Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks

    cs.CR 2026-04 unverdicted novelty 6.0

    A context-aware Sentinel-Strategist system for RAG selectively applies defenses to block membership inference and data poisoning while recovering most retrieval utility compared to always-on defense stacks.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 2 Pith papers · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020

  3. [3]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

  4. [4]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020

  5. [5]

    Retrieval- augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

  6. [6]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022

  7. [7]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781, 2020

  8. [8]

    Generative ai in search: Let google do the searching for you

    Google. Generative ai in search: Let google do the searching for you. https://blog. google/products/search/generative-ai-google-search-may-2024/ , 2024. Ac- cessed: 2025-04-21

  9. [9]

    Wikichat: Stopping the hallucination of large language model chatbots by few-shot grounding on wikipedia

    Sina J Semnani, Violet Z Yao, Heidi C Zhang, and Monica S Lam. Wikichat: Stopping the hallucination of large language model chatbots by few-shot grounding on wikipedia. arXiv preprint arXiv:2305.14292, 2023. 17

  10. [10]

    Bing chat

    Microsoft. Bing chat. https://www.microsoft.com/en-us/edge/features/ bing-chat, 2024. Accessed: 2025-04-21

  11. [11]

    Perplexity ai

    Perplexity AI. Perplexity ai. https://www.perplexity.ai/, 2024. Accessed: 2025-04-21

  12. [12]

    Llamaindex

    Jerry Liu. Llamaindex. https://github.com/jerryjliu/llama_index, November 2022. Accessed: 2025-04-21

  13. [13]

    Langchain

    LangChain. Langchain. https://github.com/langchain-ai/langchain, 2024. Ac- cessed: 2025-04-21

  14. [14]

    Re- flexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  15. [15]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

  16. [16]

    Poisoning web-scale training datasets is practical

    Nicholas Carlini, Matthew Jagielski, Christopher A Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. In 2024 IEEE Symposium on Security and Privacy (SP), pages 407–425. IEEE, 2024

  17. [17]

    Certifiably robust rag against retrieval corruption

    Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust rag against retrieval corruption. arXiv preprint arXiv:2405.15556, 2024

  18. [18]

    Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge poi- soning attacks to retrieval-augmented generation of large language models. arXiv preprint arXiv:2402.07867, 2024

  19. [19]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90, 2023

  20. [20]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping- yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Base- line defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023

  21. [21]

    Detecting Language Model Attacks with Perplexity

    Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132, 2023

  22. [22]

    Demystifying prompts in language models via perplexity estimation

    Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, and Luke Zettlemoyer. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037, 2022. 18

  23. [23]

    Rankrag: Unifying context ranking with retrieval-augmented generation in llms

    Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, and Bryan Catanzaro. Rankrag: Unifying context ranking with retrieval-augmented generation in llms. Advances in Neural Information Processing Systems, 37:121156–121184, 2024

  24. [24]

    Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation

    Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, and Wentao Zhang. Hoprag: Multi-hop reasoning for logic-aware retrieval-augmented generation. arXiv preprint arXiv:2502.12442, 2025

  25. [25]

    Collapse of dense retrievers: Short, early, and literal biases outranking factual evidence

    Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, and Nanyun Peng. Collapse of dense retrievers: Short, early, and literal biases outranking factual evidence. arXiv preprint arXiv:2503.05037, 2025

  26. [26]

    Building a robust retrieval system with dense retrieval models

    Sheng-Chieh Lin. Building a robust retrieval system with dense retrieval models. 2024

  27. [27]

    More robust dense retrieval with contrastive dual learning

    Yizhi Li, Zhenghao Liu, Chenyan Xiong, and Zhiyuan Liu. More robust dense retrieval with contrastive dual learning. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, pages 287–296, 2021

  28. [28]

    Synthetic disinformation attacks on automated fact verification systems

    Yibing Du, Antoine Bosselut, and Christopher D Manning. Synthetic disinformation attacks on automated fact verification systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10581–10589, 2022

  29. [29]

    Attacking open-domain question answering by injecting misinformation

    Liangming Pan, Wenhu Chen, Min-Yen Kan, and William Yang Wang. Attacking open-domain question answering by injecting misinformation. arXiv preprint arXiv:2110.07803, 2021

  30. [30]

    On the risk of misinformation pollution with large language models

    Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Yang Wang. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661, 2023

  31. [31]

    Poisoning retrieval corpora by injecting adversarial passages

    Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023

  32. [32]

    Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations

    Sukmin Cho, Soyeong Jeong, Jeongyeon Seo, Taeho Hwang, and Jong C Park. Typos that broke the rag’s back: Genetic attack on rag pipeline by simulating documents in the wild via low-level perturbations. arXiv preprint arXiv:2404.13948, 2024

  33. [33]

    De- fending against disinformation attacks in open-domain question answering

    Orion Weller, Aleem Khan, Nathaniel Weir, Dawn Lawrie, and Benjamin Van Durme. De- fending against disinformation attacks in open-domain question answering. arXiv preprint arXiv:2212.10002, 2022

  34. [34]

    Discern and answer: Mitigating the impact of misinformation in retrieval-augmented models with discriminators

    Giwon Hong, Jeonghwan Kim, Junmo Kang, Sung-Hyon Myaeng, and Joyce Jiyoung Whang. Discern and answer: Mitigating the impact of misinformation in retrieval-augmented models with discriminators. CoRR, 2023

  35. [35]

    Analyzing the structure of attention in a transformer language model

    Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 63–76, Florence, Italy, August 2019. Association for Computational...

  36. [36]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661–34710, 2023

  37. [37]

    Zipcache: Accurate and efficient kv cache quantization with salient token identification

    Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification. arXiv preprint arXiv:2405.14256, 2024

  38. [38]

    Attention sorting combats recency bias in long context language models

    Alexander Peysakhovich and Adam Lerer. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427, 2023

  39. [39]

    Understanding data poisoning attacks for RAG: Insights and algorithms, 2025

    Xun Xian, Tong Wang, Liwen You, and Yanjun Qi. Understanding data poisoning attacks for RAG: Insights and algorithms, 2025

  40. [40]

    Realtime qa: What’s the answer right now? Advances in neural information processing systems, 36:49025–49043, 2023

    Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A Smith, Yejin Choi, Kentaro Inui, et al. Realtime qa: What’s the answer right now? Advances in neural information processing systems, 36:49025–49043, 2023

  41. [41]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  42. [42]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023

  43. [43]

    Serial position effects of large language models.arXiv preprint arXiv:2406.15981, 2024

    Xiaobo Guo and Soroush V osoughi. Serial position effects of large language models.arXiv preprint arXiv:2406.15981, 2024

  44. [44]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019

  45. [45]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

  46. [46]

    Mistral 7B

    Devendra Singh Chaplot. Albert q. jiang, alexandre sablayrolles, arthur mensch, chris bamford, devendra singh chaplot, diego de las casas, florian bressand, gianna lengyel, guillaume lample, lucile saulnier, lélio renard lavaud, marie-anne lachaux, pierre stock, teven le scao, thibaut lavril, thomas wang, timothée lacroix, william el sayed. arXiv preprint...

  47. [47]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 20

  48. [48]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023. 21 A Additional Background on Existing Works PoisonedRAG. Given a query q and target answer s′, PoisonedRAG (Poison) seeks to craft a poisoned passage zpoison such that a RAG system is ...