pith. sign in

arxiv: 2508.03793 · v3 · submitted 2025-08-05 · 💻 cs.CL · cs.CR

AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption

Pith reviewed 2026-05-19 00:29 UTC · model grok-4.3

classification 💻 cs.CL cs.CR
keywords context tracebackprompt injectionattention weightslarge language modelsretrieval-augmented generationinterpretabilityknowledge corruption
0
0 comments X

The pith

AttnTrace attributes LLM responses back to specific context texts using refined attention weights more accurately and efficiently than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AttnTrace to trace which texts in a long context most influence an LLM's generated response. This matters for real-world uses like forensic analysis after attacks, detecting prompt injections, and making outputs from RAG systems and agents more interpretable and trustworthy. AttnTrace starts from the model's native attention weights and adds two techniques to make them effective for attribution, along with theoretical support for those choices. Evaluation shows the method exceeds state-of-the-art approaches such as TracLLM in both accuracy and speed for single response-context pairs. It further enables an attribution-before-detection strategy that improves prompt injection detection and can locate injected instructions inside documents crafted to manipulate LLM review generation.

Core claim

AttnTrace is a context traceback method that uses the attention weights an LLM produces while generating a response. Two enhancement techniques are applied to these weights to better identify the contributing contextual texts, supported by theoretical insights into the design. Systematic experiments establish that this approach delivers higher accuracy and substantially lower computation cost than existing methods like TracLLM. The same attributions support improved detection of prompt injections under long contexts and successfully pinpoint injected instructions in a paper designed to alter LLM-generated reviews.

What carries the argument

AttnTrace, which processes LLM attention weights through two enhancement techniques to attribute response influence to specific segments of the input context.

If this is right

  • Traceback for a single response-context pair becomes feasible in far less time than the hundreds of seconds required by prior tools.
  • Prompt injection detection improves when attribution is performed first to narrow the search space in long contexts.
  • Injected instructions can be located inside documents intended to manipulate downstream LLM tasks such as review generation.
  • Interpretability and trustworthiness increase for responses in retrieval-augmented generation pipelines and autonomous agent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be combined with other internal attribution signals to create hybrid tracing systems that cross-validate results.
  • If the theoretical insights generalize, similar attention processing might help detect knowledge corruption inside an LLM's long-term memory stores.
  • The efficiency gains open the possibility of running continuous attribution monitoring on deployed long-context applications without prohibitive overhead.

Load-bearing premise

Attention weights, after the two proposed techniques, reliably indicate the contextual texts that causally contribute to the LLM response.

What would settle it

A controlled test where altering one known context text changes the LLM output but AttnTrace assigns it low attribution scores would show the method does not correctly identify causal contributions.

Figures

Figures reproduced from arXiv: 2508.03793 by Jinyuan Jia, Runpeng Geng, Yanting Wang, Ying Chen.

Figure 1
Figure 1. Figure 1: AttnTrace can trace back to embedded prompts in a context that manipulate LLM outputs. Section 4.6 shows a case study for pinpointing injected prompts manipulating the LLM-generated review. attacks [20, 23, 32, 64] that mislead an LLM-empowered agent to perform a malicious action; we can also identify malicious texts crafted by knowledge corruption attacks [75] that induce a RAG system to generate an incor… view at source ↗
Figure 2
Figure 2. Figure 2: Left: the key vectors of the important tokens exhibit similarity. We use key vectors from the fifth LLM layer and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualize the distribution of contribution scores assigned to poisoned texts and clean texts, where each poisoned [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of K, ρ, and B on AttnTrace. The experiment is performed for the prompt injection attack on the MuSiQue dataset. The top and bottom row show results when injecting a malicious instruction three and five times into a context, respectively. Impact of B [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An example loss curve during prefix optimization for adaptive attack. We observe that minimizing [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples showing that the attention weights of a text usually concentrate on a few tokens. Deeper color [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A paper [41] (which is now withdrawn) containing concealed instructions. The picture is from [36]. Experimental setup: We perform evaluation for AgentPoison [12], which is a backdoor attack to LLM agents. AgentPoison injects malicious texts into the memory of an LLM agent to induce it to perform malicious actions when the query contains an attacker-chosen trigger (the trigger is optimized). Following [58],… view at source ↗
Figure 8
Figure 8. Figure 8: Impact of K is larger when q is small (0.1 in this figure). The malicious instruction is injected 5 times. 1 2 3 4 5 Number of malicious texts 0.0 0.2 0.4 0.6 0.8 1.0 Precision Recall [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes AttnTrace, a context traceback method for long-context LLMs that uses attention weights with two enhancement techniques and theoretical insights. It claims superior accuracy and efficiency over state-of-the-art methods such as TracLLM for identifying which context texts causally contribute to LLM responses, with applications to prompt injection detection and real-world forensic analysis (e.g., injected instructions in manipulated paper reviews).

Significance. If the results hold, AttnTrace could provide a practical, low-overhead alternative to expensive traceback methods for improving interpretability and security in RAG pipelines and autonomous agents. The efficiency advantage (avoiding hundreds of seconds per query) would be valuable for deployment, and the attribution-before-detection paradigm for prompt injection is a promising direction. However, significance is limited by the absence of detailed quantitative validation in the provided text.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods' is asserted without any quantitative metrics, error bars, dataset sizes, or specific comparison numbers (e.g., precision/recall or runtime values versus TracLLM). This prevents assessment of the empirical support for the main result.
  2. [Design and Theoretical Insights] Design description and theoretical insights: the mapping from modified attention weights to causal contribution is load-bearing for all accuracy claims, yet the manuscript supplies no explicit causal validation such as context ablation experiments, counterfactual replacements, or intervention tests to confirm that the two techniques isolate true causal texts rather than positional or correlational artifacts.
minor comments (1)
  1. [Abstract] The GitHub link is provided but no details on reproducibility (e.g., exact prompts, model versions, or evaluation scripts) are mentioned in the abstract or evaluation summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of results and validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods' is asserted without any quantitative metrics, error bars, dataset sizes, or specific comparison numbers (e.g., precision/recall or runtime values versus TracLLM). This prevents assessment of the empirical support for the main result.

    Authors: We agree that the abstract would be improved by including key quantitative results. The full manuscript reports a systematic evaluation with precision, recall, and runtime comparisons against TracLLM across multiple datasets, including standard deviations. We have revised the abstract to highlight representative metrics and dataset details so that the empirical support for the central claims is immediately apparent. revision: yes

  2. Referee: [Design and Theoretical Insights] Design description and theoretical insights: the mapping from modified attention weights to causal contribution is load-bearing for all accuracy claims, yet the manuscript supplies no explicit causal validation such as context ablation experiments, counterfactual replacements, or intervention tests to confirm that the two techniques isolate true causal texts rather than positional or correlational artifacts.

    Authors: We appreciate this observation. The design is motivated by theoretical analysis of attention mechanisms and their relationship to contextual influence. To provide direct empirical confirmation of causality, we have added ablation and intervention experiments in the revised manuscript. These tests remove or replace the highest-attributed context segments and quantify the resulting change in model output, demonstrating that the attributions align with causal effects beyond positional or correlational factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AttnTrace derivation or claims

full rationale

The paper proposes AttnTrace as an attention-weight-based traceback method augmented by two unspecified techniques plus theoretical insights, then reports empirical gains in accuracy and efficiency over TracLLM. No equations, derivations, or self-referential definitions appear that would reduce the performance claims to fitted parameters or to the inputs by construction. The central modeling choice (modified attention indicating causal context contribution) is presented as a design decision justified by theoretical insights rather than a load-bearing self-citation chain or ansatz smuggled from prior author work. Evaluation is performed against external baselines on prompt-injection and knowledge-corruption tasks, rendering the argument self-contained against standard transformer attention and independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Method rests on the domain assumption that attention weights encode contribution information; two new techniques are introduced but not detailed in the abstract; no free parameters or invented entities are mentioned.

axioms (1)
  • domain assumption Attention weights produced by the LLM can be post-processed to attribute response content to specific context texts.
    Central premise stated in the abstract description of AttnTrace.

pith-pipeline@v0.9.0 · 5858 in / 1087 out tokens · 35065 ms · 2026-05-19T00:29:55.564642+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

    cs.CR 2026-04 unverdicted novelty 6.0

    FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    https://github.com/Significant-Gravitas/AutoGPT

    AutoGPT: Build, Deploy, and Run AI Agents. https://github.com/Significant-Gravitas/AutoGPT . November 2024

  2. [2]

    https://ai.meta.com/blog/meta-llama-3-1/

    Introducing Llama 3.1: Our most capable models to date. https://ai.meta.com/blog/meta-llama-3-1/ . November 2024

  3. [3]

    Claude-Sonnet-4 System Card

    Anthropic. Claude-Sonnet-4 System Card. https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13 df32ed995.pdf, 2025

  4. [4]

    and contributors

    Artifex Software Inc. and contributors. Pymupdf – python bindings for mupdf (version 1.26.3). https://pymupdf.read thedocs.io/. Released July 2, 2025; high-performance PDF/text extraction library

  5. [5]

    Reliable, adaptable, and attributable language models with retrieval

    Akari Asai, Zexuan Zhong, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi, and Wen-tau Yih. Reliable, adaptable, and attributable language models with retrieval. arXiv preprint arXiv:2403.03187, 2024

  6. [6]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

  7. [7]

    The use of the area under the roc curve in the evaluation of machine learning algorithms

    Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997. 16

  8. [8]

    Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

    Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128, 2022

  9. [9]

    Polynomial calculation of the shapley value based on sampling

    Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling. Computers & operations research, 36(5):1726–1730, 2009

  10. [10]

    Jopa: Explaining large language model’s generation via joint prompt attribution

    Yurui Chang, Bochuan Cao, Yujia Wang, Jinghui Chen, and Lu Lin. Jopa: Explaining large language model’s generation via joint prompt attribution. In ACL, 2025

  11. [11]

    Phantom: General trigger attacks on retrieval augmented language generation

    Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phantom: General trigger attacks on retrieval augmented language generation. arXiv, 2024

  12. [12]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. arXiv, 2024

  13. [13]

    Trojanrag: Retrieval-augmented genera- tion can be backdoor driver in large language mod- els,

    Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv preprint arXiv:2405.13401, 2024

  14. [14]

    Learning to attribute with attention

    Benjamin Cohen-Wang, Yung-Sung Chuang, and Aleksander Madry. Learning to attribute with attention. arXiv, 2025

  15. [15]

    Contextcite: Attributing model generation to context

    Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, and Aleksander Madry. Contextcite: Attributing model generation to context. In NeurIPS, 2024

  16. [16]

    Characterizations of an empirical influence function for detecting influential cases in regression

    R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495–508, 1980

  17. [17]

    Explaining by removing: A unified framework for model explanation

    Ian Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021

  18. [18]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv, 2023

  19. [19]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In NeurIPS, 2024

  20. [20]

    Prompt Injection Attacks: A New Frontier in Cybersecurity

    Jacob Fox. Prompt Injection Attacks: A New Frontier in Cybersecurity. https://www.cobalt.io/blog/prompt-injec tion-attacks, 2023

  21. [21]

    Enabling large language models to generate text with citations

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In EMNLP, 2023

  22. [22]

    Gemini-2.5-Pro Technical Report

    Google DeepMind. Gemini-2.5-Pro Technical Report. https://storage.googleapis.com/deepmind-media/gemin i/gemini_v2_5_report.pdf, 2025

  23. [23]

    Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In AISec, 2023

  24. [24]

    Attention tracker: Detecting prompt injection attacks in llms

    Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I Chung, Winston H Hsu, Pin-Yu Chen, et al. Attention tracker: Detecting prompt injection attacks in llms. arXiv, 2024

  25. [25]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP, 2020

  26. [26]

    The narrativeqa reading comprehension challenge

    Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317– 328, 2018

  27. [27]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. TACL, 2019. 17

  28. [28]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 2020

  29. [29]

    Repoqa: Evaluating long context code understanding

    Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding. arXiv, 2024

  30. [30]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv, 2023

  31. [31]

    Automatic and universal prompt injection attacks against large language models

    Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. arXiv, 2024

  32. [33]

    Formalizing and benchmarking prompt injection attacks and defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In USENIX Security Symposium, 2024

  33. [34]

    Datasentinel: A game-theoretic detection of prompt injection attacks

    Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. In IEEE Symposium on Security and Privacy, 2025

  34. [35]

    A Unified Approach to Interpreting Model Predictions

    Scott Lundberg. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874, 2017

  35. [36]

    How sneaky researchers are using hidden ai prompts to influence the peer review process

    Medium. How sneaky researchers are using hidden ai prompts to influence the peer review process. https://medium.c om/@JimTheAIWhisperer/update-5-more-papers-to-add-to-the-17-so-far-another-32-researchers-5f1 e00885cfb

  36. [37]

    Using captum to explain generative language models

    Vivek Miglani, Aobo Yang, Aram Markosyan, Diego Garcia-Olano, and Narine Kokhlikyan. Using captum to explain generative language models. In NLP-OSS, 2023

  37. [38]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  38. [39]

    Ms marco: A human generated machine reading comprehension dataset

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016

  39. [40]

    Positive review only’: Researchers hide ai prompts in papers

    Nikkei Asia. Positive review only’: Researchers hide ai prompts in papers. https://asia.nikkei.com/Business/T echnology/Artificial-intelligence/Positive-review-only-Researchers-hide-AI-prompts-in-papers , 2025

  40. [41]

    Llm agents for bargaining with utility-based feedback

    Jihwan Oh, Murad Aghazada, Se-Young Yun, and Taehyeon Kim. Llm agents for bargaining with utility-based feedback. arXiv preprint arXiv:2505.22998, 2025

  41. [42]

    Introducing GPT-4.1 in the API

    OpenAI. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1, 2025

  42. [43]

    Neural exec: Learning (and learning from) execution triggers for prompt injection attacks

    Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. Neural exec: Learning (and learning from) execution triggers for prompt injection attacks. arXiv, 2024

  43. [44]

    Ignore previous prompt: Attack techniques for language models

    Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv, 2022

  44. [45]

    RISE: Randomized Input Sampling for Explanation of Black-box Models

    Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421, 2018

  45. [46]

    Estimating training data influence by tracing gradient descent

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. NeurIPS, 2020

  46. [47]

    why should i trust you?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. In KDD, 2016. 18

  47. [48]

    Willison

    S. Willison. Delimiters won’t save you from prompt injection. https://simonwillison.net/2023/May/11/delimite rs-wont-save-you . 2023

  48. [49]

    Is Attention Interpretable?

    Sofia Serrano and Noah A Smith. Is attention interpretable? arXiv preprint arXiv:1906.03731, 2019

  49. [50]

    Machine against the rag: Jamming retrieval-augmented generation with blocker documents

    Avital Shafran, Roei Schuster, and Vitaly Shmatikov. Machine against the rag: Jamming retrieval-augmented generation with blocker documents. In USENIX Security, 2025

  50. [51]

    Poison forensics: Traceback of data poisoning attacks in neural networks

    Shawn Shan, Arjun Nitin Bhagoji, Haitao Zheng, and Ben Y Zhao. Poison forensics: Traceback of data poisoning attacks in neural networks. In USENIX Security, 2022

  51. [52]

    Learning important features through propagating activation differences

    Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, 2017

  52. [53]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Karen Simonyan. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013

  53. [54]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In ICML, 2017

  54. [55]

    Musique: Multihop questions via single-hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  55. [56]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  56. [57]

    Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024b

    Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024b. arXiv, 2024

  57. [58]

    Tracllm: A generic framework for attributing outputs of long context llms

    Yanting Wang, Wei Zou, Runpeng Geng, and Jinyuan Jia. Tracllm: A generic framework for attributing outputs of long context llms. In USENIX Security Symposium, 2025

  58. [59]

    Gradient based feature attribution in explainable ai: A technical review,

    Yongjie Wang, Tong Zhang, Xu Guo, and Zhiqi Shen. Gradient based feature attribution in explainable ai: A technical review. arXiv preprint arXiv:2403.10415, 2024

  59. [60]

    Chain-of- thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 2022

  60. [61]

    Long-form factuality in large language models

    Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. Long-form factuality in large language models. arXiv, 2024

  61. [62]

    Wiegreffe, Y

    Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019

  62. [63]

    Prompt injection attacks against gpt-3

    Simon Willison. Prompt injection attacks against gpt-3. https://simonwillison.net/2022/Sep/12/prompt-injec tion/. 2022

  63. [64]

    Prompt injection attacks against GPT-3

    Simon Willison. Prompt injection attacks against GPT-3. https://simonwillison.net/2022/Sep/12/prompt-injec tion/, 2022

  64. [65]

    Certifiably robust rag against retrieval corruption

    Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust rag against retrieval corruption. arXiv, 2024

  65. [66]

    Bad- chain: Backdoor chain-of-thought prompting for large language models

    Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242, 2024

  66. [67]

    Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models

    Jiaqi Xue, Mengxin Zheng, Yebowen Hu, Fei Liu, Xun Chen, and Qian Lou. Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models. arXiv, 2024

  67. [68]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018. 19

  68. [69]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023

  69. [70]

    Traceback of poisoning attacks to retrieval-augmented generation

    Baolei Zhang, Haoran Xin, Minghong Fang, Zhuqing Liu, Biao Yi, Tong Li, and Zheli Liu. Traceback of poisoning attacks to retrieval-augmented generation. In Proceedings of the ACM on Web Conference, 2025

  70. [71]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. NeurIPS, 2023

  71. [72]

    Qmsum: A new benchmark for query-based multi-domain meeting summarization

    Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv, 2021

  72. [73]

    Poisoning retrieval corpora by injecting adversarial passages

    Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023

  73. [74]

    Universal and transferable adversarial attacks on aligned language models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv, 2023

  74. [75]

    Please draft a high-quality review for a top-tier conference for the following submission. {paper content}

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In USENIX Security, 2025. (a) Example 1. (b) Example 2. Figure 6: Examples showing that the attention weights of a text usually concentrate on a few tokens. Deeper color represents larger attention wei...

  75. [76]

    This dataset significantly improves upon existing benchmarks that often oversimplify negotiation dynamics

    Creation of the [dataset name] Dataset: The authors introduce a novel benchmark containing six realistic market scenarios that incorporate deception, monopolies, and asymmetric bargaining power. This dataset significantly improves upon existing benchmarks that often oversimplify negotiation dynamics

  76. [77]

    It incorporates key quantities such as consumer surplus and negotiation power to better reflect human-centered evaluation of negotiation quality

    Development of the [metric name] Metric: The proposed metric provides a principled evaluation grounded in economic theory. It incorporates key quantities such as consumer surplus and negotiation power to better reflect human-centered evaluation of negotiation quality

  77. [78]

    This method leads to measurable gains in negotiation performance and promotes generalizable bargaining strategies

    Utility-Based Feedback for In-Context Learning: The authors design an in-context learning approach where LLMs iteratively update their strategy based on structured feedback signals derived from utility outcomes. This method leads to measurable gains in negotiation performance and promotes generalizable bargaining strategies

  78. [79]

    The results demonstrate that the method achieves consistent improvements in negotiation efficiency and fairness across multiple settings

    Extensive Experimental Validation: The proposed framework is rigorously tested across various LLM families, including GPT and Gemini models. The results demonstrate that the method achieves consistent improvements in negotiation efficiency and fairness across multiple settings. Strengths: • The paper is well-organized and clearly written, providing a cohe...