AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption

Jinyuan Jia; Runpeng Geng; Yanting Wang; Ying Chen

arxiv: 2508.03793 · v3 · submitted 2025-08-05 · 💻 cs.CL · cs.CR

AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption

Yanting Wang , Runpeng Geng , Ying Chen , Jinyuan Jia This is my paper

Pith reviewed 2026-05-19 00:29 UTC · model grok-4.3

classification 💻 cs.CL cs.CR

keywords context tracebackprompt injectionattention weightslarge language modelsretrieval-augmented generationinterpretabilityknowledge corruption

0 comments

The pith

AttnTrace attributes LLM responses back to specific context texts using refined attention weights more accurately and efficiently than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AttnTrace to trace which texts in a long context most influence an LLM's generated response. This matters for real-world uses like forensic analysis after attacks, detecting prompt injections, and making outputs from RAG systems and agents more interpretable and trustworthy. AttnTrace starts from the model's native attention weights and adds two techniques to make them effective for attribution, along with theoretical support for those choices. Evaluation shows the method exceeds state-of-the-art approaches such as TracLLM in both accuracy and speed for single response-context pairs. It further enables an attribution-before-detection strategy that improves prompt injection detection and can locate injected instructions inside documents crafted to manipulate LLM review generation.

Core claim

AttnTrace is a context traceback method that uses the attention weights an LLM produces while generating a response. Two enhancement techniques are applied to these weights to better identify the contributing contextual texts, supported by theoretical insights into the design. Systematic experiments establish that this approach delivers higher accuracy and substantially lower computation cost than existing methods like TracLLM. The same attributions support improved detection of prompt injections under long contexts and successfully pinpoint injected instructions in a paper designed to alter LLM-generated reviews.

What carries the argument

AttnTrace, which processes LLM attention weights through two enhancement techniques to attribute response influence to specific segments of the input context.

If this is right

Traceback for a single response-context pair becomes feasible in far less time than the hundreds of seconds required by prior tools.
Prompt injection detection improves when attribution is performed first to narrow the search space in long contexts.
Injected instructions can be located inside documents intended to manipulate downstream LLM tasks such as review generation.
Interpretability and trustworthiness increase for responses in retrieval-augmented generation pipelines and autonomous agent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be combined with other internal attribution signals to create hybrid tracing systems that cross-validate results.
If the theoretical insights generalize, similar attention processing might help detect knowledge corruption inside an LLM's long-term memory stores.
The efficiency gains open the possibility of running continuous attribution monitoring on deployed long-context applications without prohibitive overhead.

Load-bearing premise

Attention weights, after the two proposed techniques, reliably indicate the contextual texts that causally contribute to the LLM response.

What would settle it

A controlled test where altering one known context text changes the LLM output but AttnTrace assigns it low attribution scores would show the method does not correctly identify causal contributions.

Figures

Figures reproduced from arXiv: 2508.03793 by Jinyuan Jia, Runpeng Geng, Yanting Wang, Ying Chen.

**Figure 1.** Figure 1: AttnTrace can trace back to embedded prompts in a context that manipulate LLM outputs. Section 4.6 shows a case study for pinpointing injected prompts manipulating the LLM-generated review. attacks [20, 23, 32, 64] that mislead an LLM-empowered agent to perform a malicious action; we can also identify malicious texts crafted by knowledge corruption attacks [75] that induce a RAG system to generate an incor… view at source ↗

**Figure 2.** Figure 2: Left: the key vectors of the important tokens exhibit similarity. We use key vectors from the fifth LLM layer and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Visualize the distribution of contribution scores assigned to poisoned texts and clean texts, where each poisoned [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of K, ρ, and B on AttnTrace. The experiment is performed for the prompt injection attack on the MuSiQue dataset. The top and bottom row show results when injecting a malicious instruction three and five times into a context, respectively. Impact of B [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: An example loss curve during prefix optimization for adaptive attack. We observe that minimizing [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Examples showing that the attention weights of a text usually concentrate on a few tokens. Deeper color [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: A paper [41] (which is now withdrawn) containing concealed instructions. The picture is from [36]. Experimental setup: We perform evaluation for AgentPoison [12], which is a backdoor attack to LLM agents. AgentPoison injects malicious texts into the memory of an LLM agent to induce it to perform malicious actions when the query contains an attacker-chosen trigger (the trigger is optimized). Following [58],… view at source ↗

**Figure 8.** Figure 8: Impact of K is larger when q is small (0.1 in this figure). The malicious instruction is injected 5 times. 1 2 3 4 5 Number of malicious texts 0.0 0.2 0.4 0.6 0.8 1.0 Precision Recall [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

read the original abstract

Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes AttnTrace, a context traceback method for long-context LLMs that uses attention weights with two enhancement techniques and theoretical insights. It claims superior accuracy and efficiency over state-of-the-art methods such as TracLLM for identifying which context texts causally contribute to LLM responses, with applications to prompt injection detection and real-world forensic analysis (e.g., injected instructions in manipulated paper reviews).

Significance. If the results hold, AttnTrace could provide a practical, low-overhead alternative to expensive traceback methods for improving interpretability and security in RAG pipelines and autonomous agents. The efficiency advantage (avoiding hundreds of seconds per query) would be valuable for deployment, and the attribution-before-detection paradigm for prompt injection is a promising direction. However, significance is limited by the absence of detailed quantitative validation in the provided text.

major comments (2)

[Abstract] Abstract: the central claim that 'AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods' is asserted without any quantitative metrics, error bars, dataset sizes, or specific comparison numbers (e.g., precision/recall or runtime values versus TracLLM). This prevents assessment of the empirical support for the main result.
[Design and Theoretical Insights] Design description and theoretical insights: the mapping from modified attention weights to causal contribution is load-bearing for all accuracy claims, yet the manuscript supplies no explicit causal validation such as context ablation experiments, counterfactual replacements, or intervention tests to confirm that the two techniques isolate true causal texts rather than positional or correlational artifacts.

minor comments (1)

[Abstract] The GitHub link is provided but no details on reproducibility (e.g., exact prompts, model versions, or evaluation scripts) are mentioned in the abstract or evaluation summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of results and validation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods' is asserted without any quantitative metrics, error bars, dataset sizes, or specific comparison numbers (e.g., precision/recall or runtime values versus TracLLM). This prevents assessment of the empirical support for the main result.

Authors: We agree that the abstract would be improved by including key quantitative results. The full manuscript reports a systematic evaluation with precision, recall, and runtime comparisons against TracLLM across multiple datasets, including standard deviations. We have revised the abstract to highlight representative metrics and dataset details so that the empirical support for the central claims is immediately apparent. revision: yes
Referee: [Design and Theoretical Insights] Design description and theoretical insights: the mapping from modified attention weights to causal contribution is load-bearing for all accuracy claims, yet the manuscript supplies no explicit causal validation such as context ablation experiments, counterfactual replacements, or intervention tests to confirm that the two techniques isolate true causal texts rather than positional or correlational artifacts.

Authors: We appreciate this observation. The design is motivated by theoretical analysis of attention mechanisms and their relationship to contextual influence. To provide direct empirical confirmation of causality, we have added ablation and intervention experiments in the revised manuscript. These tests remove or replace the highest-attributed context segments and quantify the resulting change in model output, demonstrating that the attributions align with causal effects beyond positional or correlational factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AttnTrace derivation or claims

full rationale

The paper proposes AttnTrace as an attention-weight-based traceback method augmented by two unspecified techniques plus theoretical insights, then reports empirical gains in accuracy and efficiency over TracLLM. No equations, derivations, or self-referential definitions appear that would reduce the performance claims to fitted parameters or to the inputs by construction. The central modeling choice (modified attention indicating causal context contribution) is presented as a design decision justified by theoretical insights rather than a load-bearing self-citation chain or ansatz smuggled from prior author work. Evaluation is performed against external baselines on prompt-injection and knowledge-corruption tasks, rendering the argument self-contained against standard transformer attention and independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Method rests on the domain assumption that attention weights encode contribution information; two new techniques are introduced but not detailed in the abstract; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Attention weights produced by the LLM can be post-processed to attribute response content to specific context texts.
Central premise stated in the abstract description of AttnTrace.

pith-pipeline@v0.9.0 · 5858 in / 1087 out tokens · 35065 ms · 2026-05-19T00:29:55.564642+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1 (Attention weight upper bound) ... αmax ≤ 1 / (1 + (m-1) exp[-||q|| √(2m λmax(ΣI)/d)])

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
cs.CR 2026-04 unverdicted novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

https://github.com/Significant-Gravitas/AutoGPT

AutoGPT: Build, Deploy, and Run AI Agents. https://github.com/Significant-Gravitas/AutoGPT . November 2024

work page 2024
[2]

https://ai.meta.com/blog/meta-llama-3-1/

Introducing Llama 3.1: Our most capable models to date. https://ai.meta.com/blog/meta-llama-3-1/ . November 2024

work page 2024
[3]

Claude-Sonnet-4 System Card

Anthropic. Claude-Sonnet-4 System Card. https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13 df32ed995.pdf, 2025

work page 2025
[4]

and contributors

Artifex Software Inc. and contributors. Pymupdf – python bindings for mupdf (version 1.26.3). https://pymupdf.read thedocs.io/. Released July 2, 2025; high-performance PDF/text extraction library

work page 2025
[5]

Reliable, adaptable, and attributable language models with retrieval

Akari Asai, Zexuan Zhong, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi, and Wen-tau Yih. Reliable, adaptable, and attributable language models with retrieval. arXiv preprint arXiv:2403.03187, 2024

work page arXiv 2024
[6]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

The use of the area under the roc curve in the evaluation of machine learning algorithms

Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997. 16

work page 1997
[8]

Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128, 2022

work page arXiv 2022
[9]

Polynomial calculation of the shapley value based on sampling

Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling. Computers & operations research, 36(5):1726–1730, 2009

work page 2009
[10]

Jopa: Explaining large language model’s generation via joint prompt attribution

Yurui Chang, Bochuan Cao, Yujia Wang, Jinghui Chen, and Lu Lin. Jopa: Explaining large language model’s generation via joint prompt attribution. In ACL, 2025

work page 2025
[11]

Phantom: General trigger attacks on retrieval augmented language generation

Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phantom: General trigger attacks on retrieval augmented language generation. arXiv, 2024

work page 2024
[12]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. arXiv, 2024

work page 2024
[13]

Trojanrag: Retrieval-augmented genera- tion can be backdoor driver in large language mod- els,

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv preprint arXiv:2405.13401, 2024

work page arXiv 2024
[14]

Learning to attribute with attention

Benjamin Cohen-Wang, Yung-Sung Chuang, and Aleksander Madry. Learning to attribute with attention. arXiv, 2025

work page 2025
[15]

Contextcite: Attributing model generation to context

Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, and Aleksander Madry. Contextcite: Attributing model generation to context. In NeurIPS, 2024

work page 2024
[16]

Characterizations of an empirical influence function for detecting influential cases in regression

R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495–508, 1980

work page 1980
[17]

Explaining by removing: A unified framework for model explanation

Ian Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021

work page 2021
[18]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv, 2023

work page 2023
[19]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In NeurIPS, 2024

work page 2024
[20]

Prompt Injection Attacks: A New Frontier in Cybersecurity

Jacob Fox. Prompt Injection Attacks: A New Frontier in Cybersecurity. https://www.cobalt.io/blog/prompt-injec tion-attacks, 2023

work page 2023
[21]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In EMNLP, 2023

work page 2023
[22]

Gemini-2.5-Pro Technical Report

Google DeepMind. Gemini-2.5-Pro Technical Report. https://storage.googleapis.com/deepmind-media/gemin i/gemini_v2_5_report.pdf, 2025

work page 2025
[23]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In AISec, 2023

work page 2023
[24]

Attention tracker: Detecting prompt injection attacks in llms

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I Chung, Winston H Hsu, Pin-Yu Chen, et al. Attention tracker: Detecting prompt injection attacks in llms. arXiv, 2024

work page 2024
[25]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP, 2020

work page 2020
[26]

The narrativeqa reading comprehension challenge

Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317– 328, 2018

work page 2018
[27]

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. TACL, 2019. 17

work page 2019
[28]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 2020

work page 2020
[29]

Repoqa: Evaluating long context code understanding

Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding. arXiv, 2024

work page 2024
[30]

Autodan: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv, 2023

work page 2023
[31]

Automatic and universal prompt injection attacks against large language models

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. arXiv, 2024

work page 2024
[33]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In USENIX Security Symposium, 2024

work page 2024
[34]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. In IEEE Symposium on Security and Privacy, 2025

work page 2025
[35]

A Unified Approach to Interpreting Model Predictions

Scott Lundberg. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

How sneaky researchers are using hidden ai prompts to influence the peer review process

Medium. How sneaky researchers are using hidden ai prompts to influence the peer review process. https://medium.c om/@JimTheAIWhisperer/update-5-more-papers-to-add-to-the-17-so-far-another-32-researchers-5f1 e00885cfb

work page
[37]

Using captum to explain generative language models

Vivek Miglani, Aobo Yang, Aram Markosyan, Diego Garcia-Olano, and Narine Kokhlikyan. Using captum to explain generative language models. In NLP-OSS, 2023

work page 2023
[38]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

Ms marco: A human generated machine reading comprehension dataset

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016

work page 2016
[40]

Positive review only’: Researchers hide ai prompts in papers

Nikkei Asia. Positive review only’: Researchers hide ai prompts in papers. https://asia.nikkei.com/Business/T echnology/Artificial-intelligence/Positive-review-only-Researchers-hide-AI-prompts-in-papers , 2025

work page 2025
[41]

Llm agents for bargaining with utility-based feedback

Jihwan Oh, Murad Aghazada, Se-Young Yun, and Taehyeon Kim. Llm agents for bargaining with utility-based feedback. arXiv preprint arXiv:2505.22998, 2025

work page arXiv 2025
[42]

Introducing GPT-4.1 in the API

OpenAI. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1, 2025

work page 2025
[43]

Neural exec: Learning (and learning from) execution triggers for prompt injection attacks

Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. Neural exec: Learning (and learning from) execution triggers for prompt injection attacks. arXiv, 2024

work page 2024
[44]

Ignore previous prompt: Attack techniques for language models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv, 2022

work page 2022
[45]

RISE: Randomized Input Sampling for Explanation of Black-box Models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Estimating training data influence by tracing gradient descent

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. NeurIPS, 2020

work page 2020
[47]

why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. In KDD, 2016. 18

work page 2016
[48]

Willison

S. Willison. Delimiters won’t save you from prompt injection. https://simonwillison.net/2023/May/11/delimite rs-wont-save-you . 2023

work page 2023
[49]

Is Attention Interpretable?

Sofia Serrano and Noah A Smith. Is attention interpretable? arXiv preprint arXiv:1906.03731, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[50]

Machine against the rag: Jamming retrieval-augmented generation with blocker documents

Avital Shafran, Roei Schuster, and Vitaly Shmatikov. Machine against the rag: Jamming retrieval-augmented generation with blocker documents. In USENIX Security, 2025

work page 2025
[51]

Poison forensics: Traceback of data poisoning attacks in neural networks

Shawn Shan, Arjun Nitin Bhagoji, Haitao Zheng, and Ben Y Zhao. Poison forensics: Traceback of data poisoning attacks in neural networks. In USENIX Security, 2022

work page 2022
[52]

Learning important features through propagating activation differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, 2017

work page 2017
[53]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[54]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In ICML, 2017

work page 2017
[55]

Musique: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[56]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[57]

Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024b

Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024b. arXiv, 2024

work page 2024
[58]

Tracllm: A generic framework for attributing outputs of long context llms

Yanting Wang, Wei Zou, Runpeng Geng, and Jinyuan Jia. Tracllm: A generic framework for attributing outputs of long context llms. In USENIX Security Symposium, 2025

work page 2025
[59]

Gradient based feature attribution in explainable ai: A technical review,

Yongjie Wang, Tong Zhang, Xu Guo, and Zhiqi Shen. Gradient based feature attribution in explainable ai: A technical review. arXiv preprint arXiv:2403.10415, 2024

work page arXiv 2024
[60]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 2022

work page 2022
[61]

Long-form factuality in large language models

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. Long-form factuality in large language models. arXiv, 2024

work page 2024
[62]

Wiegreffe, Y

Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019

work page arXiv 1908
[63]

Prompt injection attacks against gpt-3

Simon Willison. Prompt injection attacks against gpt-3. https://simonwillison.net/2022/Sep/12/prompt-injec tion/. 2022

work page 2022
[64]

Prompt injection attacks against GPT-3

Simon Willison. Prompt injection attacks against GPT-3. https://simonwillison.net/2022/Sep/12/prompt-injec tion/, 2022

work page 2022
[65]

Certifiably robust rag against retrieval corruption

Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust rag against retrieval corruption. arXiv, 2024

work page 2024
[66]

Bad- chain: Backdoor chain-of-thought prompting for large language models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242, 2024

work page arXiv 2024
[67]

Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models

Jiaqi Xue, Mengxin Zheng, Yebowen Hu, Fei Liu, Xun Chen, and Qian Lou. Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models. arXiv, 2024

work page 2024
[68]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018. 19

work page 2018
[69]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[70]

Traceback of poisoning attacks to retrieval-augmented generation

Baolei Zhang, Haoran Xin, Minghong Fang, Zhuqing Liu, Biao Yi, Tong Li, and Zheli Liu. Traceback of poisoning attacks to retrieval-augmented generation. In Proceedings of the ACM on Web Conference, 2025

work page 2025
[71]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. NeurIPS, 2023

work page 2023
[72]

Qmsum: A new benchmark for query-based multi-domain meeting summarization

Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv, 2021

work page 2021
[73]

Poisoning retrieval corpora by injecting adversarial passages

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023

work page arXiv 2023
[74]

Universal and transferable adversarial attacks on aligned language models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv, 2023

work page 2023
[75]

Please draft a high-quality review for a top-tier conference for the following submission. {paper content}

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In USENIX Security, 2025. (a) Example 1. (b) Example 2. Figure 6: Examples showing that the attention weights of a text usually concentrate on a few tokens. Deeper color represents larger attention wei...

work page 2025
[76]

This dataset significantly improves upon existing benchmarks that often oversimplify negotiation dynamics

Creation of the [dataset name] Dataset: The authors introduce a novel benchmark containing six realistic market scenarios that incorporate deception, monopolies, and asymmetric bargaining power. This dataset significantly improves upon existing benchmarks that often oversimplify negotiation dynamics

work page
[77]

It incorporates key quantities such as consumer surplus and negotiation power to better reflect human-centered evaluation of negotiation quality

Development of the [metric name] Metric: The proposed metric provides a principled evaluation grounded in economic theory. It incorporates key quantities such as consumer surplus and negotiation power to better reflect human-centered evaluation of negotiation quality

work page
[78]

This method leads to measurable gains in negotiation performance and promotes generalizable bargaining strategies

Utility-Based Feedback for In-Context Learning: The authors design an in-context learning approach where LLMs iteratively update their strategy based on structured feedback signals derived from utility outcomes. This method leads to measurable gains in negotiation performance and promotes generalizable bargaining strategies

work page
[79]

The results demonstrate that the method achieves consistent improvements in negotiation efficiency and fairness across multiple settings

Extensive Experimental Validation: The proposed framework is rigorously tested across various LLM families, including GPT and Gemini models. The results demonstrate that the method achieves consistent improvements in negotiation efficiency and fairness across multiple settings. Strengths: • The paper is well-organized and clearly written, providing a cohe...

work page

[1] [1]

https://github.com/Significant-Gravitas/AutoGPT

AutoGPT: Build, Deploy, and Run AI Agents. https://github.com/Significant-Gravitas/AutoGPT . November 2024

work page 2024

[2] [2]

https://ai.meta.com/blog/meta-llama-3-1/

Introducing Llama 3.1: Our most capable models to date. https://ai.meta.com/blog/meta-llama-3-1/ . November 2024

work page 2024

[3] [3]

Claude-Sonnet-4 System Card

Anthropic. Claude-Sonnet-4 System Card. https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13 df32ed995.pdf, 2025

work page 2025

[4] [4]

and contributors

Artifex Software Inc. and contributors. Pymupdf – python bindings for mupdf (version 1.26.3). https://pymupdf.read thedocs.io/. Released July 2, 2025; high-performance PDF/text extraction library

work page 2025

[5] [5]

Reliable, adaptable, and attributable language models with retrieval

Akari Asai, Zexuan Zhong, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi, and Wen-tau Yih. Reliable, adaptable, and attributable language models with retrieval. arXiv preprint arXiv:2403.03187, 2024

work page arXiv 2024

[6] [6]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

The use of the area under the roc curve in the evaluation of machine learning algorithms

Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997. 16

work page 1997

[8] [8]

Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi

Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128, 2022

work page arXiv 2022

[9] [9]

Polynomial calculation of the shapley value based on sampling

Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling. Computers & operations research, 36(5):1726–1730, 2009

work page 2009

[10] [10]

Jopa: Explaining large language model’s generation via joint prompt attribution

Yurui Chang, Bochuan Cao, Yujia Wang, Jinghui Chen, and Lu Lin. Jopa: Explaining large language model’s generation via joint prompt attribution. In ACL, 2025

work page 2025

[11] [11]

Phantom: General trigger attacks on retrieval augmented language generation

Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phantom: General trigger attacks on retrieval augmented language generation. arXiv, 2024

work page 2024

[12] [12]

Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases

Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. arXiv, 2024

work page 2024

[13] [13]

Trojanrag: Retrieval-augmented genera- tion can be backdoor driver in large language mod- els,

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv preprint arXiv:2405.13401, 2024

work page arXiv 2024

[14] [14]

Learning to attribute with attention

Benjamin Cohen-Wang, Yung-Sung Chuang, and Aleksander Madry. Learning to attribute with attention. arXiv, 2025

work page 2025

[15] [15]

Contextcite: Attributing model generation to context

Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, and Aleksander Madry. Contextcite: Attributing model generation to context. In NeurIPS, 2024

work page 2024

[16] [16]

Characterizations of an empirical influence function for detecting influential cases in regression

R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495–508, 1980

work page 1980

[17] [17]

Explaining by removing: A unified framework for model explanation

Ian Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021

work page 2021

[18] [18]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv, 2023

work page 2023

[19] [19]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In NeurIPS, 2024

work page 2024

[20] [20]

Prompt Injection Attacks: A New Frontier in Cybersecurity

Jacob Fox. Prompt Injection Attacks: A New Frontier in Cybersecurity. https://www.cobalt.io/blog/prompt-injec tion-attacks, 2023

work page 2023

[21] [21]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In EMNLP, 2023

work page 2023

[22] [22]

Gemini-2.5-Pro Technical Report

Google DeepMind. Gemini-2.5-Pro Technical Report. https://storage.googleapis.com/deepmind-media/gemin i/gemini_v2_5_report.pdf, 2025

work page 2025

[23] [23]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In AISec, 2023

work page 2023

[24] [24]

Attention tracker: Detecting prompt injection attacks in llms

Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I Chung, Winston H Hsu, Pin-Yu Chen, et al. Attention tracker: Detecting prompt injection attacks in llms. arXiv, 2024

work page 2024

[25] [25]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP, 2020

work page 2020

[26] [26]

The narrativeqa reading comprehension challenge

Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317– 328, 2018

work page 2018

[27] [27]

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. TACL, 2019. 17

work page 2019

[28] [28]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 2020

work page 2020

[29] [29]

Repoqa: Evaluating long context code understanding

Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding. arXiv, 2024

work page 2024

[30] [30]

Autodan: Generating stealthy jailbreak prompts on aligned large language models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv, 2023

work page 2023

[31] [31]

Automatic and universal prompt injection attacks against large language models

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. arXiv, 2024

work page 2024

[32] [33]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In USENIX Security Symposium, 2024

work page 2024

[33] [34]

Datasentinel: A game-theoretic detection of prompt injection attacks

Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. In IEEE Symposium on Security and Privacy, 2025

work page 2025

[34] [35]

A Unified Approach to Interpreting Model Predictions

Scott Lundberg. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [36]

How sneaky researchers are using hidden ai prompts to influence the peer review process

Medium. How sneaky researchers are using hidden ai prompts to influence the peer review process. https://medium.c om/@JimTheAIWhisperer/update-5-more-papers-to-add-to-the-17-so-far-another-32-researchers-5f1 e00885cfb

work page

[36] [37]

Using captum to explain generative language models

Vivek Miglani, Aobo Yang, Aram Markosyan, Diego Garcia-Olano, and Narine Kokhlikyan. Using captum to explain generative language models. In NLP-OSS, 2023

work page 2023

[37] [38]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [39]

Ms marco: A human generated machine reading comprehension dataset

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016

work page 2016

[39] [40]

Positive review only’: Researchers hide ai prompts in papers

Nikkei Asia. Positive review only’: Researchers hide ai prompts in papers. https://asia.nikkei.com/Business/T echnology/Artificial-intelligence/Positive-review-only-Researchers-hide-AI-prompts-in-papers , 2025

work page 2025

[40] [41]

Llm agents for bargaining with utility-based feedback

Jihwan Oh, Murad Aghazada, Se-Young Yun, and Taehyeon Kim. Llm agents for bargaining with utility-based feedback. arXiv preprint arXiv:2505.22998, 2025

work page arXiv 2025

[41] [42]

Introducing GPT-4.1 in the API

OpenAI. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1, 2025

work page 2025

[42] [43]

Neural exec: Learning (and learning from) execution triggers for prompt injection attacks

Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. Neural exec: Learning (and learning from) execution triggers for prompt injection attacks. arXiv, 2024

work page 2024

[43] [44]

Ignore previous prompt: Attack techniques for language models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv, 2022

work page 2022

[44] [45]

RISE: Randomized Input Sampling for Explanation of Black-box Models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[45] [46]

Estimating training data influence by tracing gradient descent

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. NeurIPS, 2020

work page 2020

[46] [47]

why should i trust you?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. In KDD, 2016. 18

work page 2016

[47] [48]

Willison

S. Willison. Delimiters won’t save you from prompt injection. https://simonwillison.net/2023/May/11/delimite rs-wont-save-you . 2023

work page 2023

[48] [49]

Is Attention Interpretable?

Sofia Serrano and Noah A Smith. Is attention interpretable? arXiv preprint arXiv:1906.03731, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[49] [50]

Machine against the rag: Jamming retrieval-augmented generation with blocker documents

Avital Shafran, Roei Schuster, and Vitaly Shmatikov. Machine against the rag: Jamming retrieval-augmented generation with blocker documents. In USENIX Security, 2025

work page 2025

[50] [51]

Poison forensics: Traceback of data poisoning attacks in neural networks

Shawn Shan, Arjun Nitin Bhagoji, Haitao Zheng, and Ben Y Zhao. Poison forensics: Traceback of data poisoning attacks in neural networks. In USENIX Security, 2022

work page 2022

[51] [52]

Learning important features through propagating activation differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, 2017

work page 2017

[52] [53]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[53] [54]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In ICML, 2017

work page 2017

[54] [55]

Musique: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022

[55] [56]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

[56] [57]

Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024b

Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024b. arXiv, 2024

work page 2024

[57] [58]

Tracllm: A generic framework for attributing outputs of long context llms

Yanting Wang, Wei Zou, Runpeng Geng, and Jinyuan Jia. Tracllm: A generic framework for attributing outputs of long context llms. In USENIX Security Symposium, 2025

work page 2025

[58] [59]

Gradient based feature attribution in explainable ai: A technical review,

Yongjie Wang, Tong Zhang, Xu Guo, and Zhiqi Shen. Gradient based feature attribution in explainable ai: A technical review. arXiv preprint arXiv:2403.10415, 2024

work page arXiv 2024

[59] [60]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 2022

work page 2022

[60] [61]

Long-form factuality in large language models

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. Long-form factuality in large language models. arXiv, 2024

work page 2024

[61] [62]

Wiegreffe, Y

Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019

work page arXiv 1908

[62] [63]

Prompt injection attacks against gpt-3

Simon Willison. Prompt injection attacks against gpt-3. https://simonwillison.net/2022/Sep/12/prompt-injec tion/. 2022

work page 2022

[63] [64]

Prompt injection attacks against GPT-3

Simon Willison. Prompt injection attacks against GPT-3. https://simonwillison.net/2022/Sep/12/prompt-injec tion/, 2022

work page 2022

[64] [65]

Certifiably robust rag against retrieval corruption

Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust rag against retrieval corruption. arXiv, 2024

work page 2024

[65] [66]

Bad- chain: Backdoor chain-of-thought prompting for large language models

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242, 2024

work page arXiv 2024

[66] [67]

Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models

Jiaqi Xue, Mengxin Zheng, Yebowen Hu, Fei Liu, Xun Chen, and Qian Lou. Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models. arXiv, 2024

work page 2024

[67] [68]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018. 19

work page 2018

[68] [69]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

[69] [70]

Traceback of poisoning attacks to retrieval-augmented generation

Baolei Zhang, Haoran Xin, Minghong Fang, Zhuqing Liu, Biao Yi, Tong Li, and Zheli Liu. Traceback of poisoning attacks to retrieval-augmented generation. In Proceedings of the ACM on Web Conference, 2025

work page 2025

[70] [71]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. NeurIPS, 2023

work page 2023

[71] [72]

Qmsum: A new benchmark for query-based multi-domain meeting summarization

Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv, 2021

work page 2021

[72] [73]

Poisoning retrieval corpora by injecting adversarial passages

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023

work page arXiv 2023

[73] [74]

Universal and transferable adversarial attacks on aligned language models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv, 2023

work page 2023

[74] [75]

Please draft a high-quality review for a top-tier conference for the following submission. {paper content}

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In USENIX Security, 2025. (a) Example 1. (b) Example 2. Figure 6: Examples showing that the attention weights of a text usually concentrate on a few tokens. Deeper color represents larger attention wei...

work page 2025

[75] [76]

This dataset significantly improves upon existing benchmarks that often oversimplify negotiation dynamics

Creation of the [dataset name] Dataset: The authors introduce a novel benchmark containing six realistic market scenarios that incorporate deception, monopolies, and asymmetric bargaining power. This dataset significantly improves upon existing benchmarks that often oversimplify negotiation dynamics

work page

[76] [77]

It incorporates key quantities such as consumer surplus and negotiation power to better reflect human-centered evaluation of negotiation quality

Development of the [metric name] Metric: The proposed metric provides a principled evaluation grounded in economic theory. It incorporates key quantities such as consumer surplus and negotiation power to better reflect human-centered evaluation of negotiation quality

work page

[77] [78]

This method leads to measurable gains in negotiation performance and promotes generalizable bargaining strategies

Utility-Based Feedback for In-Context Learning: The authors design an in-context learning approach where LLMs iteratively update their strategy based on structured feedback signals derived from utility outcomes. This method leads to measurable gains in negotiation performance and promotes generalizable bargaining strategies

work page

[78] [79]

The results demonstrate that the method achieves consistent improvements in negotiation efficiency and fairness across multiple settings

Extensive Experimental Validation: The proposed framework is rigorously tested across various LLM families, including GPT and Gemini models. The results demonstrate that the method achieves consistent improvements in negotiation efficiency and fairness across multiple settings. Strengths: • The paper is well-organized and clearly written, providing a cohe...

work page