AttnTrace: Contextual Attribution of Prompt Injection and Knowledge Corruption
Pith reviewed 2026-05-19 00:29 UTC · model grok-4.3
The pith
AttnTrace attributes LLM responses back to specific context texts using refined attention weights more accurately and efficiently than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AttnTrace is a context traceback method that uses the attention weights an LLM produces while generating a response. Two enhancement techniques are applied to these weights to better identify the contributing contextual texts, supported by theoretical insights into the design. Systematic experiments establish that this approach delivers higher accuracy and substantially lower computation cost than existing methods like TracLLM. The same attributions support improved detection of prompt injections under long contexts and successfully pinpoint injected instructions in a paper designed to alter LLM-generated reviews.
What carries the argument
AttnTrace, which processes LLM attention weights through two enhancement techniques to attribute response influence to specific segments of the input context.
If this is right
- Traceback for a single response-context pair becomes feasible in far less time than the hundreds of seconds required by prior tools.
- Prompt injection detection improves when attribution is performed first to narrow the search space in long contexts.
- Injected instructions can be located inside documents intended to manipulate downstream LLM tasks such as review generation.
- Interpretability and trustworthiness increase for responses in retrieval-augmented generation pipelines and autonomous agent systems.
Where Pith is reading between the lines
- The approach could be combined with other internal attribution signals to create hybrid tracing systems that cross-validate results.
- If the theoretical insights generalize, similar attention processing might help detect knowledge corruption inside an LLM's long-term memory stores.
- The efficiency gains open the possibility of running continuous attribution monitoring on deployed long-context applications without prohibitive overhead.
Load-bearing premise
Attention weights, after the two proposed techniques, reliably indicate the contextual texts that causally contribute to the LLM response.
What would settle it
A controlled test where altering one known context text changes the LLM output but AttnTrace assigns it low attribution scores would show the method does not correctly identify causal contributions.
Figures
read the original abstract
Long-context large language models (LLMs), such as Gemini-2.5-Pro and Claude-Sonnet-4, are increasingly used to empower advanced AI systems, including retrieval-augmented generation (RAG) pipelines and autonomous agents. In these systems, an LLM receives an instruction along with a context--often consisting of texts retrieved from a knowledge database or memory--and generates a response that is contextually grounded by following the instruction. Recent studies have designed solutions to trace back to a subset of texts in the context that contributes most to the response generated by the LLM. These solutions have numerous real-world applications, including performing post-attack forensic analysis and improving the interpretability and trustworthiness of LLM outputs. While significant efforts have been made, state-of-the-art solutions such as TracLLM often lead to a high computation cost, e.g., it takes TracLLM hundreds of seconds to perform traceback for a single response-context pair. In this work, we propose AttnTrace, a new context traceback method based on the attention weights produced by an LLM for a prompt. To effectively utilize attention weights, we introduce two techniques designed to enhance the effectiveness of AttnTrace, and we provide theoretical insights for our design choice. We also perform a systematic evaluation for AttnTrace. The results demonstrate that AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods. We also show that AttnTrace can improve state-of-the-art methods in detecting prompt injection under long contexts through the attribution-before-detection paradigm. As a real-world application, we demonstrate that AttnTrace can effectively pinpoint injected instructions in a paper designed to manipulate LLM-generated reviews. The code is at https://github.com/Wang-Yanting/AttnTrace.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AttnTrace, a context traceback method for long-context LLMs that uses attention weights with two enhancement techniques and theoretical insights. It claims superior accuracy and efficiency over state-of-the-art methods such as TracLLM for identifying which context texts causally contribute to LLM responses, with applications to prompt injection detection and real-world forensic analysis (e.g., injected instructions in manipulated paper reviews).
Significance. If the results hold, AttnTrace could provide a practical, low-overhead alternative to expensive traceback methods for improving interpretability and security in RAG pipelines and autonomous agents. The efficiency advantage (avoiding hundreds of seconds per query) would be valuable for deployment, and the attribution-before-detection paradigm for prompt injection is a promising direction. However, significance is limited by the absence of detailed quantitative validation in the provided text.
major comments (2)
- [Abstract] Abstract: the central claim that 'AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods' is asserted without any quantitative metrics, error bars, dataset sizes, or specific comparison numbers (e.g., precision/recall or runtime values versus TracLLM). This prevents assessment of the empirical support for the main result.
- [Design and Theoretical Insights] Design description and theoretical insights: the mapping from modified attention weights to causal contribution is load-bearing for all accuracy claims, yet the manuscript supplies no explicit causal validation such as context ablation experiments, counterfactual replacements, or intervention tests to confirm that the two techniques isolate true causal texts rather than positional or correlational artifacts.
minor comments (1)
- [Abstract] The GitHub link is provided but no details on reproducibility (e.g., exact prompts, model versions, or evaluation scripts) are mentioned in the abstract or evaluation summary.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of results and validation.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'AttnTrace is more accurate and efficient than existing state-of-the-art context traceback methods' is asserted without any quantitative metrics, error bars, dataset sizes, or specific comparison numbers (e.g., precision/recall or runtime values versus TracLLM). This prevents assessment of the empirical support for the main result.
Authors: We agree that the abstract would be improved by including key quantitative results. The full manuscript reports a systematic evaluation with precision, recall, and runtime comparisons against TracLLM across multiple datasets, including standard deviations. We have revised the abstract to highlight representative metrics and dataset details so that the empirical support for the central claims is immediately apparent. revision: yes
-
Referee: [Design and Theoretical Insights] Design description and theoretical insights: the mapping from modified attention weights to causal contribution is load-bearing for all accuracy claims, yet the manuscript supplies no explicit causal validation such as context ablation experiments, counterfactual replacements, or intervention tests to confirm that the two techniques isolate true causal texts rather than positional or correlational artifacts.
Authors: We appreciate this observation. The design is motivated by theoretical analysis of attention mechanisms and their relationship to contextual influence. To provide direct empirical confirmation of causality, we have added ablation and intervention experiments in the revised manuscript. These tests remove or replace the highest-attributed context segments and quantify the resulting change in model output, demonstrating that the attributions align with causal effects beyond positional or correlational factors. revision: yes
Circularity Check
No significant circularity in AttnTrace derivation or claims
full rationale
The paper proposes AttnTrace as an attention-weight-based traceback method augmented by two unspecified techniques plus theoretical insights, then reports empirical gains in accuracy and efficiency over TracLLM. No equations, derivations, or self-referential definitions appear that would reduce the performance claims to fitted parameters or to the inputs by construction. The central modeling choice (modified attention indicating causal context contribution) is presented as a design decision justified by theoretical insights rather than a load-bearing self-citation chain or ansatz smuggled from prior author work. Evaluation is performed against external baselines on prompt-injection and knowledge-corruption tasks, rendering the argument self-contained against standard transformer attention and independent benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention weights produced by the LLM can be post-processed to attribute response content to specific context texts.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 1 (Attention weight upper bound) ... αmax ≤ 1 / (1 + (m-1) exp[-||q|| √(2m λmax(ΣI)/d)])
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
Reference graph
Works this paper leans on
-
[1]
https://github.com/Significant-Gravitas/AutoGPT
AutoGPT: Build, Deploy, and Run AI Agents. https://github.com/Significant-Gravitas/AutoGPT . November 2024
work page 2024
-
[2]
https://ai.meta.com/blog/meta-llama-3-1/
Introducing Llama 3.1: Our most capable models to date. https://ai.meta.com/blog/meta-llama-3-1/ . November 2024
work page 2024
-
[3]
Anthropic. Claude-Sonnet-4 System Card. https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13 df32ed995.pdf, 2025
work page 2025
-
[4]
Artifex Software Inc. and contributors. Pymupdf – python bindings for mupdf (version 1.26.3). https://pymupdf.read thedocs.io/. Released July 2, 2025; high-performance PDF/text extraction library
work page 2025
-
[5]
Reliable, adaptable, and attributable language models with retrieval
Akari Asai, Zexuan Zhong, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi, and Wen-tau Yih. Reliable, adaptable, and attributable language models with retrieval. arXiv preprint arXiv:2403.03187, 2024
-
[6]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
The use of the area under the roc curve in the evaluation of machine learning algorithms
Andrew P Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997. 16
work page 1997
-
[8]
Hezekiah J Branch, Jonathan Rodriguez Cefalu, Jeremy McHugh, Leyla Hujer, Aditya Bahl, Daniel del Castillo Iglesias, Ron Heichman, and Ramesh Darwishi. Evaluating the susceptibility of pre-trained language models via handcrafted adversarial examples. arXiv preprint arXiv:2209.02128, 2022
-
[9]
Polynomial calculation of the shapley value based on sampling
Javier Castro, Daniel Gómez, and Juan Tejada. Polynomial calculation of the shapley value based on sampling. Computers & operations research, 36(5):1726–1730, 2009
work page 2009
-
[10]
Jopa: Explaining large language model’s generation via joint prompt attribution
Yurui Chang, Bochuan Cao, Yujia Wang, Jinghui Chen, and Lu Lin. Jopa: Explaining large language model’s generation via joint prompt attribution. In ACL, 2025
work page 2025
-
[11]
Phantom: General trigger attacks on retrieval augmented language generation
Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phantom: General trigger attacks on retrieval augmented language generation. arXiv, 2024
work page 2024
-
[12]
Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. arXiv, 2024
work page 2024
-
[13]
Trojanrag: Retrieval-augmented genera- tion can be backdoor driver in large language mod- els,
Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv preprint arXiv:2405.13401, 2024
-
[14]
Learning to attribute with attention
Benjamin Cohen-Wang, Yung-Sung Chuang, and Aleksander Madry. Learning to attribute with attention. arXiv, 2025
work page 2025
-
[15]
Contextcite: Attributing model generation to context
Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, and Aleksander Madry. Contextcite: Attributing model generation to context. In NeurIPS, 2024
work page 2024
-
[16]
Characterizations of an empirical influence function for detecting influential cases in regression
R Dennis Cook and Sanford Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics, 22(4):495–508, 1980
work page 1980
-
[17]
Explaining by removing: A unified framework for model explanation
Ian Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021
work page 2021
-
[18]
Flashattention-2: Faster attention with better parallelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv, 2023
work page 2023
-
[19]
Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. In NeurIPS, 2024
work page 2024
-
[20]
Prompt Injection Attacks: A New Frontier in Cybersecurity
Jacob Fox. Prompt Injection Attacks: A New Frontier in Cybersecurity. https://www.cobalt.io/blog/prompt-injec tion-attacks, 2023
work page 2023
-
[21]
Enabling large language models to generate text with citations
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In EMNLP, 2023
work page 2023
-
[22]
Gemini-2.5-Pro Technical Report
Google DeepMind. Gemini-2.5-Pro Technical Report. https://storage.googleapis.com/deepmind-media/gemin i/gemini_v2_5_report.pdf, 2025
work page 2025
-
[23]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In AISec, 2023
work page 2023
-
[24]
Attention tracker: Detecting prompt injection attacks in llms
Kuo-Han Hung, Ching-Yun Ko, Ambrish Rawat, I Chung, Winston H Hsu, Pin-Yu Chen, et al. Attention tracker: Detecting prompt injection attacks in llms. arXiv, 2024
work page 2024
-
[25]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP, 2020
work page 2020
-
[26]
The narrativeqa reading comprehension challenge
Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317– 328, 2018
work page 2018
-
[27]
Natural questions: a benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. TACL, 2019. 17
work page 2019
-
[28]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 2020
work page 2020
-
[29]
Repoqa: Evaluating long context code understanding
Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding. arXiv, 2024
work page 2024
-
[30]
Autodan: Generating stealthy jailbreak prompts on aligned large language models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv, 2023
work page 2023
-
[31]
Automatic and universal prompt injection attacks against large language models
Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. arXiv, 2024
work page 2024
-
[33]
Formalizing and benchmarking prompt injection attacks and defenses
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In USENIX Security Symposium, 2024
work page 2024
-
[34]
Datasentinel: A game-theoretic detection of prompt injection attacks
Yupei Liu, Yuqi Jia, Jinyuan Jia, Dawn Song, and Neil Zhenqiang Gong. Datasentinel: A game-theoretic detection of prompt injection attacks. In IEEE Symposium on Security and Privacy, 2025
work page 2025
-
[35]
A Unified Approach to Interpreting Model Predictions
Scott Lundberg. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
How sneaky researchers are using hidden ai prompts to influence the peer review process
Medium. How sneaky researchers are using hidden ai prompts to influence the peer review process. https://medium.c om/@JimTheAIWhisperer/update-5-more-papers-to-add-to-the-17-so-far-another-32-researchers-5f1 e00885cfb
-
[37]
Using captum to explain generative language models
Vivek Miglani, Aobo Yang, Aram Markosyan, Diego Garcia-Olano, and Narine Kokhlikyan. Using captum to explain generative language models. In NLP-OSS, 2023
work page 2023
-
[38]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[39]
Ms marco: A human generated machine reading comprehension dataset
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016
work page 2016
-
[40]
Positive review only’: Researchers hide ai prompts in papers
Nikkei Asia. Positive review only’: Researchers hide ai prompts in papers. https://asia.nikkei.com/Business/T echnology/Artificial-intelligence/Positive-review-only-Researchers-hide-AI-prompts-in-papers , 2025
work page 2025
-
[41]
Llm agents for bargaining with utility-based feedback
Jihwan Oh, Murad Aghazada, Se-Young Yun, and Taehyeon Kim. Llm agents for bargaining with utility-based feedback. arXiv preprint arXiv:2505.22998, 2025
-
[42]
Introducing GPT-4.1 in the API
OpenAI. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1, 2025
work page 2025
-
[43]
Neural exec: Learning (and learning from) execution triggers for prompt injection attacks
Dario Pasquini, Martin Strohmeier, and Carmela Troncoso. Neural exec: Learning (and learning from) execution triggers for prompt injection attacks. arXiv, 2024
work page 2024
-
[44]
Ignore previous prompt: Attack techniques for language models
Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. arXiv, 2022
work page 2022
-
[45]
RISE: Randomized Input Sampling for Explanation of Black-box Models
Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[46]
Estimating training data influence by tracing gradient descent
Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. NeurIPS, 2020
work page 2020
-
[47]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why should i trust you?" explaining the predictions of any classifier. In KDD, 2016. 18
work page 2016
- [48]
-
[49]
Sofia Serrano and Noah A Smith. Is attention interpretable? arXiv preprint arXiv:1906.03731, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[50]
Machine against the rag: Jamming retrieval-augmented generation with blocker documents
Avital Shafran, Roei Schuster, and Vitaly Shmatikov. Machine against the rag: Jamming retrieval-augmented generation with blocker documents. In USENIX Security, 2025
work page 2025
-
[51]
Poison forensics: Traceback of data poisoning attacks in neural networks
Shawn Shan, Arjun Nitin Bhagoji, Haitao Zheng, and Ben Y Zhao. Poison forensics: Traceback of data poisoning attacks in neural networks. In USENIX Security, 2022
work page 2022
-
[52]
Learning important features through propagating activation differences
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In ICML, 2017
work page 2017
-
[53]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Karen Simonyan. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[54]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In ICML, 2017
work page 2017
-
[55]
Musique: Multihop questions via single-hop question composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022
work page 2022
-
[56]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[57]
Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024b
Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, et al. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024b. arXiv, 2024
work page 2024
-
[58]
Tracllm: A generic framework for attributing outputs of long context llms
Yanting Wang, Wei Zou, Runpeng Geng, and Jinyuan Jia. Tracllm: A generic framework for attributing outputs of long context llms. In USENIX Security Symposium, 2025
work page 2025
-
[59]
Gradient based feature attribution in explainable ai: A technical review,
Yongjie Wang, Tong Zhang, Xu Guo, and Zhiqi Shen. Gradient based feature attribution in explainable ai: A technical review. arXiv preprint arXiv:2403.10415, 2024
-
[60]
Chain-of- thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. NeurIPS, 2022
work page 2022
-
[61]
Long-form factuality in large language models
Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, et al. Long-form factuality in large language models. arXiv, 2024
work page 2024
-
[62]
Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019
-
[63]
Prompt injection attacks against gpt-3
Simon Willison. Prompt injection attacks against gpt-3. https://simonwillison.net/2022/Sep/12/prompt-injec tion/. 2022
work page 2022
-
[64]
Prompt injection attacks against GPT-3
Simon Willison. Prompt injection attacks against GPT-3. https://simonwillison.net/2022/Sep/12/prompt-injec tion/, 2022
work page 2022
-
[65]
Certifiably robust rag against retrieval corruption
Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust rag against retrieval corruption. arXiv, 2024
work page 2024
-
[66]
Bad- chain: Backdoor chain-of-thought prompting for large language models
Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242, 2024
-
[67]
Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models
Jiaqi Xue, Mengxin Zheng, Yebowen Hu, Fei Liu, Xun Chen, and Qian Lou. Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models. arXiv, 2024
work page 2024
-
[68]
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018. 19
work page 2018
-
[69]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[70]
Traceback of poisoning attacks to retrieval-augmented generation
Baolei Zhang, Haoran Xin, Minghong Fang, Zhuqing Liu, Biao Yi, Tong Li, and Zheli Liu. Traceback of poisoning attacks to retrieval-augmented generation. In Proceedings of the ACM on Web Conference, 2025
work page 2025
-
[71]
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. NeurIPS, 2023
work page 2023
-
[72]
Qmsum: A new benchmark for query-based multi-domain meeting summarization
Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv, 2021
work page 2021
-
[73]
Poisoning retrieval corpora by injecting adversarial passages
Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023
-
[74]
Universal and transferable adversarial attacks on aligned language models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv, 2023
work page 2023
-
[75]
Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. In USENIX Security, 2025. (a) Example 1. (b) Example 2. Figure 6: Examples showing that the attention weights of a text usually concentrate on a few tokens. Deeper color represents larger attention wei...
work page 2025
-
[76]
Creation of the [dataset name] Dataset: The authors introduce a novel benchmark containing six realistic market scenarios that incorporate deception, monopolies, and asymmetric bargaining power. This dataset significantly improves upon existing benchmarks that often oversimplify negotiation dynamics
-
[77]
Development of the [metric name] Metric: The proposed metric provides a principled evaluation grounded in economic theory. It incorporates key quantities such as consumer surplus and negotiation power to better reflect human-centered evaluation of negotiation quality
-
[78]
Utility-Based Feedback for In-Context Learning: The authors design an in-context learning approach where LLMs iteratively update their strategy based on structured feedback signals derived from utility outcomes. This method leads to measurable gains in negotiation performance and promotes generalizable bargaining strategies
-
[79]
Extensive Experimental Validation: The proposed framework is rigorously tested across various LLM families, including GPT and Gemini models. The results demonstrate that the method achieves consistent improvements in negotiation efficiency and fairness across multiple settings. Strengths: • The paper is well-organized and clearly written, providing a cohe...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.