Don't Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents

Chubin Zhang; Ivor Tsang; Jingxuan Wu; Pengfei Zhou; Wangbo Zhao; Xingrui Yu; Yaxin Zhou; Zhenglin Wan

arxiv: 2606.21409 · v1 · pith:6C3IVU7Inew · submitted 2026-06-19 · 💻 cs.AI

Don't Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents

Chubin Zhang , Zhenglin Wan , Xingrui Yu , Pengfei Zhou , Wangbo Zhao , Jingxuan Wu , Yaxin Zhou , Ivor Tsang This is my paper

Pith reviewed 2026-06-26 14:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords tool-augmented agentsunreliable feedbackvalue inversionLLM agentsquestion answeringfact verificationno-feedback baseline

0 comments

The pith

Persistent misleading feedback makes tool-using LLM agents perform worse than receiving no feedback at all.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tool-augmented LLM agents are normally evaluated only under reliable external feedback, leaving open whether they would do better with no task evidence when feedback is unreliable. The paper runs a matched-loop experiment that holds the agent loop, prompt, action space, and decoding fixed while varying only the returned observation as faithful, misleading, or absent. Across question answering and fact verification tasks, persistently misleading feedback creates a value inversion in which agents that improve with clean tools fall below the no-feedback baseline. On HotpotQA, Qwen2.5-7B scores 44.8 F1 with clean retrieval and 22.3 F1 with no feedback, but only 4.7 F1 under shuffled retrieval. The inversion survives stronger clean retrieval and locally plausible distractors, yet later clean evidence can sometimes repair trajectories.

Core claim

When feedback is persistently misleading, agents that benefit from clean tools can perform worse than the matched no-feedback fallback. This inversion appears across question answering and fact verification, persists under stronger clean retrieval and plausible distractors, and weakens only when later clean evidence can repair the trajectory. Early trajectory signals predict many failures, but simple repairs such as rejecting bad evidence help only when the exposed no-feedback fallback is itself reliable.

What carries the argument

The matched-loop comparison, which fixes the agent loop, prompt, action space, and decoding while varying only the returned observation between faithful, misleading, or absent.

If this is right

Clean-tool gains can overstate tool value when unreliable feedback is possible.
Matched no-feedback fallback controls are required to evaluate tool-augmented agents.
Early trajectory signals predict many failures caused by bad feedback.
Rejecting bad evidence helps only when the no-feedback fallback is reliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent designs could add early detection of unreliable observations to avoid locking into bad trajectories.
Existing tool-use benchmarks should add unreliable-feedback conditions to avoid overstating agent capabilities.
The inversion finding may apply to other agent architectures or domains beyond the tested QA and verification tasks.
Re-evaluating prior tool-augmented agent results that lacked no-feedback controls could change assessments of their practical value.

Load-bearing premise

The matched-loop comparison fully isolates the causal effect of feedback reliability without confounds from prompt sensitivity or decoding variance.

What would settle it

A replication of the matched-loop experiment on the same tasks and models in which misleading feedback no longer produces performance below the no-feedback baseline.

read the original abstract

Tool-augmented agents are typically evaluated by their gains under reliable external feedback. Yet these gains leave open a key counterfactual: when feedback is unreliable, would the agent be better off receiving no task evidence? We study this question with a controlled matched-loop comparison that fixes the agent loop, prompt, action space, and decoding, while varying only the returned observation: faithful, misleading, or absent. Across question answering and fact verification, persistent misleading feedback produces a value inversion: agents that benefit from clean tools can perform worse than the matched no-feedback fallback. On HotpotQA, Qwen2.5-7B reaches 44.8 F1 with clean retrieval and 22.3 F1 with no feedback, but drops to 4.7 F1 under shuffled retrieval. The inversion persists under stronger clean retrieval and locally plausible distractors, but weakens when later clean evidence can repair the trajectory. Early trajectory signals predict many failures, yet simple repairs remain fallback-limited: rejecting bad evidence helps only when the exposed fallback is reliable. These results show that clean-tool gains can overstate tool value, and that matched no-feedback fallback controls are necessary for evaluating tool-augmented agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that misleading tool feedback can hurt agent performance more than no feedback, via a controlled value inversion.

read the letter

The main takeaway is that this work demonstrates a value inversion: tool-augmented agents can do worse with persistently misleading feedback than with none at all.

The controlled matched-loop design is the real contribution. It keeps the agent loop, prompt, action space, and decoding fixed while varying only the returned observation across faithful, misleading, or absent cases. On HotpotQA the numbers are direct: Qwen2.5-7B scores 44.8 F1 with clean retrieval, 22.3 with no feedback, and 4.7 with shuffled retrieval. The pattern holds under stronger clean retrieval and locally plausible distractors, though it softens when later clean evidence can repair the trajectory. Early signals predict failures, and simple rejection of bad evidence only helps if the fallback is reliable.

This is useful because most prior agent evaluations report gains under reliable feedback without the no-feedback counterfactual. The setup isolates the reliability effect cleanly.

Soft spots are minor. The abstract gives concrete results, and the stress-test confirms the isolation has no obvious internal confounds. Full details on statistical significance and exact shuffled construction would help, but the central empirical claim stands on the reported design.

This paper is for researchers who build or benchmark tool-using LLM agents. Anyone setting up evaluations will get value from the caution about overstated tool gains.

It deserves peer review. The comparison is falsifiable and directly relevant to how the subfield measures progress.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that tool-augmented LLM agents exhibit a value inversion under persistent unreliable feedback: agents that improve with clean tool observations can perform worse than a matched no-feedback baseline. This is demonstrated via controlled experiments that fix the agent loop, prompt, action space, and decoding while varying only the returned observation (faithful, misleading via shuffled retrieval, or absent) on HotpotQA and fact-verification tasks. Key result: Qwen2.5-7B reaches 44.8 F1 with clean retrieval, 22.3 F1 with no feedback, and 4.7 F1 with shuffled retrieval. The inversion holds under stronger retrieval and plausible distractors but weakens with repairable trajectories; early signals predict failures, yet simple repairs are fallback-limited.

Significance. If the results hold, the work is significant for agent evaluation practices because it shows that clean-tool gains can overstate tool utility and that no-feedback controls are required to avoid misleading assessments. The matched-loop design isolates the causal role of feedback reliability without confounding prompt or decoding changes, providing a reproducible template for future studies. The empirical demonstration of inversion on concrete tasks adds falsifiable evidence to discussions of tool reliability in LLM agents.

major comments (2)

[Experiments] Experiments section (HotpotQA results): the central value-inversion claim rests on the F1 scores (44.8 clean vs. 22.3 no-feedback vs. 4.7 shuffled), yet the manuscript does not report the number of runs, variance, or statistical significance tests; without these the magnitude and reliability of the inversion cannot be assessed.
[Experimental setup] Experimental setup: the construction of shuffled retrieval (how distractors are sampled and whether they remain locally plausible) is load-bearing for the claim that misleading feedback is realistic; the current description leaves open whether the 4.7 F1 drop is an artifact of implausible noise rather than representative unreliability.

minor comments (2)

[Introduction] The introduction should explicitly define 'value inversion' with a short formal statement before the empirical results.
[Figures] Figure captions for trajectory plots should include the exact agent model and task split used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the matched-loop design's value, and recommendation for minor revision. The comments identify areas where additional rigor and detail will strengthen the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section (HotpotQA results): the central value-inversion claim rests on the F1 scores (44.8 clean vs. 22.3 no-feedback vs. 4.7 shuffled), yet the manuscript does not report the number of runs, variance, or statistical significance tests; without these the magnitude and reliability of the inversion cannot be assessed.

Authors: We agree that reporting the number of runs, variance, and statistical significance is necessary to substantiate the reliability of the reported F1 scores. The current manuscript presents point estimates from single executions. In the revised version we will add results aggregated over multiple independent runs with different random seeds, include standard deviations, and report statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing the clean, no-feedback, and shuffled conditions. These additions will allow readers to evaluate the stability of the value inversion. revision: yes
Referee: [Experimental setup] Experimental setup: the construction of shuffled retrieval (how distractors are sampled and whether they remain locally plausible) is load-bearing for the claim that misleading feedback is realistic; the current description leaves open whether the 4.7 F1 drop is an artifact of implausible noise rather than representative unreliability.

Authors: We acknowledge that a more explicit description of distractor sampling is required. The manuscript already states that the inversion persists under "locally plausible distractors," but the precise sampling procedure is not fully detailed. We will expand the experimental setup section to specify how distractors are drawn (random selection from the same retrieval corpus, conditioned on topical overlap with the query to preserve local plausibility) and will include an example of a shuffled observation. This will clarify that the misleading feedback reflects realistic retrieval errors rather than arbitrary or implausible noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study that reports direct experimental measurements (F1 scores on HotpotQA and other tasks) obtained via a matched-loop comparison that holds agent loop, prompt, action space, and decoding fixed while varying only the observation. No equations, fitted parameters, or derivations appear in the manuscript. The central claim of value inversion follows immediately from the controlled measurements without any reduction to self-defined quantities, self-citation chains, or ansatzes. The design is self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is purely empirical and introduces no new free parameters, axioms beyond standard ML evaluation practices, or invented entities.

axioms (1)

domain assumption F1 score is a valid metric for measuring performance on HotpotQA and fact verification
Standard practice in the QA literature; invoked implicitly when reporting F1 numbers.

pith-pipeline@v0.9.1-grok · 5767 in / 1186 out tokens · 24700 ms · 2026-06-26T14:20:07.961156+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 12 canonical work pages · 1 internal anchor

[1]

2022 , url =

Karpas, Ehud and Abend, Omri and Belinkov, Yonatan and Lenz, Barak and Lieber, Opher and Ratner, Nir and Shoham, Yoav and Bata, Hofit and Levine, Yoav and Leyton-Brown, Kevin and Muhlgay, Dor and Rozen, Noam and Schwartz, Erez and Shachaf, Gal and Shalev-Shwartz, Shai and Shashua, Amnon and Tenenholtz, Moshe , journal =. 2022 , url =

2022
[2]

2021 , url =

Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and Jiang, Xu and Cobbe, Karl and Eloundou, Tyna and Krueger, Gretchen and Button, Kevin and Knight, Matthew and Chess, Benjamin and Schulman, John , journal =. 2021 , url =

2021
[3]

and Cao, Yuan , booktitle =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik R. and Cao, Yuan , booktitle =. 2023 , url =

2023
[4]

Advances in Neural Information Processing Systems , volume =

Schick, Timo and Dwivedi-Yu, Jane and Dess. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023
[6]

First Conference on Language Modeling , year =

How Easily do Irrelevant Inputs Skew the Responses of Large Language Models? , author =. First Conference on Language Modeling , year =
[8]

2026 , url =

Wang, Ruipeng and Chen, Yuxin and Wang, Yukai and Wu, Chang and Fang, Junfeng and Cai, Xiaodong and Gu, Qi and Su, Hui and Zhang, An and Wang, Xiang and Cai, Xunliang and Chua, Tat-Seng , journal =. 2026 , url =

2026
[12]

The Twelfth International Conference on Learning Representations , year =

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts , author =. The Twelfth International Conference on Learning Representations , year =
[13]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Benchmarking Large Language Models in Retrieval-Augmented Generation , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2024 , doi =

2024
[18]

The Twelfth International Conference on Learning Representations , year =

Large Language Models Cannot Self-Correct Reasoning Yet , author =. The Twelfth International Conference on Learning Representations , year =
[22]

The Instruction Hierarchy: Training

Wallace, Eric and Xiao, Kai and Leike, Reimar and Weng, Lilian and Heidecke, Johannes and Beutel, Alex , journal =. The Instruction Hierarchy: Training. 2024 , url =

2024
[23]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =. 2020 , url =

2020
[24]

2020 , series =

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , booktitle =. 2020 , series =

2020
[25]

Proceedings of the 40th International Conference on Machine Learning , pages =

Large Language Models Can Be Easily Distracted by Irrelevant Context , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , series =

2023
[27]

The Twelfth International Conference on Learning Representations , year =

Making Retrieval-Augmented Language Models Robust to Irrelevant Context , author =. The Twelfth International Conference on Learning Representations , year =
[28]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[29]

Publications Manual , year = "1983", publisher =

1983
[30]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[31]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[32]

Dan Gusfield , title =. 1997

1997
[33]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[34]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[35]

Benchmarking large language models in retrieval-augmented generation

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence, 38 0 (16): 0 17754--17762, 2024. doi:10.1609/aaai.v38i16.29728. URL https://ojs.aaai.org/index.php/AAAI/article/view/29728

work page doi:10.1609/aaai.v38i16.29728 2024
[36]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

Pith/arXiv arXiv 2021
[37]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for RAG systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '24, pages 719--729. Associ...

work page doi:10.1145/3626772.3657834 2024
[38]

Not what you’ve signed up for: Compromising real- world LLM-integrated applications with indirect prompt injection,

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world LLM -integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec '23, pages 79--90, New York, NY, USA, 2023. Association ...

work page doi:10.1145/3605764.3623985 2023
[39]

REALM : Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM : Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929--3938. PMLR, 2020. URL https://proceedings.mlr.press/v119/guu20a.html

2020
[40]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=IkmD3fKBPQ

2024
[41]

MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tenenholtz. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledg...

Pith/arXiv arXiv 2022
[42]

ToolHaystack : Stress-testing tool-augmented language models in realistic long-term interactions

Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, and Jinyoung Yeo. ToolHaystack : Stress-testing tool-augmented language models in realistic long-term interactions. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24696--24727, Suzhou, China, 2025. Association for Computational L...

work page doi:10.18653/v1/2025.findings-emnlp.1344 2025
[43]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459--9474, 2020. URL ...

2020
[44]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802--9822, Toronto, Canada, 2023....

work page doi:10.18653/v1/2023.acl-long.546 2023
[45]

WebGPT : Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT : Browser-assisted question-answering with human feedback. arXiv preprint arX...

Pith/arXiv arXiv 2021
[46]

Toolformer : Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \`i , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer : Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4b...

2023
[47]

The synthetic web: Adversarially-curated mini-internets for diagnosing epistemic weaknesses of language agents

Shrey Shah and Levent Ozgur. The synthetic web: Adversarially-curated mini-internets for diagnosing epistemic weaknesses of language agents. arXiv preprint arXiv:2603.00801, 2026. URL https://arxiv.org/abs/2603.00801

arXiv 2026
[48]

Chi, Nathanael Sch \"a rli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch \"a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 31210--31227. PMLR, 2023. URL http...

2023
[49]

FEVER: a large-scale dataset for Fact Extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : A large-scale dataset for fact extraction and VER ification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809--819, New Orleans, Louisiana,...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[50]

LLM s cannot find reasoning errors, but can correct them given the error location

Gladys Tyen, Hassan Mansoor, Victor Carbune, Peter Chen, and Tony Mak. LLM s cannot find reasoning errors, but can correct them given the error location. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13894--13908, Bangkok, Thailand, 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-acl.826. URL ...

work page doi:10.18653/v1/2024.findings-acl.826 2024
[51]

The instruction hierarchy: Training LLM s to prioritize privileged instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLM s to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024. URL https://arxiv.org/abs/2404.13208

Pith/arXiv arXiv 2024
[52]

AgentNoiseBench : Benchmarking robustness of tool-using LLM agents under noisy condition

Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, Xunliang Cai, and Tat-Seng Chua. AgentNoiseBench : Benchmarking robustness of tool-using LLM agents under noisy condition. arXiv preprint arXiv:2602.11348, 2026. URL https://arxiv.org/abs/2602.11348

arXiv 2026
[53]

How easily do irrelevant inputs skew the responses of large language models? In First Conference on Language Modeling, 2024

Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models? In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=S7NVVfuRv8

2024
[54]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=auKAUJZMO6

2024
[55]

The confidence dichotomy: Analyzing and mitigating miscalibration in tool-use agents

Weihao Xuan, Qingcheng Zeng, Heli Qi, Yunze Xiao, Junjue Wang, and Naoto Yokoya. The confidence dichotomy: Analyzing and mitigating miscalibration in tool-use agents. arXiv preprint arXiv:2601.07264, 2026. URL https://arxiv.org/abs/2601.07264

arXiv 2026
[56]

H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels, Belgium, 2018. Association for Computatio...

work page doi:10.18653/v1/d18-1259 2018
[57]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct : Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X

2023
[58]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1, KDD '25, pages 1809--1820. Association for Computing Machinery, 2025. doi:10.114...

work page doi:10.1145/3690624.3709179 2025
[59]

Making retrieval-augmented language models robust to irrelevant context

Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. Making retrieval-augmented language models robust to irrelevant context. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ZS4m74kZpH

2024
[60]

Chain-of-note: Enhancing robustness in retrieval-augmented language models

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Peixin Cao, Kaixin Ma, Jian Li, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14672--14685, Miami, Florida, USA, 2024. Association for Computational Linguistics. d...

work page doi:10.18653/v1/2024.emnlp-main.813 2024
[61]

doi:10.18653/v1/2024.findings-acl.624

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent : Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10471--10506, Bangkok, Thailand, 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-acl.624. ...

work page doi:10.18653/v1/2024.findings-acl.624 2024

[1] [1]

2022 , url =

Karpas, Ehud and Abend, Omri and Belinkov, Yonatan and Lenz, Barak and Lieber, Opher and Ratner, Nir and Shoham, Yoav and Bata, Hofit and Levine, Yoav and Leyton-Brown, Kevin and Muhlgay, Dor and Rozen, Noam and Schwartz, Erez and Shachaf, Gal and Shalev-Shwartz, Shai and Shashua, Amnon and Tenenholtz, Moshe , journal =. 2022 , url =

2022

[2] [2]

2021 , url =

Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and Jiang, Xu and Cobbe, Karl and Eloundou, Tyna and Krueger, Gretchen and Button, Kevin and Knight, Matthew and Chess, Benjamin and Schulman, John , journal =. 2021 , url =

2021

[3] [3]

and Cao, Yuan , booktitle =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik R. and Cao, Yuan , booktitle =. 2023 , url =

2023

[4] [4]

Advances in Neural Information Processing Systems , volume =

Schick, Timo and Dwivedi-Yu, Jane and Dess. Advances in Neural Information Processing Systems , volume =. 2023 , url =

2023

[5] [6]

First Conference on Language Modeling , year =

How Easily do Irrelevant Inputs Skew the Responses of Large Language Models? , author =. First Conference on Language Modeling , year =

[6] [8]

2026 , url =

Wang, Ruipeng and Chen, Yuxin and Wang, Yukai and Wu, Chang and Fang, Junfeng and Cai, Xiaodong and Gu, Qi and Su, Hui and Zhang, An and Wang, Xiang and Cai, Xunliang and Chua, Tat-Seng , journal =. 2026 , url =

2026

[7] [12]

The Twelfth International Conference on Learning Representations , year =

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts , author =. The Twelfth International Conference on Learning Representations , year =

[8] [13]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Benchmarking Large Language Models in Retrieval-Augmented Generation , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2024 , doi =

2024

[9] [18]

The Twelfth International Conference on Learning Representations , year =

Large Language Models Cannot Self-Correct Reasoning Yet , author =. The Twelfth International Conference on Learning Representations , year =

[10] [22]

The Instruction Hierarchy: Training

Wallace, Eric and Xiao, Kai and Leike, Reimar and Weng, Lilian and Heidecke, Johannes and Beutel, Alex , journal =. The Instruction Hierarchy: Training. 2024 , url =

2024

[11] [23]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =. 2020 , url =

2020

[12] [24]

2020 , series =

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , booktitle =. 2020 , series =

2020

[13] [25]

Proceedings of the 40th International Conference on Machine Learning , pages =

Large Language Models Can Be Easily Distracted by Irrelevant Context , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , series =

2023

[14] [27]

The Twelfth International Conference on Learning Representations , year =

Making Retrieval-Augmented Language Models Robust to Irrelevant Context , author =. The Twelfth International Conference on Learning Representations , year =

[15] [28]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[16] [29]

Publications Manual , year = "1983", publisher =

1983

[17] [30]

and Kozen, Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[18] [31]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[19] [32]

Dan Gusfield , title =. 1997

1997

[20] [33]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[21] [34]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[22] [35]

Benchmarking large language models in retrieval-augmented generation

Jiawei Chen, Hongyu Lin, Xianpei Han, and Le Sun. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence, 38 0 (16): 0 17754--17762, 2024. doi:10.1609/aaai.v38i16.29728. URL https://ojs.aaai.org/index.php/AAAI/article/view/29728

work page doi:10.1609/aaai.v38i16.29728 2024

[23] [36]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

Pith/arXiv arXiv 2021

[24] [37]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for RAG systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '24, pages 719--729. Associ...

work page doi:10.1145/3626772.3657834 2024

[25] [38]

Not what you’ve signed up for: Compromising real- world LLM-integrated applications with indirect prompt injection,

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world LLM -integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec '23, pages 79--90, New York, NY, USA, 2023. Association ...

work page doi:10.1145/3605764.3623985 2023

[26] [39]

REALM : Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM : Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929--3938. PMLR, 2020. URL https://proceedings.mlr.press/v119/guu20a.html

2020

[27] [40]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=IkmD3fKBPQ

2024

[28] [41]

MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, Dor Muhlgay, Noam Rozen, Erez Schwartz, Gal Shachaf, Shai Shalev-Shwartz, Amnon Shashua, and Moshe Tenenholtz. MRKL systems: A modular, neuro-symbolic architecture that combines large language models, external knowledg...

Pith/arXiv arXiv 2022

[29] [42]

ToolHaystack : Stress-testing tool-augmented language models in realistic long-term interactions

Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, and Jinyoung Yeo. ToolHaystack : Stress-testing tool-augmented language models in realistic long-term interactions. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 24696--24727, Suzhou, China, 2025. Association for Computational L...

work page doi:10.18653/v1/2025.findings-emnlp.1344 2025

[30] [43]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459--9474, 2020. URL ...

2020

[31] [44]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802--9822, Toronto, Canada, 2023....

work page doi:10.18653/v1/2023.acl-long.546 2023

[32] [45]

WebGPT : Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT : Browser-assisted question-answering with human feedback. arXiv preprint arX...

Pith/arXiv arXiv 2021

[33] [46]

Toolformer : Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \`i , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer : Language models can teach themselves to use tools. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/d842425e4b...

2023

[34] [47]

The synthetic web: Adversarially-curated mini-internets for diagnosing epistemic weaknesses of language agents

Shrey Shah and Levent Ozgur. The synthetic web: Adversarially-curated mini-internets for diagnosing epistemic weaknesses of language agents. arXiv preprint arXiv:2603.00801, 2026. URL https://arxiv.org/abs/2603.00801

arXiv 2026

[35] [48]

Chi, Nathanael Sch \"a rli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch \"a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 31210--31227. PMLR, 2023. URL http...

2023

[36] [49]

FEVER: a large-scale dataset for Fact Extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : A large-scale dataset for fact extraction and VER ification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809--819, New Orleans, Louisiana,...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018

[37] [50]

LLM s cannot find reasoning errors, but can correct them given the error location

Gladys Tyen, Hassan Mansoor, Victor Carbune, Peter Chen, and Tony Mak. LLM s cannot find reasoning errors, but can correct them given the error location. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13894--13908, Bangkok, Thailand, 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-acl.826. URL ...

work page doi:10.18653/v1/2024.findings-acl.826 2024

[38] [51]

The instruction hierarchy: Training LLM s to prioritize privileged instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLM s to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024. URL https://arxiv.org/abs/2404.13208

Pith/arXiv arXiv 2024

[39] [52]

AgentNoiseBench : Benchmarking robustness of tool-using LLM agents under noisy condition

Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, Xunliang Cai, and Tat-Seng Chua. AgentNoiseBench : Benchmarking robustness of tool-using LLM agents under noisy condition. arXiv preprint arXiv:2602.11348, 2026. URL https://arxiv.org/abs/2602.11348

arXiv 2026

[40] [53]

How easily do irrelevant inputs skew the responses of large language models? In First Conference on Language Modeling, 2024

Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models? In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=S7NVVfuRv8

2024

[41] [54]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=auKAUJZMO6

2024

[42] [55]

The confidence dichotomy: Analyzing and mitigating miscalibration in tool-use agents

Weihao Xuan, Qingcheng Zeng, Heli Qi, Yunze Xiao, Junjue Wang, and Naoto Yokoya. The confidence dichotomy: Analyzing and mitigating miscalibration in tool-use agents. arXiv preprint arXiv:2601.07264, 2026. URL https://arxiv.org/abs/2601.07264

arXiv 2026

[43] [56]

H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels, Belgium, 2018. Association for Computatio...

work page doi:10.18653/v1/d18-1259 2018

[44] [57]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. ReAct : Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X

2023

[45] [58]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =

Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, and Fangzhao Wu. Benchmarking and defending against indirect prompt injection attacks on large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1, KDD '25, pages 1809--1820. Association for Computing Machinery, 2025. doi:10.114...

work page doi:10.1145/3690624.3709179 2025

[46] [59]

Making retrieval-augmented language models robust to irrelevant context

Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. Making retrieval-augmented language models robust to irrelevant context. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ZS4m74kZpH

2024

[47] [60]

Chain-of-note: Enhancing robustness in retrieval-augmented language models

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Peixin Cao, Kaixin Ma, Jian Li, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14672--14685, Miami, Florida, USA, 2024. Association for Computational Linguistics. d...

work page doi:10.18653/v1/2024.emnlp-main.813 2024

[48] [61]

doi:10.18653/v1/2024.findings-acl.624

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent : Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10471--10506, Bangkok, Thailand, 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-acl.624. ...

work page doi:10.18653/v1/2024.findings-acl.624 2024