Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

Bo Yang; Changting Lin; Chen Ye; Meng Han; Wenpeng Xing; Xuyang Teng; Zhe Yu

arxiv: 2605.27157 · v1 · pith:TDYO3QXJnew · submitted 2026-05-26 · 💻 cs.AI

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

Zhe Yu , Wenpeng Xing , Chen Ye , Xuyang Teng , Bo Yang , Changting Lin , Meng Han This is my paper

Pith reviewed 2026-06-29 17:04 UTC · model grok-4.3

classification 💻 cs.AI

keywords monitoring-control gapretrieval-augmented generationRAG safetycontradiction detectionmulti-turn evaluationepistemic conflictaction selectionLLM robustness

0 comments

The pith

Retrieval-augmented LLMs detect contradictory evidence yet fail to constrain their final outputs accordingly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that retrieval-augmented language models can recognize contradictions in accumulated documents but do not translate that recognition into safer recommendations. This monitoring-control gap appears because single-turn tests of robustness do not predict behavior when evidence arrives across multiple turns. The authors introduce a multi-turn document accumulation protocol and run it on four model families with over 50,000 evaluations, finding that acknowledgement of conflict is uncorrelated with safe resolution. Mechanism probes indicate the problem lies in action selection rather than representation or attention. No prompt intervention reliably closes the gap.

Core claim

Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations. Detecting epistemic conflict does not imply resolving it safely. Single-turn diagnostics systematically overestimate RAG safety, contradiction acknowledgement is uncorrelated with safe resolution, and no universal prompt fix exists. Converging evidence from hidden-state probing, attention analysis, and response taxonomy points to action selection as the locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior.

What carries the argument

The monitoring-control gap, the disconnect between internal detection of contradictory evidence and its use to shape final output behavior.

If this is right

Single-turn diagnostics systematically overestimate RAG safety.
Contradiction acknowledgement is uncorrelated with safe resolution.
No universal prompt fix exists for the gap.
Danger-relevant information is represented internally and receives enhanced attention yet fails to constrain output.
The gap must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gap may appear in other sequential decision settings where evidence arrives incrementally.
Targeted interventions at the action-selection stage, rather than detection, may be needed to close it.
Deployment in domains that rely on accumulating evidence would require new multi-turn safety benchmarks.
Human validation results suggest the pattern is not an artifact of automatic metrics.

Load-bearing premise

The assumption that single-turn robustness to contradictory evidence predicts robustness when evidence accumulates across multiple turns.

What would settle it

A controlled multi-turn accumulation experiment in which models that acknowledge contradictions also produce safe resolutions at rates significantly above chance would falsify the gap.

Figures

Figures reproduced from arXiv: 2605.27157 by Bo Yang, Changting Lin, Chen Ye, Meng Han, Wenpeng Xing, Xuyang Teng, Zhe Yu.

**Figure 2.** Figure 2: Multi-turn danger escalation across four model families and six evidence timing patterns. All models [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Danger rate heatmap by timing pattern (rows) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Scale analysis of the monitoring–control gap. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Cross-model probe comparison: Qwen2.5- 1.5B vs. 7B. (A) T2→T3 probe accuracy (7B peaks at 0.645, 1.5B at 0.572). (B) T2−T3 accuracy gap: positive values indicate T2 outperforms T3 (seen in early 7B layers but reversed for 1.5B). (10K resamples) confirm ∆ is not significantly different from zero (Qwen2.5-1.5B: p = 0.384; 7B: p = 0.412; Mistral-7B: p = 0.742; Llama-3- 8B: p = 0.178). TOST with equivalence bo… view at source ↗

**Figure 7.** Figure 7: Human validation and judge calibration. Au [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows models detect contradictions in accumulating RAG evidence but fail to let that awareness shape final outputs, and single-turn tests miss this.

read the letter

The core finding is that RAG models acknowledge contradictory evidence across turns yet still produce unsafe recommendations, and this dissociation is not caught by standard single-turn checks. They ran a multi-turn accumulation protocol on four model families from 1.5B to 32B, with 50k turn-level evaluations, and report that acknowledgement rates do not predict safe resolution. Hidden-state probes, attention maps, and a response taxonomy all point to the action-selection stage as the weak point rather than detection itself.

What stands out is the scale and the converging lines of evidence. The multi-turn design directly tests the assumption that single-turn robustness carries over, and the lack of a universal prompt fix is a practical takeaway. The behavioral measures plus human validation and internal representations give the claim more weight than a purely observational study would have.

The main soft spot is that the abstract leaves the exact prompt templates, exclusion rules, and statistical controls opaque, so the 50k-turn numbers are hard to stress-test without the full methods. The correlation between acknowledgement and resolution is reported as low, but it is still correlational; a tighter causal test would strengthen it. No obvious circularity or fitting artifact shows up in the reported structure.

This is worth a serious referee for anyone working on RAG safety or multi-turn evaluation. The work is aimed at people who deploy or audit retrieval systems in settings where evidence builds over time. It deserves peer review because the experimental contrast is clean and the mechanism evidence is more than a single behavioral measure.

Referee Report

1 major / 2 minor

Summary. The paper claims that retrieval-augmented LLMs exhibit a monitoring-control gap: models acknowledge contradictory evidence (via behavioral measures and human validation) yet fail to constrain final recommendations in multi-turn settings. This is demonstrated via a multi-turn document accumulation protocol across four model families (1.5B–32B) and >50k turn-level evaluations, showing single-turn diagnostics overestimate safety, acknowledgement uncorrelated with safe resolution, no universal prompt mitigation, and converging mechanism evidence (hidden-state probing, attention analysis, response taxonomy) localizing the deficit to action selection rather than detection.

Significance. If the central empirical pattern holds, the work is significant for RAG safety evaluation: it falsifies the assumption that single-turn robustness predicts multi-turn behavior under accumulating evidence, supplies large-scale data with multiple converging analyses, and identifies action selection as the plausible locus. The scale, human validation, and mechanism probes are strengths that would support publication if methods are fully specified.

major comments (1)

[Methods] Methods section: the manuscript does not report data exclusion rules, exact statistical controls, or full protocol details for the 50k turn-level evaluations. These choices are load-bearing for the claim that acknowledgement is uncorrelated with safe resolution and that single-turn tests systematically overestimate safety.

minor comments (2)

[Abstract] Abstract: the phrase 'no universal prompt fix exists' would benefit from a brief parenthetical listing the prompt families tested.
[Figures] Figure captions (throughout): ensure all panels include error bars or confidence intervals matching the statistical tests described in the text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We agree that additional methodological transparency is required to support the core claims. We address the single major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: [Methods] Methods section: the manuscript does not report data exclusion rules, exact statistical controls, or full protocol details for the 50k turn-level evaluations. These choices are load-bearing for the claim that acknowledgement is uncorrelated with safe resolution and that single-turn tests systematically overestimate safety.

Authors: We agree that the current Methods section is insufficiently detailed. In the revision we will add: (1) explicit data exclusion criteria (e.g., removal of turns with parsing failures, model refusals, or incomplete document accumulation); (2) the precise statistical procedures, including correlation coefficients, p-value thresholds, and any corrections for multiple comparisons used to establish the lack of correlation between acknowledgement and safe resolution; and (3) a complete protocol description covering prompt templates, turn sequencing rules, document injection order, evaluation rubrics, and the exact composition of the >50k turn-level dataset. These additions will allow readers to assess the robustness of the single-turn vs. multi-turn discrepancy and the dissociation findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study that introduces a multi-turn document accumulation protocol and reports measured correlations between contradiction acknowledgement and safe resolution across model families. The central claim of a monitoring-control gap is grounded in direct behavioral measurements, human validation, hidden-state probing, attention analysis, and response taxonomy rather than any derivation, equation, or fitted parameter that reduces to its own inputs. No self-citation is load-bearing for the core result, and the work does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the multi-turn accumulation protocol and the assumption that human validation and internal probes provide independent evidence of the gap; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Single-turn and multi-turn robustness can be compared via the same contradiction-acknowledgement and resolution metrics.
Invoked when the paper states that single-turn diagnostics systematically overestimate safety.

pith-pipeline@v0.9.1-grok · 5741 in / 1099 out tokens · 25048 ms · 2026-06-29T17:04:30.772322+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 25 canonical work pages · 12 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020. ArXiv: 2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W.\ Cohen, Ruslan Salakhutdinov, and Christopher D.\ Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP, 2018. ArXiv: 1809.09600

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Poisoning re- trieval corpora by injecting adversarial passages,

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. In Proceedings of EMNLP, 2023. ArXiv: 2310.19156

work page arXiv 2023
[6]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world LLM -integrated applications with indirect prompt injection. In Proceedings of the ACM Workshop on Artificial Intelligence and Security, 2023. ArXiv: 2302.12173

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Retrieval-augmented generation with conflicting evidence

Han Wang, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Retrieval-augmented generation with conflicting evidence. In Proceedings of the Conference on Language Modeling (COLM), 2025. ArXiv: 2504.13079

work page arXiv 2025
[8]

WikiContradict : A benchmark for evaluating LLMs on real-world knowledge conflicts from Wikipedia

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, and Prasanna Sattigeri. WikiContradict : A benchmark for evaluating LLMs on real-world knowledge conflicts from Wikipedia . arXiv preprint arXiv:2406.13805, 2024

work page arXiv 2024
[9]

MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented generation sys- tems.Transactions of the Association for Computational Linguistics, 2025

Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, and Marina Danilevsky. MTRAG : A multi-turn conversational benchmark for evaluating retrieval-augmented generation systems. arXiv preprint arXiv:2501.03468, 2025

work page arXiv 2025
[10]

Worse than zero-shot? A fact-checking dataset for evaluating the robustness of RAG against misleading retrievals

Linda Zeng, Rithwik Gupta, Divij Motwani, Yi Zhang, and Diji Yang. Worse than zero-shot? A fact-checking dataset for evaluating the robustness of RAG against misleading retrievals. In Advances in Neural Information Processing Systems (NeurIPS), 2025. ArXiv: 2502.16101

work page arXiv 2025
[11]

Certifiably robust RAG against retrieval corruption

Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust RAG against retrieval corruption. arXiv preprint arXiv:2405.15556, 2024

work page arXiv 2024
[12]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Teaching Models to Express Their Uncertainty in Words

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research (TMLR), 2022. ArXiv: 2205.14334

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models,

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. PoisonedRAG : Knowledge corruption attacks to retrieval-augmented generation of large language models. In Proceedings of USENIX Security, 2025. ArXiv: 2402.07867

work page arXiv 2025
[15]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J.\ Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS, Oral), 2023. ArXiv: 2307.02483

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Red teaming language models with language models

Ethan Perez, Saffron Huang, H.\ Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. In Proceedings of EMNLP, 2022

2022
[18]

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. In Proceedings of ICLR, 2023. ArXiv: 2212.03827

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Kenneth Li, Oam Patel, Fernanda Vi\' e gas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023. ArXiv: 2306.03341

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Scalable Extraction of Training Data from (Production) Language Models

Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A.\ Feder Cooper, Daphne Ippolito, Christopher A.\ Choquette-Choo, Eric Wallace, Florian Tram\` e r, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In Proceedings of ICLR (Spotlight), 2024. ArXiv: 2305.13300

work page arXiv 2024
[22]

Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence

Hung-Ting Chen, Michael Zhang, and Eunsol Choi. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Proceedings of EMNLP, 2022. DOI: 10.18653/v1/2022.emnlp-main.146. ArXiv: 2210.13701

work page doi:10.18653/v1/2022.emnlp-main.146 2022
[23]

Pandora : Jailbreak GPTs by retrieval augmented generation poisoning

Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, and Yang Liu. Pandora : Jailbreak GPTs by retrieval augmented generation poisoning. arXiv preprint arXiv:2402.08416, 2024

work page arXiv 2024
[24]

Lost in the middle: How language models use long contexts

Nelson F.\ Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the ACL (TACL), 12:157--173, 2024. DOI: 10.1162/tacl\_a\_00638

work page internal anchor Pith review doi:10.1162/tacl 2024
[25]

Self-RAG : Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG : Learning to retrieve, generate, and critique through self-reflection. In Proceedings of ICLR (Oral), 2024

2024
[26]

TrojanRAG : Retrieval-augmented generation can be backdoor driver in large language models

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. TrojanRAG : Retrieval-augmented generation can be backdoor driver in large language models. arXiv:2405.13401, 2024

work page arXiv 2024
[27]

When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. In Proceedings of ACL, 2023

2023
[28]

arXiv preprint arXiv:2403.08319 , year=

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowledge conflicts for LLMs : A survey. In Proceedings of EMNLP, 2024. ArXiv: 2403.08319

work page arXiv 2024
[29]

Machine against the RAG : Jamming retrieval-augmented generation with blocker documents

Avital Shafran, Roei Schuster, and Vitaly Shmatikov. Machine against the RAG : Jamming retrieval-augmented generation with blocker documents. In Proceedings of USENIX Security, 2025. ArXiv: 2406.05870

work page arXiv 2025
[30]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. MS MARCO : A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[3] [3]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020. ArXiv: 2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W.\ Cohen, Ruslan Salakhutdinov, and Christopher D.\ Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP, 2018. ArXiv: 1809.09600

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Poisoning re- trieval corpora by injecting adversarial passages,

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. In Proceedings of EMNLP, 2023. ArXiv: 2310.19156

work page arXiv 2023

[6] [6]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world LLM -integrated applications with indirect prompt injection. In Proceedings of the ACM Workshop on Artificial Intelligence and Security, 2023. ArXiv: 2302.12173

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Retrieval-augmented generation with conflicting evidence

Han Wang, Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Retrieval-augmented generation with conflicting evidence. In Proceedings of the Conference on Language Modeling (COLM), 2025. ArXiv: 2504.13079

work page arXiv 2025

[8] [8]

WikiContradict : A benchmark for evaluating LLMs on real-world knowledge conflicts from Wikipedia

Yufang Hou, Alessandra Pascale, Javier Carnerero-Cano, Tigran Tchrakian, Radu Marinescu, Elizabeth Daly, Inkit Padhi, and Prasanna Sattigeri. WikiContradict : A benchmark for evaluating LLMs on real-world knowledge conflicts from Wikipedia . arXiv preprint arXiv:2406.13805, 2024

work page arXiv 2024

[9] [9]

MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented generation sys- tems.Transactions of the Association for Computational Linguistics, 2025

Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, and Marina Danilevsky. MTRAG : A multi-turn conversational benchmark for evaluating retrieval-augmented generation systems. arXiv preprint arXiv:2501.03468, 2025

work page arXiv 2025

[10] [10]

Worse than zero-shot? A fact-checking dataset for evaluating the robustness of RAG against misleading retrievals

Linda Zeng, Rithwik Gupta, Divij Motwani, Yi Zhang, and Diji Yang. Worse than zero-shot? A fact-checking dataset for evaluating the robustness of RAG against misleading retrievals. In Advances in Neural Information Processing Systems (NeurIPS), 2025. ArXiv: 2502.16101

work page arXiv 2025

[11] [11]

Certifiably robust RAG against retrieval corruption

Chong Xiang, Tong Wu, Zexuan Zhong, David Wagner, Danqi Chen, and Prateek Mittal. Certifiably robust RAG against retrieval corruption. arXiv preprint arXiv:2405.15556, 2024

work page arXiv 2024

[12] [12]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Teaching Models to Express Their Uncertainty in Words

Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. Transactions on Machine Learning Research (TMLR), 2022. ArXiv: 2205.14334

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models,

Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. PoisonedRAG : Knowledge corruption attacks to retrieval-augmented generation of large language models. In Proceedings of USENIX Security, 2025. ArXiv: 2402.07867

work page arXiv 2025

[15] [15]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J.\ Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In Advances in Neural Information Processing Systems (NeurIPS, Oral), 2023. ArXiv: 2307.02483

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Red teaming language models with language models

Ethan Perez, Saffron Huang, H.\ Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. In Proceedings of EMNLP, 2022

2022

[18] [18]

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. In Proceedings of ICLR, 2023. ArXiv: 2212.03827

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Kenneth Li, Oam Patel, Fernanda Vi\' e gas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023. ArXiv: 2306.03341

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Scalable Extraction of Training Data from (Production) Language Models

Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A.\ Feder Cooper, Daphne Ippolito, Christopher A.\ Choquette-Choo, Eric Wallace, Florian Tram\` e r, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In Proceedings of ICLR (Spotlight), 2024. ArXiv: 2305.13300

work page arXiv 2024

[22] [22]

Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence

Hung-Ting Chen, Michael Zhang, and Eunsol Choi. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. In Proceedings of EMNLP, 2022. DOI: 10.18653/v1/2022.emnlp-main.146. ArXiv: 2210.13701

work page doi:10.18653/v1/2022.emnlp-main.146 2022

[23] [23]

Pandora : Jailbreak GPTs by retrieval augmented generation poisoning

Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, and Yang Liu. Pandora : Jailbreak GPTs by retrieval augmented generation poisoning. arXiv preprint arXiv:2402.08416, 2024

work page arXiv 2024

[24] [24]

Lost in the middle: How language models use long contexts

Nelson F.\ Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the ACL (TACL), 12:157--173, 2024. DOI: 10.1162/tacl\_a\_00638

work page internal anchor Pith review doi:10.1162/tacl 2024

[25] [25]

Self-RAG : Learning to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG : Learning to retrieve, generate, and critique through self-reflection. In Proceedings of ICLR (Oral), 2024

2024

[26] [26]

TrojanRAG : Retrieval-augmented generation can be backdoor driver in large language models

Pengzhou Cheng, Yidong Ding, Tianjie Ju, Zongru Wu, Wei Du, Ping Yi, Zhuosheng Zhang, and Gongshen Liu. TrojanRAG : Retrieval-augmented generation can be backdoor driver in large language models. arXiv:2405.13401, 2024

work page arXiv 2024

[27] [27]

When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. In Proceedings of ACL, 2023

2023

[28] [28]

arXiv preprint arXiv:2403.08319 , year=

Rongwu Xu, Zehan Qi, Zhijiang Guo, Cunxiang Wang, Hongru Wang, Yue Zhang, and Wei Xu. Knowledge conflicts for LLMs : A survey. In Proceedings of EMNLP, 2024. ArXiv: 2403.08319

work page arXiv 2024

[29] [29]

Machine against the RAG : Jamming retrieval-augmented generation with blocker documents

Avital Shafran, Roei Schuster, and Vitaly Shmatikov. Machine against the RAG : Jamming retrieval-augmented generation with blocker documents. In Proceedings of USENIX Security, 2025. ArXiv: 2406.05870

work page arXiv 2025

[30] [30]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. MS MARCO : A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016