Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

Aditya Nannapaneni; Juan Pablo De la Cruz Weinstein; Tianyu Ding

arxiv: 2606.23915 · v1 · pith:KNFWZ5AHnew · submitted 2026-06-22 · 💻 cs.CL · cs.IR· cs.LG

Do LLM Attribution Metrics Transfer? Auditing Retrieval-Augmented Generation Evaluation Across Datasets and Constructs

Tianyu Ding , Aditya Nannapaneni , Juan Pablo De la Cruz Weinstein This is my paper

Pith reviewed 2026-06-26 08:05 UTC · model grok-4.3

classification 💻 cs.CL cs.IRcs.LG

keywords attribution metricsRAG evaluationmetric transferautomatic scorershuman labelsNLIBERTScoreLLM evaluation

0 comments

The pith

Automatic metrics for checking how well LLM answers attribute to sources do not transfer across datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits eight automatic scorers on three evaluation constructs for retrieval-augmented generation, checking whether any scorer stays competitive on every dataset inside a construct. In the generated-answer attribution construct, which has the broadest human-labeled coverage across five datasets, no scorer transfers. An NLI model that reaches AUROC 0.90 on short-claim data falls to chance level on long-form answers, while BERTScore rises to the top; rankings between datasets reverse. A rule that simply picks the average-best scorer produces worse held-out performance than locking in one fixed scorer.

Core claim

In the generated-answer attribution construct, none of the eight automatic scorers transfers: per-dataset rankings invert (Kendall tau = -0.64 on AttributedQA vs. LFQA), an off-the-shelf NLI scorer drops from AUROC 0.90 to 0.53 on long-form data where BERTScore reaches 0.91, and the instability is not explained by length; a best-on-average selection rule yields mean held-out regret of 0.172 AUROC, worse than fixing one scorer, so metric choice must be validated on the target dataset.

What carries the argument

The transfer audit that requires any scorer to remain inside the 95% confidence interval of the best scorer on every dataset of a multi-dataset construct.

If this is right

Metric rankings invert across datasets with statistically significant negative correlation.
Selecting the average-best scorer produces higher regret than a fixed scorer in leave-one-dataset-out testing.
Prompt-based LLM judges avoid chance-level collapses but remain non-uniform, costly, and non-deterministic.
Performance gaps between scorers persist after length controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transfer test could be applied to automatic metrics in summarization or open-ended QA to check for similar instability.
Evaluation benchmarks may need to require reporting on multiple attribution datasets rather than a single one.
Developers could explore lightweight adapters that adjust an existing scorer to a new dataset without full re-labeling.

Load-bearing premise

The human-labeled datasets supply reliable and comparable ground truth for attribution across domains, answer lengths, and question types.

What would settle it

A single automatic scorer that stays inside the 95% confidence interval of the best scorer on every dataset in the generated-answer attribution construct.

Figures

Figures reproduced from arXiv: 2606.23915 by Aditya Nannapaneni, Juan Pablo De la Cruz Weinstein, Tianyu Ding.

**Figure 2.** Figure 2: Verbatim LLM prompt templates. (a) The graded support-probability prompt used to run Claude Opus 4.8 as a scoring metric ( [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

Practice often treats automatic metrics for attribution in LLM retrieval-augmented generation as interchangeable. We audit eight automatic scorers -- lexical, embedding, and BERTScore baselines alongside entailment/grounding-trained models (clean and FEVER NLI, the checker MiniCheck) -- across three evaluation constructs (provenance/topicality, generated-answer attribution, and fact-check entailment), asking whether any scorer transfers: stays within the 95% confidence interval of the best audited scorer on every dataset of a multi-dataset construct. In the construct with the most multi-dataset human-labeled coverage -- generated-answer attribution (AttributionBench's four source datasets, n = 1,610, with independent HAGRID, n = 2,150) -- none does: the per-dataset metric rankings invert (Kendall tau = -0.64, p = 0.031 on AttributedQA vs. LFQA), and an off-the-shelf NLI scorer that is best on short-claim AttributedQA (AUROC 0.90) collapses to AUROC 0.53 (chance) on long-form LFQA, where BERTScore wins (0.91); the flip is not a length or truncation artifact. This instability has a concrete decision cost: a naive "best-on-average" rule for choosing an evaluator fails leave-one-dataset-out (mean held-out regret 0.172 AUROC, worse than fixing one scorer), so metric choice must be validated on the target dataset rather than learned from others. A prompt-based LLM judge avoids the chance-level collapses the automatic scorers suffer (no LFQA collapse) but is not uniformly best, ~100x costlier, and non-deterministic -- relocating, not removing, the validation burden.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

No attribution scorer transfers across RAG datasets, with ranking inversions and AUROC drops, but the result hinges on treating human labels from different sources as interchangeable ground truth.

read the letter

The main takeaway is that none of the eight audited scorers for attribution in RAG transfer reliably across datasets in the generated-answer construct. Metric rankings invert between AttributedQA and LFQA, and an NLI scorer that hits 0.90 AUROC on short claims falls to 0.53 on long-form data where BERTScore reaches 0.91.

The paper does a clean job documenting this with multi-dataset coverage, Kendall tau of -0.64, and a leave-one-dataset-out regret calculation that shows picking the average best scorer costs 0.172 AUROC on held-out data. It also checks that the flip is not just a length artifact and contrasts the automatic scorers against a prompt-based LLM judge that avoids the chance-level collapses. These are direct empirical comparisons against external human labels, with no circular fitting.

The soft spot is the assumption that the human labels from AttributionBench's four sources plus HAGRID supply consistent, comparable ground truth for the same construct. If annotation protocols, entailment definitions, or agreement levels differ systematically between short-claim and long-form datasets, the ranking inversions and performance gaps could partly reflect label differences rather than scorer non-transfer. The abstract does not detail how they aligned or verified these labels.

This is useful for anyone running RAG evaluations or building metrics. It has enough concrete evidence and practical implication to deserve peer review, though the label consistency point should be checked in the full methods.

Referee Report

2 major / 2 minor

Summary. The paper audits eight automatic attribution scorers (lexical, embedding, BERTScore, NLI variants, MiniCheck) across three RAG evaluation constructs, focusing on whether any scorer transfers by remaining within the 95% CI of the best performer on every dataset within a multi-dataset construct. In generated-answer attribution (AttributionBench's four sources, n=1,610 plus HAGRID n=2,150), no scorer transfers: per-dataset rankings invert (Kendall tau=-0.64, p=0.031 between AttributedQA and LFQA), an NLI scorer drops from AUROC 0.90 on short-claim AttributedQA to 0.53 on long-form LFQA (where BERTScore reaches 0.91), and a best-on-average rule incurs mean held-out regret of 0.172 AUROC. An LLM judge avoids some collapses but is costlier and non-deterministic. The conclusion is that metric choice requires target-dataset validation.

Significance. If the empirical comparisons hold, the finding that automatic metrics fail to transfer and that average-based selection underperforms fixed scorers would directly affect evaluation practices in RAG and attribution research, supporting a shift toward dataset-specific validation rather than reliance on published leaderboards or averages.

major comments (2)

[Methods / Dataset construction (AttributionBench and HAGRID sections)] The central non-transfer claim (inverted rankings and AUROC collapse from 0.90 to 0.53) is computed against human labels treated as interchangeable ground truth across AttributionBench sources and HAGRID. The manuscript does not report inter-annotator agreement, positive-class definitions (strict vs. loose entailment), or annotation-protocol differences between short-claim and long-form datasets; without these, the observed metric flips could arise from label inconsistency rather than scorer non-transfer.
[Results (regret and transfer experiments)] The leave-one-dataset-out regret analysis (mean 0.172 AUROC) and the claim that 'metric choice must be validated on the target dataset' rest on the same cross-dataset label comparability; if protocols differ systematically by answer length or question type, the regret numbers and transfer failure may not generalize beyond the specific label sets used.

minor comments (2)

[Dataset description] Clarify whether the four AttributionBench source datasets use identical annotation guidelines or if any filtering/exclusion rules differ from HAGRID.
[Results] The abstract states the flip 'is not a length or truncation artifact'; add the specific control experiment (e.g., truncation lengths tested) to the main text for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the careful review and constructive comments. We address the concerns about human label comparability and annotation protocols point by point below. We agree that additional context from the source dataset papers will strengthen the manuscript and will incorporate it in revision.

read point-by-point responses

Referee: [Methods / Dataset construction (AttributionBench and HAGRID sections)] The central non-transfer claim (inverted rankings and AUROC collapse from 0.90 to 0.53) is computed against human labels treated as interchangeable ground truth across AttributionBench sources and HAGRID. The manuscript does not report inter-annotator agreement, positive-class definitions (strict vs. loose entailment), or annotation-protocol differences between short-claim and long-form datasets; without these, the observed metric flips could arise from label inconsistency rather than scorer non-transfer.

Authors: We acknowledge that the manuscript does not report inter-annotator agreement, positive-class definitions, or a side-by-side comparison of annotation protocols. The labels are taken directly from the released AttributionBench and HAGRID datasets; we will add a new Methods subsection that extracts and summarizes the annotation protocols, IAA statistics, and entailment definitions from the original dataset papers. This will make any potential label inconsistencies transparent to readers. We maintain that the observed ranking inversions and AUROC collapses are unlikely to be solely an artifact of label noise, as they appear consistently across multiple metrics and the paper already rules out length/truncation effects, but we agree the added protocol details will allow better assessment of this alternative explanation. revision: yes
Referee: [Results (regret and transfer experiments)] The leave-one-dataset-out regret analysis (mean 0.172 AUROC) and the claim that 'metric choice must be validated on the target dataset' rest on the same cross-dataset label comparability; if protocols differ systematically by answer length or question type, the regret numbers and transfer failure may not generalize beyond the specific label sets used.

Authors: We agree that the regret numbers and the validation recommendation are tied to the specific label sets and their comparability. Systematic differences in annotation protocols could affect generalizability of the 0.172 mean regret figure. In revision we will add an explicit limitations paragraph noting this caveat, stating that the quantitative regret value applies to the label sets examined, and reiterating that practitioners should validate metrics against their own target data and labeling protocol. We will also cross-reference the original dataset papers for protocol details. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparisons on external labels

full rationale

The paper's central claims consist of AUROC values, Kendall tau rankings, and leave-one-dataset-out regret computed directly against human labels from AttributionBench (four sources, n=1610) and independent HAGRID (n=2150). No equations, fitted parameters, or self-citations reduce these outcomes to inputs defined by the paper itself. The non-transfer result is an observed empirical pattern across datasets rather than a derivation that collapses by construction. Human-label comparability is an external assumption about ground truth, not a circularity in the reported metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness and reliability of the chosen human-annotated datasets as ground truth and on AUROC as the comparison metric; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Human annotations in AttributionBench and HAGRID datasets constitute reliable gold-standard labels for attribution
Used to benchmark all automatic scorers and declare non-transfer.
standard math AUROC is an appropriate scalar for comparing scorer agreement with human judgments across datasets
Central to all reported performance numbers and the regret calculation.

pith-pipeline@v0.9.1-grok · 5866 in / 1477 out tokens · 33502 ms · 2026-06-26T08:05:42.349707+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references

[1]

Advances in Neural Information Processing Systems , year=

Training Language Models to Follow Instructions with Human Feedback , author=. Advances in Neural Information Processing Systems , year=
[2]

Advances in Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in Neural Information Processing Systems , year=
[3]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[4]

2204.06092 , archivePrefix=

Stelmakh, Ivan and Luan, Yi and Dhingra, Bhuwan and Chang, Ming-Wei , year=. 2204.06092 , archivePrefix=

arXiv
[5]

Proceedings of EMNLP , year=

Enabling Large Language Models to Generate Text with Citations , author=. Proceedings of EMNLP , year=
[6]

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and others , booktitle=
[7]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and others , booktitle=. Retrieval-Augmented Generation for Knowledge-Intensive
[8]

Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh , booktitle=. Self-
[9]

2601.04525 , archivePrefix=

Zhao, Yibo and Zhu, Jiapeng and Ding, Zichen and Li, Xiang , year=. 2601.04525 , archivePrefix=

arXiv
[10]

Lessons from Training Grounded

Sim, Shang Hong and Pala, Tej Deep and Toh, Vernon and Chieu, Hai Leong and Zadeh, Amir and Li, Chuan and Majumder, Navonil and Poria, Soujanya , year=. Lessons from Training Grounded. 2506.15522 , archivePrefix=

arXiv
[11]

2026 , eprint=

Lost in the Noise: How Reasoning Models Fail with Contextual Distractors , author=. 2026 , eprint=

2026
[12]

2026 , eprint=

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models , author=. 2026 , eprint=

2026
[13]

Proceedings of ACL , year=

Attribute First, then Generate: Locally-attributable Grounded Text Generation , author=. Proceedings of ACL , year=
[14]

2409.02897 , archivePrefix=

Zhang, Jiajie and Bai, Yushi and Lv, Xin and others , year=. 2409.02897 , archivePrefix=

arXiv
[15]

Sentence-

Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-
[16]

2016 , eprint=

Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li , booktitle=. 2016 , eprint=

2016
[17]

ACM Computing Surveys , volume =

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Yejin and Madotto, Andrea and Fung, Pascale , title =. ACM Computing Surveys , volume =. 2023 , publisher =

2023
[18]

Transactions of the Association for Computational Linguistics , volume =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =

2024
[19]

2019 , address =

Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , booktitle =. 2019 , address =

2019
[20]

Computational Linguistics , volume =

Measuring Attribution in Natural Language Generation Models , author =. Computational Linguistics , volume =. 2023 , publisher =

2023
[21]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =

Evaluating Verifiability in Generative Search Engines , author =. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =. 2023 , address =

2023
[22]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =

Automatic Evaluation of Attribution by Large Language Models , author =. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =. 2023 , address =

2023
[23]

2024 , address =

Es, Shahul and James, Jithin and Espinosa Anke, Luis and Schockaert, Steven , booktitle =. 2024 , address =

2024
[24]

2023 , address =

Amouyal, Samuel Joseph and Wolfson, Tomer and Rubin, Ohad and Yoran, Ori and Herzig, Jonathan and Berant, Jonathan , booktitle =. 2023 , address =

2023
[25]

and Hearst, Marti A

Laban, Philippe and Schnabel, Tobias and Bennett, Paul N. and Hearst, Marti A. , journal =. 2022 , publisher =

2022
[26]

and Artzi, Yoav , booktitle =

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , booktitle =
[27]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Training Language Models to Generate Text with Citations via Fine-grained Rewards , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , address =

2024
[28]

2022 , eprint =

Teaching Language Models to Support Answers with Verified Quotes , author =. 2022 , eprint =

2022
[29]

Conference on Language Modeling (COLM) , year =

How Easily do Irrelevant Inputs Skew the Responses of Large Language Models? , author =. Conference on Language Modeling (COLM) , year =. 2404.03302 , archivePrefix =

arXiv
[30]

and Salakhutdinov, Ruslan and Manning, Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. 2018 , address =

2018
[31]

2024 , address =

Li, Yifei and Yue, Xiang and Liao, Zeyi and Sun, Huan , booktitle =. 2024 , address =

2024
[32]

Get Your Vitamin

Schuster, Tal and Fisch, Adam and Barzilay, Regina , booktitle =. Get Your Vitamin. 2021 , publisher =

2021
[33]

Evaluating Attribution in Dialogue Systems: The

Dziri, Nouha and Rashkin, Hannah and Linzen, Tal and Reitter, David , journal =. Evaluating Attribution in Dialogue Systems: The. 2022 , publisher =

2022
[34]

2022 , publisher =

Honovich, Or and Aharoni, Roee and Herzig, Jonathan and Taitelbaum, Hagai and Kukliansy, Doron and Cohen, Vered and Scialom, Thomas and Szpektor, Idan and Hassidim, Avinatan and Matias, Yossi , booktitle =. 2022 , publisher =

2022
[35]

Tang, Liyan and Laban, Philippe and Durrett, Greg , booktitle=
[36]

Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting , booktitle=
[37]

, booktitle=

Hu, Nan and Chen, Jiaoyan and Wu, Yike and Qi, Guilin and Wang, Hongru and Bi, Sheng and Chen, Yongrui and Wu, Tongtong and Pan, Jeff Z. , booktitle=. Can
[38]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Cruz Bland. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=
[39]

Support Evaluation for the

Thakur, Nandan and Pradeep, Ronak and Upadhyay, Shivani and Campos, Daniel and Craswell, Nick and Lin, Jimmy , year=. Support Evaluation for the. 2504.15205 , archivePrefix=

arXiv
[40]

Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=

Measurement and Fairness , author=. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=

2021
[41]

It Takes Two to Tango: Navigating Conceptualizations of

Subramonian, Arjun and Yuan, Xingdi and Daum. It Takes Two to Tango: Navigating Conceptualizations of. Findings of the Association for Computational Linguistics: ACL 2023 , year=

2023
[42]

2307.16883 , archivePrefix=

Kamalloo, Ehsan and Jafari, Aref and Zhang, Xinyu and Thakur, Nandan and Lin, Jimmy , year=. 2307.16883 , archivePrefix=

arXiv

[1] [1]

Advances in Neural Information Processing Systems , year=

Training Language Models to Follow Instructions with Human Feedback , author=. Advances in Neural Information Processing Systems , year=

[2] [2]

Advances in Neural Information Processing Systems , year=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. Advances in Neural Information Processing Systems , year=

[3] [3]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[4] [4]

2204.06092 , archivePrefix=

Stelmakh, Ivan and Luan, Yi and Dhingra, Bhuwan and Chang, Ming-Wei , year=. 2204.06092 , archivePrefix=

arXiv

[5] [5]

Proceedings of EMNLP , year=

Enabling Large Language Models to Generate Text with Citations , author=. Proceedings of EMNLP , year=

[6] [6]

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and others , booktitle=

[7] [7]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and others , booktitle=. Retrieval-Augmented Generation for Knowledge-Intensive

[8] [8]

Asai, Akari and Wu, Zeqiu and Wang, Yizhong and Sil, Avirup and Hajishirzi, Hannaneh , booktitle=. Self-

[9] [9]

2601.04525 , archivePrefix=

Zhao, Yibo and Zhu, Jiapeng and Ding, Zichen and Li, Xiang , year=. 2601.04525 , archivePrefix=

arXiv

[10] [10]

Lessons from Training Grounded

Sim, Shang Hong and Pala, Tej Deep and Toh, Vernon and Chieu, Hai Leong and Zadeh, Amir and Li, Chuan and Majumder, Navonil and Poria, Soujanya , year=. Lessons from Training Grounded. 2506.15522 , archivePrefix=

arXiv

[11] [11]

2026 , eprint=

Lost in the Noise: How Reasoning Models Fail with Contextual Distractors , author=. 2026 , eprint=

2026

[12] [12]

2026 , eprint=

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models , author=. 2026 , eprint=

2026

[13] [13]

Proceedings of ACL , year=

Attribute First, then Generate: Locally-attributable Grounded Text Generation , author=. Proceedings of ACL , year=

[14] [14]

2409.02897 , archivePrefix=

Zhang, Jiajie and Bai, Yushi and Lv, Xin and others , year=. 2409.02897 , archivePrefix=

arXiv

[15] [15]

Sentence-

Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-

[16] [16]

2016 , eprint=

Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li , booktitle=. 2016 , eprint=

2016

[17] [17]

ACM Computing Surveys , volume =

Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Yejin and Madotto, Andrea and Fung, Pascale , title =. ACM Computing Surveys , volume =. 2023 , publisher =

2023

[18] [18]

Transactions of the Association for Computational Linguistics , volume =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =. 2024 , publisher =

2024

[19] [19]

2019 , address =

Fan, Angela and Jernite, Yacine and Perez, Ethan and Grangier, David and Weston, Jason and Auli, Michael , booktitle =. 2019 , address =

2019

[20] [20]

Computational Linguistics , volume =

Measuring Attribution in Natural Language Generation Models , author =. Computational Linguistics , volume =. 2023 , publisher =

2023

[21] [21]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =

Evaluating Verifiability in Generative Search Engines , author =. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =. 2023 , address =

2023

[22] [22]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =

Automatic Evaluation of Attribution by Large Language Models , author =. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages =. 2023 , address =

2023

[23] [23]

2024 , address =

Es, Shahul and James, Jithin and Espinosa Anke, Luis and Schockaert, Steven , booktitle =. 2024 , address =

2024

[24] [24]

2023 , address =

Amouyal, Samuel Joseph and Wolfson, Tomer and Rubin, Ohad and Yoran, Ori and Herzig, Jonathan and Berant, Jonathan , booktitle =. 2023 , address =

2023

[25] [25]

and Hearst, Marti A

Laban, Philippe and Schnabel, Tobias and Bennett, Paul N. and Hearst, Marti A. , journal =. 2022 , publisher =

2022

[26] [26]

and Artzi, Yoav , booktitle =

Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav , booktitle =

[27] [27]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Training Language Models to Generate Text with Citations via Fine-grained Rewards , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , address =

2024

[28] [28]

2022 , eprint =

Teaching Language Models to Support Answers with Verified Quotes , author =. 2022 , eprint =

2022

[29] [29]

Conference on Language Modeling (COLM) , year =

How Easily do Irrelevant Inputs Skew the Responses of Large Language Models? , author =. Conference on Language Modeling (COLM) , year =. 2404.03302 , archivePrefix =

arXiv

[30] [30]

and Salakhutdinov, Ruslan and Manning, Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle =. 2018 , address =

2018

[31] [31]

2024 , address =

Li, Yifei and Yue, Xiang and Liao, Zeyi and Sun, Huan , booktitle =. 2024 , address =

2024

[32] [32]

Get Your Vitamin

Schuster, Tal and Fisch, Adam and Barzilay, Regina , booktitle =. Get Your Vitamin. 2021 , publisher =

2021

[33] [33]

Evaluating Attribution in Dialogue Systems: The

Dziri, Nouha and Rashkin, Hannah and Linzen, Tal and Reitter, David , journal =. Evaluating Attribution in Dialogue Systems: The. 2022 , publisher =

2022

[34] [34]

2022 , publisher =

Honovich, Or and Aharoni, Roee and Herzig, Jonathan and Taitelbaum, Hagai and Kukliansy, Doron and Cohen, Vered and Scialom, Thomas and Szpektor, Idan and Hassidim, Avinatan and Matias, Yossi , booktitle =. 2022 , publisher =

2022

[35] [35]

Tang, Liyan and Laban, Philippe and Durrett, Greg , booktitle=

[36] [36]

Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting , booktitle=

[37] [37]

, booktitle=

Hu, Nan and Chen, Jiaoyan and Wu, Yike and Qi, Guilin and Wang, Hongru and Bi, Sheng and Chen, Yongrui and Wu, Tongtong and Pan, Jeff Z. , booktitle=. Can

[38] [38]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

Cruz Bland. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) , year=

[39] [39]

Support Evaluation for the

Thakur, Nandan and Pradeep, Ronak and Upadhyay, Shivani and Campos, Daniel and Craswell, Nick and Lin, Jimmy , year=. Support Evaluation for the. 2504.15205 , archivePrefix=

arXiv

[40] [40]

Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=

Measurement and Fairness , author=. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) , year=

2021

[41] [41]

It Takes Two to Tango: Navigating Conceptualizations of

Subramonian, Arjun and Yuan, Xingdi and Daum. It Takes Two to Tango: Navigating Conceptualizations of. Findings of the Association for Computational Linguistics: ACL 2023 , year=

2023

[42] [42]

2307.16883 , archivePrefix=

Kamalloo, Ehsan and Jafari, Aref and Zhang, Xinyu and Thakur, Nandan and Lin, Jimmy , year=. 2307.16883 , archivePrefix=

arXiv