Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

Dongha Lee; Eunyoung Kim; Hyeonseo Jang; Wooseok Jeong; Yongsik Seo

arxiv: 2605.28565 · v1 · pith:XD6GYYWQnew · submitted 2026-05-27 · 💻 cs.DL · cs.AI· cs.CL· cs.IR

Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

Yongsik Seo , Wooseok Jeong , Eunyoung Kim , Hyeonseo Jang , Dongha Lee This is my paper

Pith reviewed 2026-06-29 09:21 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.CLcs.IR

keywords verified misguidancecitation evaluationsearch-augmented LLMssource fidelitycitation suitabilitystructural citation failuresCITETRACE datasetcitation quality

0 comments

The pith

Search-augmented LLMs cite real sources that distort content in 30.6 percent of cases or come from unsuitable domains in 27.1 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds CITETRACE, a dataset that follows the full path from user query to retrieved source to generated answer across 11,200 queries and 112,000 responses. It pairs this with a three-dimension scoring system that checks whether a citation aligns with query intent, comes from an appropriate source, and faithfully represents the source material. When applied, the scores expose verified misguidance: models produce citations that are real and accessible yet fail on one or more of these checks, creating a trade-off between fidelity to the source and suitability of the source. Provider-level differences account for nearly all the observed variation in citation quality rather than differences between individual models.

Core claim

The central claim is that search-augmented LLMs exhibit verified misguidance, a pattern in which they cite real, accessible sources that nevertheless fail on intent-purpose alignment, source suitability, or answer-source fidelity. This produces a fidelity-suitability trade-off in which more faithful citations often come from less appropriate sources and vice versa. Measurements across the dataset show 30.6 percent of citations distort their sources and 27.1 percent originate from domain-inappropriate sources, with up to 96 percent of responses containing at least one structurally misleading citation. Provider-level factors explain 88-96 percent of citation-quality variance.

What carries the argument

The CITETRACE dataset, which traces citation chains from 11,200 real-world queries through 112,000 responses to produce 761,495 evaluable citation pairs, combined with a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity using expert-validated predefined matrices and a five-level fidelity rubric.

If this is right

A fidelity-suitability trade-off appears consistently across models, so improving one dimension often harms the other.
Provider-level differences explain 88-96 percent of citation-quality variance, indicating that retrieval and ranking policies at the system level dominate over model-specific capabilities.
Up to 96 percent of responses contain at least one structurally misleading citation, so users relying on citations without verification face a high probability of encountering distorted or mismatched evidence.
The evaluation framework can be applied to any system that returns citation-bearing responses, enabling systematic diagnosis of structural citation failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Targeting provider-level source selection and ranking logic may yield larger reductions in verified misguidance than further tuning of the language models themselves.
Explicit mechanisms to detect and balance the fidelity-suitability trade-off during retrieval could be added to existing search-augmented pipelines.
The dataset could serve as a benchmark for testing whether new retrieval methods reduce the rate of domain-inappropriate or distorting citations.

Load-bearing premise

The three-dimension evaluation framework using expert-validated predefined matrices and a five-level fidelity rubric accurately captures the structural trustworthiness of citations without systematic bias.

What would settle it

Re-scoring a random sample of several hundred citation pairs with a new panel of experts and obtaining substantially different distributions on any of the three dimensions would show the framework does not reliably measure citation quality.

Figures

Figures reproduced from arXiv: 2605.28565 by Dongha Lee, Eunyoung Kim, Hyeonseo Jang, Wooseok Jeong, Yongsik Seo.

**Figure 2.** Figure 2: Composition of the CITETRACE query set. The 28 Stack Exchange communities are grouped by Stack Exchange’s six official categories. Each community is annotated with its audience and domain. Communities were selected to satisfy three conditions: expert knowledge is required, commercial actors are present, and queries involve substantive decisions about safety, cost, or design. Applying the framework to CITET… view at source ↗

**Figure 3.** Figure 3: Overview of the three-dimension evaluation framework for diagnosing Verified Misguidance. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Source-type profiles and YMYL amplification. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Response-level amplification of citation failures. A response is counted as exposed if at [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for Response Generation Evaluability filtering. Among the 802,945 successfully crawled citations, we apply four filters to exclude pairs whose cited_sentence is defective. Following the staged attrition protocol of CiteME [42], we report each filter’s effect separately rather than as a single aggregate ( [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Full classification prompt for Query Intent (QI1–QI5). Issued via [PITH_FULL_IMAGE:figures/full_fig_p038_7.png] view at source ↗

**Figure 8.** Figure 8: Full classification prompt for Source Purpose (SP1–SP6). [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗

**Figure 9.** Figure 9: Full classification prompt for Source Domain (SD1–SD10). YMYL labels (SD1–SD3) [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗

**Figure 10.** Figure 10: Full classification prompt for Source Type (ST1–ST6). [PITH_FULL_IMAGE:figures/full_fig_p041_10.png] view at source ↗

**Figure 11.** Figure 11: Full adjudication prompt for Answer–Source Fidelity (ASF1–ASF5). [PITH_FULL_IMAGE:figures/full_fig_p042_11.png] view at source ↗

**Figure 12.** Figure 12: Screenshot of the expert-panel survey landing page used for IPA Matrix validation (§C.1) [PITH_FULL_IMAGE:figures/full_fig_p043_12.png] view at source ↗

**Figure 13.** Figure 13: Screenshot of the IPA matrix expert-validation interface (§C.1). On a single page, [PITH_FULL_IMAGE:figures/full_fig_p044_13.png] view at source ↗

**Figure 14.** Figure 14: Screenshot of the SSM expert-validation interface (§C.2). On a single page, validators [PITH_FULL_IMAGE:figures/full_fig_p045_14.png] view at source ↗

**Figure 15.** Figure 15: Per-model Answer–Source Fidelity score distribution across the 761,495 evaluable [PITH_FULL_IMAGE:figures/full_fig_p048_15.png] view at source ↗

**Figure 16.** Figure 16: Fidelity–suitability trade-off across ten models. Each bubble plots a model’s Fidelity [PITH_FULL_IMAGE:figures/full_fig_p050_16.png] view at source ↗

**Figure 17.** Figure 17: Score distributions by provider across three dimensions. (a) Answer–Source Fidelity: [PITH_FULL_IMAGE:figures/full_fig_p051_17.png] view at source ↗

read the original abstract

Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a new large-scale dataset and joint scoring framework for citation quality in search-augmented LLMs, but the headline rates rest on expert labels whose reliability is not reported.

read the letter

The main takeaway is that this work supplies CITETRACE, a dataset built from 11,200 real queries across 28 communities and 761k citation pairs from ten models, plus a three-dimension rubric covering intent-purpose alignment, source suitability, and answer-source fidelity. That joint view is new; prior work looked at pieces separately.

What the paper does well is document the scale of the issue and the provider-level dominance. The claim that provider differences explain 88-96% of variance suggests the problem sits more in retrieval pipelines than in any single model. The response-level finding that up to 96% of outputs contain at least one structurally misleading citation is also worth attention if the labels hold.

The soft spot is exactly where the stress-test note points: the abstract describes expert-validated matrices and a five-level rubric but supplies no expert count, selection method, training details, or inter-rater agreement. Every downstream number (30.6% distortion, 27.1% domain-inappropriate) flows directly from those scores. If the raters share unstated priors on what counts as distortion or appropriate domain, the fidelity-suitability trade-off and the VM pattern could be partly artifactual. That gap is material, not minor, because the paper's central quantitative claims depend on it.

This is for researchers measuring trustworthiness in deployed RAG systems. A reader who wants fresh data and a reusable framework will find value here, even if they plan to re-annotate a subset themselves. It deserves a serious referee because the data construction is concrete and the practical question is live; the review would mainly press for the missing annotation protocol and any robustness checks on the labels.

Referee Report

1 major / 2 minor

Summary. The paper constructs the CITETRACE dataset from 11,200 real-world queries across 28 communities, generating 112,000 responses from ten models and 761,495 citation pairs. It proposes a three-dimension evaluation framework assessing intent-purpose alignment, source suitability, and answer-source fidelity via expert-validated matrices and a five-level rubric. The analysis reveals a 'Verified Misguidance' pattern, with 30.6% of citations distorting sources, 27.1% from inappropriate domains, up to 96% of responses having at least one misleading citation, and provider-level factors explaining 88-96% of quality variance.

Significance. If the framework's annotations prove reliable, the work is significant for quantifying citation trustworthiness issues in search-augmented LLMs at scale. It provides a new dataset and framework applicable to any citation-bearing system, highlighting a fidelity-suitability trade-off and the dominance of provider effects. The large empirical scale and construction of a traceable citation dataset are clear strengths that could advance benchmarking in this area.

major comments (1)

[Framework design and validation] The section describing the three-dimension evaluation framework states that it uses 'expert-validated predefined matrices and a five-level fidelity rubric' but provides no information on the number of experts, their selection or training, or inter-rater reliability statistics. This is load-bearing for the central claims, as the reported 30.6% distortion rate, 27.1% inappropriate source rate, and the VM pattern are direct outputs of applying this rubric to the 761k pairs; without agreement metrics, the possibility of annotator bias cannot be assessed.

minor comments (2)

[Results section] The explanation of how provider-level differences account for 88-96% of variance would benefit from an explicit reference to the statistical method or model used (e.g., ANOVA or regression details).
[Dataset construction] Clarification on how the 28 communities were selected and any controls for selection bias would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and agree that additional details on the annotation process are required to support the central claims. The revised manuscript will incorporate these details.

read point-by-point responses

Referee: [Framework design and validation] The section describing the three-dimension evaluation framework states that it uses 'expert-validated predefined matrices and a five-level fidelity rubric' but provides no information on the number of experts, their selection or training, or inter-rater reliability statistics. This is load-bearing for the central claims, as the reported 30.6% distortion rate, 27.1% inappropriate source rate, and the VM pattern are direct outputs of applying this rubric to the 761k pairs; without agreement metrics, the possibility of annotator bias cannot be assessed.

Authors: We agree that the submitted manuscript omits these critical details on the expert validation process, which prevents readers from fully assessing reliability and potential bias. The annotation protocol used predefined matrices and the five-level rubric, but the manuscript does not report the number of experts, selection criteria, training, or inter-rater statistics. In the revision we will add a new subsection (likely 3.3.1) that reports: the number of experts and their backgrounds, recruitment and selection process, training and calibration procedure, and inter-rater reliability metrics computed on a held-out sample of citation pairs. We will also release the full annotation guidelines and a sample of annotated pairs as supplementary material. These additions directly address the concern without changing the reported aggregate statistics or conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement study with independent data and scoring

full rationale

The paper constructs CITETRACE (11,200 queries, 112,000 responses, 761k citation pairs) from external sources and applies a three-dimension rubric (intent-purpose alignment, source suitability, answer-source fidelity) with expert-validated matrices and a five-level scale. All headline statistics (30.6% distortion, 27.1% domain-inappropriate, 88-96% provider variance) are direct tallies from this application. No equations, fitted parameters, predictions, self-citations, or uniqueness theorems appear; the measurement chain does not reduce any output to a quantity defined by the inputs themselves. This is a standard empirical study whose central claims rest on the collected data and rubric application rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the validity of the newly constructed dataset and the expert-validated scoring framework rather than on mathematical derivations or new physical entities.

axioms (1)

domain assumption Expert-validated predefined matrices and five-level fidelity rubric provide an unbiased measure of citation quality across intent, suitability, and fidelity dimensions
The framework is the load-bearing measurement tool; its correctness is assumed after expert validation is mentioned but not demonstrated in the abstract.

pith-pipeline@v0.9.1-grok · 5871 in / 1441 out tokens · 33805 ms · 2026-06-29T09:21:37.511355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 21 canonical work pages · 2 internal anchors

[1]

Geo: Generative engine optimization

Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, and Ameet Deshpande. Geo: Generative engine optimization. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 5–16, 2024

2024
[2]

George A. Akerlof. The market for “lemons”: Quality uncertainty and the market mechanism. The Quarterly Journal of Economics, 84(3):488–500, 1970

1970
[3]

Croissant: A metadata format for ML-ready datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Pieter Gijsbers, Joan Giner-Miguelez, Sujata Goswami, Nitisha Jain, Michalis Karamousadakis, Satyapriya Krishna, Michael Kuchnik, Sylvain Lesage, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Hamidah Oderinwale, Pierre Ruyssen, Tim Santos, Rajat Shinde, Elena S...

2024
[4]

Kenneth J. Arrow. Uncertainty and the welfare economics of medical care.The American Economic Review, 53(5):941–973, 1963

1963
[5]

Measuring political bias in large language models: What is said and how it is said

Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. Measuring political bias in large language models: What is said and how it is said. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11142–11159, 2024

2024
[6]

Trafilatura: A web scraping library and command-line tool for text discovery and extraction

Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. InProceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–131, 2021

2021
[7]

Exploring the composition of the searchable web: A corpus-based taxonomy of web registers.Corpora, 10(1):11–45, 2015

Douglas Biber, Jesse Egbert, and Mark Davies. Exploring the composition of the searchable web: A corpus-based taxonomy of web registers.Corpora, 10(1):11–45, 2015

2015
[8]

The CRAAP test.LOEX Quarterly, 31(3):4, 2004

Sarah Blakeslee. The CRAAP test.LOEX Quarterly, 31(3):4, 2004

2004
[9]

Attributed question answering: Evaluation and modeling for attributed large language models.arXiv preprint arXiv:2212.08037, 2022

Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, et al. Attributed question answering: Evaluation and modeling for attributed large language models.arXiv preprint arXiv:2212.08037, 2022

work page arXiv 2022
[10]

A non-factoid question-answering taxonomy

Valeriia Bolotova, Vladislav Blinov, Falk Scholer, W Bruce Croft, and Mark Sanderson. A non-factoid question-answering taxonomy. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1196–1207, 2022

2022
[11]

A taxonomy of web search

Andrei Broder. A taxonomy of web search. InACM SIGIR Forum, volume 36, pages 3–10, 2002

2002
[12]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. 10

1960
[13]

Citations and trust in llm generated responses, 2025

Yifan Ding, Matthew Facciani, Amrit Poudel, Ellen Joyce, Salvador Aguinaga, Balaji Veeramani, Sanmitra Bhattacharya, and Tim Weninger. Citations and trust in llm generated responses, 2025. URLhttps://arxiv.org/abs/2501.01303

work page arXiv 2025
[14]

RAGAS: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAS: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, 2024

2024
[15]

B. J. Fogg, Cathy Soohoo, David R. Danielson, Leslie Marable, Julianne Stanford, and Ellen R. Tauber. How do users evaluate the credibility of web sites? a study with over 2,500 participants. InProceedings of the 2003 Conference on Designing for User Experiences (DUX), pages 1–15, 2003

2003
[16]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6465–6488, 2023

2023
[17]

Search quality evaluator guidelines

Google. Search quality evaluator guidelines. https://guidelines.raterhub.com/sear chqualityevaluatorguidelines.pdf, 2022

2022
[18]

Semanticcite: Citation verification with ai-powered full-text analysis and evidence-based reasoning.arXiv preprint arXiv:2511.16198, 2025

Sebastian Haan. Semanticcite: Citation verification with ai-powered full-text analysis and evidence-based reasoning.arXiv preprint arXiv:2511.16198, 2025

work page arXiv 2025
[19]

Verspoor, and Timothy Baldwin

Doris Hoogeveen, Karin M. Verspoor, and Timothy Baldwin. CQADupStack: A benchmark data set for community question-answering research. InProceedings of the 20th Australasian Document Computing Symposium, pages 3:1–3:8, 2015

2015
[20]

Retrieval-augmented generation with estimation of source reliability

Jeongyeon Hwang, Junyoung Park, Hyejin Park, Dongwoo Kim, Sangdon Park, and Jungseul Ok. Retrieval-augmented generation with estimation of source reliability. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34279–34303. Association for Computational Linguistics, 2025. URL https://aclanthology.org/2025. emnlp...

2025
[21]

Content taxonomy 3.0

IAB Tech Lab. Content taxonomy 3.0. IAB Tech Lab Standard, 2022. URL https://iabtec hlab.com/standards/content-taxonomy/

2022
[22]

Springer, Dordrecht, 2005

Peter Ingwersen and Kalervo Järvelin.The Turn: Integration of Information Seeking and Retrieval in Context, volume 18 ofThe Information Retrieval Series. Springer, Dordrecht, 2005. ISBN 978-1-4020-3851-8. doi: 10.1007/1-4020-3851-8

work page doi:10.1007/1-4020-3851-8 2005
[23]

Determining the informational, navigational, and transactional intent of web queries.Information Processing & Management, 44(3):1251–1266, 2008

Bernard J Jansen, Danielle L Booth, and Amanda Spink. Determining the informational, navigational, and transactional intent of web queries.Information Processing & Management, 44(3):1251–1266, 2008

2008
[24]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

2023
[25]

SourceBench: Can AI answers reference quality web sources?arXiv preprint arXiv:2602.16942, 2026

Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik, and Yiying Zhang. SourceBench: Can AI answers reference quality web sources?arXiv preprint arXiv:2602.16942, 2026

work page arXiv 2026
[26]

Understanding and predicting web content credibility using the content credibility corpus.Information Processing & Management, 53(5):1043–1061, 2017

Michal Kakol, Radoslaw Nielek, and Adam Wierzbicki. Understanding and predicting web content credibility using the content credibility corpus.Information Processing & Management, 53(5):1043–1061, 2017

2017
[27]

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. PRISM: A participatory, representative and individualised evaluation of language model alignment. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. Best Paper Award

2024
[28]

A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of chiropractic medicine, 15(2):155–163, 2016

Terry K Koo and Mae Y Li. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of chiropractic medicine, 15(2):155–163, 2016. 11

2016
[29]

Rfc 9309 robots exclusion protocol.Internet Engineering Task Force, 2022

Martijn Koster, Gary Illyes, Henner Zeller, and Lizzi Sassman. Rfc 9309 robots exclusion protocol.Internet Engineering Task Force, 2022

2022
[30]

Computing Krippendorff’s alpha-reliability

Klaus Krippendorff. Computing Krippendorff’s alpha-reliability. Technical Report 43, Annenberg School for Communication, University of Pennsylvania, 2011. URL https: //repository.upenn.edu/asc_papers/43

2011
[31]

Evaluating the factual consistency of abstractive text summarization

Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9332–9346, 2020. doi: 10.18653 /v1/2020.emnlp-main.750. URLhttps://aclanthology.org/2020.emnlp-main.750/

2020
[32]

Bennett, and Marti A

Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics, 10:163–177, 2022. doi: 10.1162/tacl_a_00453

work page doi:10.1162/tacl_a_00453 2022
[33]

The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

1977
[34]

Alice Li and Luanne Sinnamon. Generative ai search engines as arbiters of public knowledge: An audit of bias and authority.Proceedings of the Association for Information Science and Technology, 61(1):205–217, 2024

2024
[35]

Liu, Tianyi Zhang, and Percy Liang

Nelson F. Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines.Findings of the Association for Computational Linguistics (EMNLP), pages 7001–7025, 2023

2023
[36]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, 2020. doi: 10.18653/v1/2020.acl -main.173. URLhttps://aclanthology.org/2020.acl-main.173/

work page doi:10.18653/v1/2020.acl 1906
[37]

FActScore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023. doi: 10.18653 /v1/2...

2023
[38]

Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E

Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E. Gonzalez. Search arena: Analyzing search-augmented LLMs. InProceedings of the International Conference on Learning Representations (ICLR), 2026. arXiv:2506.05334

work page arXiv 2026
[39]

Search engines in the AI era: A qualitative understanding to the false promise of factual and verifiable source-cited responses in LLM-based search

Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Yixin Mao, and Chien-Sheng Wu. Search engines in the AI era: A qualitative understanding to the false promise of factual and verifiable source-cited responses in LLM-based search. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 1325–1340, 2025. doi: ...

work page doi:10.1145/3715275.3732089 2025
[40]

Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosber, Sean P

Josh M. Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosber, Sean P. Bber, Christoph Thaler, Yuhao Deng, Casey S. Greene, and Satoshi Nishi. Scite: A smart citation index that displays the context of citations and classifies their intent using deep learning. Quantitative Science Studies, 2(3):882–898, 2021

2021
[41]

Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, 2021. doi: 10.18653/v1/2021.n a...

work page doi:10.18653/v1/2021.n 2021
[42]

Citeme: Can language models accurately cite scientific claims?Advances in Neural Information Processing Systems, 37:7847–7877, 2024

Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, and Matthias Bethge. Citeme: Can language models accurately cite scientific claims?Advances in Neural Information Processing Systems, 37:7847–7877, 2024. 12

2024
[43]

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Delip Rao, Eric Wong, and Chris Callison-Burch. Detecting and correcting reference hal- lucinations in commercial llms and deep research agents.arXiv preprint arXiv:2604.03173, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Measuring attribution in natural language generation models.Computational Linguistics, 49(4):777–840, December

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models.Computational Linguistics, 49(4):777–840, December
[45]

2023), 777–840

doi: 10.1162/coli_a_00486. URLhttps://aclanthology.org/2023.cl-4.2/

work page doi:10.1162/coli_a_00486 2023
[46]

Danielson

Soo Young Rieh and David R. Danielson. Credibility: A multidisciplinary framework.Annual Review of Information Science and Technology, 41(1):307–364, 2007. doi: 10.1002/aris.2007. 1440410114

work page doi:10.1002/aris.2007 2007
[47]

Understanding user goals in web search

Daniel E Rose and Danny Levinson. Understanding user goals in web search. InProceedings of the 13th international conference on World Wide Web, pages 13–19, 2004

2004
[48]

ARES: An automated evaluation framework for retrieval-augmented generation systems

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES: An automated evaluation framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 338–354, 2024

2024
[49]

Generation-time vs

Yash Saxena, Raviteja Bommireddy, Ankur Padia, and Manas Gaur. Generation-time vs. post-hoc citation: A holistic evaluation of LLM attribution.arXiv preprint arXiv:2509.21557, 2025

work page arXiv 2025
[50]

Schema.org: A shared vocabulary for structured data

Schema.org Community Group. Schema.org: A shared vocabulary for structured data. https: //schema.org, 2011. Founded by Google, Microsoft, Yahoo, and Yandex

2011
[51]

Functional text dimensions for the annotation of web corpora.Corpora, 13(1): 65–95, 2018

Serge Sharoff. Functional text dimensions for the annotation of web corpora.Corpora, 13(1): 65–95, 2018

2018
[52]

Yalin Sun, Yan Zhang, Jacek Gwizdka, and Ciaran B. Trace. Consumer evaluation of the quality of online health information: Systematic literature review of relevant criteria and indicators. Journal of Medical Internet Research, 21(5):e12522, 2019. doi: 10.2196/12522

work page doi:10.2196/12522 2019
[53]

Swales.Genre Analysis: English in Academic and Research Settings

John M. Swales.Genre Analysis: English in Academic and Research Settings. Cambridge University Press, Cambridge, 1990

1990
[54]

MiniCheck: Efficient fact-checking of LLMs on grounding documents

Liyan Tang, Philippe Laban, and Greg Durrett. MiniCheck: Efficient fact-checking of LLMs on grounding documents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024
[55]

A systematic mapping study of crowd knowledge enhanced software engineering research using Stack Overflow.Journal of Systems and Software, 226:112405, 2025

Minaoar Hossain Tanzil, Shaiful Chowdhury, Somayeh Modaberi, Gias Uddin, and Hadi Hemmati. A systematic mapping study of crowd knowledge enhanced software engineering research using Stack Overflow.Journal of Systems and Software, 226:112405, 2025. doi: 10.1016/j.jss.2025.112405

work page doi:10.1016/j.jss.2025.112405 2025
[56]

Robert S. Taylor. Question-negotiation and information seeking in libraries.College & Research Libraries, 29(3):178–194, 1968. doi: 10.5860/crl_29_03_178

work page doi:10.5860/crl_29_03_178 1968
[57]

Assessing web search credibility and response groundedness in chat assistants

Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, and Marian Simko. Assessing web search credibility and response groundedness in chat assistants. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2539–2560, 2026

2026
[58]

Correctness is not faithful- ness in retrieval augmented generation attributions

Jonas Wallat, Maria Heuss, Maarten de Rijke, and Avishek Anand. Correctness is not faithful- ness in retrieval augmented generation attributions. InProceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), pages 22–32, 2025

2025
[59]

Asking and answering questions to evaluate the factual consistency of summaries

Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and answering questions to evaluate the factual consistency of summaries. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, 2020. doi: 10.18653/v1/2020.acl-main.450. URLhttps://aclanthology.org/2020.acl-main.450/. 13

work page doi:10.18653/v1/2020.acl-main.450 2020
[60]

An automated framework for assessing how well LLMs cite relevant medical references.Nature Communications, 16(1):3615, 2025

Kevin Wu, Eric Wu, Kevin Wei, Angela Zhang, Allison Casasola, Teresa Nguyen, Sith Ri- antawan, Patricia Shi, Daniel Ho, and James Zou. An automated framework for assessing how well LLMs cite relevant medical references.Nature Communications, 16(1):3615, 2025. doi: 10.1038/s41467-025-58551-6

work page doi:10.1038/s41467-025-58551-6 2025
[61]

ALiiCE: Evaluating positional fine-grained citation generation

Yilong Xu, Jinhua Niu, and Guoxin Xie. ALiiCE: Evaluating positional fine-grained citation generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024
[62]

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

Zuyao Xu, Yuqi Qiu, Lu Sun, FaSheng Miao, Fubin Wu, Xinyi Wang, Xiang Li, Haozhe Lu, ZhengZe Zhang, Yuxin Hu, et al. Ghostcite: A large-scale analysis of citation validity in the age of large language models.arXiv preprint arXiv:2602.06718, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[63]

intent_reasoning

Peixian Zhang, Qiming Ye, Zifan Peng, Kiran Garimella, and Gareth Tyson. Source coverage and citation bias in llm-based vs. traditional search engines.arXiv preprint arXiv:2512.09483, 2025. 14 Appendices A Discussions: Scope, Limitations, and Broader Impact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 A.1 Scope and Operational Assumpti...

work page arXiv 2025
[64]

You’re more likely to experience dizziness, fatigue or feeling faint (a recipe for injury), as well as increasing the risk of bleeding, bruising or soreness

but content is fabricated (ASF = 1). This pattern matches the OpenAI provider profile (FFR 43.8%, SFR 11.7%). 58 Case 5: Accurate citation of an inappropriate source—fitness-app blog as medical evidence Query IDQ04569 QueryWhy I should avoid exercising after blood donation? SiteMedical Sciences·Science Modelclaude-haiku-4-5 (Anthropic) QIQI2 Explanation C...

2023

[1] [1]

Geo: Generative engine optimization

Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, and Ameet Deshpande. Geo: Generative engine optimization. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 5–16, 2024

2024

[2] [2]

George A. Akerlof. The market for “lemons”: Quality uncertainty and the market mechanism. The Quarterly Journal of Economics, 84(3):488–500, 1970

1970

[3] [3]

Croissant: A metadata format for ML-ready datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Pieter Gijsbers, Joan Giner-Miguelez, Sujata Goswami, Nitisha Jain, Michalis Karamousadakis, Satyapriya Krishna, Michael Kuchnik, Sylvain Lesage, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Hamidah Oderinwale, Pierre Ruyssen, Tim Santos, Rajat Shinde, Elena S...

2024

[4] [4]

Kenneth J. Arrow. Uncertainty and the welfare economics of medical care.The American Economic Review, 53(5):941–973, 1963

1963

[5] [5]

Measuring political bias in large language models: What is said and how it is said

Yejin Bang, Delong Chen, Nayeon Lee, and Pascale Fung. Measuring political bias in large language models: What is said and how it is said. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11142–11159, 2024

2024

[6] [6]

Trafilatura: A web scraping library and command-line tool for text discovery and extraction

Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. InProceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–131, 2021

2021

[7] [7]

Exploring the composition of the searchable web: A corpus-based taxonomy of web registers.Corpora, 10(1):11–45, 2015

Douglas Biber, Jesse Egbert, and Mark Davies. Exploring the composition of the searchable web: A corpus-based taxonomy of web registers.Corpora, 10(1):11–45, 2015

2015

[8] [8]

The CRAAP test.LOEX Quarterly, 31(3):4, 2004

Sarah Blakeslee. The CRAAP test.LOEX Quarterly, 31(3):4, 2004

2004

[9] [9]

Attributed question answering: Evaluation and modeling for attributed large language models.arXiv preprint arXiv:2212.08037, 2022

Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, et al. Attributed question answering: Evaluation and modeling for attributed large language models.arXiv preprint arXiv:2212.08037, 2022

work page arXiv 2022

[10] [10]

A non-factoid question-answering taxonomy

Valeriia Bolotova, Vladislav Blinov, Falk Scholer, W Bruce Croft, and Mark Sanderson. A non-factoid question-answering taxonomy. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1196–1207, 2022

2022

[11] [11]

A taxonomy of web search

Andrei Broder. A taxonomy of web search. InACM SIGIR Forum, volume 36, pages 3–10, 2002

2002

[12] [12]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. 10

1960

[13] [13]

Citations and trust in llm generated responses, 2025

Yifan Ding, Matthew Facciani, Amrit Poudel, Ellen Joyce, Salvador Aguinaga, Balaji Veeramani, Sanmitra Bhattacharya, and Tim Weninger. Citations and trust in llm generated responses, 2025. URLhttps://arxiv.org/abs/2501.01303

work page arXiv 2025

[14] [14]

RAGAS: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. RAGAS: Automated evaluation of retrieval augmented generation. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, 2024

2024

[15] [15]

B. J. Fogg, Cathy Soohoo, David R. Danielson, Leslie Marable, Julianne Stanford, and Ellen R. Tauber. How do users evaluate the credibility of web sites? a study with over 2,500 participants. InProceedings of the 2003 Conference on Designing for User Experiences (DUX), pages 1–15, 2003

2003

[16] [16]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6465–6488, 2023

2023

[17] [17]

Search quality evaluator guidelines

Google. Search quality evaluator guidelines. https://guidelines.raterhub.com/sear chqualityevaluatorguidelines.pdf, 2022

2022

[18] [18]

Semanticcite: Citation verification with ai-powered full-text analysis and evidence-based reasoning.arXiv preprint arXiv:2511.16198, 2025

Sebastian Haan. Semanticcite: Citation verification with ai-powered full-text analysis and evidence-based reasoning.arXiv preprint arXiv:2511.16198, 2025

work page arXiv 2025

[19] [19]

Verspoor, and Timothy Baldwin

Doris Hoogeveen, Karin M. Verspoor, and Timothy Baldwin. CQADupStack: A benchmark data set for community question-answering research. InProceedings of the 20th Australasian Document Computing Symposium, pages 3:1–3:8, 2015

2015

[20] [20]

Retrieval-augmented generation with estimation of source reliability

Jeongyeon Hwang, Junyoung Park, Hyejin Park, Dongwoo Kim, Sangdon Park, and Jungseul Ok. Retrieval-augmented generation with estimation of source reliability. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34279–34303. Association for Computational Linguistics, 2025. URL https://aclanthology.org/2025. emnlp...

2025

[21] [21]

Content taxonomy 3.0

IAB Tech Lab. Content taxonomy 3.0. IAB Tech Lab Standard, 2022. URL https://iabtec hlab.com/standards/content-taxonomy/

2022

[22] [22]

Springer, Dordrecht, 2005

Peter Ingwersen and Kalervo Järvelin.The Turn: Integration of Information Seeking and Retrieval in Context, volume 18 ofThe Information Retrieval Series. Springer, Dordrecht, 2005. ISBN 978-1-4020-3851-8. doi: 10.1007/1-4020-3851-8

work page doi:10.1007/1-4020-3851-8 2005

[23] [23]

Determining the informational, navigational, and transactional intent of web queries.Information Processing & Management, 44(3):1251–1266, 2008

Bernard J Jansen, Danielle L Booth, and Amanda Spink. Determining the informational, navigational, and transactional intent of web queries.Information Processing & Management, 44(3):1251–1266, 2008

2008

[24] [24]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

2023

[25] [25]

SourceBench: Can AI answers reference quality web sources?arXiv preprint arXiv:2602.16942, 2026

Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik, and Yiying Zhang. SourceBench: Can AI answers reference quality web sources?arXiv preprint arXiv:2602.16942, 2026

work page arXiv 2026

[26] [26]

Understanding and predicting web content credibility using the content credibility corpus.Information Processing & Management, 53(5):1043–1061, 2017

Michal Kakol, Radoslaw Nielek, and Adam Wierzbicki. Understanding and predicting web content credibility using the content credibility corpus.Information Processing & Management, 53(5):1043–1061, 2017

2017

[27] [27]

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, and Scott A. Hale. PRISM: A participatory, representative and individualised evaluation of language model alignment. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. Best Paper Award

2024

[28] [28]

A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of chiropractic medicine, 15(2):155–163, 2016

Terry K Koo and Mae Y Li. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of chiropractic medicine, 15(2):155–163, 2016. 11

2016

[29] [29]

Rfc 9309 robots exclusion protocol.Internet Engineering Task Force, 2022

Martijn Koster, Gary Illyes, Henner Zeller, and Lizzi Sassman. Rfc 9309 robots exclusion protocol.Internet Engineering Task Force, 2022

2022

[30] [30]

Computing Krippendorff’s alpha-reliability

Klaus Krippendorff. Computing Krippendorff’s alpha-reliability. Technical Report 43, Annenberg School for Communication, University of Pennsylvania, 2011. URL https: //repository.upenn.edu/asc_papers/43

2011

[31] [31]

Evaluating the factual consistency of abstractive text summarization

Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 9332–9346, 2020. doi: 10.18653 /v1/2020.emnlp-main.750. URLhttps://aclanthology.org/2020.emnlp-main.750/

2020

[32] [32]

Bennett, and Marti A

Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics, 10:163–177, 2022. doi: 10.1162/tacl_a_00453

work page doi:10.1162/tacl_a_00453 2022

[33] [33]

The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

1977

[34] [34]

Alice Li and Luanne Sinnamon. Generative ai search engines as arbiters of public knowledge: An audit of bias and authority.Proceedings of the Association for Information Science and Technology, 61(1):205–217, 2024

2024

[35] [35]

Liu, Tianyi Zhang, and Percy Liang

Nelson F. Liu, Tianyi Zhang, and Percy Liang. Evaluating verifiability in generative search engines.Findings of the Association for Computational Linguistics (EMNLP), pages 7001–7025, 2023

2023

[36] [36]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, 2020. doi: 10.18653/v1/2020.acl -main.173. URLhttps://aclanthology.org/2020.acl-main.173/

work page doi:10.18653/v1/2020.acl 1906

[37] [37]

FActScore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023. doi: 10.18653 /v1/2...

2023

[38] [38]

Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E

Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E. Gonzalez. Search arena: Analyzing search-augmented LLMs. InProceedings of the International Conference on Learning Representations (ICLR), 2026. arXiv:2506.05334

work page arXiv 2026

[39] [39]

Search engines in the AI era: A qualitative understanding to the false promise of factual and verifiable source-cited responses in LLM-based search

Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Yixin Mao, and Chien-Sheng Wu. Search engines in the AI era: A qualitative understanding to the false promise of factual and verifiable source-cited responses in LLM-based search. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT), pages 1325–1340, 2025. doi: ...

work page doi:10.1145/3715275.3732089 2025

[40] [40]

Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosber, Sean P

Josh M. Nicholson, Milo Mordaunt, Patrice Lopez, Ashish Uppala, Domenic Rosber, Sean P. Bber, Christoph Thaler, Yuhao Deng, Casey S. Greene, and Satoshi Nishi. Scite: A smart citation index that displays the context of citations and classifies their intent using deep learning. Quantitative Science Studies, 2(3):882–898, 2021

2021

[41] [41]

Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, 2021. doi: 10.18653/v1/2021.n a...

work page doi:10.18653/v1/2021.n 2021

[42] [42]

Citeme: Can language models accurately cite scientific claims?Advances in Neural Information Processing Systems, 37:7847–7877, 2024

Ori Press, Andreas Hochlehnert, Ameya Prabhu, Vishaal Udandarao, Ofir Press, and Matthias Bethge. Citeme: Can language models accurately cite scientific claims?Advances in Neural Information Processing Systems, 37:7847–7877, 2024. 12

2024

[43] [43]

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Delip Rao, Eric Wong, and Chris Callison-Burch. Detecting and correcting reference hal- lucinations in commercial llms and deep research agents.arXiv preprint arXiv:2604.03173, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

Measuring attribution in natural language generation models.Computational Linguistics, 49(4):777–840, December

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language generation models.Computational Linguistics, 49(4):777–840, December

[45] [45]

2023), 777–840

doi: 10.1162/coli_a_00486. URLhttps://aclanthology.org/2023.cl-4.2/

work page doi:10.1162/coli_a_00486 2023

[46] [46]

Danielson

Soo Young Rieh and David R. Danielson. Credibility: A multidisciplinary framework.Annual Review of Information Science and Technology, 41(1):307–364, 2007. doi: 10.1002/aris.2007. 1440410114

work page doi:10.1002/aris.2007 2007

[47] [47]

Understanding user goals in web search

Daniel E Rose and Danny Levinson. Understanding user goals in web search. InProceedings of the 13th international conference on World Wide Web, pages 13–19, 2004

2004

[48] [48]

ARES: An automated evaluation framework for retrieval-augmented generation systems

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES: An automated evaluation framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 338–354, 2024

2024

[49] [49]

Generation-time vs

Yash Saxena, Raviteja Bommireddy, Ankur Padia, and Manas Gaur. Generation-time vs. post-hoc citation: A holistic evaluation of LLM attribution.arXiv preprint arXiv:2509.21557, 2025

work page arXiv 2025

[50] [50]

Schema.org: A shared vocabulary for structured data

Schema.org Community Group. Schema.org: A shared vocabulary for structured data. https: //schema.org, 2011. Founded by Google, Microsoft, Yahoo, and Yandex

2011

[51] [51]

Functional text dimensions for the annotation of web corpora.Corpora, 13(1): 65–95, 2018

Serge Sharoff. Functional text dimensions for the annotation of web corpora.Corpora, 13(1): 65–95, 2018

2018

[52] [52]

Yalin Sun, Yan Zhang, Jacek Gwizdka, and Ciaran B. Trace. Consumer evaluation of the quality of online health information: Systematic literature review of relevant criteria and indicators. Journal of Medical Internet Research, 21(5):e12522, 2019. doi: 10.2196/12522

work page doi:10.2196/12522 2019

[53] [53]

Swales.Genre Analysis: English in Academic and Research Settings

John M. Swales.Genre Analysis: English in Academic and Research Settings. Cambridge University Press, Cambridge, 1990

1990

[54] [54]

MiniCheck: Efficient fact-checking of LLMs on grounding documents

Liyan Tang, Philippe Laban, and Greg Durrett. MiniCheck: Efficient fact-checking of LLMs on grounding documents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024

[55] [55]

A systematic mapping study of crowd knowledge enhanced software engineering research using Stack Overflow.Journal of Systems and Software, 226:112405, 2025

Minaoar Hossain Tanzil, Shaiful Chowdhury, Somayeh Modaberi, Gias Uddin, and Hadi Hemmati. A systematic mapping study of crowd knowledge enhanced software engineering research using Stack Overflow.Journal of Systems and Software, 226:112405, 2025. doi: 10.1016/j.jss.2025.112405

work page doi:10.1016/j.jss.2025.112405 2025

[56] [56]

Robert S. Taylor. Question-negotiation and information seeking in libraries.College & Research Libraries, 29(3):178–194, 1968. doi: 10.5860/crl_29_03_178

work page doi:10.5860/crl_29_03_178 1968

[57] [57]

Assessing web search credibility and response groundedness in chat assistants

Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, and Marian Simko. Assessing web search credibility and response groundedness in chat assistants. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2539–2560, 2026

2026

[58] [58]

Correctness is not faithful- ness in retrieval augmented generation attributions

Jonas Wallat, Maria Heuss, Maarten de Rijke, and Avishek Anand. Correctness is not faithful- ness in retrieval augmented generation attributions. InProceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), pages 22–32, 2025

2025

[59] [59]

Asking and answering questions to evaluate the factual consistency of summaries

Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and answering questions to evaluate the factual consistency of summaries. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, 2020. doi: 10.18653/v1/2020.acl-main.450. URLhttps://aclanthology.org/2020.acl-main.450/. 13

work page doi:10.18653/v1/2020.acl-main.450 2020

[60] [60]

An automated framework for assessing how well LLMs cite relevant medical references.Nature Communications, 16(1):3615, 2025

Kevin Wu, Eric Wu, Kevin Wei, Angela Zhang, Allison Casasola, Teresa Nguyen, Sith Ri- antawan, Patricia Shi, Daniel Ho, and James Zou. An automated framework for assessing how well LLMs cite relevant medical references.Nature Communications, 16(1):3615, 2025. doi: 10.1038/s41467-025-58551-6

work page doi:10.1038/s41467-025-58551-6 2025

[61] [61]

ALiiCE: Evaluating positional fine-grained citation generation

Yilong Xu, Jinhua Niu, and Guoxin Xie. ALiiCE: Evaluating positional fine-grained citation generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024

[62] [62]

GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models

Zuyao Xu, Yuqi Qiu, Lu Sun, FaSheng Miao, Fubin Wu, Xinyi Wang, Xiang Li, Haozhe Lu, ZhengZe Zhang, Yuxin Hu, et al. Ghostcite: A large-scale analysis of citation validity in the age of large language models.arXiv preprint arXiv:2602.06718, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[63] [63]

intent_reasoning

Peixian Zhang, Qiming Ye, Zifan Peng, Kiran Garimella, and Gareth Tyson. Source coverage and citation bias in llm-based vs. traditional search engines.arXiv preprint arXiv:2512.09483, 2025. 14 Appendices A Discussions: Scope, Limitations, and Broader Impact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 A.1 Scope and Operational Assumpti...

work page arXiv 2025

[64] [64]

You’re more likely to experience dizziness, fatigue or feeling faint (a recipe for injury), as well as increasing the risk of bleeding, bruising or soreness

but content is fabricated (ASF = 1). This pattern matches the OpenAI provider profile (FFR 43.8%, SFR 11.7%). 58 Case 5: Accurate citation of an inappropriate source—fitness-app blog as medical evidence Query IDQ04569 QueryWhy I should avoid exercising after blood donation? SiteMedical Sciences·Science Modelclaude-haiku-4-5 (Anthropic) QIQI2 Explanation C...

2023