Recognition: unknown
Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
Pith reviewed 2026-05-08 09:49 UTC · model grok-4.3
The pith
LLM deep research agents generate citations that link and align with sources but frequently fail to match the facts they claim.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Citations are evaluated along three dimensions: Link Works verifies URL accessibility, Relevant Content measures topical alignment, and Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual 1cc
What carries the argument
The source attribution evaluation framework, which parses LLM Markdown output with an AST parser to extract inline citations, retrieves the cited web content, and judges each citation on link accessibility, topical relevance, and factual consistency.
If this is right
- Even top models reach at most 77 percent factual accuracy on cited claims.
- Fewer than half of open-source models can produce a properly cited report in a single attempt.
- Factual accuracy falls by roughly 42 percent when the number of tool calls rises from 2 to 150.
- More retrieval steps do not improve citation reliability and can reduce it.
Where Pith is reading between the lines
- The framework could be inserted into agent training loops so that factual mismatches become direct training signals.
- Agents might benefit from built-in verification steps rather than generating long reports and checking them afterward.
- Current scaling trends that add more retrieval may be counterproductive for factual reliability.
Load-bearing premise
The rubric-based LLM judges, after human calibration, correctly assess whether each claim matches the content of its cited source, and the AST parser extracts every inline citation without missing or misparsing any.
What would settle it
A fresh human review of several hundred extracted citations that produces factual-accuracy rates more than 15 points different from the LLM-judge scores on the same sample.
Figures
read the original abstract
Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical alignment, and (3) Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual accuracy, while fewer than half of open-source models successfully generate cited reports in a one-shot setting. Ablation studies on research depth show that Fact Check accuracy drops by approximately 42% on average across two frontier models as tool calls scale from 2 to 150, demonstrating that more retrieval does not produce more accurate citations. These findings reveal a critical disconnect between surface-level citation quality and factual reliability, and our framework provides the evaluation infrastructure to assess the disconnect.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the first source attribution evaluation framework for LLM deep research agents. It uses a reproducible AST parser to extract inline citations from generated Markdown reports, retrieves the actual cited web sources, and evaluates each citation on three dimensions: link validity (URL accessibility), relevant content (topical alignment), and fact check (factual accuracy against the source). Benchmarking 14 closed- and open-source LLMs with rubric-based LLM-as-a-judge evaluators (calibrated via human review) shows frontier models achieve >94% link validity and >80% relevance yet only 39-77% factual accuracy; fewer than half of open-source models produce cited reports in one-shot settings. Ablation studies further report an average 42% drop in fact-check accuracy as tool calls scale from 2 to 150.
Significance. If the evaluation components prove robust, the work is significant for highlighting a critical disconnect between surface-level citation metrics and factual reliability in LLM agents. The finding that increased retrieval depth degrades rather than improves citation accuracy challenges prevailing assumptions in RAG and agent architectures. The provision of a reproducible AST parser and the scale of the 14-model benchmark are strengths that could establish a useful evaluation standard for the community.
major comments (2)
- [Evaluation Framework (abstract description)] The central results (39-77% factual accuracy and 42% drop with scale) rest on the AST parser's claim to extract all inline citations completely and correctly from Markdown. No precision/recall evaluation against a gold-standard set of reports with known citations is described, which is load-bearing because incomplete extraction would systematically bias all reported percentages and the scaling ablation.
- [Benchmarking Results and Ablation Studies (abstract description)] Fact Check accuracy is measured via rubric-based LLM-as-a-judge evaluators calibrated through human review, yet no calibration sample size, inter-annotator agreement statistics, or disagreement examples are provided. This is load-bearing for the core claim of a disconnect between link/relevance scores and factual accuracy, especially given known LLM-judge divergence on paraphrased or long-source claims.
minor comments (2)
- [Abstract] The abstract lists the three dimensions as (1) Link Works, (2) Relevant Content, (3) Fact Check; ensure consistent terminology and capitalization are used in all later sections and tables.
- [Evaluation Methodology] For full reproducibility, the exact rubrics and prompts supplied to the LLM judge should be included as an appendix or supplementary material.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments correctly identify areas where additional validation and transparency are needed to support the core claims. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Evaluation Framework (abstract description)] The central results (39-77% factual accuracy and 42% drop with scale) rest on the AST parser's claim to extract all inline citations completely and correctly from Markdown. No precision/recall evaluation against a gold-standard set of reports with known citations is described, which is load-bearing because incomplete extraction would systematically bias all reported percentages and the scaling ablation.
Authors: We agree that quantitative validation of the AST parser is necessary to ensure the reported metrics are not biased by extraction errors. The manuscript describes the parser as a reproducible AST-based method for Markdown but does not include precision, recall, or F1 evaluation against a human-annotated gold standard. In the revised version, we will add a new subsection detailing such an evaluation: we will manually create a gold-standard set of 100 generated reports with verified citations, compute precision/recall/F1 for the parser, and include error analysis. This will directly address potential bias in the fact-check and scaling results. revision: yes
-
Referee: [Benchmarking Results and Ablation Studies (abstract description)] Fact Check accuracy is measured via rubric-based LLM-as-a-judge evaluators calibrated through human review, yet no calibration sample size, inter-annotator agreement statistics, or disagreement examples are provided. This is load-bearing for the core claim of a disconnect between link/relevance scores and factual accuracy, especially given known LLM-judge divergence on paraphrased or long-source claims.
Authors: We acknowledge that the manuscript's description of LLM-as-a-judge calibration is insufficiently detailed. While it notes calibration via human review, it omits sample size, agreement statistics, and disagreement examples. In the revision, we will expand the evaluation methodology section to report: the calibration sample size (200 citations), inter-annotator agreement (percentage agreement and Cohen's kappa between human reviewers and the LLM judge), and specific examples of disagreements (including on paraphrased or long claims) with resolution criteria. These additions will provide stronger support for the fact-check scores and the observed disconnect with link/relevance metrics. revision: yes
Circularity Check
Empirical benchmarking study with no derivations or self-referential predictions
full rationale
The paper introduces an evaluation framework for source attribution in LLM agents and reports direct measurements of link validity, relevance, and factual accuracy across 14 models. It relies on an AST parser for citation extraction and rubric-based LLM-as-a-judge evaluators calibrated via human review. No equations, first-principles derivations, fitted parameters, or predictions appear in the abstract or described methodology. All reported results (e.g., 39-77% factual accuracy, 42% drop with scale) are presented as empirical observations from the new framework rather than outputs derived from prior fitted values or self-citations. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation
Yee Man Choi, Xuehang Guo, Yi R. Fung, and Qingyun Wang. CiteGuard: Faithful citation attribution for LLMs via retrieval-augmented validation.arXiv preprint arXiv:2510.17853,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepRe- search Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,
-
[3]
RARR: Re- searching and revising what language models say, using language models
Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: Re- searching and revising what language models say, using language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 16477–16508, 2023a. Tian...
-
[4]
AttributionBench: How hard is automatic attribution evaluation? InFindings of the Association for Computational Linguistics: ACL 2024,
Yifei Li, Xiang Yue, Zeyi Liao, and Huan Sun. AttributionBench: How hard is automatic attribution evaluation? InFindings of the Association for Computational Linguistics: ACL 2024,
2024
-
[5]
Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and Roberto Hernandez. Comparison of text-based and image-based retrieval in multimodal retrieval augmented generation large language model systems.arXiv preprint arXiv:2511.16654, 2025a. 10 Preprint. Elias Lumer, Anmol Gulati, Vamse K...
-
[6]
doi: 10.20944/preprints202512.1050.v2. URL https://doi.org/10. 20944/preprints202512.1050.v2. OpenAI. Introducing ChatGPT search. https://openai.com/index/introducing-chatgpt- search,
-
[7]
Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi
Datasets and Bench- marks Track. Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi. HALoGEN: Fantastic LLM hallucinations and where to find them.arXiv preprint arXiv:2501.08292,
-
[8]
Yash Saxena, Raviteja Bommireddy, Ankur Padia, and Manas Gaur
ACL 2025 Outstanding Paper. Yash Saxena, Raviteja Bommireddy, Ankur Padia, and Manas Gaur. Generation-time vs. post-hoc citation: A holistic evaluation of LLM attribution.arXiv preprint arXiv:2509.21557,
-
[9]
Sahil Sen, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah. Chronos: Temporal- aware conversational agents with structured event retrieval for long-term memory.arXiv preprint arXiv:2603.16862,
-
[10]
Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, and Youngjae Yu. Verifying the verifiers: Unveiling pitfalls and potentials in fact verifiers.arXiv preprint arXiv:2506.13342,
-
[11]
David Wan, Eran Hirsch, Elias Stengel-Eskin, Ido Dagan, and Mohit Bansal. GenerationPro- grams: Fine-grained attribution with executable programs.arXiv preprint arXiv:2506.14580,
-
[12]
Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, and Bingsheng He. Assessing judging bias in large reasoning models: An empirical study.arXiv preprint arXiv:2504.09946, 2025a. 11 Preprint. Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, and Wenhu Chen. ...
-
[13]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
URL https: //arxiv.org/abs/2504.12516. Yumo Xu, Peng Qi, Jifan Chen, Kunlun Liu, Rujun Han, Lan Liu, Bonan Min, Vittorio Castelli, Arshit Gupta, and Zhiguo Wang. CiteEval: Principle-driven citation evaluation for source attribution. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,
work page internal anchor Pith review arXiv
-
[14]
Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V . Chawla, and Yanfang Ye. CiteAudit: You cited it, but did you read it? A benchmark for verifying scientific references in the LLM era.arXiv preprint arXiv:2602.23452,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.