arxiv: 2605.06635 · v1 · submitted 2026-05-07 · 💻 cs.CL

Recognition: unknown

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Austin Huber, Corey Feld, Elias Lumer, Hailey Onweller, Pia Ramchandani, Vamse Kumar Subbiah

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentssource attributioncitation evaluationfactual verificationdeep research agentsAST parsingbenchmarking

0 comments

The pith

LLM deep research agents generate citations that link and align with sources but frequently fail to match the facts they claim.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework to test citations inside LLM-generated research reports by automatically extracting them from the Markdown output and then retrieving the actual linked web pages. Evaluators then check each citation for whether the link works, whether the page content matches the topic, and whether the specific claim in the report is factually supported by that page. Tests across 14 models show that frontier systems keep links working over 94 percent of the time and stay relevant over 80 percent, yet factual accuracy stays between 39 and 77 percent, and accuracy falls sharply when agents perform more retrieval steps. The work matters because these agents are increasingly used to produce long, cited documents, and unverified citations can spread errors even when the surface presentation looks professional.

Core claim

We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Citations are evaluated along three dimensions: Link Works verifies URL accessibility, Relevant Content measures topical alignment, and Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual 1cc

What carries the argument

The source attribution evaluation framework, which parses LLM Markdown output with an AST parser to extract inline citations, retrieves the cited web content, and judges each citation on link accessibility, topical relevance, and factual consistency.

If this is right

Even top models reach at most 77 percent factual accuracy on cited claims.
Fewer than half of open-source models can produce a properly cited report in a single attempt.
Factual accuracy falls by roughly 42 percent when the number of tool calls rises from 2 to 150.
More retrieval steps do not improve citation reliability and can reduce it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be inserted into agent training loops so that factual mismatches become direct training signals.
Agents might benefit from built-in verification steps rather than generating long reports and checking them afterward.
Current scaling trends that add more retrieval may be counterproductive for factual reliability.

Load-bearing premise

The rubric-based LLM judges, after human calibration, correctly assess whether each claim matches the content of its cited source, and the AST parser extracts every inline citation without missing or misparsing any.

What would settle it

A fresh human review of several hundred extracted citations that produces factual-accuracy rates more than 15 points different from the LLM-judge scores on the same sample.

Figures

Figures reproduced from arXiv: 2605.06635 by Austin Huber, Corey Feld, Elias Lumer, Hailey Onweller, Pia Ramchandani, Vamse Kumar Subbiah.

**Figure 1.** Figure 1: Source attribution evaluation framework. A deep research agent generates Mark view at source ↗

**Figure 2.** Figure 2: Fact Check accuracy degradation as search depth increases. Both models show view at source ↗

read the original abstract

Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical alignment, and (3) Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual accuracy, while fewer than half of open-source models successfully generate cited reports in a one-shot setting. Ablation studies on research depth show that Fact Check accuracy drops by approximately 42% on average across two frontier models as tool calls scale from 2 to 150, demonstrating that more retrieval does not produce more accurate citations. These findings reveal a critical disconnect between surface-level citation quality and factual reliability, and our framework provides the evaluation infrastructure to assess the disconnect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a usable framework to measure how often LLM research agents cite sources that actually support the claims, and finds factual accuracy trails link and relevance scores by a wide margin.

read the letter

The main thing to take away is that even strong frontier models produce cited reports where links work over 94 percent of the time and content is relevant over 80 percent, yet factual accuracy against the retrieved sources lands only in the 39-77 percent range. Open-source models often fail to generate any cited output in one shot, and fact-check scores drop roughly 42 percent as tool calls increase from 2 to 150 across the two models they tested in depth. That scaling result is the clearest practical warning in the work.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the first source attribution evaluation framework for LLM deep research agents. It uses a reproducible AST parser to extract inline citations from generated Markdown reports, retrieves the actual cited web sources, and evaluates each citation on three dimensions: link validity (URL accessibility), relevant content (topical alignment), and fact check (factual accuracy against the source). Benchmarking 14 closed- and open-source LLMs with rubric-based LLM-as-a-judge evaluators (calibrated via human review) shows frontier models achieve >94% link validity and >80% relevance yet only 39-77% factual accuracy; fewer than half of open-source models produce cited reports in one-shot settings. Ablation studies further report an average 42% drop in fact-check accuracy as tool calls scale from 2 to 150.

Significance. If the evaluation components prove robust, the work is significant for highlighting a critical disconnect between surface-level citation metrics and factual reliability in LLM agents. The finding that increased retrieval depth degrades rather than improves citation accuracy challenges prevailing assumptions in RAG and agent architectures. The provision of a reproducible AST parser and the scale of the 14-model benchmark are strengths that could establish a useful evaluation standard for the community.

major comments (2)

[Evaluation Framework (abstract description)] The central results (39-77% factual accuracy and 42% drop with scale) rest on the AST parser's claim to extract all inline citations completely and correctly from Markdown. No precision/recall evaluation against a gold-standard set of reports with known citations is described, which is load-bearing because incomplete extraction would systematically bias all reported percentages and the scaling ablation.
[Benchmarking Results and Ablation Studies (abstract description)] Fact Check accuracy is measured via rubric-based LLM-as-a-judge evaluators calibrated through human review, yet no calibration sample size, inter-annotator agreement statistics, or disagreement examples are provided. This is load-bearing for the core claim of a disconnect between link/relevance scores and factual accuracy, especially given known LLM-judge divergence on paraphrased or long-source claims.

minor comments (2)

[Abstract] The abstract lists the three dimensions as (1) Link Works, (2) Relevant Content, (3) Fact Check; ensure consistent terminology and capitalization are used in all later sections and tables.
[Evaluation Methodology] For full reproducibility, the exact rubrics and prompts supplied to the LLM judge should be included as an appendix or supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments correctly identify areas where additional validation and transparency are needed to support the core claims. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Evaluation Framework (abstract description)] The central results (39-77% factual accuracy and 42% drop with scale) rest on the AST parser's claim to extract all inline citations completely and correctly from Markdown. No precision/recall evaluation against a gold-standard set of reports with known citations is described, which is load-bearing because incomplete extraction would systematically bias all reported percentages and the scaling ablation.

Authors: We agree that quantitative validation of the AST parser is necessary to ensure the reported metrics are not biased by extraction errors. The manuscript describes the parser as a reproducible AST-based method for Markdown but does not include precision, recall, or F1 evaluation against a human-annotated gold standard. In the revised version, we will add a new subsection detailing such an evaluation: we will manually create a gold-standard set of 100 generated reports with verified citations, compute precision/recall/F1 for the parser, and include error analysis. This will directly address potential bias in the fact-check and scaling results. revision: yes
Referee: [Benchmarking Results and Ablation Studies (abstract description)] Fact Check accuracy is measured via rubric-based LLM-as-a-judge evaluators calibrated through human review, yet no calibration sample size, inter-annotator agreement statistics, or disagreement examples are provided. This is load-bearing for the core claim of a disconnect between link/relevance scores and factual accuracy, especially given known LLM-judge divergence on paraphrased or long-source claims.

Authors: We acknowledge that the manuscript's description of LLM-as-a-judge calibration is insufficiently detailed. While it notes calibration via human review, it omits sample size, agreement statistics, and disagreement examples. In the revision, we will expand the evaluation methodology section to report: the calibration sample size (200 citations), inter-annotator agreement (percentage agreement and Cohen's kappa between human reviewers and the LLM judge), and specific examples of disagreements (including on paraphrased or long claims) with resolution criteria. These additions will provide stronger support for the fact-check scores and the observed disconnect with link/relevance metrics. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations or self-referential predictions

full rationale

The paper introduces an evaluation framework for source attribution in LLM agents and reports direct measurements of link validity, relevance, and factual accuracy across 14 models. It relies on an AST parser for citation extraction and rubric-based LLM-as-a-judge evaluators calibrated via human review. No equations, first-principles derivations, fitted parameters, or predictions appear in the abstract or described methodology. All reported results (e.g., 39-77% factual accuracy, 42% drop with scale) are presented as empirical observations from the new framework rather than outputs derived from prior fitted values or self-citations. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5610 in / 1182 out tokens · 54921 ms · 2026-05-08T09:49:25.093031+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages · 3 internal anchors

[1]

CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation

Yee Man Choi, Xuehang Guo, Yi R. Fung, and Qingyun Wang. CiteGuard: Faithful citation attribution for LLMs via retrieval-augmented validation.arXiv preprint arXiv:2510.17853,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. DeepRe- search Bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,

work page arXiv
[3]

RARR: Re- searching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. RARR: Re- searching and revising what language models say, using language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 16477–16508, 2023a. Tian...

work page arXiv 2023
[4]

AttributionBench: How hard is automatic attribution evaluation? InFindings of the Association for Computational Linguistics: ACL 2024,

Yifei Li, Xiang Yue, Zeyi Liao, and Huan Sun. AttributionBench: How hard is automatic attribution evaluation? InFindings of the Association for Computational Linguistics: ACL 2024,

2024
[5]

Comparison of text-based and image-based retrieval in multimodal retrieval augmented generation large language model systems.arXiv preprint arXiv:2511.16654, 2025a

Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, and Roberto Hernandez. Comparison of text-based and image-based retrieval in multimodal retrieval augmented generation large language model systems.arXiv preprint arXiv:2511.16654, 2025a. 10 Preprint. Elias Lumer, Anmol Gulati, Vamse K...

work page arXiv
[6]

Burke , title =

doi: 10.20944/preprints202512.1050.v2. URL https://doi.org/10. 20944/preprints202512.1050.v2. OpenAI. Introducing ChatGPT search. https://openai.com/index/introducing-chatgpt- search,

work page doi:10.20944/preprints202512.1050.v2
[7]

Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi

Datasets and Bench- marks Track. Abhilasha Ravichander, Shrusti Ghela, David Wadden, and Yejin Choi. HALoGEN: Fantastic LLM hallucinations and where to find them.arXiv preprint arXiv:2501.08292,

work page arXiv
[8]

Yash Saxena, Raviteja Bommireddy, Ankur Padia, and Manas Gaur

ACL 2025 Outstanding Paper. Yash Saxena, Raviteja Bommireddy, Ankur Padia, and Manas Gaur. Generation-time vs. post-hoc citation: A holistic evaluation of LLM attribution.arXiv preprint arXiv:2509.21557,

work page arXiv 2025
[9]

Chronos: Temporal- aware conversational agents with structured event retrieval for long-term memory.arXiv preprint arXiv:2603.16862,

Sahil Sen, Elias Lumer, Anmol Gulati, and Vamse Kumar Subbiah. Chronos: Temporal- aware conversational agents with structured event retrieval for long-term memory.arXiv preprint arXiv:2603.16862,

work page arXiv
[10]

Verifying the verifiers: Unveiling pitfalls and potentials in fact verifiers.arXiv preprint arXiv:2506.13342,

Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, and Youngjae Yu. Verifying the verifiers: Unveiling pitfalls and potentials in fact verifiers.arXiv preprint arXiv:2506.13342,

work page arXiv
[11]

GenerationPro- grams: Fine-grained attribution with executable programs.arXiv preprint arXiv:2506.14580,

David Wan, Eran Hirsch, Elias Stengel-Eskin, Ido Dagan, and Mohit Bansal. GenerationPro- grams: Fine-grained attribution with executable programs.arXiv preprint arXiv:2506.14580,

work page arXiv
[12]

Assessing judging bias in large reasoning models: An empirical study.arXiv preprint arXiv:2504.09946, 2025a

Qian Wang, Zhanzhi Lou, Zhenheng Tang, Nuo Chen, Xuandong Zhao, Wenxuan Zhang, Dawn Song, and Bingsheng He. Assessing judging bias in large reasoning models: An empirical study.arXiv preprint arXiv:2504.09946, 2025a. 11 Preprint. Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, and Wenhu Chen. ...

work page arXiv
[13]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

URL https: //arxiv.org/abs/2504.12516. Yumo Xu, Peng Qi, Jifan Chen, Kunlun Liu, Rujun Han, Lan Liu, Bonan Min, Vittorio Castelli, Arshit Gupta, and Zhiguo Wang. CiteEval: Principle-driven citation evaluation for source attribution. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,

work page internal anchor Pith review arXiv
[14]

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V . Chawla, and Yanfang Ye. CiteAudit: You cited it, but did you read it? A benchmark for verifying scientific references in the LLM era.arXiv preprint arXiv:2602.23452,

work page internal anchor Pith review Pith/arXiv arXiv