arxiv: 2604.03173 · v1 · submitted 2026-04-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Delip Rao , Eric Wong , Chris Callison-Burch

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:02 UTC · model grok-4.3

classification 💻 cs.CL

keywords citation hallucinationLLM reliabilityURL validationdeep research agentsreference accuracyWayback Machineself-correctionhallucination detection

0 comments

The pith

LLMs and deep research agents hallucinate 3-13% of citation URLs with no web archive record, but a new tool cuts non-resolving links by up to 79 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures citation URL validity across ten models and agents on large benchmarks covering academic fields. It establishes that 3-13% of supplied URLs have no Wayback Machine record and were likely never real, while overall non-resolving rates reach 5-18%. Deep research agents produce more citations than search-augmented LLMs yet hallucinate at higher rates, with clear differences across domains such as business and theology. The authors release urlhealth, an open-source checker that distinguishes hallucinated links from link rot using historical archives. Models that apply the tool for self-correction reduce broken citations by factors of 6 to 79, bringing rates below 1%.

Core claim

Citation URL validity is measurable at scale and correctable in practice: 3-13% of URLs generated by LLMs and agents have no historical record in the Wayback Machine and are therefore hallucinated, while 5-18% fail to resolve; deep research agents show higher hallucination rates than search-augmented models, with domain and model-specific patterns; urlhealth enables self-correction that reduces non-resolving URLs by 6-79 times to under 1%.

What carries the argument

urlhealth, an open-source tool that checks URL liveness and classifies stale versus hallucinated citations using the Wayback Machine as a historical proxy.

If this is right

Deep research agents produce substantially more citations per query but hallucinate URLs at higher rates than search-augmented LLMs.
Non-resolving citation rates vary by domain, ranging from 5.4% in business to 11.4% in theology.
Models equipped with urlhealth reduce non-resolving citation URLs by 6-79 times to under 1%.
Some models fabricate every non-resolving URL while others show substantial fractions attributable to link rot.
The urlhealth tool and all evaluation data are released publicly for reuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Citation validation tools like urlhealth could be integrated directly into generation pipelines to improve reliability of AI research outputs.
The same archival-checking approach might apply to other reference types such as DOIs, dataset identifiers, or code repositories.
Current retrieval-augmented systems may need explicit post-generation URL validation steps to prevent downstream propagation of broken links.
Widespread adoption could shift evaluation benchmarks for LLMs toward measuring citation integrity rather than only factual accuracy.

Load-bearing premise

The absence of a Wayback Machine record is treated as proof that a URL never existed at the time the model generated it.

What would settle it

A controlled test in which models generate citations for queries with known live URLs at generation time, followed by checking whether any such URLs lack Wayback records or whether urlhealth still flags them as hallucinated.

Figures

Figures reproduced from arXiv: 2604.03173 by Chris Callison-Burch, Delip Rao, Eric Wong.

**Figure 2.** Figure 2: Non-resolving URL rates by academic field and model for ExpertQA. Fields are [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Non-resolving URL rates by subfield within Healthcare/Medicine. The top 15 [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of urlhealth correction rounds per question (435 questions each, 3 models). The three models exhibit distinct self-correction profiles. Gemini 2.5 Pro (green) completes in 1–2 rounds every time: its two-phase architecture (Google Search grounding followed by a single verification turn) caps it at two rounds, and 44% of questions need only one. GPT-5.1 (orange) clusters at 2 rounds (61%), with … view at source ↗

read the original abstract

Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3--13\% of citation URLs are hallucinated -- they have no record in the Wayback Machine and likely never existed -- while 5--18\% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4\% (Business) to 11.4\% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by $6\textrm{--}79\times$ to under 1\%, though effectiveness depends on the model's tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures citation URL hallucinations at scale across LLMs and agents, finds clear rates, and ships a working correction tool.

read the letter

The main thing to know is that this paper runs a large audit of citation URLs from ten models and agents on two benchmarks, reports 3-13% hallucinated and 5-18% non-resolving overall, and shows that a simple tool lets models cut the bad URLs by 6-79x down to under 1% in self-correction loops. Agents cite more than plain LLMs but hallucinate at higher rates, and the failure breakdown shows some models fabricate every bad URL while others retrieve real pages that later rot. The urlhealth tool and all the data are released, which makes the numbers checkable.

Referee Report

2 major / 2 minor

Summary. The paper measures citation URL validity across 10 LLMs and deep research agents on DRBench (53,090 URLs) and ExpertQA (168,021 URLs). It reports 3-13% of URLs as hallucinated (no Wayback Machine record, presumed never existed) and 5-18% as non-resolving overall, with agents producing more citations but higher hallucination rates. Domain and model effects are analyzed, a failure taxonomy is proposed, and the open-source urlhealth tool is introduced; self-correction experiments show 6-79x reductions in non-resolving URLs to under 1%.

Significance. If the measurements and taxonomy hold, the work supplies the first large-scale empirical baseline on citation URL hallucination versus link rot, a reproducible open-source classification tool, and evidence that tool-augmented self-correction is practical. These contributions directly support reliability improvements in LLM research agents and are likely to be cited in follow-on studies of grounded generation.

major comments (2)

[Methods] Methods section on URL classification: defining hallucinated URLs strictly as those with no Wayback Machine record is load-bearing for the headline 3-13% rate and the claim that some models fabricate every non-resolving URL. The archive's incomplete coverage of low-traffic, recent, or dynamically generated pages means a non-negligible fraction of 'no-record' URLs may have existed at generation time, which would inflate hallucination estimates and weaken the hallucination-versus-link-rot taxonomy.
[Results] Results and evaluation sections: no confidence intervals are reported on the per-model or per-domain hallucination and non-resolving rates, and the query sampling procedure for DRBench and ExpertQA is not detailed. These omissions limit assessment of statistical reliability and generalizability of the reported ranges.

minor comments (2)

[Abstract] Abstract and Table 1: the reported ranges (3-13%, 5-18%) would be clearer if accompanied by exact per-model or per-benchmark breakdowns rather than aggregated intervals.
[Tool description] urlhealth tool description: the precise Wayback Machine query parameters and handling of redirects or paywalled pages should be specified to allow exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below with clarifications and revisions to improve the manuscript.

read point-by-point responses

Referee: [Methods] Methods section on URL classification: defining hallucinated URLs strictly as those with no Wayback Machine record is load-bearing for the headline 3-13% rate and the claim that some models fabricate every non-resolving URL. The archive's incomplete coverage of low-traffic, recent, or dynamically generated pages means a non-negligible fraction of 'no-record' URLs may have existed at generation time, which would inflate hallucination estimates and weaken the hallucination-versus-link-rot taxonomy.

Authors: We agree that the Wayback Machine provides incomplete coverage, particularly for recent, low-traffic, or dynamically generated pages, and that this introduces uncertainty into the hallucination classification. Our definition treats absence of any record as evidence of likely fabrication (a conservative proxy), while any record classifies the URL as non-hallucinated regardless of current liveness. This choice prioritizes avoiding false negatives on hallucination but can overestimate fabrication rates. In revision we will add explicit discussion of this limitation in the Methods and Limitations sections, qualify the reported rates as upper bounds under the chosen definition, and note implications for the taxonomy without altering the core measurement approach. revision: partial
Referee: [Results] Results and evaluation sections: no confidence intervals are reported on the per-model or per-domain hallucination and non-resolving rates, and the query sampling procedure for DRBench and ExpertQA is not detailed. These omissions limit assessment of statistical reliability and generalizability of the reported ranges.

Authors: We appreciate this observation. The query sampling for DRBench and ExpertQA follows the construction protocols of the source datasets (random sampling stratified by domain for ExpertQA; full query set for DRBench); we will now describe these procedures in detail in the revised Methods section, including sample sizes and any filtering steps. We will also add 95% bootstrap confidence intervals for all per-model and per-domain rates in the Results tables and figures to quantify statistical reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's measurements of hallucinated URLs (3-13%) and non-resolving rates (5-18%) are computed directly from external Wayback Machine record checks on the DRBench and ExpertQA datasets, with no fitted parameters or self-referential definitions. The urlhealth tool applies the same external proxy for classification, and the reported 6-79x reductions are simple before/after empirical counts on identical queries. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The derivation chain remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that Wayback Machine snapshots are a sufficient ground truth for URL existence and that non-resolving status cleanly distinguishes hallucination from genuine link rot. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Wayback Machine record is a reliable proxy for whether a URL existed at generation time
Used to classify URLs as hallucinated when no record exists.

pith-pipeline@v0.9.0 · 5583 in / 1326 out tokens · 17347 ms · 2026-05-13T20:02:22.912412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.48550/arxiv.2506.11763 2025