CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

· 2026 · cs.CL · arXiv 2602.23452

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

Scientific research relies on citation integrity, yet large language models (LLMs) have introduced a critical risk: fabricated references that appear plausible but correspond to no real publications. As manual verification becomes infeasible and existing automated tools remain fragile, we introduce CiteAudit, a comprehensive benchmark and detection framework for hallucinated citations. We design a multi-agent verification pipeline that decomposes citation checking into metadata extraction, memory lookup, web-based retrieval, and final judgment. To evaluate this, we construct a large-scale, human-validated dataset spanning diverse domains and hallucination types. Experiments demonstrate that our framework achieves superior verification performance over state-of-the-art LLMs and commercial baselines. Our work provides the necessary infrastructure to audit citations at scale and safeguard the trustworthiness of scholarly discourse. Code is available at https://github.com/shiiiikw/CiteAudit.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

Citation Grounding metric and CG-DPO training method detect and reduce hallucinations in LLM-generated legal citations using a graph from 100.8 million court decisions.

Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection

cs.CL · 2026-05-09 · accept · novelty 7.0

CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new framework parses and evaluates citations in LLM deep research reports across link validity, relevance, and factuality, finding 94%+ link success but only 39-77% factual accuracy.

BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

cs.DL · 2026-04-03 · conditional · novelty 7.0

Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.

sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing

cs.DL · 2026-04-09

citing papers explorer

Showing 5 of 5 citing papers.

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs cs.CL · 2026-05-30 · unverdicted · none · ref 21 · internal anchor
Citation Grounding metric and CG-DPO training method detect and reduce hallucinations in LLM-generated legal citations using a graph from 100.8 million court decisions.
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection cs.CL · 2026-05-09 · accept · none · ref 33 · internal anchor
CiteTracer detects citation hallucinations at 97.1% accuracy on synthetic and real-world benchmarks by combining structured extraction, multi-source retrieval, deterministic matching, and class-specialist agents.
Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents cs.CL · 2026-05-07 · unverdicted · none · ref 14 · internal anchor
A new framework parses and evaluates citations in LLM deep research reports across link validity, relevance, and factuality, finding 94%+ link success but only 39-77% factual accuracy.
BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation cs.DL · 2026-04-03 · conditional · none · ref 22 · internal anchor
Frontier LLMs generate BibTeX entries at 83.6% field accuracy but only 50.9% fully correct; two-stage clibib revision raises accuracy to 91.5% and fully correct entries to 78.3% with 0.8% regression.
sciwrite-lint: Verification Infrastructure for the Age of Science Vibe-Writing cs.DL · 2026-04-09 · unreviewed · ref 48 · internal anchor

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer