Recognition: no theorem link
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection
Pith reviewed 2026-05-12 01:14 UTC · model grok-4.3
The pith
CiteTracer detects citation hallucinations at 97.1 percent accuracy by retrieving evidence across sources and adjudicating each citation field against a three-class taxonomy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CiteTracer is a cascading multi-agent detector built on a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. It extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class-specialist judgers. On the synthetic benchmark it reaches 97.1 percent accuracy with class-level F1 scores of 97.0, 95.8, and 98.5; on the real-world set it detects 97.1 percent of fabrications without abstaining.
What carries the argument
CiteTracer, the cascading multi-agent detector that performs taxonomy-aligned field-level adjudication after multi-source evidence retrieval and deterministic matching.
If this is right
- Auditors receive field-level signals rather than simple binary verification outcomes.
- The detector identifies 97.1 percent of fabrications in a collection of desk-rejected real submissions.
- Class-level F1 scores exceed 95 percent across Real, Potential, and Hallucinated categories on synthetic data.
- The released benchmark of mutated real seeds paired with actual fabrications supports further detector development.
Where Pith is reading between the lines
- Integration into writing assistants could flag suspicious citations during drafting instead of after submission.
- Performance may drop for citations from low-indexed or non-English sources where web retrieval is incomplete.
- Extending the taxonomy or adding domain-specific retrieval agents could address edge cases the current pipeline misses.
Load-bearing premise
The retrieval pipeline of cache lookup, URL fetch, scholar connectors, and web search plus deterministic field matching will produce sufficient evidence for most cases, and the controlled LLM mutations in the synthetic benchmark adequately represent real-world citation fabrications.
What would settle it
A set of real-world fabricated citations where the retrieval pipeline returns no usable evidence for any field, causing the system to miss the fabrications or abstain, would show that the accuracy claims do not hold outside the tested conditions.
Figures
read the original abstract
Large language models are increasingly used in scientific writing, yet they can fabricate citation-shaped references that appear plausible but fail bibliographic verification. Existing detectors often reduce verification to binary found/not-found decisions and rely on brittle parsing or incomplete retrieval, offering little field-level signal to auditors. We reframe citation hallucination detection as taxonomy-aligned field-level adjudication and introduce a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. Based on this taxonomy, we build CiteTracer, a cascading multi-agent detector that extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class-specialist judgers. We release a benchmark of 2,450 synthetic citations built from real seeds with controlled LLM mutations, paired with 957 real-world fabricated citations drawn from ICLR 2026 and an anonymous conference desk-rejected submissions. CiteTracer reaches 97.1% accuracy on the synthetic benchmark, with class-level F1 scores of 97.0, 95.8, and 98.5 for Real, Potential, and Hallucinated, respectively, and detects 97.1% of fabrications on the real-world set without abstaining. Code: https://github.com/aaFrostnova/CiteTracer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CiteTracer, a cascading multi-agent detector for citation hallucinations that uses a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. Citations are extracted from PDF/BibTeX, evidence is retrieved via cache/URL/scholar/web search, deterministic field matching is applied, and ambiguous cases are routed to specialist judgers. It reports 97.1% accuracy (with per-class F1 of 97.0/95.8/98.5) on a synthetic benchmark of 2,450 controlled LLM-mutated citations and 97.1% detection of 957 real-world fabricated citations from ICLR 2026 and desk-rejected submissions, with code released.
Significance. If the results hold, the work offers a practical, taxonomy-aligned alternative to binary found/not-found detectors for a timely problem in LLM-assisted scientific writing. The release of code, the synthetic benchmark built from real seeds, and the real-world set constitute concrete contributions that could support follow-on research and auditing tools.
major comments (1)
- [Abstract] Abstract: the 957 real-world fabricated citations are described only as 'drawn from ICLR 2026 and an anonymous conference desk-rejected submissions' with no account of identification, verification, labeling, inclusion criteria, or how many candidates were screened. This detail is load-bearing for the 97.1% detection claim, because the reported rate could be conditioned on an easier subset (e.g., obvious non-existent DOIs or failures already caught by retrieval pipelines similar to CiteTracer's) rather than the full distribution of citation hallucinations.
minor comments (2)
- [Evaluation] No error bars, ablation results on the retrieval components, or failure-case analysis are mentioned, which would help assess robustness beyond the headline numbers.
- [Abstract] The abstract refers to 'an anonymous conference' without further clarification; if possible, more detail on the source distribution would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for highlighting an important point about transparency in the real-world evaluation. We address the major comment below and will incorporate revisions to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the 957 real-world fabricated citations are described only as 'drawn from ICLR 2026 and an anonymous conference desk-rejected submissions' with no account of identification, verification, labeling, inclusion criteria, or how many candidates were screened. This detail is load-bearing for the 97.1% detection claim, because the reported rate could be conditioned on an easier subset (e.g., obvious non-existent DOIs or failures already caught by retrieval pipelines similar to CiteTracer's) rather than the full distribution of citation hallucinations.
Authors: We agree that the current description of the real-world dataset is insufficiently detailed and that this information is necessary to evaluate the 97.1% detection rate and rule out selection bias. In the revised manuscript we will expand the Experiments section (and update the abstract accordingly) with a full account of dataset construction. This will include: the identification process (initial flagging of suspicious citations in ICLR 2026 submissions and desk-rejected papers via automated DOI/URL checks combined with reviewer or organizer reports); verification steps (multi-source retrieval attempts confirming absence or mismatch, followed by author adjudication); labeling procedure (application of the 12-code taxonomy by multiple annotators, with reported inter-annotator agreement); explicit inclusion criteria (citations that were fabricated yet presented in a form that could plausibly pass casual inspection); and screening statistics (total candidates examined and the fraction retained as the final 957). We will also add a short discussion of the distribution of hallucination types in this set to demonstrate that it is not limited to trivial cases already caught by basic retrieval. These changes will make the evaluation transparent while respecting the anonymity constraints of the source conference. revision: yes
Circularity Check
No circularity: performance claims rest on externally constructed benchmarks independent of the detector definition.
full rationale
The paper introduces a taxonomy, retrieval pipeline, and multi-agent adjudication framework whose definitions and components are specified prior to and independently of the reported accuracy numbers. The synthetic benchmark is generated from real citation seeds via controlled external LLM mutations, and the real-world set is drawn from conference submissions; neither is defined in terms of the detector's outputs or fitted parameters. No equations, self-referential predictions, or load-bearing self-citations appear in the provided text that would reduce the 97.1% accuracy or F1 scores to tautological inputs by construction. The evaluation is therefore self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can generate plausible but non-existent citations that require external verification
invented entities (2)
-
12-code taxonomy for Real, Potential, and Hallucinated citations
no independent evidence
-
CiteTracer cascading multi-agent detector
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Transparency report on AI-generated citations in ACM CCS 2026 submissions
ACM CCS 2026 Program Committee. Transparency report on AI-generated citations in ACM CCS 2026 submissions. https://github.com/ACM-CCS-2026/Transparency-Report ,
work page 2026
-
[2]
Claude (opus 4.7 version) [large language model], 2026
Anthropic. Claude (opus 4.7 version) [large language model], 2026
work page 2026
-
[3]
UMass citation field extraction dataset
Sam Anzaroot and Andrew McCallum. UMass citation field extraction dataset. http://www. iesl.cs.umass.edu/data/data-umasscitationfield, 2013
work page 2013
-
[4]
Sam Anzaroot, Alexandre Passos, David Belanger, and Andrew McCallum. Learning soft linear constraints with application to citation field extraction.arXiv preprint arXiv:1403.1349, 2014
-
[5]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Mikaël Chelli, Jules Descamps, Vincent Lavoué, Christophe Trojani, Michel Azar, Marcel Deckert, Jean-Luc Raynier, Gilles Clowez, Pascal Boileau, Caroline Ruetsch-Chelli, et al. Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews: comparative analysis.Journal of Medical Internet Research, 26(1):e53164, 2024
work page 2024
-
[7]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
CiteCheck: Ai-powered citation verification
CiteCheck. CiteCheck: Ai-powered citation verification. https://citecheck.ai/, 2024. Accessed: 2026-04
work page 2024
-
[9]
Citely: AI citation assistant.https://citely.ai/, 2024
Citely. Citely: AI citation assistant.https://citely.ai/, 2024. Accessed: 2026-04
work page 2024
-
[10]
Gemini (3.1 pro version) [large language model], 2026
Google. Gemini (3.1 pro version) [large language model], 2026
work page 2026
-
[11]
GPTZero finds over 50 hallucinations in ICLR 2026 submissions
GPTZero. GPTZero finds over 50 hallucinations in ICLR 2026 submissions. https:// gptzero.me/news/iclr-2026, 2025
work page 2026
-
[12]
GPTZero flags fabricated citations in NeurIPS submissions
GPTZero. GPTZero flags fabricated citations in NeurIPS submissions. https://gptzero. me/news/neurips/, 2025. Accessed: 2026-05
work page 2025
-
[13]
GPTZero: Detecting AI-generated text, 2023
GPTZero Team. GPTZero: Detecting AI-generated text, 2023
work page 2023
-
[14]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025
work page 2025
-
[15]
Kimi k2.5: Visual agentic intelligence, 2026
Kimi Team, Tongtong Bai, Yifan Bai, et al. Kimi k2.5: Visual agentic intelligence, 2026
work page 2026
-
[16]
Autodata: A multi-agent system for open web data collection.arXiv preprint arXiv:2505.15859, 2025
Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, et al. Autodata: A multi-agent system for open web data collection.arXiv preprint arXiv:2505.15859, 2025
-
[17]
Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. InProceedings of EMNLP, 2023
work page 2023
-
[18]
ChatGPT (5.5 version) [large language model], 2026
OpenAI. ChatGPT (5.5 version) [large language model], 2026
work page 2026
-
[19]
Subhey Sadi Rahman, Md Adnanul Islam, Md Mahbub Alam, Musarrat Zeba, Md Abdur Rahman, Sadia Sultana Chowa, Mohaimenul Azam Khan Raiaan, and Sami Azam. Hallucination to truth: a review of fact-checking and factuality evaluation in large language models.Artificial Intelligence Review, 2026
work page 2026
-
[20]
RefCheck-AI. RefCheck-AI. https://github.com/HuaHenry/RefCheck_ai, 2024. Ac- cessed: 2026-04. 10
work page 2024
-
[21]
Academic urban legends.Social Studies of Science, 44(4):638–654, 2014
Ole Bjørn Rekdal. Academic urban legends.Social Studies of Science, 44(4):638–654, 2014
work page 2014
-
[22]
Y . Sakai, H. Kamigaito, and T. Watanabe. HalluCitation matters: Revealing the impact of hallucinated references with 300 hallucinated papers in ACL conferences. https://arxiv. org/abs/2601.18724, 2026
-
[23]
Assessing citation integrity in biomedical publications: corpus annotation and NLP models
Maria Janina Sarol, Shufan Ming, Shruthan Radhakrishna, Jodi Schneider, and Halil Kilicoglu. Assessing citation integrity in biomedical publications: corpus annotation and NLP models. Bioinformatics, 40(7):btae420, 2024
work page 2024
-
[24]
Hallucinator: A citation hallucination checker
Gianluca Sbardella. Hallucinator: A citation hallucination checker. https://github.com/ gianlucasb/hallucinator, 2024
work page 2024
-
[25]
SwanRef: Reference verification platform
SwanRef. SwanRef: Reference verification platform. https://www.swanref.org/, 2024. Accessed: 2026-04
work page 2024
-
[26]
AI conference’s papers contaminated by AI hallucinations
The Register. AI conference’s papers contaminated by AI hallucinations. https: //www.theregister.com/2026/01/22/neurips_papers_contaiminated_ai_ hallucinations/, 2026
work page 2026
-
[27]
Jacob-Junqi Tian, Hao Yu, Yury Orlovskiy, Tyler Vergho, Mauricio Rivera, Mayank Goel, Zachary Yang, Jean-Francois Godbout, Reihaneh Rabbany, and Kellin Pelrine. Web retrieval agents for evidence-based misinformation detection.arXiv preprint arXiv:2409.00009, 2024
- [28]
-
[29]
L. J. Janse van Rensburg. Ai-powered citation auditing: A zero-assumption protocol for systematic reference verification in academic research, 2025
work page 2025
-
[30]
Walters and Esther Isabelle Wilder
William H. Walters and Esther Isabelle Wilder. Fabrication and errors in the bibliographic citations generated by ChatGPT.Scientific Reports, 13(1):14045, 2023
work page 2023
-
[31]
Ludo Waltman. A review of the literature on citation impact indicators.Journal of informetrics, 10(2):365–391, 2016
work page 2016
-
[32]
Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552,
Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow.arXiv preprint arXiv:2601.20552, 2026
-
[33]
Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V Chawla, and Yanfang Ye. Citeaudit: You cited it, but did you read it? a benchmark for verifying scientific references in the llm era.arXiv preprint arXiv:2602.23452, 2026. A Benchmark Details This appendix expands Section 3 with the per-code prose, mutation operator schemas, and quality- cont...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.