pith. sign in

arxiv: 2606.13104 · v1 · pith:2AEZTJSCnew · submitted 2026-06-11 · 💻 cs.LG

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

Pith reviewed 2026-06-27 07:23 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM hallucinationcitation biasepistemic susceptibilityAuthorityBenchmulti-domain benchmarkclaim veracityfabricated citationsauthority signals
0
0 comments X

The pith

Citations, whether real or fabricated, raise hallucination rates in large language models compared to prompts with no citations at all.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a balanced benchmark to isolate how citation authority signals affect whether models generate accurate answers or invent content. It crosses true and false claims with real, fake, and absent citations across thousands of examples in four domains. Models hallucinate more often whenever citations appear, and the increase is largest when invented citations are attached to correct facts. The pattern appears in general knowledge, science, and medicine but is weaker for legal statements. Neither the prestige of the cited source nor the apparent origin of the authors changes the effect much.

Core claim

AuthorityBench applies a fully crossed 2x2 design of claim veracity by citation veracity to 220,564 prompts and shows that any citation presence elevates hallucination rates over a no-citation baseline, with fabricated citations paired to true claims producing the largest rise of 3 to 22 percentage points and peak rates of 35 to 77 percent in the general-knowledge domain.

What carries the argument

AuthorityBench, the 220,564-prompt benchmark that uses a 2x2 factorial crossing of claim veracity with citation veracity while holding prompt templates, venue prestige tiers, and country-coded author names fixed.

If this is right

  • Citation presence alone drives higher hallucination rates than a no-citation baseline across domains.
  • The largest hallucination increases occur when fabricated citations accompany true claims.
  • Legal-domain claims show smaller susceptibility to citation authority than general-knowledge, science, or medical claims.
  • Venue prestige levels and author country signals produce negligible differences in hallucination rates.
  • The observed effect operates independently of whether the underlying claim is factually correct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures that down-weight citation tokens when judging claim truth could lower the observed hallucination increases.
  • Retrieval-augmented systems may inherit the same citation-driven errors unless citation handling is explicitly decoupled from fact checking.
  • Legal-domain applications of language models may need fewer safeguards against citation bias than general-purpose chat systems.
  • Simple prompt variants that suppress citation fields offer a low-cost way to test mitigation outside the benchmark setting.

Load-bearing premise

The factorial design and its controls for prompts, venues, and author names succeed in separating citation authority signals from the actual truth value of the claims being presented.

What would settle it

A new run of the same claims and citation conditions that records no measurable rise in hallucination rates when citations are added versus the no-citation baseline.

Figures

Figures reproduced from arXiv: 2606.13104 by Aravind Ramana RN, Aryan Khurana, Dhruv Kumar.

Figure 1
Figure 1. Figure 1: THE 2x2 FACTORIAL DESIGN. Note: The 35–77% figure is domain-specific (general knowledge) and per-model full results appear in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset Construction Pipeline. puts vary systematically with perceived demographic at￾tributes of names in prompts — motivating our inclusion of a country-coded author name variable to extend demo￾graphic bias analysis to epistemic authority contexts. FalseCite. The most directly relevant prior work is False￾Cite (Mao et al., 2025), which introduced 82,000 prompts pairing false claims from FEVER and SciQ w… view at source ↗
Figure 3
Figure 3. Figure 3: Secondary results (15K subset). (a) Domain effects under TC×FC across models. (b) Template structure effects, showing average TC×FC lift by citation format. (c) Venue prestige null result across four tiers. (d) Author demographics: hallucination rate by surname region. (e) Prestige × domain interaction (elite vs. low-tier lift). (f) Domain alignment effect: same-domain vs. cross-domain citation hallucinati… view at source ↗
Figure 4
Figure 4. Figure 4: Hallucination rates across all five citation conditions for all seven models. The TC×FC column (true claim, fabricated citation) is universally the worst-performing condition across every model tested. Baseline, Base TC, and Base FC columns show full-dataset values (F) for Gemma 3 4B, Llama 3.1 8B, and Phi-4 Mini Instruct; 15K-subset values (S) for all others [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: False-claim condition lifts by model. Bars show hallucination rate lift (pp) relative to the no-citation baseline for the FC×FC (fabricated citation) and FC×TC (real citation) conditions. Models in the suppression region (left) show citation-induced reduction in hallucination on false claims; models in the amplification region (right) show the opposite. DeepSeek V3.2 shows a near-zero effect. cialised regi… view at source ↗
read the original abstract

Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AuthorityBench, a 220,564-prompt multi-domain benchmark using a balanced 2x2 factorial design that crosses claim veracity with citation veracity (real, fabricated, or absent). The benchmark spans general knowledge, science, law, and medicine, with controls for 40 prompt templates, four venue prestige tiers, and country-coded author names. Evaluating seven LLMs, the central claim is that citation presence (real or fabricated) consistently raises hallucination rates relative to a no-citation baseline, with the largest effects (3–22 percentage points, up to 35–77% in general knowledge) occurring when fabricated citations accompany true claims; legal claims are more robust, while venue prestige and author demographics show negligible impact. Datasets and evaluation code are released publicly.

Significance. If the effects hold under the stated controls, the work supplies a large-scale, open benchmark for isolating citation-based authority signals from factual content in LLM epistemic behavior. The factorial design and public release of data plus code constitute clear strengths for reproducibility and extension by the community.

major comments (2)
  1. [Methods] Methods/Evaluation section: The operational definition of hallucination used to label model outputs and compute all reported rates is not provided in the manuscript (only referenced via the GitHub repository). This definition is load-bearing for the central claims about 3–22 percentage-point increases.
  2. [Results] Results section: The manuscript reports percentage-point changes without statistical tests, confidence intervals, or analysis of variance across the 40 prompt templates and seven models. This omission makes it difficult to assess whether the reported effect sizes are reliable or generalizable.
minor comments (1)
  1. [Abstract] The abstract is lengthy; condensing the description of controls while preserving the key numerical findings would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of clarity and statistical rigor. We address each major comment below and will incorporate the suggested changes into the revised manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods/Evaluation section: The operational definition of hallucination used to label model outputs and compute all reported rates is not provided in the manuscript (only referenced via the GitHub repository). This definition is load-bearing for the central claims about 3–22 percentage-point increases.

    Authors: We agree that the operational definition of hallucination is essential for interpreting the results and should appear in the main text rather than solely in the repository. In the revised manuscript we will expand the Methods/Evaluation section to include a complete description of the labeling procedure, including the exact criteria used to classify outputs as hallucinations. revision: yes

  2. Referee: [Results] Results section: The manuscript reports percentage-point changes without statistical tests, confidence intervals, or analysis of variance across the 40 prompt templates and seven models. This omission makes it difficult to assess whether the reported effect sizes are reliable or generalizable.

    Authors: We concur that formal statistical support would strengthen the presentation of effect sizes. In the revision we will add bootstrap confidence intervals for the reported percentage-point differences and include mixed-effects or ANOVA-style analyses that account for variation across the 40 prompt templates and seven models. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark with independent verification path

full rationale

The paper describes an empirical benchmark (AuthorityBench) constructed via a balanced 2x2 factorial design crossing claim veracity with citation veracity, plus controlled prompt templates, venue tiers, and author names. Hallucination rates are measured directly from model outputs on the released dataset; no equations, fitted parameters, derivations, or predictions reduce any reported quantity to prior self-referential inputs. The open GitHub release of datasets and evaluation code supplies an external verification route independent of the paper itself. No self-citation chains, ansatzes, or uniqueness theorems appear as load-bearing elements. This is a standard empirical study whose central claims are falsifiable against the released artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical benchmark rather than a derivation; relies on standard assumptions in LLM evaluation that hallucination can be measured via structured prompts and that the factorial crossing isolates authority signals.

axioms (1)
  • domain assumption The 2x2 factorial design crossing claim veracity with citation veracity isolates citation-based authority signals independent of factual content
    Explicitly stated as the benchmark purpose in the abstract.

pith-pipeline@v0.9.1-grok · 5752 in / 1337 out tokens · 30319 ms · 2026-06-27T07:23:56.668490+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , year =

    Nathan Mao and Varun Kaushik and Shreya Shivkumar and Parham Sharafoleslami and Kevin Zhu and Sunishchal Dev , title =. The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , year =

  2. [2]

    2601.05866 , archivePrefix =

    Maxime Dassen and Rebecca Kotula and Kenton Murray and Andrew Yates and Dawn Lawrie and Efsun Kayi and James Mayfield and Kevin Duh , year =. 2601.05866 , archivePrefix =

  3. [3]

    and Henderson, Peter and Ho, Daniel E

    Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho , title =. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law , series =. 2021 , pages =. doi:10.1145/3462757.3466088 , url =

  4. [4]

    Proceedings of the Conference on Health, Inference, and Learning , series =

    Ankit Pal and Logesh Kumar Umapathi and Malaikannan Sankarasubbu , title =. Proceedings of the Conference on Health, Inference, and Learning , series =. 2022 , publisher =

  5. [5]

    2025 , eprint =

    Qwen3 Technical Report , author =. 2025 , eprint =

  6. [6]

    2024 , howpublished =

  7. [7]

    2024 , howpublished =

    Washington &. 2024 , howpublished =

  8. [8]

    TruthfulQA: Measuring how models mimic human false- hoods

    Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229

  9. [9]

    Halueval: A large-scale hallucination evaluation benchmark for large language models

    Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

  10. [10]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

    Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , title =. 2025 , month = jan, issn =. doi:10.1145/3703155 , url =

  11. [11]

    Enabling Large Language Models to Generate Text with Citations

    Gao, Tianyu and Yen, Howard and Yu, Jiatong and Chen, Danqi. Enabling Large Language Models to Generate Text with Citations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.398

  12. [12]

    doi: 10.18653/v1/2023.emnlp-main.741

    Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.1...

  13. [13]

    Whose Facts Win?

    Jakob Schuster and Vagrant Gautam and Katja Markert , year =. Whose Facts Win?. 2601.03746 , archivePrefix =

  14. [14]

    2024 , howpublished =

    Hughes Hallucination Evaluation Model (. 2024 , howpublished =

  15. [15]

    2023 , note =

    Boothe, Andy , title =. 2023 , note =

  16. [16]

    URLhttps://doi.org/10.18653/v1/D19-1259

    Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua. P ub M ed QA : A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1259

  17. [17]

    Fact or Fiction: Verifying Scientific Claims

    Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2020. doi:10.18653/v1/2020.emnlp-main.609

  18. [18]

    and Gardner, Matt

    Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. doi:10.18653/v1/W17-4413

  19. [19]

    Knowledge conflicts for LLMs: A survey

    Xu, Rongwu and Qi, Zehan and Guo, Zhijiang and Wang, Cunxiang and Wang, Hongru and Zhang, Yue and Xu, Wei. Knowledge Conflicts for LLM s: A Survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.486

  20. [20]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1074

  21. [21]

    Algorithmic Inheritance: Surname Bias in

    Pat Pataranutaporn and Nattavudh Powdthavee and Pattie Maes , year =. Algorithmic Inheritance: Surname Bias in. 2501.19407 , archivePrefix =

  22. [22]

    2024 , eprint =

    Hallucination is Inevitable: An Innate Limitation of Large Language Models , author =. 2024 , eprint =

  23. [23]

    The Twelfth International Conference on Learning Representations , year =

    Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts , author =. The Twelfth International Conference on Learning Representations , year =

  24. [24]

    Gender bias and stereotypes in Large Language Models , url=

    Kotek, Hadas and Dockum, Rikker and Sun, David , title =. Proceedings of The. 2023 , isbn =. doi:10.1145/3582269.3615599 , url =

  25. [25]

    Proceedings of the 2024

    Wilson, Kyra and Caliskan, Aylin , title =. Proceedings of the 2024. 2025 , publisher =