Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

Aravind Ramana RN; Aryan Khurana; Dhruv Kumar

arxiv: 2606.13104 · v1 · pith:2AEZTJSCnew · submitted 2026-06-11 · 💻 cs.LG

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

Aryan Khurana , Aravind Ramana RN , Dhruv Kumar This is my paper

Pith reviewed 2026-06-27 07:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM hallucinationcitation biasepistemic susceptibilityAuthorityBenchmulti-domain benchmarkclaim veracityfabricated citationsauthority signals

0 comments

The pith

Citations, whether real or fabricated, raise hallucination rates in large language models compared to prompts with no citations at all.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a balanced benchmark to isolate how citation authority signals affect whether models generate accurate answers or invent content. It crosses true and false claims with real, fake, and absent citations across thousands of examples in four domains. Models hallucinate more often whenever citations appear, and the increase is largest when invented citations are attached to correct facts. The pattern appears in general knowledge, science, and medicine but is weaker for legal statements. Neither the prestige of the cited source nor the apparent origin of the authors changes the effect much.

Core claim

AuthorityBench applies a fully crossed 2x2 design of claim veracity by citation veracity to 220,564 prompts and shows that any citation presence elevates hallucination rates over a no-citation baseline, with fabricated citations paired to true claims producing the largest rise of 3 to 22 percentage points and peak rates of 35 to 77 percent in the general-knowledge domain.

What carries the argument

AuthorityBench, the 220,564-prompt benchmark that uses a 2x2 factorial crossing of claim veracity with citation veracity while holding prompt templates, venue prestige tiers, and country-coded author names fixed.

If this is right

Citation presence alone drives higher hallucination rates than a no-citation baseline across domains.
The largest hallucination increases occur when fabricated citations accompany true claims.
Legal-domain claims show smaller susceptibility to citation authority than general-knowledge, science, or medical claims.
Venue prestige levels and author country signals produce negligible differences in hallucination rates.
The observed effect operates independently of whether the underlying claim is factually correct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training procedures that down-weight citation tokens when judging claim truth could lower the observed hallucination increases.
Retrieval-augmented systems may inherit the same citation-driven errors unless citation handling is explicitly decoupled from fact checking.
Legal-domain applications of language models may need fewer safeguards against citation bias than general-purpose chat systems.
Simple prompt variants that suppress citation fields offer a low-cost way to test mitigation outside the benchmark setting.

Load-bearing premise

The factorial design and its controls for prompts, venues, and author names succeed in separating citation authority signals from the actual truth value of the claims being presented.

What would settle it

A new run of the same claims and citation conditions that records no measurable rise in hallucination rates when citations are added versus the no-citation baseline.

Figures

Figures reproduced from arXiv: 2606.13104 by Aravind Ramana RN, Aryan Khurana, Dhruv Kumar.

**Figure 1.** Figure 1: THE 2x2 FACTORIAL DESIGN. Note: The 35–77% figure is domain-specific (general knowledge) and per-model full results appear in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Dataset Construction Pipeline. puts vary systematically with perceived demographic attributes of names in prompts — motivating our inclusion of a country-coded author name variable to extend demographic bias analysis to epistemic authority contexts. FalseCite. The most directly relevant prior work is FalseCite (Mao et al., 2025), which introduced 82,000 prompts pairing false claims from FEVER and SciQ w… view at source ↗

**Figure 3.** Figure 3: Secondary results (15K subset). (a) Domain effects under TC×FC across models. (b) Template structure effects, showing average TC×FC lift by citation format. (c) Venue prestige null result across four tiers. (d) Author demographics: hallucination rate by surname region. (e) Prestige × domain interaction (elite vs. low-tier lift). (f) Domain alignment effect: same-domain vs. cross-domain citation hallucinati… view at source ↗

**Figure 4.** Figure 4: Hallucination rates across all five citation conditions for all seven models. The TC×FC column (true claim, fabricated citation) is universally the worst-performing condition across every model tested. Baseline, Base TC, and Base FC columns show full-dataset values (F) for Gemma 3 4B, Llama 3.1 8B, and Phi-4 Mini Instruct; 15K-subset values (S) for all others [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: False-claim condition lifts by model. Bars show hallucination rate lift (pp) relative to the no-citation baseline for the FC×FC (fabricated citation) and FC×TC (real citation) conditions. Models in the suppression region (left) show citation-induced reduction in hallucination on false claims; models in the amplification region (right) show the opposite. DeepSeek V3.2 shows a near-zero effect. cialised regi… view at source ↗

read the original abstract

Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AuthorityBench gives a large, controlled benchmark that isolates citation effects on LLM hallucinations via a clean 2x2 design across domains.

read the letter

Colleague,

The main takeaway is that this paper delivers AuthorityBench, a 220k-prompt dataset built on a balanced 2x2 factorial crossing claim truth with citation truth. It runs across general knowledge, science, law, and medicine, with 40 prompt templates, four venue tiers, and country-coded author names as controls. They test seven models and report that citation presence raises hallucination rates over a no-citation baseline, with the largest lifts when fabricated citations sit next to true claims.

The construction is the strongest part. The scale and the explicit factorial isolation let them separate authority signals from content. Releasing the datasets and evaluation code is useful for anyone who wants to rerun or extend the tests. The domain differences, especially legal robustness versus higher rates in general knowledge, and the near-zero impact from venue or demographics, are clear secondary results that follow from the design.

The softer spots are limited. The abstract skips the exact hallucination scoring rule and any statistical tests behind the 3-to-22-point shifts, so the numbers need the full methods to confirm they survive variance checks or prompt artifacts. Nothing in the reported structure looks circular or internally inconsistent, though.

This is for researchers working on LLM evaluation, citation-augmented systems, or epistemic robustness. Anyone building benchmarks or testing models in knowledge domains will find the resource practical. It deserves serious referee time because the new controlled tool and the open verification path are substantive enough to warrant external input.

Referee Report

2 major / 1 minor

Summary. The paper introduces AuthorityBench, a 220,564-prompt multi-domain benchmark using a balanced 2x2 factorial design that crosses claim veracity with citation veracity (real, fabricated, or absent). The benchmark spans general knowledge, science, law, and medicine, with controls for 40 prompt templates, four venue prestige tiers, and country-coded author names. Evaluating seven LLMs, the central claim is that citation presence (real or fabricated) consistently raises hallucination rates relative to a no-citation baseline, with the largest effects (3–22 percentage points, up to 35–77% in general knowledge) occurring when fabricated citations accompany true claims; legal claims are more robust, while venue prestige and author demographics show negligible impact. Datasets and evaluation code are released publicly.

Significance. If the effects hold under the stated controls, the work supplies a large-scale, open benchmark for isolating citation-based authority signals from factual content in LLM epistemic behavior. The factorial design and public release of data plus code constitute clear strengths for reproducibility and extension by the community.

major comments (2)

[Methods] Methods/Evaluation section: The operational definition of hallucination used to label model outputs and compute all reported rates is not provided in the manuscript (only referenced via the GitHub repository). This definition is load-bearing for the central claims about 3–22 percentage-point increases.
[Results] Results section: The manuscript reports percentage-point changes without statistical tests, confidence intervals, or analysis of variance across the 40 prompt templates and seven models. This omission makes it difficult to assess whether the reported effect sizes are reliable or generalizable.

minor comments (1)

[Abstract] The abstract is lengthy; condensing the description of controls while preserving the key numerical findings would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of clarity and statistical rigor. We address each major comment below and will incorporate the suggested changes into the revised manuscript.

read point-by-point responses

Referee: [Methods] Methods/Evaluation section: The operational definition of hallucination used to label model outputs and compute all reported rates is not provided in the manuscript (only referenced via the GitHub repository). This definition is load-bearing for the central claims about 3–22 percentage-point increases.

Authors: We agree that the operational definition of hallucination is essential for interpreting the results and should appear in the main text rather than solely in the repository. In the revised manuscript we will expand the Methods/Evaluation section to include a complete description of the labeling procedure, including the exact criteria used to classify outputs as hallucinations. revision: yes
Referee: [Results] Results section: The manuscript reports percentage-point changes without statistical tests, confidence intervals, or analysis of variance across the 40 prompt templates and seven models. This omission makes it difficult to assess whether the reported effect sizes are reliable or generalizable.

Authors: We concur that formal statistical support would strengthen the presentation of effect sizes. In the revision we will add bootstrap confidence intervals for the reported percentage-point differences and include mixed-effects or ANOVA-style analyses that account for variation across the 40 prompt templates and seven models. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark with independent verification path

full rationale

The paper describes an empirical benchmark (AuthorityBench) constructed via a balanced 2x2 factorial design crossing claim veracity with citation veracity, plus controlled prompt templates, venue tiers, and author names. Hallucination rates are measured directly from model outputs on the released dataset; no equations, fitted parameters, derivations, or predictions reduce any reported quantity to prior self-referential inputs. The open GitHub release of datasets and evaluation code supplies an external verification route independent of the paper itself. No self-citation chains, ansatzes, or uniqueness theorems appear as load-bearing elements. This is a standard empirical study whose central claims are falsifiable against the released artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical benchmark rather than a derivation; relies on standard assumptions in LLM evaluation that hallucination can be measured via structured prompts and that the factorial crossing isolates authority signals.

axioms (1)

domain assumption The 2x2 factorial design crossing claim veracity with citation veracity isolates citation-based authority signals independent of factual content
Explicitly stated as the benchmark purpose in the abstract.

pith-pipeline@v0.9.1-grok · 5752 in / 1337 out tokens · 30319 ms · 2026-06-27T07:23:56.668490+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 12 canonical work pages · 1 internal anchor

[1]

The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , year =

Nathan Mao and Varun Kaushik and Shreya Shivkumar and Parham Sharafoleslami and Kevin Zhu and Sunishchal Dev , title =. The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , year =
[2]

2601.05866 , archivePrefix =

Maxime Dassen and Rebecca Kotula and Kenton Murray and Andrew Yates and Dawn Lawrie and Efsun Kayi and James Mayfield and Kevin Duh , year =. 2601.05866 , archivePrefix =

arXiv
[3]

and Henderson, Peter and Ho, Daniel E

Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho , title =. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law , series =. 2021 , pages =. doi:10.1145/3462757.3466088 , url =

work page doi:10.1145/3462757.3466088 2021
[4]

Proceedings of the Conference on Health, Inference, and Learning , series =

Ankit Pal and Logesh Kumar Umapathi and Malaikannan Sankarasubbu , title =. Proceedings of the Conference on Health, Inference, and Learning , series =. 2022 , publisher =

2022
[5]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

2025
[6]

2024 , howpublished =

2024
[7]

2024 , howpublished =

Washington &. 2024 , howpublished =

2024
[8]

TruthfulQA: Measuring how models mimic human false- hoods

Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022
[9]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023
[10]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , title =. 2025 , month = jan, issn =. doi:10.1145/3703155 , url =

work page doi:10.1145/3703155 2025
[11]

Enabling Large Language Models to Generate Text with Citations

Gao, Tianyu and Yen, Howard and Yu, Jiatong and Chen, Danqi. Enabling Large Language Models to Generate Text with Citations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.398

work page doi:10.18653/v1/2023.emnlp-main.398 2023
[12]

doi: 10.18653/v1/2023.emnlp-main.741

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.1...

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[13]

Whose Facts Win?

Jakob Schuster and Vagrant Gautam and Katja Markert , year =. Whose Facts Win?. 2601.03746 , archivePrefix =

Pith/arXiv arXiv
[14]

2024 , howpublished =

Hughes Hallucination Evaluation Model (. 2024 , howpublished =

2024
[15]

2023 , note =

Boothe, Andy , title =. 2023 , note =

2023
[16]

URLhttps://doi.org/10.18653/v1/D19-1259

Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua. P ub M ed QA : A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1259

work page doi:10.18653/v1/d19-1259 2019
[17]

Fact or Fiction: Verifying Scientific Claims

Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2020. doi:10.18653/v1/2020.emnlp-main.609

work page doi:10.18653/v1/2020.emnlp-main.609 2020
[18]

and Gardner, Matt

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. doi:10.18653/v1/W17-4413

work page doi:10.18653/v1/w17-4413 2017
[19]

Knowledge conflicts for LLMs: A survey

Xu, Rongwu and Qi, Zehan and Guo, Zhijiang and Wang, Cunxiang and Wang, Hongru and Zhang, Yue and Xu, Wei. Knowledge Conflicts for LLM s: A Survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.486

work page doi:10.18653/v1/2024.emnlp-main.486 2024
[20]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1074

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[21]

Algorithmic Inheritance: Surname Bias in

Pat Pataranutaporn and Nattavudh Powdthavee and Pattie Maes , year =. Algorithmic Inheritance: Surname Bias in. 2501.19407 , archivePrefix =

arXiv
[22]

2024 , eprint =

Hallucination is Inevitable: An Innate Limitation of Large Language Models , author =. 2024 , eprint =

2024
[23]

The Twelfth International Conference on Learning Representations , year =

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts , author =. The Twelfth International Conference on Learning Representations , year =
[24]

Gender bias and stereotypes in Large Language Models , url=

Kotek, Hadas and Dockum, Rikker and Sun, David , title =. Proceedings of The. 2023 , isbn =. doi:10.1145/3582269.3615599 , url =

work page doi:10.1145/3582269.3615599 2023
[25]

Proceedings of the 2024

Wilson, Kyra and Caliskan, Aylin , title =. Proceedings of the 2024. 2025 , publisher =

2024

[1] [1]

The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , year =

Nathan Mao and Varun Kaushik and Shreya Shivkumar and Parham Sharafoleslami and Kevin Zhu and Sunishchal Dev , title =. The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , year =

[2] [2]

2601.05866 , archivePrefix =

Maxime Dassen and Rebecca Kotula and Kenton Murray and Andrew Yates and Dawn Lawrie and Efsun Kayi and James Mayfield and Kevin Duh , year =. 2601.05866 , archivePrefix =

arXiv

[3] [3]

and Henderson, Peter and Ho, Daniel E

Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho , title =. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law , series =. 2021 , pages =. doi:10.1145/3462757.3466088 , url =

work page doi:10.1145/3462757.3466088 2021

[4] [4]

Proceedings of the Conference on Health, Inference, and Learning , series =

Ankit Pal and Logesh Kumar Umapathi and Malaikannan Sankarasubbu , title =. Proceedings of the Conference on Health, Inference, and Learning , series =. 2022 , publisher =

2022

[5] [5]

2025 , eprint =

Qwen3 Technical Report , author =. 2025 , eprint =

2025

[6] [6]

2024 , howpublished =

2024

[7] [7]

2024 , howpublished =

Washington &. 2024 , howpublished =

2024

[8] [8]

TruthfulQA: Measuring how models mimic human false- hoods

Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022

[9] [9]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023

[10] [10]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , title =. 2025 , month = jan, issn =. doi:10.1145/3703155 , url =

work page doi:10.1145/3703155 2025

[11] [11]

Enabling Large Language Models to Generate Text with Citations

Gao, Tianyu and Yen, Howard and Yu, Jiatong and Chen, Danqi. Enabling Large Language Models to Generate Text with Citations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.398

work page doi:10.18653/v1/2023.emnlp-main.398 2023

[12] [12]

doi: 10.18653/v1/2023.emnlp-main.741

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.1...

work page doi:10.18653/v1/2023.emnlp-main.741 2023

[13] [13]

Whose Facts Win?

Jakob Schuster and Vagrant Gautam and Katja Markert , year =. Whose Facts Win?. 2601.03746 , archivePrefix =

Pith/arXiv arXiv

[14] [14]

2024 , howpublished =

Hughes Hallucination Evaluation Model (. 2024 , howpublished =

2024

[15] [15]

2023 , note =

Boothe, Andy , title =. 2023 , note =

2023

[16] [16]

URLhttps://doi.org/10.18653/v1/D19-1259

Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua. P ub M ed QA : A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1259

work page doi:10.18653/v1/d19-1259 2019

[17] [17]

Fact or Fiction: Verifying Scientific Claims

Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2020. doi:10.18653/v1/2020.emnlp-main.609

work page doi:10.18653/v1/2020.emnlp-main.609 2020

[18] [18]

and Gardner, Matt

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. doi:10.18653/v1/W17-4413

work page doi:10.18653/v1/w17-4413 2017

[19] [19]

Knowledge conflicts for LLMs: A survey

Xu, Rongwu and Qi, Zehan and Guo, Zhijiang and Wang, Cunxiang and Wang, Hongru and Zhang, Yue and Xu, Wei. Knowledge Conflicts for LLM s: A Survey. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.486

work page doi:10.18653/v1/2024.emnlp-main.486 2024

[20] [20]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1074

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018

[21] [21]

Algorithmic Inheritance: Surname Bias in

Pat Pataranutaporn and Nattavudh Powdthavee and Pattie Maes , year =. Algorithmic Inheritance: Surname Bias in. 2501.19407 , archivePrefix =

arXiv

[22] [22]

2024 , eprint =

Hallucination is Inevitable: An Innate Limitation of Large Language Models , author =. 2024 , eprint =

2024

[23] [23]

The Twelfth International Conference on Learning Representations , year =

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts , author =. The Twelfth International Conference on Learning Representations , year =

[24] [24]

Gender bias and stereotypes in Large Language Models , url=

Kotek, Hadas and Dockum, Rikker and Sun, David , title =. Proceedings of The. 2023 , isbn =. doi:10.1145/3582269.3615599 , url =

work page doi:10.1145/3582269.3615599 2023

[25] [25]

Proceedings of the 2024

Wilson, Kyra and Caliskan, Aylin , title =. Proceedings of the 2024. 2025 , publisher =

2024