PhantomBench: Benchmarking the Non-existential Threat of Language Models

Haeji Jung; Hila Gonen

arxiv: 2606.11105 · v1 · pith:KDVPM7QGnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI

PhantomBench: Benchmarking the Non-existential Threat of Language Models

Haeji Jung , Hila Gonen This is my paper

Pith reviewed 2026-06-27 12:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hallucinationlanguage modelsbenchmarknon-existent conceptsabstentionknowledge limitsfrontier modelsphantom terms

0 comments

The pith

Language models hallucinate on non-existent concepts at rates as high as 86.7 percent instead of abstaining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhantomBench, a dataset of more than 60,000 non-existent terms and entities built from real concepts across domains, to measure whether language models can detect the limits of their knowledge. Evaluation across 21 models shows consistently high hallucination rates, where models produce answers rather than refusing, and this problem persists even in frontier systems especially when questions assume the concept exists. The benchmark also functions as a proxy for how models behave on rare real concepts that trigger similar errors. A pipeline is supplied for generating additional non-existent terms tailored to particular research needs.

Core claim

PhantomBench shows that language models of various types and sizes produce factually ungrounded responses about non-existent terms and entities at high rates, averaging as high as 86.7 percent, and that even frontier models fail to abstain reliably when the input presumes the concept is real.

What carries the argument

PhantomBench, a benchmark of over 60K non-existent terms and entities derived from real concepts to test hallucination versus abstention.

Load-bearing premise

The constructed non-existent terms are verifiably non-existent and the evaluation protocol correctly distinguishes hallucination from legitimate responses without systematic bias.

What would settle it

A model that abstains correctly on the majority of PhantomBench queries while still answering accurately on comparable real concepts would challenge the reported hallucination rates.

Figures

Figures reproduced from arXiv: 2606.11105 by Haeji Jung, Hila Gonen.

**Figure 1.** Figure 1: The pipeline to construct PHANTOMBENCH. Existing concepts from seed terms and entities are decomposed into smaller components (words and n-grams) which are then recombined to form new concepts (§2.1). Frequency filter discards concepts found in a large corpus, considering concepts with zero matches as non-existent (§2.2). The resulting concepts are queried through diverse prompts targeting different attrib… view at source ↗

**Figure 2.** Figure 2: shows the results of core models on the full benchmark. All models struggle to abstain, 4Models with the most downloads within their model family as of Apr 2026 on Hugging Face (https://huggingface.co/). Existence Meaning Date Place 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Hallucination Rate Model gemma-2-9b gemma-3-12b llama-3.1-8b mistral-7b qwen-2.5-7b qwen-3-8b Model mean Per-dataset [PITH_FULL_IM… view at source ↗

**Figure 3.** Figure 3: Hallucination Rates (HR) on PHANTOM-T across different models and prompt types. Models PHANTOM-T (terms) PHANTOM-E (entities) E M D P E M D P Core Models Llama 3.1 8B 14.34 26.42 0.94 3.58 7.41 7.92 1.33 3.67 Mistral 7B 30.47 54.62 32.08 43.68 17.09 47.83 20.08 35.16 Qwen 2.5 7B 5.57 10.94 5.57 6.51 4.08 8.33 9.67 12.25 Qwen 3 8B 4.91 9.06 3.87 3.21 4.00 15.92 11.75 18.59 Gemma 2 9B 7.45 27.64 4.53 12.83 7… view at source ↗

**Figure 4.** Figure 4: Abstention rates on non-existent and common concepts, averaged across prompt types. Transparent dots indicate abstention rates for individual subsets under each prompt type. cepts are selected similarly from the lowest frequency concepts, having no more than 15 matches. Among the term datasets, legal terms did not contain any terms satisfying our criteria for rare, so we excluded them from this analysis… view at source ↗

**Figure 5.** Figure 5: Proportion of each fine-grained category of abstention type out of all abstention responses, per [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template for the LLM judge to make binary decisions on abstention. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for LLM judge to classify responses into fine-grained categories. We consider [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhantomBench scales up a test for model abstention on made-up terms but the high hallucination numbers rest on unverified claims that none of the 60K items actually exist.

read the letter

This paper's core offering is PhantomBench, a collection of more than 60K non-existent terms built from real concepts, used to measure how often 21 models hallucinate instead of abstaining. They report average hallucination rates reaching 86.7% and note that even frontier models keep answering when the prompt treats the concept as real.

The scale and the supplied generation pipeline are the clearest strengths. Being able to produce domain-specific phantom items at this volume could be handy for people who want to test models on knowledge boundaries without relying on existing rare-entity lists. Tying the benchmark to behavior on low-frequency real concepts is also a sensible move that connects to prior work on long-tail hallucinations.

The soft spot is exactly the one flagged in the stress-test note: confirming the terms are verifiably non-existent. The abstract describes derivation from real concepts but gives no concrete steps for exhaustive checking against databases, web sources, or human review. If even a modest fraction of items have obscure real referents, the reported rates would be inflated and the claim about non-existential threats would weaken. The summary also omits error bars, exact scoring rules, and controls for question phrasing, so it is difficult to judge how stable the numbers are.

This is aimed at researchers working on hallucination detection, model uncertainty, and safety evaluation. A reader who needs ideas for new benchmark construction would find the pipeline useful; anyone planning to cite the specific percentages would want the full methods first.

It deserves peer review because the idea is direct, the model coverage is broad, and the scale is ambitious. The verification gap is real but fixable with additional checks.

Referee Report

2 major / 2 minor

Summary. The paper introduces PhantomBench, a benchmark of more than 60K non-existent terms and entities derived from real concepts across domains. It evaluates 21 models of varying types and sizes, reporting average hallucination rates as high as 86.7%, with frontier models failing to abstain especially when inputs presume existence. PhantomBench is also positioned as a proxy for rare-concept behavior, and a construction pipeline is supplied to enable scalable generation of such terms.

Significance. If the non-existence of all terms is rigorously verified and the evaluation protocol is free of systematic bias in phrasing or scoring, the benchmark would provide a useful large-scale resource for quantifying and mitigating models' inability to recognize knowledge boundaries, with direct relevance to hallucination risks in high-stakes settings.

major comments (2)

[Benchmark construction] Benchmark construction section: the pipeline derives terms from real concepts but supplies no description of an exhaustive, automated existence-verification procedure (e.g., multi-source KB lookup, web search, or human audit) that would confirm none of the 60K+ items have obscure real-world referents. This verification is load-bearing for the headline 86.7% hallucination rates; any non-negligible fraction of real entities would cause legitimate model responses to be scored as hallucinations, inflating the reported figures.
[Evaluation and results] Evaluation and results sections: quantitative claims on 21 models are presented without accompanying details on prompting templates, exact criteria distinguishing hallucination from abstention or partial knowledge, inter-annotator agreement (if human scoring is used), or controls for phrasing bias that presumes existence. These omissions prevent assessment of whether the protocol systematically overestimates non-abstention.

minor comments (2)

[Abstract] Abstract states results on 21 models but omits any methods summary, error bars, or validation statistics; the full paper should include a concise methods paragraph in the abstract for readability.
[Proxy analysis] The claim that PhantomBench serves as a 'proxy for studying model behavior on rare concepts' requires explicit correlation analysis or ablation showing that performance on phantom items predicts performance on verified rare real items; this link is asserted but not demonstrated in the provided summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and will revise the manuscript to address the identified gaps in documentation.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction section: the pipeline derives terms from real concepts but supplies no description of an exhaustive, automated existence-verification procedure (e.g., multi-source KB lookup, web search, or human audit) that would confirm none of the 60K+ items have obscure real-world referents. This verification is load-bearing for the headline 86.7% hallucination rates; any non-negligible fraction of real entities would cause legitimate model responses to be scored as hallucinations, inflating the reported figures.

Authors: We agree that the manuscript does not currently describe the existence-verification procedure in sufficient detail. We will expand the Benchmark construction section to document the full verification pipeline, including the specific multi-source KB lookups, web searches, and any human audit steps used to confirm non-existence of all terms. revision: yes
Referee: [Evaluation and results] Evaluation and results sections: quantitative claims on 21 models are presented without accompanying details on prompting templates, exact criteria distinguishing hallucination from abstention or partial knowledge, inter-annotator agreement (if human scoring is used), or controls for phrasing bias that presumes existence. These omissions prevent assessment of whether the protocol systematically overestimates non-abstention.

Authors: We concur that the evaluation protocol requires additional documentation for reproducibility and bias assessment. In the revision we will include the complete prompting templates, precise classification criteria for hallucination versus abstention or partial knowledge, inter-annotator agreement statistics where human scoring was involved, and any controls applied for phrasing bias. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and direct model evaluation

full rationale

The paper introduces a benchmark of >60K non-existent terms and reports hallucination rates from evaluating 21 models. No equations, fitted parameters, or derivations are present. The headline rates are direct empirical measurements on the constructed dataset rather than quantities that reduce to prior fits or self-citations by construction. The pipeline for term generation is described as a construction method, not a self-referential prediction. This is a standard empirical benchmark paper with no load-bearing self-citation chains or definitional loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the validity of the phantom-term construction process and the assumption that model responses to presumed-existence prompts constitute hallucinations when the terms do not exist.

invented entities (1)

non-existent terms and entities no independent evidence
purpose: Test items that look like real concepts but do not exist, used to probe hallucination
Derived from real concepts but altered to be non-existent; no independent evidence provided that they are verifiably absent from all knowledge sources.

pith-pipeline@v0.9.1-grok · 5732 in / 1149 out tokens · 16781 ms · 2026-06-27T12:55:32.530335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 22 canonical work pages

[1]

arXiv preprint arXiv:2605.01428 , year=

Hallucinations undermine trust; metacognition is a way forward , author=. arXiv preprint arXiv:2605.01428 , year=

Pith/arXiv arXiv
[2]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[3]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Bogin, Ben and Chandu, Khyathi and Dumas, Jennifer and Elazar, Yanai and Hofmann, Valentin and Jha, Ananya and Kumar, Sachin and Lucy, Li and Lyu, Xinxi and Lambert, Nathan and Magnusson, Ian and Morrison, Jacob and Muennighoff, Niklas and...

work page doi:10.18653/v1/2024.acl-long.840 2024
[4]

First Conference on Language Modeling , year=

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens , author=. First Conference on Language Modeling , year=
[5]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, Alex and Asai, Akari and Zhong, Victor and Das, Rajarshi and Khashabi, Daniel and Hajishirzi, Hannaneh. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023...

work page doi:10.18653/v1/2023.acl-long.546 2023
[6]

The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s

Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782

work page doi:10.18653/v1/2025.acl-long.782 2025
[7]

M ed INST : Meta Dataset of Biomedical Instructions

Han, Wenhan and Fang, Meng and Zhang, Zihan and Yin, Yu and Song, Zirui and Chen, Ling and Pechenizkiy, Mykola and Chen, Qingyu. M ed INST : Meta Dataset of Biomedical Instructions. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.482

work page doi:10.18653/v1/2024.findings-emnlp.482 2024
[8]

Miao Xiong and Zhiyuan Hu and Xinyang Lu and YIFEI LI and Jie Fu and Junxian He and Bryan Hooi , booktitle=. Can. 2024 , url=

2024
[9]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

2025
[10]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[11]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025
[12]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

2024
[13]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023
[14]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025
[15]

B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Labrak, Yanis and Bazoge, Adrien and Morin, Emmanuel and Gourraud, Pierre-Antoine and Rouvier, Mickael and Dufour, Richard. B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.348

work page doi:10.18653/v1/2024.findings-acl.348 2024
[16]

2025 , eprint=

MedGemma Technical Report , author=. 2025 , eprint=

2025
[17]

2023 , eprint=

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models , author=. 2023 , eprint=

2023
[18]

2024 , eprint=

SaulLM-7B: A pioneering Large Language Model for Law , author=. 2024 , eprint=

2024
[19]

2024 , eprint=

SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain , author=. 2024 , eprint=

2024
[20]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

2025
[21]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[22]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[23]

Röttger, H

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024
[24]

Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 13:529–556, 2025

Wen, Bingbing and Yao, Jihan and Feng, Shangbin and Xu, Chenjun and Tsvetkov, Yulia and Howe, Bill and Wang, Lucy Lu. Know Your Limits: A Survey of Abstention in Large Language Models. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl_a_00754

work page doi:10.1162/tacl_a_00754 2025
[25]

2026 , eprint=

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs , author=. 2026 , eprint=

2026
[26]

Proceedings of the 62nd

Groeneveld, Dirk and Beltagy, Iz and Walsh, Evan and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack and Khot, Tus...

work page doi:10.18653/v1/2024.acl-long.841 2024
[27]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[28]

2025 , eprint=

Are Reasoning Models More Prone to Hallucination? , author=. 2025 , eprint=

2025
[29]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , title =. ACM Trans. Inf. Syst. , month = jan, articleno =. 2025 , issue_date =. doi:10.1145/3703155 , abstract =

work page doi:10.1145/3703155 2025
[30]

2026 , eprint=

A Unified Definition of Hallucination: It's The World Model, Stupid! , author=. 2026 , eprint=

2026
[31]

2025 , eprint=

Why Language Models Hallucinate , author=. 2025 , eprint=

2025
[32]

Why and How

Yiyou Sun and Yu Gai and Lijie Chen and Abhilasha Ravichander and Yejin Choi and Nouha Dziri and Dawn Song , booktitle=. Why and How. 2026 , url=

2026
[33]

Proceedings of the 40th International Conference on Machine Learning , pages =

Large Language Models Struggle to Learn Long-Tail Knowledge , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023
[34]

TruthfulQA: Measuring how models mimic human false- hoods

Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022
[35]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023
[36]

doi: 10.18653/v1/2023.emnlp-main.741

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.1...

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[37]

2025 , address =

Bang, Yejin and Ji, Ziwei and Schelten, Alan and Hartshorn, Anthony and Fowler, Tara and Zhang, Cheng and Cancedda, Nicola and Fung, Pascale. H allu L ens: LLM Hallucination Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1176

work page doi:10.18653/v1/2025.acl-long.1176 2025
[38]

Smith, Yejin Choi, and Hannaneh Hajishirzi

Brahman, Faeze and Kumar, Sachin and Balachandran, Vidhisha and Dasigi, Pradeep and Pyatkin, Valentina and Ravichander, Abhilasha and Wiegreffe, Sarah and Dziri, Nouha and Chandu, Khyathi and Hessel, Jack and Tsvetkov, Yulia and Smith, Noah A. and Choi, Yejin and Hajishirzi, Hannaneh , editor=. The Art of Saying No: Contextual Noncompliance in Language Mo...

work page doi:10.52202/079017-1573 2024
[39]

2024 , eprint=

Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback , author=. 2024 , eprint=

2024
[40]

AbstentionBench: Reasoning

Polina Kirichenko and Mark Ibrahim and Kamalika Chaudhuri and Samuel Bell , booktitle=. AbstentionBench: Reasoning. 2026 , url=

2026
[41]

Findings of the Association for Computational Linguistics: ACL 2023 , pages =

Yin, Zhangyue and Sun, Qiushi and Guo, Qipeng and Wu, Jiawen and Qiu, Xipeng and Huang, Xuanjing. Do Large Language Models Know What They Don ' t Know?. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.551

work page doi:10.18653/v1/2023.findings-acl.551 2023
[42]

Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models

Amayuelas, Alfonso and Wong, Kyle and Pan, Liangming and Chen, Wenhu and Wang, William Yang. Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.383

work page doi:10.18653/v1/2024.findings-acl.383 2024
[43]

Knowledge Boundary of Large Language Models: A Survey

Li, Moxin and Zhao, Yong and Zhang, Wenxuan and Li, Shuaiyi and Xie, Wenya and Ng, See-Kiong and Chua, Tat-Seng and Deng, Yang. Knowledge Boundary of Large Language Models: A Survey. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.256

work page doi:10.18653/v1/2025.acl-long.256 2025
[44]

2024 , eprint=

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge , author=. 2024 , eprint=

2024
[45]

ALCUNA : Large Language Models Meet New Knowledge

Yin, Xunjian and Huang, Baizhou and Wan, Xiaojun. ALCUNA : Large Language Models Meet New Knowledge. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.87

work page doi:10.18653/v1/2023.emnlp-main.87 2023
[46]

H ypo T erm QA : Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLM s

Uluoglakci, Cem and Temizel, Tugba. H ypo T erm QA : Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLM s. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop. 2024. doi:10.18653/v1/2024.eacl-srw.9

work page doi:10.18653/v1/2024.eacl-srw.9 2024
[47]

, biburl =

Zipf, George K. , biburl =
[48]

Piantadosi , doi=

Steven T. Piantadosi , doi=. Zipf’s word frequency law in natural language: A critical review and future directions , publisher=. 2014 , issn=

2014
[49]

Language, Usage and Cognition , publisher=

Bybee, Joan , year=. Language, Usage and Cognition , publisher=
[50]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =
[51]

Advances in Neural Information Processing Systems , editor=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022
[52]

R -Tuning: Instructing Large Language Models to Say ` I Don ' t Know'

Zhang, Hanning and Diao, Shizhe and Lin, Yong and Fung, Yi and Lian, Qing and Wang, Xingyao and Chen, Yangyi and Ji, Heng and Zhang, Tong. R -Tuning: Instructing Large Language Models to Say ` I Don ' t Know'. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

work page doi:10.18653/v1/2024.naacl-long.394 2024
[53]

The Thirteenth International Conference on Learning Representations , year=

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
[54]

2024 , eprint=

WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries , author=. 2024 , eprint=

2024
[55]

Journal of Legal Analysis , volume =

Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models , volume=. Journal of Legal Analysis , author=. 2024 , month=jan, pages=. doi:10.1093/jla/laae003 , abstractNote=

work page doi:10.1093/jla/laae003 2024
[56]

Weidinger, J

Weidinger, Laura and Uesato, Jonathan and Rauh, Maribeth and Griffin, Conor and Huang, Po-Sen and Mellor, John and Glaese, Amelia and Cheng, Myra and Balle, Borja and Kasirzadeh, Atoosa and Biles, Courtney and Brown, Sasha and Kenton, Zac and Hawkins, Will and Stepleton, Tom and Birhane, Abeba and Hendricks, Lisa Anne and Rimell, Laura and Isaac, William ...

work page doi:10.1145/3531146.3533088 2022

[1] [1]

arXiv preprint arXiv:2605.01428 , year=

Hallucinations undermine trust; metacognition is a way forward , author=. arXiv preprint arXiv:2605.01428 , year=

Pith/arXiv arXiv

[2] [2]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

[3] [3]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Bogin, Ben and Chandu, Khyathi and Dumas, Jennifer and Elazar, Yanai and Hofmann, Valentin and Jha, Ananya and Kumar, Sachin and Lucy, Li and Lyu, Xinxi and Lambert, Nathan and Magnusson, Ian and Morrison, Jacob and Muennighoff, Niklas and...

work page doi:10.18653/v1/2024.acl-long.840 2024

[4] [4]

First Conference on Language Modeling , year=

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens , author=. First Conference on Language Modeling , year=

[5] [5]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, Alex and Asai, Akari and Zhong, Victor and Das, Rajarshi and Khashabi, Daniel and Hajishirzi, Hannaneh. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023...

work page doi:10.18653/v1/2023.acl-long.546 2023

[6] [6]

The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s

Calderon, Nitay and Reichart, Roi and Dror, Rotem. The Alternative Annotator Test for LLM -as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.782

work page doi:10.18653/v1/2025.acl-long.782 2025

[7] [7]

M ed INST : Meta Dataset of Biomedical Instructions

Han, Wenhan and Fang, Meng and Zhang, Zihan and Yin, Yu and Song, Zirui and Chen, Ling and Pechenizkiy, Mykola and Chen, Qingyu. M ed INST : Meta Dataset of Biomedical Instructions. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.482

work page doi:10.18653/v1/2024.findings-emnlp.482 2024

[8] [8]

Miao Xiong and Zhiyuan Hu and Xinyang Lu and YIFEI LI and Jie Fu and Junxian He and Bryan Hooi , booktitle=. Can. 2024 , url=

2024

[9] [9]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

2025

[10] [10]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[11] [11]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025

[12] [12]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

2024

[13] [13]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023

[14] [14]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025

[15] [15]

B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Labrak, Yanis and Bazoge, Adrien and Morin, Emmanuel and Gourraud, Pierre-Antoine and Rouvier, Mickael and Dufour, Richard. B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.348

work page doi:10.18653/v1/2024.findings-acl.348 2024

[16] [16]

2025 , eprint=

MedGemma Technical Report , author=. 2025 , eprint=

2025

[17] [17]

2023 , eprint=

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models , author=. 2023 , eprint=

2023

[18] [18]

2024 , eprint=

SaulLM-7B: A pioneering Large Language Model for Law , author=. 2024 , eprint=

2024

[19] [19]

2024 , eprint=

SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain , author=. 2024 , eprint=

2024

[20] [20]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

2025

[21] [21]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[22] [22]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[23] [23]

Röttger, H

R. XST est: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.301

work page doi:10.18653/v1/2024.naacl-long.301 2024

[24] [24]

Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 13:529–556, 2025

Wen, Bingbing and Yao, Jihan and Feng, Shangbin and Xu, Chenjun and Tsvetkov, Yulia and Howe, Bill and Wang, Lucy Lu. Know Your Limits: A Survey of Abstention in Large Language Models. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl_a_00754

work page doi:10.1162/tacl_a_00754 2025

[25] [25]

2026 , eprint=

Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs , author=. 2026 , eprint=

2026

[26] [26]

Proceedings of the 62nd

Groeneveld, Dirk and Beltagy, Iz and Walsh, Evan and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack and Khot, Tus...

work page doi:10.18653/v1/2024.acl-long.841 2024

[27] [27]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[28] [28]

2025 , eprint=

Are Reasoning Models More Prone to Hallucination? , author=. 2025 , eprint=

2025

[29] [29]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , title =. ACM Trans. Inf. Syst. , month = jan, articleno =. 2025 , issue_date =. doi:10.1145/3703155 , abstract =

work page doi:10.1145/3703155 2025

[30] [30]

2026 , eprint=

A Unified Definition of Hallucination: It's The World Model, Stupid! , author=. 2026 , eprint=

2026

[31] [31]

2025 , eprint=

Why Language Models Hallucinate , author=. 2025 , eprint=

2025

[32] [32]

Why and How

Yiyou Sun and Yu Gai and Lijie Chen and Abhilasha Ravichander and Yejin Choi and Nouha Dziri and Dawn Song , booktitle=. Why and How. 2026 , url=

2026

[33] [33]

Proceedings of the 40th International Conference on Machine Learning , pages =

Large Language Models Struggle to Learn Long-Tail Knowledge , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

2023

[34] [34]

TruthfulQA: Measuring how models mimic human false- hoods

Lin, Stephanie and Hilton, Jacob and Evans, Owain. T ruthful QA : Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022

[35] [35]

Halueval: A large-scale hallucination evaluation benchmark for large language models

Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023

[36] [36]

doi: 10.18653/v1/2023.emnlp-main.741

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.1...

work page doi:10.18653/v1/2023.emnlp-main.741 2023

[37] [37]

2025 , address =

Bang, Yejin and Ji, Ziwei and Schelten, Alan and Hartshorn, Anthony and Fowler, Tara and Zhang, Cheng and Cancedda, Nicola and Fung, Pascale. H allu L ens: LLM Hallucination Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1176

work page doi:10.18653/v1/2025.acl-long.1176 2025

[38] [38]

Smith, Yejin Choi, and Hannaneh Hajishirzi

Brahman, Faeze and Kumar, Sachin and Balachandran, Vidhisha and Dasigi, Pradeep and Pyatkin, Valentina and Ravichander, Abhilasha and Wiegreffe, Sarah and Dziri, Nouha and Chandu, Khyathi and Hessel, Jack and Tsvetkov, Yulia and Smith, Noah A. and Choi, Yejin and Hajishirzi, Hannaneh , editor=. The Art of Saying No: Contextual Noncompliance in Language Mo...

work page doi:10.52202/079017-1573 2024

[39] [39]

2024 , eprint=

Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback , author=. 2024 , eprint=

2024

[40] [40]

AbstentionBench: Reasoning

Polina Kirichenko and Mark Ibrahim and Kamalika Chaudhuri and Samuel Bell , booktitle=. AbstentionBench: Reasoning. 2026 , url=

2026

[41] [41]

Findings of the Association for Computational Linguistics: ACL 2023 , pages =

Yin, Zhangyue and Sun, Qiushi and Guo, Qipeng and Wu, Jiawen and Qiu, Xipeng and Huang, Xuanjing. Do Large Language Models Know What They Don ' t Know?. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.551

work page doi:10.18653/v1/2023.findings-acl.551 2023

[42] [42]

Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models

Amayuelas, Alfonso and Wong, Kyle and Pan, Liangming and Chen, Wenhu and Wang, William Yang. Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.383

work page doi:10.18653/v1/2024.findings-acl.383 2024

[43] [43]

Knowledge Boundary of Large Language Models: A Survey

Li, Moxin and Zhao, Yong and Zhang, Wenxuan and Li, Shuaiyi and Xie, Wenya and Ng, See-Kiong and Chua, Tat-Seng and Deng, Yang. Knowledge Boundary of Large Language Models: A Survey. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.256

work page doi:10.18653/v1/2025.acl-long.256 2025

[44] [44]

2024 , eprint=

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge , author=. 2024 , eprint=

2024

[45] [45]

ALCUNA : Large Language Models Meet New Knowledge

Yin, Xunjian and Huang, Baizhou and Wan, Xiaojun. ALCUNA : Large Language Models Meet New Knowledge. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.87

work page doi:10.18653/v1/2023.emnlp-main.87 2023

[46] [46]

H ypo T erm QA : Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLM s

Uluoglakci, Cem and Temizel, Tugba. H ypo T erm QA : Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLM s. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop. 2024. doi:10.18653/v1/2024.eacl-srw.9

work page doi:10.18653/v1/2024.eacl-srw.9 2024

[47] [47]

, biburl =

Zipf, George K. , biburl =

[48] [48]

Piantadosi , doi=

Steven T. Piantadosi , doi=. Zipf’s word frequency law in natural language: A critical review and future directions , publisher=. 2014 , issn=

2014

[49] [49]

Language, Usage and Cognition , publisher=

Bybee, Joan , year=. Language, Usage and Cognition , publisher=

[50] [50]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

[51] [51]

Advances in Neural Information Processing Systems , editor=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022

[52] [52]

R -Tuning: Instructing Large Language Models to Say ` I Don ' t Know'

Zhang, Hanning and Diao, Shizhe and Lin, Yong and Fung, Yi and Lian, Qing and Wang, Xingyao and Chen, Yangyi and Ji, Heng and Zhang, Tong. R -Tuning: Instructing Large Language Models to Say ` I Don ' t Know'. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

work page doi:10.18653/v1/2024.naacl-long.394 2024

[53] [53]

The Thirteenth International Conference on Learning Representations , year=

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

[54] [54]

2024 , eprint=

WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries , author=. 2024 , eprint=

2024

[55] [55]

Journal of Legal Analysis , volume =

Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models , volume=. Journal of Legal Analysis , author=. 2024 , month=jan, pages=. doi:10.1093/jla/laae003 , abstractNote=

work page doi:10.1093/jla/laae003 2024

[56] [56]

Weidinger, J

Weidinger, Laura and Uesato, Jonathan and Rauh, Maribeth and Griffin, Conor and Huang, Po-Sen and Mellor, John and Glaese, Amelia and Cheng, Myra and Balle, Borja and Kasirzadeh, Atoosa and Biles, Courtney and Brown, Sasha and Kenton, Zac and Hawkins, Will and Stepleton, Tom and Birhane, Abeba and Hendricks, Lisa Anne and Rimell, Laura and Isaac, William ...

work page doi:10.1145/3531146.3533088 2022