From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models

Anh Tuan Luu; Cong-Duy Nguyen; Duc Anh Vu; Ponhvoan Srey; Quang Minh Nguyen; Xiaobao Wu

arxiv: 2606.27679 · v1 · pith:HTDYHJ64new · submitted 2026-06-26 · 💻 cs.CL · cs.AI

From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models

Ponhvoan Srey , Xiaobao Wu , Cong-Duy Nguyen , Quang Minh Nguyen , Duc Anh Vu , Anh Tuan Luu This is my paper

Pith reviewed 2026-06-29 05:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords probe-based uncertainty estimationhallucination detectionlarge language modelsdistribution shiftfactorised studypretrained probesinternal model signals

0 comments

The pith

Factorised study shows raw hidden states hard to beat in-domain for LLM uncertainty probes but structured features are more robust under distribution shift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a controlled factorised analysis of probe-based uncertainty estimation for hallucination detection in large language models. It varies feature design, training data construction, and evaluation settings one at a time under matched conditions to isolate what drives results. Raw hidden states and attention features prove difficult to outperform when test data matches the training distribution. Structured and compressed features, however, deliver better robustness once the input distribution shifts. Prompting choices and label construction also exert strong effects on probe behaviour, and the authors use the resulting best practices to train pretrained probes that transfer to open-ended factual generation.

Core claim

Under matched conditions that isolate feature design, data construction, and evaluation setting, raw hidden states and attention features are difficult to outperform in-domain, yet structured and compressed features prove more robust under distribution shift; prompting and label construction significantly affect probe behaviour; and benchmark-based pretrained probes transfer reasonably well to open-ended factual generation.

What carries the argument

The factorised study design that holds all but one variable matched across experiments to isolate the separate contributions of feature choice, training data, and evaluation setting.

If this is right

In-domain performance alone is insufficient to measure progress in probe-based uncertainty estimation.
Structured and compressed features should be preferred when probes must operate under distribution shift.
Prompting and label construction must be treated as first-class design choices rather than afterthoughts.
Benchmark-based pretrained probes can serve as stable off-the-shelf baselines for open-ended factual generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future evaluations of uncertainty estimators should routinely include explicit distribution-shift tests rather than relying on in-domain metrics.
The transfer results suggest that pretrained probes could reduce the need for task-specific retraining across different generation settings.
Extending the same factorised design to other uncertainty methods might reveal whether the in-domain versus shift pattern holds beyond probes.

Load-bearing premise

The chosen matched conditions and distribution-shift setups isolate the contributions of feature design, training data construction, and evaluation setting without residual confounding from model choice or prompt formatting details.

What would settle it

A replication under the same matched conditions that finds raw hidden states still outperform structured features even after distribution shift would falsify the robustness claim.

Figures

Figures reproduced from arXiv: 2606.27679 by Anh Tuan Luu, Cong-Duy Nguyen, Duc Anh Vu, Ponhvoan Srey, Quang Minh Nguyen, Xiaobao Wu.

**Figure 2.** Figure 2: Effect of probe architecture on AUROC. that reflects attention allocation between input and generated tokens, namely recent-token Attention (Vazhentsev et al., 2025), and Lookback Ratio (Chuang et al., 2024). Finally, combined features concatenate complementary signals into Layer Topm Prob (He et al., 2024), Attention + MSP, Internal Variance, and SATMD + MSP. We provide more details on feature representa… view at source ↗

**Figure 3.** Figure 3: Dependence on number of training examples. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Impact of various automated annotation op [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Benchmark-transfer performance. Average AUROC across In-domain, Out-of-domain (same task), and [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: AUROC and ECE of Llama-3.1-8B-Instruct and Qwen-3-4B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 9.** Figure 9: Effect of probe architecture on ECE. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Benchmark-transfer performance. Average ECE across In-domain, Out-of-domain (same task), and [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: In-domain AUROC against OOD setting for individual benchmarks with Qwen-3-8B. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: In-domain ECE against OOD setting for individual benchmarks with Qwen-3-8B. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: AUROC for individual dataset with varying [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

read the original abstract

Probe-based uncertainty estimation (UE) has emerged as a prominent approach to detect hallucinations in Large Language Models (LLMs) by learning uncertainty from internal model signals. Yet, recent methods vary simultaneously across feature design, training data construction, and evaluation setting, obscuring what actually drives performance. To address this issue, we propose a factorised study of probe-based UE under matched conditions. Our results show that raw hidden states and attention features are difficult to outperform in-domain. However, under distribution shift, structured and compressed features are more robust, suggesting that in-domain performance alone is insufficient to measure progress. Furthermore, prompting and label construction significantly affect probe behaviour. Building on these best-practice findings, we train benchmark-based pretrained probes that transfer reasonably well to open-ended factual generation, providing a stable off-the-shelf baseline. Our work encourages more deployment-oriented evaluation of probe-based uncertainty estimators. The code repository is available at https://github.com/ponhvoan/ProbeUE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs matched ablations on probe-based UE and finds raw features hold up in-domain while structured ones are more robust under shift, plus some transfer results, but the isolation of factors may have gaps.

read the letter

This paper runs a factorised ablation on probe-based uncertainty estimation and finds that raw hidden states are tough to beat in-domain, but structured features hold up better when the distribution shifts. Prompting and how labels are built also matter a lot for how the probes behave. They use the best practices from that to train some pretrained probes that transfer okay to open-ended factual generation.

The new part is doing all this under matched conditions so they can attribute differences to specific choices in features, data, and setting. That produces some patterns not seen together before. It does well at showing why in-domain tests aren't enough on their own and at providing an off-the-shelf baseline with public code.

The soft spot is whether the matching really keeps everything else constant. The stress-test raises a fair point that prompt or model interactions with feature type could still be confounding the robustness results, so the claim that in-domain performance is insufficient might not follow as directly. If the full paper has the details on how they controlled for that, it would help. The evidence is all empirical with no circularity issues.

This is for people working on making uncertainty estimators reliable in real deployments rather than just lab settings. A reader who wants to see a careful comparison of design choices would find it useful.

I'd recommend sending it for peer review. It's a useful incremental empirical piece that deserves checking.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a factorised empirical study of probe-based uncertainty estimation (UE) for hallucination detection in LLMs. It varies feature design (raw hidden states/attention vs. structured/compressed), training data construction, and evaluation setting under matched conditions, reporting that raw features are difficult to outperform in-domain while structured features are more robust under distribution shift. Prompting and label construction are found to significantly affect probe behaviour, and benchmark-based pretrained probes are shown to transfer reasonably well to open-ended factual generation, supporting a call for deployment-oriented evaluation.

Significance. If the matched-conditions factorisation holds without residual confounding, the results would provide actionable best-practice guidance on feature choice and evaluation for probe-based UE, highlighting that in-domain performance is an incomplete proxy for progress and supplying a stable off-the-shelf baseline for factual-generation settings.

major comments (2)

[Abstract] Abstract (factorised study design): The attribution of robustness advantages under distribution shift to structured/compressed features requires that prompt formatting, label construction, and base-model choice are demonstrably orthogonal to feature type; the manuscript provides no explicit verification (e.g., interaction tables or ablation on prompt-feature crosses) that these factors do not interact, which is load-bearing for the claim that in-domain performance alone is insufficient.
[Abstract] Transfer results paragraph: The statement that benchmark-based pretrained probes 'transfer reasonably well' to open-ended factual generation is presented without quantitative metrics, error bars, or explicit baselines, preventing assessment of whether the transfer performance supports the deployment-oriented recommendation.

minor comments (1)

[Abstract] The abstract states that code is available at the cited GitHub link but does not indicate whether the repository contains the exact experimental configurations, random seeds, and data splits used for the reported factorised comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our factorised study. Below we respond point-by-point to the major comments. We have revised the manuscript where the comments identify gaps that can be addressed without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract (factorised study design): The attribution of robustness advantages under distribution shift to structured/compressed features requires that prompt formatting, label construction, and base-model choice are demonstrably orthogonal to feature type; the manuscript provides no explicit verification (e.g., interaction tables or ablation on prompt-feature crosses) that these factors do not interact, which is load-bearing for the claim that in-domain performance alone is insufficient.

Authors: Our experimental design holds prompt formatting, label construction, and base-model choice fixed while varying only the feature type, which by construction isolates the contribution of feature design under matched conditions. This matching was applied uniformly across all in-domain and distribution-shift evaluations. We acknowledge that the manuscript does not include explicit interaction tables or prompt-feature cross-ablation results. To further substantiate the orthogonality claim, we will add a supplementary table summarising performance across a small set of prompt variations crossed with feature types. revision: partial
Referee: [Abstract] Transfer results paragraph: The statement that benchmark-based pretrained probes 'transfer reasonably well' to open-ended factual generation is presented without quantitative metrics, error bars, or explicit baselines, preventing assessment of whether the transfer performance supports the deployment-oriented recommendation.

Authors: We agree that the abstract phrasing is qualitative and would benefit from supporting numbers to allow readers to evaluate the transfer claim. The full paper reports concrete metrics (including means and standard deviations across seeds) and comparisons against simple baselines in the transfer experiments. We will revise the abstract to include the key quantitative results and error bars from those experiments so that the statement is self-contained and evidence-based. revision: yes

Circularity Check

0 steps flagged

Empirical ablation study with no derivation chain

full rationale

This is an empirical factorised study reporting performance comparisons across feature designs, data constructions, and evaluation settings under matched conditions. No equations, derivations, or predictions are presented that reduce reported outcomes to quantities defined by the same fit or by self-citation chains. Central claims rest on experimental metrics rather than any self-definitional or fitted-input reductions, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning study with no mathematical derivations, new physical entities, or parameter-free theoretical claims. No free parameters, axioms, or invented entities are introduced beyond standard supervised probe training.

pith-pipeline@v0.9.1-grok · 5719 in / 1193 out tokens · 26513 ms · 2026-06-29T05:02:50.786458+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 29 canonical work pages

[1]

The Internal State of an LLM Knows When It ' s Lying

Azaria, Amos and Mitchell, Tom. The Internal State of an LLM Knows When It ' s Lying. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.68

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[2]

P o LLM graph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics

Zhu, Derui and Chen, Dingfan and Li, Qing and Chen, Zongxiong and Ma, Lei and Grossklags, Jens and Fritz, Mario. P o LLM graph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.294

work page doi:10.18653/v1/2024.findings-naacl.294 2024
[3]

arXiv preprint arXiv:2604.15741 , url =

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models , author=. arXiv preprint arXiv:2604.15741 , url =

Pith/arXiv arXiv
[4]

Towards Harmonized Uncertainty Estimation for Large Language Models

Li, Rui and Long, Jing and Qi, Muge and Xia, Heming and Sha, Lei and Wang, Peiyi and Sui, Zhifang. Towards Harmonized Uncertainty Estimation for Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1118

work page doi:10.18653/v1/2025.acl-long.1118 2025
[5]

arXiv preprint arXiv:2604.00445 , year=

Towards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models , author=. arXiv preprint arXiv:2604.00445 , year=

arXiv
[6]

Prompt-Guided Internal States for Hallucination Detection of Large Language Models

Zhang, Fujie and Yu, Peiqi and Yi, Biao and Zhang, Baolei and Li, Tong and Liu, Zheli. Prompt-Guided Internal States for Hallucination Detection of Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1058

work page doi:10.18653/v1/2025.acl-long.1058 2025
[7]

arXiv preprint arXiv:2310.06824 , year=

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2212.03827 , year=

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

Pith/arXiv arXiv
[9]

A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models

Sahoo, Pranab and Meharia, Prabhash and Ghosh, Akash and Saha, Sriparna and Jain, Vinija and Chadha, Aman. A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.685

work page doi:10.18653/v1/2024.findings-emnlp.685 2024
[10]

2025 , issue_date =

Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , title =. 2025 , issue_date =. doi:10.1145/3703155 , journal =

work page doi:10.1145/3703155 2025
[11]

Unsupervised Hallucination Detection by Inspecting Reasoning Processes

Srey, Ponhvoan and Wu, Xiaobao and Luu, Anh Tuan. Unsupervised Hallucination Detection by Inspecting Reasoning Processes. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1124

work page doi:10.18653/v1/2025.emnlp-main.1124 2025
[12]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Zhang, Yue and Li, Yafu and Cui, Leyang and Cai, Deng and Liu, Lemao and Fu, Tingchen and Huang, Xinting and Zhao, Enbo and Zhang, Yu and Chen, Yulong and Wang, Longyue and Luu, Anh Tuan and Bi, Wei and Shi, Freda and Shi, Shuming. Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. Computational Linguistics. 2025. doi:10.116...

work page doi:10.1162/coli.a.16 2025
[13]

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM -Polygraph

Vashurin, Roman and Fadeeva, Ekaterina and Vazhentsev, Artem and Rvanova, Lyudmila and Vasilev, Daniil and Tsvigun, Akim and Petrakov, Sergey and Xing, Rui and Sadallah, Abdelrahman and Grishchenkov, Kirill and Panchenko, Alexander and Baldwin, Timothy and Nakov, Preslav and Panov, Maxim and Shelmanov, Artem. Benchmarking Uncertainty Quantification Method...

work page doi:10.1162/tacl_a_00737 2025
[14]

Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

Vazhentsev, Artem and Rvanova, Lyudmila and Lazichny, Ivan and Panchenko, Alexander and Panov, Maxim and Baldwin, Timothy and Shelmanov, Artem. Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Co...

work page doi:10.18653/v1/2025.naacl-long.113 2025
[15]

A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs

Shelmanov, Artem and Fadeeva, Ekaterina and Tsvigun, Akim and Tsvigun, Ivan and Xie, Zhuohan and Kiselev, Igor and Daheim, Nico and Zhang, Caiqi and Vazhentsev, Artem and Sachan, Mrinmaya and Nakov, Preslav and Baldwin, Timothy. A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Output...

work page doi:10.18653/v1/2025.emnlp-main.1809 2025
[16]

LM -Polygraph: Uncertainty Estimation for Language Models

Fadeeva, Ekaterina and Vashurin, Roman and Tsvigun, Akim and Vazhentsev, Artem and Petrakov, Sergey and Fedyanin, Kirill and Vasilev, Daniil and Goncharova, Elizaveta and Panchenko, Alexander and Panov, Maxim and Baldwin, Timothy and Shelmanov, Artem. LM -Polygraph: Uncertainty Estimation for Language Models. Proceedings of the 2023 Conference on Empirica...

work page doi:10.18653/v1/2023.emnlp-demo.41 2023
[17]

LLM Factoscope: Uncovering LLM s' Factual Discernment through Measuring Inner States

He, Jinwen and Gong, Yujia and Lin, Zijin and Wei, Cheng ' an and Zhao, Yue and Chen, Kai. LLM Factoscope: Uncovering LLM s' Factual Discernment through Measuring Inner States. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.608

work page doi:10.18653/v1/2024.findings-acl.608 2024
[18]

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps

Chuang, Yung-Sung and Qiu, Linlu and Hsieh, Cheng-Yu and Krishna, Ranjay and Kim, Yoon and Glass, James R. Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.84

work page doi:10.18653/v1/2024.emnlp-main.84 2024
[19]

Do Androids Know They ' re Only Dreaming of Electric Sheep?

CH-Wang, Sky and Van Durme, Benjamin and Eisner, Jason and Kedzie, Chris. Do Androids Know They ' re Only Dreaming of Electric Sheep?. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.260

work page doi:10.18653/v1/2024.findings-acl.260 2024
[20]

Too Consistent to Detect: A Study of Self-Consistent Errors in LLM s

Tan, Hexiang and Sun, Fei and Liu, Sha and Su, Du and Cao, Qi and Chen, Xin and Wang, Jingang and Cai, Xunliang and Wang, Yuanzhuo and Shen, Huawei and Cheng, Xueqi. Too Consistent to Detect: A Study of Self-Consistent Errors in LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.238

work page doi:10.18653/v1/2025.emnlp-main.238 2025
[21]

Journal of Machine Learning Research , volume=

Uqlm: A python package for uncertainty quantification in large language models , author=. Journal of Machine Learning Research , volume=
[22]

and Yilmaz, Emine and Shi, Shuming and Tu, Zhaopeng , title =

Ye, Fanghua and Yang, Mingming and Pang, Jianhui and Wang, Longyue and Wong, Derek F. and Yilmaz, Emine and Shi, Shuming and Tu, Zhaopeng , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

2024
[23]

Factual Confidence of LLM s: on Reliability and Robustness of Current Estimators

Mahaut, Mat. Factual Confidence of LLM s: on Reliability and Robustness of Current Estimators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.250

work page doi:10.18653/v1/2024.acl-long.250 2024
[24]

arXiv preprint arXiv:2511.03166 , year=

Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks , author=. arXiv preprint arXiv:2511.03166 , year=

arXiv
[25]

The Illusion of Progress: Re-evaluating Hallucination Detection in LLM s

Janiak, Denis and Binkowski, Jakub and Sawczyn, Albert and Gabrys, Bogdan and Shwartz-Ziv, Ravid and Kajdanowicz, Tomasz Jan. The Illusion of Progress: Re-evaluating Hallucination Detection in LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1761

work page doi:10.18653/v1/2025.emnlp-main.1761 2025
[26]

Advances in Neural Information Processing Systems , volume=

Reasoning models better express their confidence , author=. Advances in Neural Information Processing Systems , volume=
[27]

arXiv preprint arXiv:2501.09775 , year=

Multiple choice questions: Reasoning makes large language models (llms) more self-confident even when they are wrong , author=. arXiv preprint arXiv:2501.09775 , year=

Pith/arXiv arXiv
[28]

Reasoning about Uncertainty: Do Reasoning Models Know When They Don ' t Know?

Mei, Zhiting and Zhang, Christina and Yin, Tenny and Lidard, Justin and Sho, Ola and Majumdar, Anirudha. Reasoning about Uncertainty: Do Reasoning Models Know When They Don ' t Know?. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.178

work page doi:10.18653/v1/2026.findings-eacl.178 2026
[29]

Han, Jiatong and Band, Neil and Razzak, Muhammed and Kossen, Jannik and Rudner, Tim G. J. and Gal, Yarin. Simple Factuality Probes Detect Hallucinations in Long-Form Natural Language Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.880

work page doi:10.18653/v1/2025.findings-emnlp.880 2025
[30]

C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

work page doi:10.18653/v1/n19-1421 2019
[31]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017
[32]

and Gardner, Matt

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. doi:10.18653/v1/W17-4413

work page doi:10.18653/v1/w17-4413 2017
[33]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, Alex and Asai, Akari and Zhong, Victor and Das, Rajarshi and Khashabi, Daniel and Hajishirzi, Hannaneh. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023...

work page doi:10.18653/v1/2023.acl-long.546 2023
[34]

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00370

work page doi:10.1162/tacl_a_00370 2021
[35]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1300 2019
[36]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv
[37]

2026 , note =

Gemini 3.1 Flash-Lite , howpublished =. 2026 , note =

2026
[38]

2026 , note =

Introducing GPT‑5.4 mini and nano , howpublished =. 2026 , note =

2026
[39]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[40]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2503.19786 , volume=

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , volume=. 2025 , publisher=

Pith/arXiv arXiv 2025
[42]

Proceedings of the 23rd international conference on Machine learning , pages=

The relationship between Precision-Recall and ROC curves , author=. Proceedings of the 23rd international conference on Machine learning , pages=. 2006 , url=

2006
[43]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[44]

International Conference on Learning Representations , volume=

Latent space chain-of-embedding enables output-free llm self-evaluation , author=. International Conference on Learning Representations , volume=. 2025 , url=

2025
[45]

IEEE Transactions on Software Engineering , volume=

Look before you leap: An exploratory study of uncertainty analysis for large language models , author=. IEEE Transactions on Software Engineering , volume=. 2025 , publisher=

2025
[46]

Advances in neural information processing systems , volume=

Energy-based out-of-distribution detection , author=. Advances in neural information processing systems , volume=. 2020 , url=

2020
[47]

Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

Su, Weihang and Wang, Changyue and Ai, Qingyao and Hu, Yiran and Wu, Zhijing and Zhou, Yujia and Liu, Yiqun. Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.854

work page doi:10.18653/v1/2024.findings-acl.854 2024
[48]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004
[49]

A lign S core: Evaluating Factual Consistency with A Unified Alignment Function

Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting. A lign S core: Evaluating Factual Consistency with A Unified Alignment Function. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.634

work page doi:10.18653/v1/2023.acl-long.634 2023
[50]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=. 2022 , url=

2022
[51]

arXiv preprint arXiv:2509.03531 , year=

Real-time detection of hallucinated entities in long-form generation , author=. arXiv preprint arXiv:2509.03531 , year=

arXiv
[52]

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

Duan, Jinhao and Cheng, Hao and Wang, Shiqi and Zavalny, Alex and Wang, Chenan and Xu, Renjing and Kailkhura, Bhavya and Xu, Kaidi. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

work page doi:10.18653/v1/2024.acl-long.276 2024
[53]

arXiv preprint arXiv:2302.09664 , year=

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=

Pith/arXiv arXiv
[54]

FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

2023
[55]

arXiv preprint arXiv:2310.03951 , year=

Chain of natural language inference for reducing large language model ungrounded hallucinations , author=. arXiv preprint arXiv:2310.03951 , year=

arXiv

[1] [1]

The Internal State of an LLM Knows When It ' s Lying

Azaria, Amos and Mitchell, Tom. The Internal State of an LLM Knows When It ' s Lying. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.68

work page doi:10.18653/v1/2023.findings-emnlp.68 2023

[2] [2]

P o LLM graph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics

Zhu, Derui and Chen, Dingfan and Li, Qing and Chen, Zongxiong and Ma, Lei and Grossklags, Jens and Fritz, Mario. P o LLM graph: Unraveling Hallucinations in Large Language Models via State Transition Dynamics. Findings of the Association for Computational Linguistics: NAACL 2024. 2024. doi:10.18653/v1/2024.findings-naacl.294

work page doi:10.18653/v1/2024.findings-naacl.294 2024

[3] [3]

arXiv preprint arXiv:2604.15741 , url =

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models , author=. arXiv preprint arXiv:2604.15741 , url =

Pith/arXiv arXiv

[4] [4]

Towards Harmonized Uncertainty Estimation for Large Language Models

Li, Rui and Long, Jing and Qi, Muge and Xia, Heming and Sha, Lei and Wang, Peiyi and Sui, Zhifang. Towards Harmonized Uncertainty Estimation for Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1118

work page doi:10.18653/v1/2025.acl-long.1118 2025

[5] [5]

arXiv preprint arXiv:2604.00445 , year=

Towards Reliable Truth-Aligned Uncertainty Estimation in Large Language Models , author=. arXiv preprint arXiv:2604.00445 , year=

arXiv

[6] [6]

Prompt-Guided Internal States for Hallucination Detection of Large Language Models

Zhang, Fujie and Yu, Peiqi and Yi, Biao and Zhang, Baolei and Li, Tong and Liu, Zheli. Prompt-Guided Internal States for Hallucination Detection of Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1058

work page doi:10.18653/v1/2025.acl-long.1058 2025

[7] [7]

arXiv preprint arXiv:2310.06824 , year=

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2212.03827 , year=

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

Pith/arXiv arXiv

[9] [9]

A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models

Sahoo, Pranab and Meharia, Prabhash and Ghosh, Akash and Saha, Sriparna and Jain, Vinija and Chadha, Aman. A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.685

work page doi:10.18653/v1/2024.findings-emnlp.685 2024

[10] [10]

2025 , issue_date =

Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , title =. 2025 , issue_date =. doi:10.1145/3703155 , journal =

work page doi:10.1145/3703155 2025

[11] [11]

Unsupervised Hallucination Detection by Inspecting Reasoning Processes

Srey, Ponhvoan and Wu, Xiaobao and Luu, Anh Tuan. Unsupervised Hallucination Detection by Inspecting Reasoning Processes. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1124

work page doi:10.18653/v1/2025.emnlp-main.1124 2025

[12] [12]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Zhang, Yue and Li, Yafu and Cui, Leyang and Cai, Deng and Liu, Lemao and Fu, Tingchen and Huang, Xinting and Zhao, Enbo and Zhang, Yu and Chen, Yulong and Wang, Longyue and Luu, Anh Tuan and Bi, Wei and Shi, Freda and Shi, Shuming. Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. Computational Linguistics. 2025. doi:10.116...

work page doi:10.1162/coli.a.16 2025

[13] [13]

Benchmarking Uncertainty Quantification Methods for Large Language Models with LM -Polygraph

Vashurin, Roman and Fadeeva, Ekaterina and Vazhentsev, Artem and Rvanova, Lyudmila and Vasilev, Daniil and Tsvigun, Akim and Petrakov, Sergey and Xing, Rui and Sadallah, Abdelrahman and Grishchenkov, Kirill and Panchenko, Alexander and Baldwin, Timothy and Nakov, Preslav and Panov, Maxim and Shelmanov, Artem. Benchmarking Uncertainty Quantification Method...

work page doi:10.1162/tacl_a_00737 2025

[14] [14]

Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

Vazhentsev, Artem and Rvanova, Lyudmila and Lazichny, Ivan and Panchenko, Alexander and Panov, Maxim and Baldwin, Timothy and Shelmanov, Artem. Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Co...

work page doi:10.18653/v1/2025.naacl-long.113 2025

[15] [15]

A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs

Shelmanov, Artem and Fadeeva, Ekaterina and Tsvigun, Akim and Tsvigun, Ivan and Xie, Zhuohan and Kiselev, Igor and Daheim, Nico and Zhang, Caiqi and Vazhentsev, Artem and Sachan, Mrinmaya and Nakov, Preslav and Baldwin, Timothy. A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Output...

work page doi:10.18653/v1/2025.emnlp-main.1809 2025

[16] [16]

LM -Polygraph: Uncertainty Estimation for Language Models

Fadeeva, Ekaterina and Vashurin, Roman and Tsvigun, Akim and Vazhentsev, Artem and Petrakov, Sergey and Fedyanin, Kirill and Vasilev, Daniil and Goncharova, Elizaveta and Panchenko, Alexander and Panov, Maxim and Baldwin, Timothy and Shelmanov, Artem. LM -Polygraph: Uncertainty Estimation for Language Models. Proceedings of the 2023 Conference on Empirica...

work page doi:10.18653/v1/2023.emnlp-demo.41 2023

[17] [17]

LLM Factoscope: Uncovering LLM s' Factual Discernment through Measuring Inner States

He, Jinwen and Gong, Yujia and Lin, Zijin and Wei, Cheng ' an and Zhao, Yue and Chen, Kai. LLM Factoscope: Uncovering LLM s' Factual Discernment through Measuring Inner States. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.608

work page doi:10.18653/v1/2024.findings-acl.608 2024

[18] [18]

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps

Chuang, Yung-Sung and Qiu, Linlu and Hsieh, Cheng-Yu and Krishna, Ranjay and Kim, Yoon and Glass, James R. Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.84

work page doi:10.18653/v1/2024.emnlp-main.84 2024

[19] [19]

Do Androids Know They ' re Only Dreaming of Electric Sheep?

CH-Wang, Sky and Van Durme, Benjamin and Eisner, Jason and Kedzie, Chris. Do Androids Know They ' re Only Dreaming of Electric Sheep?. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.260

work page doi:10.18653/v1/2024.findings-acl.260 2024

[20] [20]

Too Consistent to Detect: A Study of Self-Consistent Errors in LLM s

Tan, Hexiang and Sun, Fei and Liu, Sha and Su, Du and Cao, Qi and Chen, Xin and Wang, Jingang and Cai, Xunliang and Wang, Yuanzhuo and Shen, Huawei and Cheng, Xueqi. Too Consistent to Detect: A Study of Self-Consistent Errors in LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.238

work page doi:10.18653/v1/2025.emnlp-main.238 2025

[21] [21]

Journal of Machine Learning Research , volume=

Uqlm: A python package for uncertainty quantification in large language models , author=. Journal of Machine Learning Research , volume=

[22] [22]

and Yilmaz, Emine and Shi, Shuming and Tu, Zhaopeng , title =

Ye, Fanghua and Yang, Mingming and Pang, Jianhui and Wang, Longyue and Wong, Derek F. and Yilmaz, Emine and Shi, Shuming and Tu, Zhaopeng , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

2024

[23] [23]

Factual Confidence of LLM s: on Reliability and Robustness of Current Estimators

Mahaut, Mat. Factual Confidence of LLM s: on Reliability and Robustness of Current Estimators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.250

work page doi:10.18653/v1/2024.acl-long.250 2024

[24] [24]

arXiv preprint arXiv:2511.03166 , year=

Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks , author=. arXiv preprint arXiv:2511.03166 , year=

arXiv

[25] [25]

The Illusion of Progress: Re-evaluating Hallucination Detection in LLM s

Janiak, Denis and Binkowski, Jakub and Sawczyn, Albert and Gabrys, Bogdan and Shwartz-Ziv, Ravid and Kajdanowicz, Tomasz Jan. The Illusion of Progress: Re-evaluating Hallucination Detection in LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1761

work page doi:10.18653/v1/2025.emnlp-main.1761 2025

[26] [26]

Advances in Neural Information Processing Systems , volume=

Reasoning models better express their confidence , author=. Advances in Neural Information Processing Systems , volume=

[27] [27]

arXiv preprint arXiv:2501.09775 , year=

Multiple choice questions: Reasoning makes large language models (llms) more self-confident even when they are wrong , author=. arXiv preprint arXiv:2501.09775 , year=

Pith/arXiv arXiv

[28] [28]

Reasoning about Uncertainty: Do Reasoning Models Know When They Don ' t Know?

Mei, Zhiting and Zhang, Christina and Yin, Tenny and Lidard, Justin and Sho, Ola and Majumdar, Anirudha. Reasoning about Uncertainty: Do Reasoning Models Know When They Don ' t Know?. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.178

work page doi:10.18653/v1/2026.findings-eacl.178 2026

[29] [29]

Han, Jiatong and Band, Neil and Razzak, Muhammed and Kossen, Jannik and Rudner, Tim G. J. and Gal, Yarin. Simple Factuality Probes Detect Hallucinations in Long-Form Natural Language Generation. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.880

work page doi:10.18653/v1/2025.findings-emnlp.880 2025

[30] [30]

C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge

Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan. C ommonsense QA : A Question Answering Challenge Targeting Commonsense Knowledge. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/...

work page doi:10.18653/v1/n19-1421 2019

[31] [31]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1147

work page doi:10.18653/v1/p17-1147 2017

[32] [32]

and Gardner, Matt

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017. doi:10.18653/v1/W17-4413

work page doi:10.18653/v1/w17-4413 2017

[33] [33]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Mallen, Alex and Asai, Akari and Zhong, Victor and Das, Rajarshi and Khashabi, Daniel and Hajishirzi, Hannaneh. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023...

work page doi:10.18653/v1/2023.acl-long.546 2023

[34] [34]

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00370

work page doi:10.1162/tacl_a_00370 2021

[35] [35]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page doi:10.18653/v1/n19-1300 2019

[36] [36]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv

[37] [37]

2026 , note =

Gemini 3.1 Flash-Lite , howpublished =. 2026 , note =

2026

[38] [38]

2026 , note =

Introducing GPT‑5.4 mini and nano , howpublished =. 2026 , note =

2026

[39] [39]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[40] [40]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[41] [41]

arXiv preprint arXiv:2503.19786 , volume=

Gemma 3 technical report , author=. arXiv preprint arXiv:2503.19786 , volume=. 2025 , publisher=

Pith/arXiv arXiv 2025

[42] [42]

Proceedings of the 23rd international conference on Machine learning , pages=

The relationship between Precision-Recall and ROC curves , author=. Proceedings of the 23rd international conference on Machine learning , pages=. 2006 , url=

2006

[43] [43]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[44] [44]

International Conference on Learning Representations , volume=

Latent space chain-of-embedding enables output-free llm self-evaluation , author=. International Conference on Learning Representations , volume=. 2025 , url=

2025

[45] [45]

IEEE Transactions on Software Engineering , volume=

Look before you leap: An exploratory study of uncertainty analysis for large language models , author=. IEEE Transactions on Software Engineering , volume=. 2025 , publisher=

2025

[46] [46]

Advances in neural information processing systems , volume=

Energy-based out-of-distribution detection , author=. Advances in neural information processing systems , volume=. 2020 , url=

2020

[47] [47]

Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

Su, Weihang and Wang, Changyue and Ai, Qingyao and Hu, Yiran and Wu, Zhijing and Zhou, Yujia and Liu, Yiqun. Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.854

work page doi:10.18653/v1/2024.findings-acl.854 2024

[48] [48]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

2004

[49] [49]

A lign S core: Evaluating Factual Consistency with A Unified Alignment Function

Zha, Yuheng and Yang, Yichi and Li, Ruichen and Hu, Zhiting. A lign S core: Evaluating Factual Consistency with A Unified Alignment Function. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.634

work page doi:10.18653/v1/2023.acl-long.634 2023

[50] [50]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=. 2022 , url=

2022

[51] [51]

arXiv preprint arXiv:2509.03531 , year=

Real-time detection of hallucinated entities in long-form generation , author=. arXiv preprint arXiv:2509.03531 , year=

arXiv

[52] [52]

Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models

Duan, Jinhao and Cheng, Hao and Wang, Shiqi and Zavalny, Alex and Wang, Chenan and Xu, Renjing and Kailkhura, Bhavya and Xu, Kaidi. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...

work page doi:10.18653/v1/2024.acl-long.276 2024

[53] [53]

arXiv preprint arXiv:2302.09664 , year=

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=

Pith/arXiv arXiv

[54] [54]

FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Min, Sewon and Krishna, Kalpesh and Lyu, Xinxi and Lewis, Mike and Yih, Wen-tau and Koh, Pang and Iyyer, Mohit and Zettlemoyer, Luke and Hajishirzi, Hannaneh. FA ct S core: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

2023

[55] [55]

arXiv preprint arXiv:2310.03951 , year=

Chain of natural language inference for reducing large language model ungrounded hallucinations , author=. arXiv preprint arXiv:2310.03951 , year=

arXiv