pith. sign in

arxiv: 2605.27016 · v1 · pith:PRQFANIQnew · submitted 2026-05-26 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

Pith reviewed 2026-06-29 18:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML
keywords LLM hallucinationsuncertainty estimationintrinsic hallucinationsextrinsic hallucinationsRAGTruthHalluLensempirical evaluation
0
0 comments X

The pith

Uncertainty estimators associate only weakly and variably with LLM hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a direct empirical evaluation of multiple uncertainty estimation methods against labeled hallucinations in large language models. It distinguishes intrinsic hallucinations that violate input faithfulness from extrinsic ones unsupported by training data, and applies the tests across four benchmarks. The central finding is that the link between uncertainty scores and hallucination presence is inconsistent and frequently weak, varying by hallucination category and by which LLM is examined. This undercuts the common practice of treating uncertainty as an automatic indicator of when an LLM output is unreliable.

Core claim

The association between uncertainty estimators and hallucinations in LLMs is highly variable and often weak, depending on the hallucination type and the LLM under evaluation.

What carries the argument

Systematic comparison of information-theoretic, sampling-based, and reflexive uncertainty estimators against hallucination annotations on RAGTruth, HalluLens, and two additional benchmarks for both intrinsic and extrinsic cases.

If this is right

  • Uncertainty scores cannot be used as a reliable standalone signal that an output is hallucinated.
  • Any practical hallucination mitigation strategy must incorporate information beyond uncertainty, such as type-specific checks.
  • Performance of uncertainty methods should be reported separately for intrinsic versus extrinsic hallucinations rather than in aggregate.
  • Model-specific tuning of uncertainty thresholds is required rather than assuming a universal relationship.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment pipelines that currently gate outputs on uncertainty alone may need additional verification layers for different hallucination modes.
  • Future work could test whether combining uncertainty with external retrieval or fact-checking steps restores a usable signal where uncertainty alone fails.

Load-bearing premise

The chosen benchmarks and the split between intrinsic and extrinsic hallucinations give a representative sample of actual hallucination behavior.

What would settle it

Finding a single uncertainty estimator that produces consistently high correlation with hallucination labels across all four benchmarks and multiple LLMs would falsify the claim of weak and variable association.

Figures

Figures reproduced from arXiv: 2605.27016 by Annabelle Blangero, Anna Korba, Karteek Alahari, Nicolas Chesneau, Yedidia Agnimo.

Figure 1
Figure 1. Figure 1: Mean AUROC against rank variability across the 12 panels. Each point represents [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Family-level ROC aggregates per task, averaged across the three generators. For each [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spearman correlation between the top uncertainty estimators across panels. The second cluster contains the two Eccentric￾ity variants, based respectively on contradic￾tion (Ecc-c) and entailment (Ecc-e) relations. Both are black-box graph-based estimators: they build a semantic-relation graph over sampled responses and quantify dispersion from the re￾sulting graph spectral representation. Their moderate co… view at source ↗
Figure 4
Figure 4. Figure 4: Performance–stability profiles for uncertainty estimators within each task, aggregated [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise Kendall’s τ agreement between estimator rankings induced by ROC-AUC across models, shown separately for each task. Higher values indicate that the same estimators tend to rank similarly across models for a given hallucination type [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pairwise Kendall’s τ agreement between estimator rankings induced by ROC-AUC across tasks, shown separately for each model. Higher values indicate that the same estimators tend to rank similarly across hallucination types for a given model [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Family-level ROC aggregates per task, averaged across the three models. For each task– [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗
read the original abstract

Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic empirical study of the relationship between uncertainty estimation (UE) methods—including information-theoretic, sampling-based, and reflexive estimators—and hallucinations in LLMs. It distinguishes intrinsic hallucinations (input faithfulness violations) from extrinsic ones (unsupported claims relative to training data), evaluates across four benchmarks (RAGTruth, HalluLens, and two others), and concludes that the association between UE and hallucinations is highly variable and often weak, depending on hallucination type and the specific LLM. This challenges the implicit treatment of UE as a direct proxy for hallucination detection.

Significance. If the empirical findings are robust, the work provides a valuable caution against over-reliance on uncertainty estimates for hallucination mitigation in LLM deployment. By directly testing the association rather than assuming it, and covering both intrinsic and extrinsic settings, the results clarify the limited actionable information provided by current UE methods and could steer research toward more targeted detection strategies.

major comments (2)
  1. [§4 (Benchmarks and Hallucination Definitions)] §4 (Benchmarks and Hallucination Definitions): The central claim that the UE-hallucination association is 'often weak' depends on the four chosen benchmarks and the intrinsic/extrinsic distinction constituting an unbiased probe. The manuscript provides no analysis of annotation protocols in RAGTruth or HalluLens (or the other two benchmarks) to show they do not systematically under-sample hallucinations that correlate with uncertainty or introduce labeling artifacts via the intrinsic/extrinsic split; without this, the variability finding risks being an artifact of the evaluation settings.
  2. [§5 (Results)] §5 (Results): The reported associations lack accompanying statistical significance tests, confidence intervals, or error bars on the correlation or AUC metrics across models and hallucination types. This makes it difficult to determine whether observed 'weak' associations are reliably distinguishable from noise or from stronger associations in other settings, undermining the strength of the 'highly variable and often weak' conclusion.
minor comments (2)
  1. The abstract and introduction would benefit from explicitly naming all four benchmarks and the exact LLMs evaluated, rather than referring to 'two others.'
  2. [§3] Notation for UE methods (e.g., how reflexive estimators are formalized) could be made more consistent between §3 and the experimental tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which highlight important aspects of our evaluation methodology and statistical reporting. We provide point-by-point responses below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [§4 (Benchmarks and Hallucination Definitions)] §4 (Benchmarks and Hallucination Definitions): The central claim that the UE-hallucination association is 'often weak' depends on the four chosen benchmarks and the intrinsic/extrinsic distinction constituting an unbiased probe. The manuscript provides no analysis of annotation protocols in RAGTruth or HalluLens (or the other two benchmarks) to show they do not systematically under-sample hallucinations that correlate with uncertainty or introduce labeling artifacts via the intrinsic/extrinsic split; without this, the variability finding risks being an artifact of the evaluation settings.

    Authors: We appreciate this concern regarding potential biases in the benchmarks. Our selection of RAGTruth, HalluLens, and the additional benchmarks was based on their established use in the literature for distinguishing intrinsic and extrinsic hallucinations, with labels provided by human annotators following documented protocols. While we did not conduct an independent audit of the annotation processes, the consistency of our findings across multiple independent benchmarks supports that the observed variability is not an artifact of a single evaluation setting. In the revision, we will add a subsection discussing the limitations of these benchmarks and the potential for annotation artifacts, including references to the original papers' annotation guidelines. revision: partial

  2. Referee: [§5 (Results)] §5 (Results): The reported associations lack accompanying statistical significance tests, confidence intervals, or error bars on the correlation or AUC metrics across models and hallucination types. This makes it difficult to determine whether observed 'weak' associations are reliably distinguishable from noise or from stronger associations in other settings, undermining the strength of the 'highly variable and often weak' conclusion.

    Authors: We agree that including statistical significance tests and confidence intervals would strengthen the presentation of our results. In the revised manuscript, we will recompute the correlations and AUC values with bootstrap confidence intervals and report p-values for the key comparisons to assess whether the weak associations are statistically distinguishable from stronger ones. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation

full rationale

The paper performs direct empirical comparisons of uncertainty estimators against hallucination labels on four benchmarks (RAGTruth, HalluLens and two others), distinguishing intrinsic vs. extrinsic cases. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the derivation chain. The headline finding (variable and often weak association) is a statistical observation from the data, not a reduction to any fitted quantity or prior self-result. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work rests on standard definitions of hallucinations and uncertainty estimators drawn from prior literature.

pith-pipeline@v0.9.1-grok · 5741 in / 937 out tokens · 24704 ms · 2026-06-29T18:28:24.637686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    Abbasi-Yadkori, Y ., Kuzborskij, I., György, A., and Szepesvari, C. (2024). To Believe or Not to Believe Your LLM: Iterative Prompting for Estimating Epistemic Uncertainty. InNeurIPS

  2. [2]

    L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., Hovy, E., Ji, H., Menczer, F., Miguez, R., Nakov, P., Scheufele, D., Sharma, S., and Zagni, G

    Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G. L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., Hovy, E., Ji, H., Menczer, F., Miguez, R., Nakov, P., Scheufele, D., Sharma, S., and Zagni, G. (2024). Factuality challenges in the era of large language models and opportunities for fact-checking.Nature Machine Intelligenc...

  3. [3]

    F., Yaldiz, D

    Bakman, Y . F., Yaldiz, D. N., Kang, S., Zhang, T., Buyukates, B., Avestimehr, S., and Karim- ireddy, S. P. (2025). Reconsidering LLM uncertainty estimation methods in the wild. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29531–29556

  4. [4]

    Bang, Y ., Ji, Z., Schelten, A., Hartshorn, A., Fowler, T., Zhang, C., Cancedda, N., and Fung, P. (2025). HalluLens: LLM hallucination benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24128–24156

  5. [5]

    Y ., Okolo, C

    Bengio, Y ., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y ., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V ., Mazeika, M., Michael, J., Newman, J., Ng, K. Y ., Okolo, C. T., Raji, D., Sastry, G., Seger, E., Skeadas, T., South, T., Strubell, ...

  6. [6]

    S., Skarbrevik, D., Ra, H.-K., Bajaj, V ., and Ahmad, Z

    Bouchard, D., Chauhan, M. S., Skarbrevik, D., Ra, H.-K., Bajaj, V ., and Ahmad, Z. (2026). UQLM: A Python Package for Uncertainty Quantification in Large Language Models.Journal of Machine Learning Research, 27(13):1–10

  7. [7]

    Chen, C., Liu, K., Chen, Z., Gu, Y ., Wu, Y ., Tao, M., Fu, Z., and Ye, J. (2023). INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection. InInternational Conference on Learning Representations

  8. [8]

    d., Suchanek, F

    Chen, L., Melo, G. d., Suchanek, F. M., and Varoquaux, G. (2026). Query-Level Uncertainty in Large Language Models. InICLR

  9. [9]

    Cover, T. M. and Thomas, J. A. (2001).Elements of information theory. Wiley-Interscience, Hoboken, NJ

  10. [10]

    F., Hardt, M., and Mendler-Dünner, C

    Cruz, A. F., Hardt, M., and Mendler-Dünner, C. (2024). Evaluating language models as risk scores. InProceedings of the 38th International Conference on Neural Information Processing Systems, volume 37, pages 97378–97407

  11. [11]

    Darrin, M., Piantanida, P., and Colombo, P. (2023). Rainproof: An umbrella to shield text generator from out-of-distribution data. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5831–5857. 10

  12. [12]

    Devic, S., Srinivasan, T., Thomason, J., Neiswanger, W., and Sharan, V . (2025). From Cal- ibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered. arXiv:2506.07461 [cs] version: 1

  13. [13]

    Duan, J., Cheng, H., Wang, S., Zavalny, A., Wang, C., Xu, R., Kailkhura, B., and Xu, K. (2024). Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free- Form Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5050–5063. Associati...

  14. [14]

    Fadeeva, E., Rubashevskii, A., Shelmanov, A., Petrakov, S., Li, H., Mubarak, H., Tsymbalov, E., Kuzmin, G., Panchenko, A., Baldwin, T., Nakov, P., and Panov, M. (2024). Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification. InFindings of the Association for Computational Linguistics, pages 9367–9385

  15. [15]

    Fadeeva, E., Vashurin, R., Tsvigun, A., Vazhentsev, A., Petrakov, S., Fedyanin, K., Vasilev, D., Goncharova, E., Panchenko, A., Panov, M., et al. (2023). Lm-polygraph: Uncertainty estimation for language models.Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations

  16. [16]

    Fan, D., Delsad, S., Flammarion, N., and Andriushchenko, M. (2026). HalluHard: A Hard Multi-Turn Hallucination Benchmark. arXiv:2602.01031 [cs]

  17. [17]

    Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y . (2024). Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630

  18. [18]

    Fomicheva, M., Sun, S., Yankovskaya, L., Blain, F., Guzmán, F., Fishel, M., Aletras, N., Chaudhary, V ., and Specia, L. (2020). Unsupervised quality estimation for neural machine translation.Transactions of the Association for Computational Linguistics, 8:539–555

  19. [19]

    He, P., Liu, X., Gao, J., and Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with disentangled attention. InInternational Conference on Learning Representations

  20. [20]

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2024a). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Transactions on Information Systems

  21. [21]

    Huang, X., Li, S., Yu, M., Sesia, M., Hassani, H., Lee, I., Bastani, O., and Dobriban, E. (2024b). Uncertainty in Language Models: Assessment through Rank-Calibration. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 284–312

  22. [22]

    Ielanskyi, M., Schweighofer, K., Aichberger, L., and Hochreiter, S. (2025). Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation. InICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI

  23. [23]

    S., Madotto, A., and Fung, P

    Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y ., Ishii, E., Bang, Y ., Chen, D., Dai, W., Chan, H. S., Madotto, A., and Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):1–38

  24. [24]

    Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601– 1611

  25. [25]

    Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield- Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. (2022). Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221

  26. [26]

    F., Yaldiz, D

    Kang, S., Bakman, Y . F., Yaldiz, D. N., Buyukates, B., and Avestimehr, S. (2025). Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions.arXiv 2510.12040. 11

  27. [27]

    Kuhn, L., Gal, Y ., and Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.International Conference on Learning Representations

  28. [28]

    M., Uszkoreit, J., Le, Q., and Petrov, S

    Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., and Petrov, S. (2019). Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational L...

  29. [29]

    Lee, K., Lee, K., Lee, H., and Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in neural information processing systems, 31

  30. [30]

    Lin, C.-Y . (2004). ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81. Association for Computational Linguistics

  31. [31]

    Lin, Z., Trivedi, S., and Sun, J. (2023). Generating with confidence: Uncertainty quantification for black-box large language models.Findings of ACL

  32. [32]

    Lin, Z., Trivedi, S., and Sun, J. (2024a). Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 10351–10368

  33. [33]

    Lin, Z., Trivedi, S., and Sun, J. (2024b). Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models.Transactions on Machine Learning Research

  34. [34]

    and Gales, M

    Malinin, A. and Gales, M. (2021). Uncertainty estimation in autoregressive structured prediction. International Conference on Learning Representations

  35. [35]

    Moskvoretskii, V ., Marina, M., Salnikov, M., Ivanov, N., Pletenev, S., Galimzianova, D., Krayko, N., Konovalov, V ., Nikishina, I., and Panchenko, A. (2025). Adaptive Retrieval Without Self- Knowledge? Bringing Uncertainty Back Home. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6355–6384

  36. [36]

    V ., Kossen, J., Gal, Y ., and Marttinen, P

    Nikitin, A. V ., Kossen, J., Gal, Y ., and Marttinen, P. (2024). Kernel Language Entropy: Fine- grained Uncertainty Quantification for LLMs from Semantic Similarities. InAdvances in Neural Information Processing Systems

  37. [37]

    Niu, C., Wu, Y ., Zhu, J., Xu, S., Shum, K., Zhong, R., Song, J., and Zhang, T. (2024). RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878

  38. [38]

    Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318

  39. [39]

    and Miikkulainen, R

    Qiu, X. and Miikkulainen, R. (2024). Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space. InAdvances in Neural Information Processing Systems

  40. [40]

    G., Padhy, S., and Lakshminarayanan, B

    Ren, J., Fort, S., Liu, J., Roy, A. G., Padhy, S., and Lakshminarayanan, B. (2021). A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection. arXiv:2106.09022 [cs]

  41. [41]

    Sahoo, P., Meharia, P., Ghosh, A., Saha, S., Jain, V ., and Chadha, A. (2024). A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11709–11724

  42. [42]

    Santilli, A., Golinski, A., Kirchhof, M., Danieli, F., Blaas, A., Xiong, M., Zappella, L., and Williamson, S. (2025). Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers...

  43. [43]

    S., Saha, S., Kattakinda, P., and Feizi, S

    Sriramanan, G., Bharti, S., Sadasivan, V . S., Saha, S., Kattakinda, P., and Feizi, S. (2024). LLM-Check: Investigating Detection of Hallucinations in Large Language Models. InNeurIPS

  44. [44]

    Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5433–5442

  45. [45]

    Tomov, T., Fuchsgruber, D., Wollschläger, T., and Günnemann, S. (2026). The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity. arXiv:2511.04418 [cs]

  46. [46]

    van der Poel, L., Cotterell, R., and Meister, C. (2022). Mutual Information Alleviates Hallucina- tions in Abstractive Summarization. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5956–5965

  47. [47]

    Vashurin, R., Fadeeva, E., Vazhentsev, A., Rvanova, L., Vasilev, D., Tsvigun, A., Petrakov, S., Xing, R., Sadallah, A., Grishchenkov, K., Panchenko, A., Baldwin, T., Nakov, P., Panov, M., and Shelmanov, A. (2025a). Benchmarking uncertainty quantification methods for large language models with LM-polygraph.Transactions of the Association for Computational ...

  48. [48]

    Vashurin, R., Goloburda, M., Ilina, A., Rubashevskii, A., Nakov, P., Shelmanov, A., and Panov, M. (2025b). CoCoA: A Minimum Bayes Risk Framework Bridging Confidence and Consistency for Uncertainty Quantification in LLMs. InNeurIPS

  49. [49]

    Vazhentsev, A., Kuzmin, G., Tsvigun, A., Panchenko, A., Panov, M., Burtsev, M., and Shel- manov, A. (2023). Hybrid Uncertainty Quantification for Selective Text Classification in Am- biguous Tasks. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11659–11681

  50. [50]

    Vazhentsev, A., Rvanova, L., Kuzmin, G., Fadeeva, E., Lazichny, I., Panchenko, A., Panov, M., Baldwin, T., Sachan, M., Nakov, P., and Shelmanov, A. (2025). Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs.arXiv 2505.20045

  51. [51]

    Wang, X., Zhang, Z., Chen, G., Li, Q., Luo, B., Han, Z., Wang, H., Li, Z., Gao, H., and Hu, M. (2025). UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions. InFindings of the Association for Computational Linguistics, pages 8076–8107

  52. [52]

    Measuring short-form factuality in large language models

    Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. (2024). Measuring short-form factuality in large language models. arXiv:2411.04368 [cs]

  53. [53]

    Williams, R. J. and Zipser, D. (1989). A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.Neural Computation, 1(2):270–280

  54. [54]

    Wu, X., Li, X., Quan, L., and Hu, Q. (2025). UncertaintyZoo: A Unified Toolkit for Quantifying Predictive Uncertainty in Deep Learning Systems. arXiv:2512.06406 [cs]

  55. [55]

    Yang, Q., Ravikumar, S., Schmitt-Ulms, F., Lolla, S., Demir, E., Elistratov, I., Lavaee, A., Lolla, S., Ahmadi, E., Rus, D., Amini, A., and Perez, A. (2023). Uncertainty-aware Language Modeling for Selective Question Answering. arXiv:2311.15451 [cs]

  56. [56]

    Yao, Y ., Wu, H., Guo, Z., Biyan, Z., Gao, J., Luo, S., Hou, H., Fu, X., and Song, L. (2024). Learning From Correctness Without Prompting Makes LLM Efficient Reasoner. InCOLM

  57. [57]

    Yoo, K., Kim, J., Jang, J., and Kwak, N. (2022). Detection of Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation. InFindings of the Association for Computational Linguistics, pages 3656–3672

  58. [58]

    Zha, Y ., Yang, Y ., Li, R., and Hu, Z. (2023). Alignscore: Evaluating factual consistency with a unified alignment function.arXiv preprint arXiv:2305.16739

  59. [59]

    Zhang, C., Liu, F., Basaldella, M., and Collier, N. (2024). LUQ: Long-text Uncertainty Quantification for LLMs. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N., editors,Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5244–5262. 13

  60. [60]

    Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., Du, Y ., Yang, C., Chen, Y ., Chen, Z., Jiang, J., Ren, R., Li, Y ., Tang, X., Liu, Z., Liu, P., Nie, J.-Y ., and Wen, J.-R. (2025). A Survey of Large Language Models. arXiv:2303.18223 [cs]

  61. [61]

    A., and Roy, S

    Zhou, H., Wan, X., Proleev, L., Mincu, D., Chen, J., Heller, K. A., and Roy, S. (2023). Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering. InICLR. A Related works A.1 Evaluating uncertainty estimation methods Work on evaluating uncertainty estimation (UE) methods for LLMs has addressed three concerns: building infras...

  62. [62]

    method A outperforms method B on benchmark X

    implements a comprehensive estimator suite behind a single pipeline that computes greedy responses, token probabilities, sampled responses, and semantic relation matrices once and reuses them across estimators, controlling for implementation differences. Adjacent toolkits – UQLM [6], UNCERTAINTYZOO[ 54], and UBENCH[ 51] – provide complementary coverage or...