Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

Annabelle Blangero; Anna Korba; Karteek Alahari; Nicolas Chesneau; Yedidia Agnimo

arxiv: 2605.27016 · v1 · pith:PRQFANIQnew · submitted 2026-05-26 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

Yedidia Agnimo , Anna Korba , Annabelle Blangero , Nicolas Chesneau , Karteek Alahari This is my paper

Pith reviewed 2026-06-29 18:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML

keywords LLM hallucinationsuncertainty estimationintrinsic hallucinationsextrinsic hallucinationsRAGTruthHalluLensempirical evaluation

0 comments

The pith

Uncertainty estimators associate only weakly and variably with LLM hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a direct empirical evaluation of multiple uncertainty estimation methods against labeled hallucinations in large language models. It distinguishes intrinsic hallucinations that violate input faithfulness from extrinsic ones unsupported by training data, and applies the tests across four benchmarks. The central finding is that the link between uncertainty scores and hallucination presence is inconsistent and frequently weak, varying by hallucination category and by which LLM is examined. This undercuts the common practice of treating uncertainty as an automatic indicator of when an LLM output is unreliable.

Core claim

The association between uncertainty estimators and hallucinations in LLMs is highly variable and often weak, depending on the hallucination type and the LLM under evaluation.

What carries the argument

Systematic comparison of information-theoretic, sampling-based, and reflexive uncertainty estimators against hallucination annotations on RAGTruth, HalluLens, and two additional benchmarks for both intrinsic and extrinsic cases.

If this is right

Uncertainty scores cannot be used as a reliable standalone signal that an output is hallucinated.
Any practical hallucination mitigation strategy must incorporate information beyond uncertainty, such as type-specific checks.
Performance of uncertainty methods should be reported separately for intrinsic versus extrinsic hallucinations rather than in aggregate.
Model-specific tuning of uncertainty thresholds is required rather than assuming a universal relationship.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment pipelines that currently gate outputs on uncertainty alone may need additional verification layers for different hallucination modes.
Future work could test whether combining uncertainty with external retrieval or fact-checking steps restores a usable signal where uncertainty alone fails.

Load-bearing premise

The chosen benchmarks and the split between intrinsic and extrinsic hallucinations give a representative sample of actual hallucination behavior.

What would settle it

Finding a single uncertainty estimator that produces consistently high correlation with hallucination labels across all four benchmarks and multiple LLMs would falsify the claim of weak and variable association.

Figures

Figures reproduced from arXiv: 2605.27016 by Annabelle Blangero, Anna Korba, Karteek Alahari, Nicolas Chesneau, Yedidia Agnimo.

**Figure 2.** Figure 2: Family-level ROC aggregates per task, averaged across the three generators. For each [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Spearman correlation between the top uncertainty estimators across panels. The second cluster contains the two Eccentricity variants, based respectively on contradiction (Ecc-c) and entailment (Ecc-e) relations. Both are black-box graph-based estimators: they build a semantic-relation graph over sampled responses and quantify dispersion from the resulting graph spectral representation. Their moderate co… view at source ↗

**Figure 4.** Figure 4: Performance–stability profiles for uncertainty estimators within each task, aggregated [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

**Figure 5.** Figure 5: Pairwise Kendall’s τ agreement between estimator rankings induced by ROC-AUC across models, shown separately for each task. Higher values indicate that the same estimators tend to rank similarly across models for a given hallucination type [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Pairwise Kendall’s τ agreement between estimator rankings induced by ROC-AUC across tasks, shown separately for each model. Higher values indicate that the same estimators tend to rank similarly across hallucination types for a given model [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Family-level ROC aggregates per task, averaged across the three models. For each task– [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

read the original abstract

Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main finding is that uncertainty estimators link only weakly and variably to hallucinations, depending on type and model.

read the letter

The paper's core result is that uncertainty estimators do not track hallucinations reliably. The association is highly variable and often weak once you break it down by hallucination category and by the LLM being tested.

They run a direct comparison of information-theoretic, sampling-based, and reflexive estimators on intrinsic and extrinsic hallucinations. The setup uses four benchmarks including RAGTruth and HalluLens, and it tests the link rather than assuming it exists. This produces the concrete observation that the strength of the association shifts a lot across settings. That is the new piece: a systematic check across UE families and hallucination splits on established data.

The work is useful because it supplies evidence against the common shortcut of treating uncertainty scores as hallucination detectors. The design is empirical and avoids presenting fitted quantities as predictions, so there is little circularity.

The soft spot is the dependence on the chosen benchmarks and the intrinsic/extrinsic labeling. If those datasets under-sample hallucinations that actually correlate with uncertainty, or if the split introduces artifacts, the "often weak" pattern could be narrower than it appears. The paper mitigates this by using complementary benchmarks, but the general claim still rests on how representative those resources are.

This is for people who build or evaluate production LLM pipelines and need data on where uncertainty methods fall short. It shows clear engagement with the literature by testing an assumption instead of repeating it. The empirical scope is broad enough that it deserves a serious referee to check the splits, statistical reporting, and whether the variability holds under alternative labelings.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic empirical study of the relationship between uncertainty estimation (UE) methods—including information-theoretic, sampling-based, and reflexive estimators—and hallucinations in LLMs. It distinguishes intrinsic hallucinations (input faithfulness violations) from extrinsic ones (unsupported claims relative to training data), evaluates across four benchmarks (RAGTruth, HalluLens, and two others), and concludes that the association between UE and hallucinations is highly variable and often weak, depending on hallucination type and the specific LLM. This challenges the implicit treatment of UE as a direct proxy for hallucination detection.

Significance. If the empirical findings are robust, the work provides a valuable caution against over-reliance on uncertainty estimates for hallucination mitigation in LLM deployment. By directly testing the association rather than assuming it, and covering both intrinsic and extrinsic settings, the results clarify the limited actionable information provided by current UE methods and could steer research toward more targeted detection strategies.

major comments (2)

[§4 (Benchmarks and Hallucination Definitions)] §4 (Benchmarks and Hallucination Definitions): The central claim that the UE-hallucination association is 'often weak' depends on the four chosen benchmarks and the intrinsic/extrinsic distinction constituting an unbiased probe. The manuscript provides no analysis of annotation protocols in RAGTruth or HalluLens (or the other two benchmarks) to show they do not systematically under-sample hallucinations that correlate with uncertainty or introduce labeling artifacts via the intrinsic/extrinsic split; without this, the variability finding risks being an artifact of the evaluation settings.
[§5 (Results)] §5 (Results): The reported associations lack accompanying statistical significance tests, confidence intervals, or error bars on the correlation or AUC metrics across models and hallucination types. This makes it difficult to determine whether observed 'weak' associations are reliably distinguishable from noise or from stronger associations in other settings, undermining the strength of the 'highly variable and often weak' conclusion.

minor comments (2)

The abstract and introduction would benefit from explicitly naming all four benchmarks and the exact LLMs evaluated, rather than referring to 'two others.'
[§3] Notation for UE methods (e.g., how reflexive estimators are formalized) could be made more consistent between §3 and the experimental tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which highlight important aspects of our evaluation methodology and statistical reporting. We provide point-by-point responses below and indicate where revisions will be made.

read point-by-point responses

Referee: [§4 (Benchmarks and Hallucination Definitions)] §4 (Benchmarks and Hallucination Definitions): The central claim that the UE-hallucination association is 'often weak' depends on the four chosen benchmarks and the intrinsic/extrinsic distinction constituting an unbiased probe. The manuscript provides no analysis of annotation protocols in RAGTruth or HalluLens (or the other two benchmarks) to show they do not systematically under-sample hallucinations that correlate with uncertainty or introduce labeling artifacts via the intrinsic/extrinsic split; without this, the variability finding risks being an artifact of the evaluation settings.

Authors: We appreciate this concern regarding potential biases in the benchmarks. Our selection of RAGTruth, HalluLens, and the additional benchmarks was based on their established use in the literature for distinguishing intrinsic and extrinsic hallucinations, with labels provided by human annotators following documented protocols. While we did not conduct an independent audit of the annotation processes, the consistency of our findings across multiple independent benchmarks supports that the observed variability is not an artifact of a single evaluation setting. In the revision, we will add a subsection discussing the limitations of these benchmarks and the potential for annotation artifacts, including references to the original papers' annotation guidelines. revision: partial
Referee: [§5 (Results)] §5 (Results): The reported associations lack accompanying statistical significance tests, confidence intervals, or error bars on the correlation or AUC metrics across models and hallucination types. This makes it difficult to determine whether observed 'weak' associations are reliably distinguishable from noise or from stronger associations in other settings, undermining the strength of the 'highly variable and often weak' conclusion.

Authors: We agree that including statistical significance tests and confidence intervals would strengthen the presentation of our results. In the revised manuscript, we will recompute the correlations and AUC values with bootstrap confidence intervals and report p-values for the key comparisons to assess whether the weak associations are statistically distinguishable from stronger ones. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation

full rationale

The paper performs direct empirical comparisons of uncertainty estimators against hallucination labels on four benchmarks (RAGTruth, HalluLens and two others), distinguishing intrinsic vs. extrinsic cases. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the derivation chain. The headline finding (variable and often weak association) is a statistical observation from the data, not a reduction to any fitted quantity or prior self-result. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The work rests on standard definitions of hallucinations and uncertainty estimators drawn from prior literature.

pith-pipeline@v0.9.1-grok · 5741 in / 937 out tokens · 24704 ms · 2026-06-29T18:28:24.637686+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Abbasi-Yadkori, Y ., Kuzborskij, I., György, A., and Szepesvari, C. (2024). To Believe or Not to Believe Your LLM: Iterative Prompting for Estimating Epistemic Uncertainty. InNeurIPS

2024
[2]

L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., Hovy, E., Ji, H., Menczer, F., Miguez, R., Nakov, P., Scheufele, D., Sharma, S., and Zagni, G

Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G. L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., Hovy, E., Ji, H., Menczer, F., Miguez, R., Nakov, P., Scheufele, D., Sharma, S., and Zagni, G. (2024). Factuality challenges in the era of large language models and opportunities for fact-checking.Nature Machine Intelligenc...

2024
[3]

F., Yaldiz, D

Bakman, Y . F., Yaldiz, D. N., Kang, S., Zhang, T., Buyukates, B., Avestimehr, S., and Karim- ireddy, S. P. (2025). Reconsidering LLM uncertainty estimation methods in the wild. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29531–29556

2025
[4]

Bang, Y ., Ji, Z., Schelten, A., Hartshorn, A., Fowler, T., Zhang, C., Cancedda, N., and Fung, P. (2025). HalluLens: LLM hallucination benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24128–24156

2025
[5]

Y ., Okolo, C

Bengio, Y ., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y ., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V ., Mazeika, M., Michael, J., Newman, J., Ng, K. Y ., Okolo, C. T., Raji, D., Sastry, G., Seger, E., Skeadas, T., South, T., Strubell, ...

work page arXiv 2025
[6]

S., Skarbrevik, D., Ra, H.-K., Bajaj, V ., and Ahmad, Z

Bouchard, D., Chauhan, M. S., Skarbrevik, D., Ra, H.-K., Bajaj, V ., and Ahmad, Z. (2026). UQLM: A Python Package for Uncertainty Quantification in Large Language Models.Journal of Machine Learning Research, 27(13):1–10

2026
[7]

Chen, C., Liu, K., Chen, Z., Gu, Y ., Wu, Y ., Tao, M., Fu, Z., and Ye, J. (2023). INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection. InInternational Conference on Learning Representations

2023
[8]

d., Suchanek, F

Chen, L., Melo, G. d., Suchanek, F. M., and Varoquaux, G. (2026). Query-Level Uncertainty in Large Language Models. InICLR

2026
[9]

Cover, T. M. and Thomas, J. A. (2001).Elements of information theory. Wiley-Interscience, Hoboken, NJ

2001
[10]

F., Hardt, M., and Mendler-Dünner, C

Cruz, A. F., Hardt, M., and Mendler-Dünner, C. (2024). Evaluating language models as risk scores. InProceedings of the 38th International Conference on Neural Information Processing Systems, volume 37, pages 97378–97407

2024
[11]

Darrin, M., Piantanida, P., and Colombo, P. (2023). Rainproof: An umbrella to shield text generator from out-of-distribution data. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5831–5857. 10

2023
[12]

Devic, S., Srinivasan, T., Thomason, J., Neiswanger, W., and Sharan, V . (2025). From Cal- ibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered. arXiv:2506.07461 [cs] version: 1

work page arXiv 2025
[13]

Duan, J., Cheng, H., Wang, S., Zavalny, A., Wang, C., Xu, R., Kailkhura, B., and Xu, K. (2024). Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free- Form Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5050–5063. Associati...

2024
[14]

Fadeeva, E., Rubashevskii, A., Shelmanov, A., Petrakov, S., Li, H., Mubarak, H., Tsymbalov, E., Kuzmin, G., Panchenko, A., Baldwin, T., Nakov, P., and Panov, M. (2024). Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification. InFindings of the Association for Computational Linguistics, pages 9367–9385

2024
[15]

Fadeeva, E., Vashurin, R., Tsvigun, A., Vazhentsev, A., Petrakov, S., Fedyanin, K., Vasilev, D., Goncharova, E., Panchenko, A., Panov, M., et al. (2023). Lm-polygraph: Uncertainty estimation for language models.Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations

2023
[16]

Fan, D., Delsad, S., Flammarion, N., and Andriushchenko, M. (2026). HalluHard: A Hard Multi-Turn Hallucination Benchmark. arXiv:2602.01031 [cs]

work page arXiv 2026
[17]

Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y . (2024). Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630

2024
[18]

Fomicheva, M., Sun, S., Yankovskaya, L., Blain, F., Guzmán, F., Fishel, M., Aletras, N., Chaudhary, V ., and Specia, L. (2020). Unsupervised quality estimation for neural machine translation.Transactions of the Association for Computational Linguistics, 8:539–555

2020
[19]

He, P., Liu, X., Gao, J., and Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with disentangled attention. InInternational Conference on Learning Representations

2021
[20]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2024a). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Transactions on Information Systems
[21]

Huang, X., Li, S., Yu, M., Sesia, M., Hassani, H., Lee, I., Bastani, O., and Dobriban, E. (2024b). Uncertainty in Language Models: Assessment through Rank-Calibration. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 284–312
[22]

Ielanskyi, M., Schweighofer, K., Aichberger, L., and Hochreiter, S. (2025). Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation. InICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI

2025
[23]

S., Madotto, A., and Fung, P

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y ., Ishii, E., Bang, Y ., Chen, D., Dai, W., Chan, H. S., Madotto, A., and Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):1–38

2023
[24]

Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601– 1611

2017
[25]

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield- Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. (2022). Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

F., Yaldiz, D

Kang, S., Bakman, Y . F., Yaldiz, D. N., Buyukates, B., and Avestimehr, S. (2025). Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions.arXiv 2510.12040. 11

work page arXiv 2025
[27]

Kuhn, L., Gal, Y ., and Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.International Conference on Learning Representations

2023
[28]

M., Uszkoreit, J., Le, Q., and Petrov, S

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., and Petrov, S. (2019). Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational L...

2019
[29]

Lee, K., Lee, K., Lee, H., and Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in neural information processing systems, 31

2018
[30]

Lin, C.-Y . (2004). ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81. Association for Computational Linguistics

2004
[31]

Lin, Z., Trivedi, S., and Sun, J. (2023). Generating with confidence: Uncertainty quantification for black-box large language models.Findings of ACL

2023
[32]

Lin, Z., Trivedi, S., and Sun, J. (2024a). Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 10351–10368
[33]

Lin, Z., Trivedi, S., and Sun, J. (2024b). Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models.Transactions on Machine Learning Research
[34]

and Gales, M

Malinin, A. and Gales, M. (2021). Uncertainty estimation in autoregressive structured prediction. International Conference on Learning Representations

2021
[35]

Moskvoretskii, V ., Marina, M., Salnikov, M., Ivanov, N., Pletenev, S., Galimzianova, D., Krayko, N., Konovalov, V ., Nikishina, I., and Panchenko, A. (2025). Adaptive Retrieval Without Self- Knowledge? Bringing Uncertainty Back Home. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6355–6384

2025
[36]

V ., Kossen, J., Gal, Y ., and Marttinen, P

Nikitin, A. V ., Kossen, J., Gal, Y ., and Marttinen, P. (2024). Kernel Language Entropy: Fine- grained Uncertainty Quantification for LLMs from Semantic Similarities. InAdvances in Neural Information Processing Systems

2024
[37]

Niu, C., Wu, Y ., Zhu, J., Xu, S., Shum, K., Zhong, R., Song, J., and Zhang, T. (2024). RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878

2024
[38]

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318

2002
[39]

and Miikkulainen, R

Qiu, X. and Miikkulainen, R. (2024). Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space. InAdvances in Neural Information Processing Systems

2024
[40]

G., Padhy, S., and Lakshminarayanan, B

Ren, J., Fort, S., Liu, J., Roy, A. G., Padhy, S., and Lakshminarayanan, B. (2021). A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection. arXiv:2106.09022 [cs]

work page arXiv 2021
[41]

Sahoo, P., Meharia, P., Ghosh, A., Saha, S., Jain, V ., and Chadha, A. (2024). A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11709–11724

2024
[42]

Santilli, A., Golinski, A., Kirchhof, M., Danieli, F., Blaas, A., Xiong, M., Zappella, L., and Williamson, S. (2025). Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers...

2025
[43]

S., Saha, S., Kattakinda, P., and Feizi, S

Sriramanan, G., Bharti, S., Sadasivan, V . S., Saha, S., Kattakinda, P., and Feizi, S. (2024). LLM-Check: Investigating Detection of Hallucinations in Large Language Models. InNeurIPS

2024
[44]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5433–5442

2023
[45]

Tomov, T., Fuchsgruber, D., Wollschläger, T., and Günnemann, S. (2026). The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity. arXiv:2511.04418 [cs]

work page arXiv 2026
[46]

van der Poel, L., Cotterell, R., and Meister, C. (2022). Mutual Information Alleviates Hallucina- tions in Abstractive Summarization. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5956–5965

2022
[47]

Vashurin, R., Fadeeva, E., Vazhentsev, A., Rvanova, L., Vasilev, D., Tsvigun, A., Petrakov, S., Xing, R., Sadallah, A., Grishchenkov, K., Panchenko, A., Baldwin, T., Nakov, P., Panov, M., and Shelmanov, A. (2025a). Benchmarking uncertainty quantification methods for large language models with LM-polygraph.Transactions of the Association for Computational ...
[48]

Vashurin, R., Goloburda, M., Ilina, A., Rubashevskii, A., Nakov, P., Shelmanov, A., and Panov, M. (2025b). CoCoA: A Minimum Bayes Risk Framework Bridging Confidence and Consistency for Uncertainty Quantification in LLMs. InNeurIPS
[49]

Vazhentsev, A., Kuzmin, G., Tsvigun, A., Panchenko, A., Panov, M., Burtsev, M., and Shel- manov, A. (2023). Hybrid Uncertainty Quantification for Selective Text Classification in Am- biguous Tasks. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11659–11681

2023
[50]

Vazhentsev, A., Rvanova, L., Kuzmin, G., Fadeeva, E., Lazichny, I., Panchenko, A., Panov, M., Baldwin, T., Sachan, M., Nakov, P., and Shelmanov, A. (2025). Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs.arXiv 2505.20045

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Wang, X., Zhang, Z., Chen, G., Li, Q., Luo, B., Han, Z., Wang, H., Li, Z., Gao, H., and Hu, M. (2025). UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions. InFindings of the Association for Computational Linguistics, pages 8076–8107

2025
[52]

Measuring short-form factuality in large language models

Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. (2024). Measuring short-form factuality in large language models. arXiv:2411.04368 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Williams, R. J. and Zipser, D. (1989). A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.Neural Computation, 1(2):270–280

1989
[54]

Wu, X., Li, X., Quan, L., and Hu, Q. (2025). UncertaintyZoo: A Unified Toolkit for Quantifying Predictive Uncertainty in Deep Learning Systems. arXiv:2512.06406 [cs]

work page arXiv 2025
[55]

Yang, Q., Ravikumar, S., Schmitt-Ulms, F., Lolla, S., Demir, E., Elistratov, I., Lavaee, A., Lolla, S., Ahmadi, E., Rus, D., Amini, A., and Perez, A. (2023). Uncertainty-aware Language Modeling for Selective Question Answering. arXiv:2311.15451 [cs]

work page arXiv 2023
[56]

Yao, Y ., Wu, H., Guo, Z., Biyan, Z., Gao, J., Luo, S., Hou, H., Fu, X., and Song, L. (2024). Learning From Correctness Without Prompting Makes LLM Efficient Reasoner. InCOLM

2024
[57]

Yoo, K., Kim, J., Jang, J., and Kwak, N. (2022). Detection of Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation. InFindings of the Association for Computational Linguistics, pages 3656–3672

2022
[58]

Zha, Y ., Yang, Y ., Li, R., and Hu, Z. (2023). Alignscore: Evaluating factual consistency with a unified alignment function.arXiv preprint arXiv:2305.16739

work page arXiv 2023
[59]

Zhang, C., Liu, F., Basaldella, M., and Collier, N. (2024). LUQ: Long-text Uncertainty Quantification for LLMs. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N., editors,Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5244–5262. 13

2024
[60]

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., Du, Y ., Yang, C., Chen, Y ., Chen, Z., Jiang, J., Ren, R., Li, Y ., Tang, X., Liu, Z., Liu, P., Nie, J.-Y ., and Wen, J.-R. (2025). A Survey of Large Language Models. arXiv:2303.18223 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

A., and Roy, S

Zhou, H., Wan, X., Proleev, L., Mincu, D., Chen, J., Heller, K. A., and Roy, S. (2023). Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering. InICLR. A Related works A.1 Evaluating uncertainty estimation methods Work on evaluating uncertainty estimation (UE) methods for LLMs has addressed three concerns: building infras...

2023
[62]

method A outperforms method B on benchmark X

implements a comprehensive estimator suite behind a single pipeline that computes greedy responses, token probabilities, sampled responses, and semantic relation matrices once and reuses them across estimators, controlling for implementation differences. Adjacent toolkits – UQLM [6], UNCERTAINTYZOO[ 54], and UBENCH[ 51] – provide complementary coverage or...

1996

[1] [1]

Abbasi-Yadkori, Y ., Kuzborskij, I., György, A., and Szepesvari, C. (2024). To Believe or Not to Believe Your LLM: Iterative Prompting for Estimating Epistemic Uncertainty. InNeurIPS

2024

[2] [2]

L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., Hovy, E., Ji, H., Menczer, F., Miguez, R., Nakov, P., Scheufele, D., Sharma, S., and Zagni, G

Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G. L., Corney, D., DiResta, R., Ferrara, E., Hale, S., Halevy, A., Hovy, E., Ji, H., Menczer, F., Miguez, R., Nakov, P., Scheufele, D., Sharma, S., and Zagni, G. (2024). Factuality challenges in the era of large language models and opportunities for fact-checking.Nature Machine Intelligenc...

2024

[3] [3]

F., Yaldiz, D

Bakman, Y . F., Yaldiz, D. N., Kang, S., Zhang, T., Buyukates, B., Avestimehr, S., and Karim- ireddy, S. P. (2025). Reconsidering LLM uncertainty estimation methods in the wild. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29531–29556

2025

[4] [4]

Bang, Y ., Ji, Z., Schelten, A., Hartshorn, A., Fowler, T., Zhang, C., Cancedda, N., and Fung, P. (2025). HalluLens: LLM hallucination benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24128–24156

2025

[5] [5]

Y ., Okolo, C

Bengio, Y ., Mindermann, S., Privitera, D., Besiroglu, T., Bommasani, R., Casper, S., Choi, Y ., Fox, P., Garfinkel, B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Khalatbari, L., Longpre, S., Manning, S., Mavroudis, V ., Mazeika, M., Michael, J., Newman, J., Ng, K. Y ., Okolo, C. T., Raji, D., Sastry, G., Seger, E., Skeadas, T., South, T., Strubell, ...

work page arXiv 2025

[6] [6]

S., Skarbrevik, D., Ra, H.-K., Bajaj, V ., and Ahmad, Z

Bouchard, D., Chauhan, M. S., Skarbrevik, D., Ra, H.-K., Bajaj, V ., and Ahmad, Z. (2026). UQLM: A Python Package for Uncertainty Quantification in Large Language Models.Journal of Machine Learning Research, 27(13):1–10

2026

[7] [7]

Chen, C., Liu, K., Chen, Z., Gu, Y ., Wu, Y ., Tao, M., Fu, Z., and Ye, J. (2023). INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection. InInternational Conference on Learning Representations

2023

[8] [8]

d., Suchanek, F

Chen, L., Melo, G. d., Suchanek, F. M., and Varoquaux, G. (2026). Query-Level Uncertainty in Large Language Models. InICLR

2026

[9] [9]

Cover, T. M. and Thomas, J. A. (2001).Elements of information theory. Wiley-Interscience, Hoboken, NJ

2001

[10] [10]

F., Hardt, M., and Mendler-Dünner, C

Cruz, A. F., Hardt, M., and Mendler-Dünner, C. (2024). Evaluating language models as risk scores. InProceedings of the 38th International Conference on Neural Information Processing Systems, volume 37, pages 97378–97407

2024

[11] [11]

Darrin, M., Piantanida, P., and Colombo, P. (2023). Rainproof: An umbrella to shield text generator from out-of-distribution data. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5831–5857. 10

2023

[12] [12]

Devic, S., Srinivasan, T., Thomason, J., Neiswanger, W., and Sharan, V . (2025). From Cal- ibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered. arXiv:2506.07461 [cs] version: 1

work page arXiv 2025

[13] [13]

Duan, J., Cheng, H., Wang, S., Zavalny, A., Wang, C., Xu, R., Kailkhura, B., and Xu, K. (2024). Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free- Form Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5050–5063. Associati...

2024

[14] [14]

Fadeeva, E., Rubashevskii, A., Shelmanov, A., Petrakov, S., Li, H., Mubarak, H., Tsymbalov, E., Kuzmin, G., Panchenko, A., Baldwin, T., Nakov, P., and Panov, M. (2024). Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification. InFindings of the Association for Computational Linguistics, pages 9367–9385

2024

[15] [15]

Fadeeva, E., Vashurin, R., Tsvigun, A., Vazhentsev, A., Petrakov, S., Fedyanin, K., Vasilev, D., Goncharova, E., Panchenko, A., Panov, M., et al. (2023). Lm-polygraph: Uncertainty estimation for language models.Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations

2023

[16] [16]

Fan, D., Delsad, S., Flammarion, N., and Andriushchenko, M. (2026). HalluHard: A Hard Multi-Turn Hallucination Benchmark. arXiv:2602.01031 [cs]

work page arXiv 2026

[17] [17]

Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y . (2024). Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630

2024

[18] [18]

Fomicheva, M., Sun, S., Yankovskaya, L., Blain, F., Guzmán, F., Fishel, M., Aletras, N., Chaudhary, V ., and Specia, L. (2020). Unsupervised quality estimation for neural machine translation.Transactions of the Association for Computational Linguistics, 8:539–555

2020

[19] [19]

He, P., Liu, X., Gao, J., and Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with disentangled attention. InInternational Conference on Learning Representations

2021

[20] [20]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2024a). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.ACM Transactions on Information Systems

[21] [21]

Huang, X., Li, S., Yu, M., Sesia, M., Hassani, H., Lee, I., Bastani, O., and Dobriban, E. (2024b). Uncertainty in Language Models: Assessment through Rank-Calibration. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 284–312

[22] [22]

Ielanskyi, M., Schweighofer, K., Aichberger, L., and Hochreiter, S. (2025). Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation. InICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI

2025

[23] [23]

S., Madotto, A., and Fung, P

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y ., Ishii, E., Bang, Y ., Chen, D., Dai, W., Chan, H. S., Madotto, A., and Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):1–38

2023

[24] [24]

Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. (2017). TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601– 1611

2017

[25] [25]

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield- Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. (2022). Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

F., Yaldiz, D

Kang, S., Bakman, Y . F., Yaldiz, D. N., Buyukates, B., and Avestimehr, S. (2025). Uncertainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions.arXiv 2510.12040. 11

work page arXiv 2025

[27] [27]

Kuhn, L., Gal, Y ., and Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.International Conference on Learning Representations

2023

[28] [28]

M., Uszkoreit, J., Le, Q., and Petrov, S

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., Toutanova, K., Jones, L., Kelcey, M., Chang, M.-W., Dai, A. M., Uszkoreit, J., Le, Q., and Petrov, S. (2019). Natural Questions: A Benchmark for Question Answering Research.Transactions of the Association for Computational L...

2019

[29] [29]

Lee, K., Lee, K., Lee, H., and Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks.Advances in neural information processing systems, 31

2018

[30] [30]

Lin, C.-Y . (2004). ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81. Association for Computational Linguistics

2004

[31] [31]

Lin, Z., Trivedi, S., and Sun, J. (2023). Generating with confidence: Uncertainty quantification for black-box large language models.Findings of ACL

2023

[32] [32]

Lin, Z., Trivedi, S., and Sun, J. (2024a). Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 10351–10368

[33] [33]

Lin, Z., Trivedi, S., and Sun, J. (2024b). Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models.Transactions on Machine Learning Research

[34] [34]

and Gales, M

Malinin, A. and Gales, M. (2021). Uncertainty estimation in autoregressive structured prediction. International Conference on Learning Representations

2021

[35] [35]

Moskvoretskii, V ., Marina, M., Salnikov, M., Ivanov, N., Pletenev, S., Galimzianova, D., Krayko, N., Konovalov, V ., Nikishina, I., and Panchenko, A. (2025). Adaptive Retrieval Without Self- Knowledge? Bringing Uncertainty Back Home. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6355–6384

2025

[36] [36]

V ., Kossen, J., Gal, Y ., and Marttinen, P

Nikitin, A. V ., Kossen, J., Gal, Y ., and Marttinen, P. (2024). Kernel Language Entropy: Fine- grained Uncertainty Quantification for LLMs from Semantic Similarities. InAdvances in Neural Information Processing Systems

2024

[37] [37]

Niu, C., Wu, Y ., Zhu, J., Xu, S., Shum, K., Zhong, R., Song, J., and Zhang, T. (2024). RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878

2024

[38] [38]

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318

2002

[39] [39]

and Miikkulainen, R

Qiu, X. and Miikkulainen, R. (2024). Semantic Density: Uncertainty Quantification for Large Language Models through Confidence Measurement in Semantic Space. InAdvances in Neural Information Processing Systems

2024

[40] [40]

G., Padhy, S., and Lakshminarayanan, B

Ren, J., Fort, S., Liu, J., Roy, A. G., Padhy, S., and Lakshminarayanan, B. (2021). A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection. arXiv:2106.09022 [cs]

work page arXiv 2021

[41] [41]

Sahoo, P., Meharia, P., Ghosh, A., Saha, S., Jain, V ., and Chadha, A. (2024). A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11709–11724

2024

[42] [42]

Santilli, A., Golinski, A., Kirchhof, M., Danieli, F., Blaas, A., Xiong, M., Zappella, L., and Williamson, S. (2025). Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers...

2025

[43] [43]

S., Saha, S., Kattakinda, P., and Feizi, S

Sriramanan, G., Bharti, S., Sadasivan, V . S., Saha, S., Kattakinda, P., and Feizi, S. (2024). LLM-Check: Investigating Detection of Hallucinations in Large Language Models. InNeurIPS

2024

[44] [44]

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5433–5442

2023

[45] [45]

Tomov, T., Fuchsgruber, D., Wollschläger, T., and Günnemann, S. (2026). The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity. arXiv:2511.04418 [cs]

work page arXiv 2026

[46] [46]

van der Poel, L., Cotterell, R., and Meister, C. (2022). Mutual Information Alleviates Hallucina- tions in Abstractive Summarization. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5956–5965

2022

[47] [47]

Vashurin, R., Fadeeva, E., Vazhentsev, A., Rvanova, L., Vasilev, D., Tsvigun, A., Petrakov, S., Xing, R., Sadallah, A., Grishchenkov, K., Panchenko, A., Baldwin, T., Nakov, P., Panov, M., and Shelmanov, A. (2025a). Benchmarking uncertainty quantification methods for large language models with LM-polygraph.Transactions of the Association for Computational ...

[48] [48]

Vashurin, R., Goloburda, M., Ilina, A., Rubashevskii, A., Nakov, P., Shelmanov, A., and Panov, M. (2025b). CoCoA: A Minimum Bayes Risk Framework Bridging Confidence and Consistency for Uncertainty Quantification in LLMs. InNeurIPS

[49] [49]

Vazhentsev, A., Kuzmin, G., Tsvigun, A., Panchenko, A., Panov, M., Burtsev, M., and Shel- manov, A. (2023). Hybrid Uncertainty Quantification for Selective Text Classification in Am- biguous Tasks. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11659–11681

2023

[50] [50]

Vazhentsev, A., Rvanova, L., Kuzmin, G., Fadeeva, E., Lazichny, I., Panchenko, A., Panov, M., Baldwin, T., Sachan, M., Nakov, P., and Shelmanov, A. (2025). Uncertainty-Aware Attention Heads: Efficient Unsupervised Uncertainty Quantification for LLMs.arXiv 2505.20045

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Wang, X., Zhang, Z., Chen, G., Li, Q., Luo, B., Han, Z., Wang, H., Li, Z., Gao, H., and Hu, M. (2025). UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions. InFindings of the Association for Computational Linguistics, pages 8076–8107

2025

[52] [52]

Measuring short-form factuality in large language models

Wei, J., Karina, N., Chung, H. W., Jiao, Y . J., Papay, S., Glaese, A., Schulman, J., and Fedus, W. (2024). Measuring short-form factuality in large language models. arXiv:2411.04368 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Williams, R. J. and Zipser, D. (1989). A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.Neural Computation, 1(2):270–280

1989

[54] [54]

Wu, X., Li, X., Quan, L., and Hu, Q. (2025). UncertaintyZoo: A Unified Toolkit for Quantifying Predictive Uncertainty in Deep Learning Systems. arXiv:2512.06406 [cs]

work page arXiv 2025

[55] [55]

Yang, Q., Ravikumar, S., Schmitt-Ulms, F., Lolla, S., Demir, E., Elistratov, I., Lavaee, A., Lolla, S., Ahmadi, E., Rus, D., Amini, A., and Perez, A. (2023). Uncertainty-aware Language Modeling for Selective Question Answering. arXiv:2311.15451 [cs]

work page arXiv 2023

[56] [56]

Yao, Y ., Wu, H., Guo, Z., Biyan, Z., Gao, J., Luo, S., Hou, H., Fu, X., and Song, L. (2024). Learning From Correctness Without Prompting Makes LLM Efficient Reasoner. InCOLM

2024

[57] [57]

Yoo, K., Kim, J., Jang, J., and Kwak, N. (2022). Detection of Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation. InFindings of the Association for Computational Linguistics, pages 3656–3672

2022

[58] [58]

Zha, Y ., Yang, Y ., Li, R., and Hu, Z. (2023). Alignscore: Evaluating factual consistency with a unified alignment function.arXiv preprint arXiv:2305.16739

work page arXiv 2023

[59] [59]

Zhang, C., Liu, F., Basaldella, M., and Collier, N. (2024). LUQ: Long-text Uncertainty Quantification for LLMs. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N., editors,Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 5244–5262. 13

2024

[60] [60]

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., Du, Y ., Yang, C., Chen, Y ., Chen, Z., Jiang, J., Ren, R., Li, Y ., Tang, X., Liu, Z., Liu, P., Nie, J.-Y ., and Wen, J.-R. (2025). A Survey of Large Language Models. arXiv:2303.18223 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

A., and Roy, S

Zhou, H., Wan, X., Proleev, L., Mincu, D., Chen, J., Heller, K. A., and Roy, S. (2023). Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering. InICLR. A Related works A.1 Evaluating uncertainty estimation methods Work on evaluating uncertainty estimation (UE) methods for LLMs has addressed three concerns: building infras...

2023

[62] [62]

method A outperforms method B on benchmark X

implements a comprehensive estimator suite behind a single pipeline that computes greedy responses, token probabilities, sampled responses, and semantic relation matrices once and reuses them across estimators, controlling for implementation differences. Adjacent toolkits – UQLM [6], UNCERTAINTYZOO[ 54], and UBENCH[ 51] – provide complementary coverage or...

1996