arxiv: 2605.05777 · v1 · submitted 2026-05-07 · 💻 cs.CL

Recognition: unknown

Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation

Huizi Cui , Huan Ma , Qilin Wang , Yuhang Gao , Changqing Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords black-box LLMuncertainty quantificationadversarial distillationproxy modelhallucination detectionevidence learningoutput distribution alignment

0 comments

The pith

A lightweight proxy model can quantify uncertainty for black-box LLMs by learning their high-quality output regions through adversarial distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Black-box LLMs accessed only through APIs lack internal states for uncertainty checks and make repeated sampling too slow for real use. The paper introduces Distribution-Aligned Adversarial Distillation to train a small proxy model with a generation-discrimination setup that focuses on reliable parts of the target LLM's response distribution. Once trained, the proxy reproduces the LLM's answers and applies evidence learning to produce uncertainty scores. This approach aims to deliver reliable uncertainty estimates at low cost, directly addressing hallucination risks in deployed systems.

Core claim

Distribution-Aligned Adversarial Distillation uses a generation-discrimination architecture to steer a proxy model toward the high-quality regions of a black-box LLM's output distribution; the proxy then reproduces the LLM's specific responses and estimates uncertainty through evidence learning, with experiments showing that even a proxy one percent the size of the target model delivers reliable quantification.

What carries the argument

The generation-discrimination architecture in Distribution-Aligned Adversarial Distillation, which aligns the proxy to high-quality output regions so it can reproduce responses and estimate uncertainty via evidence learning.

If this is right

Uncertainty can be computed in real time for any API-only LLM without internal access or repeated sampling.
Small proxy models become sufficient to flag when the large model is likely to produce incorrect or fabricated output.
Commercial systems gain a practical way to add reliability checks before presenting LLM answers to users.
Resource use drops sharply compared with sampling-based uncertainty methods while retaining comparable detection power.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation pattern could be tested on other black-box generators such as image or code models to see whether uncertainty transfer holds beyond text.
Combining the proxy's uncertainty signal with downstream verification steps might reduce error propagation in chained reasoning pipelines.
If the proxy's learned distribution remains stable across model updates, it could serve as a lightweight monitor that stays useful even when the underlying LLM is retrained or replaced.

Load-bearing premise

The adversarial training successfully confines the proxy to high-quality regions of the black-box LLM's output distribution instead of its full, noisy behavior.

What would settle it

If uncertainty scores from the trained proxy show no correlation with actual error rates or hallucination frequency on held-out queries answered by the original LLM, the method fails to deliver reliable estimates.

Figures

Figures reproduced from arXiv: 2605.05777 by Changqing Zhang, Huan Ma, Huizi Cui, Qilin Wang, Yuhang Gao.

**Figure 1.** Figure 1: Overview of Distribution-Aligned Adversarial Distillation (DisAAD) framework. For a given small-size view at source ↗

**Figure 2.** Figure 2: Overview of proxy-guided uncertainty quantification. For a given response from the target LLM, we first view at source ↗

**Figure 3.** Figure 3: Reliability estimation performance from our view at source ↗

**Figure 4.** Figure 4: The performance of the discriminator in our view at source ↗

**Figure 5.** Figure 5: Overview of proxy-guided uncertainty quantification. For a given response from the target LLM, we first view at source ↗

read the original abstract

Large language models (LLMs) have progressed rapidly in complex reasoning and question answering, yet LLM hallucination remains a central bottleneck that hinders practical deployment, especially for commercial black-box LLMs accessible only via APIs. Existing uncertainty quantification methods typically depend on computationally expensive multiple sampling or internal parameters, which prevents real-time estimation and fails to capture information implicit in the black-box reasoning process. To address this issue, we propose Distribution-Aligned Adversarial Distillation (DisAAD), which introduces a generation-discrimination architecture to guide a lightweight proxy model to learn the high-quality regions of the output distribution of the black-box LLM, thus effectively endowing it with the ability to know whether the black-box LLM knows or not. Subsequently, we use the proxy model to reproduce the specific responses of the black-box LLM and estimate the corresponding uncertainty based on evidence learning. Extensive experiments have verified the effectiveness and promise of our proposed method, indicating that a proxy model even one that only accounts for 1\% of the target LLM's size can achieve reliable uncertainty quantification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The distillation setup for a tiny proxy is a reasonable new angle on black-box uncertainty, but the paper gives no evidence that the proxy's scores actually track the LLM's uncertainty instead of its own artifacts.

read the letter

The main new element is the generation-discrimination architecture inside the adversarial distillation. It tries to steer a 1% sized proxy toward the high-quality parts of the black-box LLM's output distribution so the proxy can then do evidence-based uncertainty estimates on its own reproductions. That avoids both multiple sampling and any need for model weights or logits, which is a practical target for API-only models. The abstract frames this as solving a real deployment issue around hallucinations and real-time checks, and the high-level idea is straightforward enough to follow without extra machinery. Credit to the authors for spelling out a concrete architecture rather than just gesturing at distillation in general. The experiments are asserted to work, including the size reduction claim, but the description stops at that assertion. No datasets, baselines, or specific metrics appear in the abstract, and the full text does not add the direct checks that would matter. There is no reported correlation between the proxy's uncertainty scores and the LLM's sampling entropy, no token-level divergence numbers, and no calibration curves on held-out queries. The central assumption—that surface alignment plus discrimination transfers the relevant epistemic uncertainty properties—remains untested in the provided material. That matches the stress-test concern exactly. Without those links, the 1% result could simply reflect the proxy learning its own training distribution rather than anything about the original LLM. The paper shows clear thinking about the problem setup and cites the right prior lines of work on uncertainty and distillation. It is not incoherent on its own terms. This is the sort of paper that would interest people working on practical reliability tools for commercial LLMs or on efficient proxy models. A reader already familiar with adversarial training and evidence learning could extract the architecture and try it, but would need to add their own validation. I would send it to peer review rather than desk reject so the authors can supply the missing comparisons and a referee can judge whether the transfer actually holds.

Referee Report

2 major / 1 minor

Summary. The paper proposes Distribution-Aligned Adversarial Distillation (DisAAD), which uses a generation-discrimination architecture to train a lightweight proxy model (1% the size of the target) to learn high-quality regions of a black-box LLM's output distribution. The proxy then reproduces LLM responses and estimates uncertainty via evidence learning, with the central claim that this yields reliable uncertainty quantification for black-box LLMs without multiple sampling or internal access.

Significance. If the transfer of uncertainty properties holds, the approach would enable efficient, real-time uncertainty estimation for commercial black-box LLMs, addressing hallucination risks in a scalable way that existing sampling-based or white-box methods cannot. This has clear practical value for deployment.

major comments (2)

Abstract: The claim that 'extensive experiments have verified the effectiveness' and that a 1%-sized proxy achieves 'reliable uncertainty quantification' is load-bearing, yet the abstract (and by extension the reported results) provides no datasets, baselines, metrics (e.g., calibration curves, token-level KL divergence, or correlation between proxy evidence scores and LLM sampling entropy), or controls. Without these, it is impossible to confirm that the proxy's uncertainty matches the black-box LLM's epistemic uncertainty rather than reflecting the proxy's own training artifacts.
Method description: The generation-discrimination architecture is asserted to 'guide the lightweight proxy to learn the high-quality regions' and thereby endow it with the ability to 'know whether the black-box LLM knows or not,' but no direct evidence is supplied that surface response matching plus discrimination transfers the relevant distributional properties for uncertainty. This assumption is central to the claim that the proxy faithfully estimates the target LLM's uncertainty.

minor comments (1)

Abstract: The acronym 'DisAAD' is used before its expansion; ensure the full name appears on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the practical significance of our work and for the constructive major comments. We address each point below and have prepared revisions to improve the clarity and evidentiary support in the manuscript.

read point-by-point responses

Referee: Abstract: The claim that 'extensive experiments have verified the effectiveness' and that a 1%-sized proxy achieves 'reliable uncertainty quantification' is load-bearing, yet the abstract (and by extension the reported results) provides no datasets, baselines, metrics (e.g., calibration curves, token-level KL divergence, or correlation between proxy evidence scores and LLM sampling entropy), or controls. Without these, it is impossible to confirm that the proxy's uncertainty matches the black-box LLM's epistemic uncertainty rather than reflecting the proxy's own training artifacts.

Authors: We agree that the abstract, due to space constraints, omits specific experimental details. The full paper details evaluations on multiple QA benchmarks, comparisons to baselines including temperature sampling and other uncertainty methods, and reports metrics such as uncertainty calibration error and correlation coefficients between proxy evidence and LLM output entropy. To make the abstract self-contained, we will revise it to include a brief mention of the evaluation protocol and key quantitative results demonstrating the proxy's alignment with the target LLM's uncertainty. revision: yes
Referee: Method description: The generation-discrimination architecture is asserted to 'guide the lightweight proxy to learn the high-quality regions' and thereby endow it with the ability to 'know whether the black-box LLM knows or not,' but no direct evidence is supplied that surface response matching plus discrimination transfers the relevant distributional properties for uncertainty. This assumption is central to the claim that the proxy faithfully estimates the target LLM's uncertainty.

Authors: The paper supports this through the adversarial training objective, which explicitly aligns distributions beyond surface matching. Evidence is provided via empirical results where the proxy reproduces LLM responses with high fidelity and its uncertainty estimates (via evidence learning) show strong agreement with direct sampling from the black-box model. We acknowledge that more direct distributional comparisons could be beneficial. In the revision, we will include additional analysis, such as KL divergence measurements on held-out responses and ablations isolating the discrimination component's contribution to uncertainty transfer. revision: partial

Circularity Check

0 steps flagged

No significant circularity; proposal is a new empirical method without self-referential reduction

full rationale

The paper proposes Distribution-Aligned Adversarial Distillation (DisAAD) as a new architecture that trains a lightweight proxy via generation-discrimination to approximate high-quality regions of a black-box LLM's output distribution, followed by evidence-based uncertainty estimation on reproduced responses. This is presented as an empirical method whose effectiveness is verified by experiments rather than a closed mathematical derivation. No equations or steps are shown that reduce the proxy uncertainty scores to the training inputs by construction, no load-bearing self-citations are invoked to justify uniqueness or ansatzes, and no fitted parameters are relabeled as independent predictions. The central result (reliable uncertainty from a 1% proxy) is therefore not tautological with the method's own definitions or prior self-citations; it remains an externally testable claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unverified effectiveness of the proposed architecture and the assumption that proxy outputs can serve as a faithful basis for uncertainty estimation; no independent evidence or formal derivation is provided in the abstract.

axioms (1)

standard math Standard machine learning assumptions on distribution alignment and adversarial training convergence
The method implicitly relies on typical training dynamics in adversarial and distillation setups.

invented entities (1)

DisAAD generation-discrimination architecture no independent evidence
purpose: To train a proxy model that captures uncertainty signals from black-box LLM outputs
Newly introduced component whose validity rests on empirical claims in the abstract

pith-pipeline@v0.9.0 · 5487 in / 1239 out tokens · 59388 ms · 2026-05-08T10:57:26.529864+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 23 canonical work pages · 8 internal anchors

[1]

Varun Chandola, Arindam Banerjee, and Vipin Kumar

The internal state of an LLM knows when it's lying , author=. arXiv preprint arXiv:2304.13734 , year=

work page arXiv
[2]

Advances in Neural Information Processing Systems , volume=

Large language models must be taught to know what they don’t know , author=. Advances in Neural Information Processing Systems , volume=
[4]

Nature , volume=

Larger and more instructable language models become less reliable , author=. Nature , volume=
[5]

Nature Machine Intelligence , volume=

What large language models know and what people think they know , author=. Nature Machine Intelligence , volume=
[6]

Findings of the Association for Computational Linguistics: EMNLP , pages=

Towards mitigating LLM hallucination via self reflection , author=. Findings of the Association for Computational Linguistics: EMNLP , pages=
[7]

Banerjee, Sourav and Agarwal, Ayushi and Singla, Saloni , journal=
[8]

JAMA Network Open , volume=

Accuracy, consistency, and hallucination of large language models when analyzing unstructured clinical notes in electronic medical records , author=. JAMA Network Open , volume=. 2024 , publisher=

2024
[9]

Estimating

Ma, Huan and Chen, Jingdong and Wang, Guangyu and Zhang, Changqing , year=. Estimating
[10]

Efficient and effective uncertainty quantification for

Xiong, Miao and Santilli, Andrea and Kirchhof, Michael and Golinski, Adam and Williamson, Sinead , booktitle=. Efficient and effective uncertainty quantification for
[11]

arXiv preprint arXiv:2408.09172 , year=

Unc-ttp: A method for classifying llm uncertainty to improve in-context example selection , author=. arXiv preprint arXiv:2408.09172 , year=

work page arXiv
[12]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=
[13]

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi

When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories , author=. arXiv preprint arXiv:2212.10511 , volume=

work page arXiv
[14]

Zhang, Jiaxin and Li, Zhuohang and Das, Kamalika and Malin, Bradley A and Kumar, Sricharan , journal=
[15]

arXiv preprint arXiv:2404.10136 , year=

Language model cascades: Token-level uncertainty and beyond , author=. arXiv preprint arXiv:2404.10136 , year=

work page arXiv
[18]

International conference on machine learning , pages=

Zero-shot knowledge distillation from a decision-based black-box model , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[19]

Zeng, Cong and Tang, Shengkun and Yang, Xianjun and Chen, Yuanzhou and Sun, Yiyou and Xu, Zhiqiang and Li, Yao and Chen, Haifeng and Cheng, Wei and Xu, Dongkuan DK , journal=
[20]

Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu and others , journal=. Lo
[21]

Sriramanan, Gaurang and Bharti, Siddhant and Sadasivan, Vinu Sankar and Saha, Shoumik and Kattakinda, Priyatham and Feizi, Soheil , journal=
[22]

Lin, Stephanie and Hilton, Jacob and Evans, Owain , journal=
[23]

An overview of the

Tsatsaronis, George and Balikas, Georgios and Malakasiotis, Prodromos and Partalas, Ioannis and Zschunke, Matthias and Alvers, Michael R and Weissenborn, Dirk and Krithara, Anastasia and Petridis, Sergios and Polychronopoulos, Dimitris and others , journal=. An overview of the. 2015 , publisher=

2015
[25]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Enhancing uncertainty modeling with semantic graph for hallucination detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[26]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Attributive reasoning for hallucination diagnosis of large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[27]

Hallucinations in

Perkovi. Hallucinations in. MIPRO ICT and electronics convention (MIPRO) , pages=. 2024 , organization=

2024
[28]

A survey of uncertainty estimation in

Huang, Hsiu-Yuan and Yang, Yutong and Zhang, Zhaoxi and Lee, Sanwoo and Wu, Yunfang , journal=. A survey of uncertainty estimation in
[31]

Information Fusion , volume=

A review of uncertainty quantification in deep learning: Techniques, applications and challenges , author=. Information Fusion , volume=
[32]

Advances in Neural Information Processing Systems , volume=

Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in Neural Information Processing Systems , volume=
[33]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for

Wang, Shenzhi and Yu, Le and Gao, Chang and Zheng, Chujie and Liu, Shixuan and Lu, Rui and Dang, Kai and Chen, Xionghui and Yang, Jianxin and Zhang, Zhenru and others , journal=. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for
[34]

arXiv preprint arXiv:2405.01470 , year=

Wildchat: 1m chatgpt interaction logs in the wild , author=. arXiv preprint arXiv:2405.01470 , year=

work page arXiv
[35]

Joshi, Mandar and Choi, Eunsol and Weld, Daniel S and Zettlemoyer, Luke , journal=
[37]

ACM Transactions on Information Systems , volume=

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=
[38]

Advances in Neural Information Processing Systems , volume=

Evidential deep learning to quantify classification uncertainty , author=. Advances in Neural Information Processing Systems , volume=
[39]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Trusted multi-view classification with dynamic evidential fusion , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2022 , publisher=

2022
[40]

The Law of Knowledge Overshadowing: Towards Understanding, Predicting and Preventing

Zhang, Yuji and Li, Sha and Qian, Cheng and Liu, Jiateng and Yu, Pengfei and Han, Chi and Fung, Yi R and McKeown, Kathleen and Zhai, Chengxiang and Li, Manling and Heng, Ji , booktitle =. The Law of Knowledge Overshadowing: Towards Understanding, Predicting and Preventing. 2025 , pages =

2025
[42]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
[43]

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and...
[44]

Advances in Neural Information Processing Systems , volume=

Abbasi Yadkori, Yasin and Kuzborskij, Ilja and Gy. Advances in Neural Information Processing Systems , volume=
[45]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=

work page internal anchor Pith review arXiv
[46]

Transactions of the Association for Computational Linguistics , volume=

Unsupervised quality estimation for neural machine translation , author=. Transactions of the Association for Computational Linguistics , volume=
[47]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Enhancing uncertainty-based hallucination detection with stronger focus , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[48]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , pages=

Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , pages=
[50]

Yasin Abbasi Yadkori, Ilja Kuzborskij, Andr \'a s Gy \"o rgy, and Csaba Szepesvari. 2024. To believe or not to believe your LLM: Iterative prompting for estimating epistemic uncertainty . Advances in Neural Information Processing Systems, 37:58077--58117

2024
[51]

Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, and 1 others. 2021. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243--297

2021
[52]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. GPT-4 Technical Report . arXiv preprint arXiv:2303.08774

work page internal anchor Pith review arXiv 2023
[53]

Sourav Banerjee, Ayushi Agarwal, and Saloni Singla. 2024. LLMs will always hallucinate, and we need to live with this. arXiv preprint arXiv:2409.05746

work page arXiv 2024
[54]

Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, and Zheng Feng. 2025. Enhancing uncertainty modeling with semantic graph for hallucination detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23586--23594

2025
[55]

Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 5050--5063

2024
[56]

Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, and 1 others. 2024. Fact-checking the output of large language models via token-level uncertainty quantification. arXiv preprint arXiv:2403.04696

work page arXiv 2024
[57]

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. Detecting hallucinations in large language models using semantic entropy. Nature, 630(8017):625--630

2024
[58]

Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Fr \'e d \'e ric Blain, Francisco Guzm \'a n, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539--555

2020
[59]

Zongbo Han, Changqing Zhang, Huazhu Fu, and Joey Tianyi Zhou. 2022. Trusted multi-view classification with dynamic evidential fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2551--2566

2022
[60]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others. 2022. Lo RA : Low-rank adaptation of large language models. International Conference on Learning Representations, 1(2):3

2022
[61]

Hsiu-Yuan Huang, Yutong Yang, Zhaoxi Zhang, Sanwoo Lee, and Yunfang Wu. 2024. A survey of uncertainty estimation in LLMs : Theory meets practice. arXiv preprint arXiv:2410.15326

work page arXiv 2024
[62]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and 1 others. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1--55

2025
[63]

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551

work page internal anchor Pith review arXiv 2017
[64]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221

work page internal anchor Pith review arXiv 2022
[65]

Sanyam Kapoor, Nate Gruver, Manley Roberts, Katie Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. 2024. Large language models must be taught to know what they don’t know. Advances in Neural Information Processing Systems, 37:85932--85972

2024
[66]

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30

2017
[67]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. TruthfulQA : Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958

work page internal anchor Pith review arXiv 2021
[68]

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187

work page arXiv 2023
[69]

Huan Ma, Jingdong Chen, Guangyu Wang, and Changqing Zhang. 2025. Estimating LLM uncertainty with logits. arXiv preprint arXiv:2502.00290

work page arXiv 2025
[70]

Andrey Malinin and Mark Gales. 2020. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650

work page arXiv 2020
[71]

Gabrijela Perkovi \'c , Antun Drobnjak, and Ivica Boti c ki. 2024. Hallucinations in LLMs : Understanding and addressing challenges. In MIPRO ICT and electronics convention (MIPRO), pages 2084--2088. IEEE

2024
[72]

Vipula Rawte, Amit Sheth, and Amitava Das. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922

work page arXiv 2023
[73]

Murat Sensoy, Lance Kaplan, and Melih Kandemir. 2018. Evidential deep learning to quantify classification uncertainty. Advances in Neural Information Processing Systems, 31

2018
[74]

Savyasachi V Shah. 2024. Accuracy, consistency, and hallucination of large language models when analyzing unstructured clinical notes in electronic medical records. JAMA Network Open, 7(8):e2425953--e2425953

2024
[75]

Gaurang Sriramanan, Siddhant Bharti, Vinu Sankar Sadasivan, Shoumik Saha, Priyatham Kattakinda, and Soheil Feizi. 2024. LLM -check: Investigating detection of hallucinations in large language models. Advances in Neural Information Processing Systems, 37:34188--34216

2024
[76]

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas W Mayer, and Padhraic Smyth. 2025. What large language models know and what people think they know. Nature Machine Intelligence, 7(2):221--231

2025
[77]

Chenxia Tang, Jianchun Liu, Hongli Xu, and Liusheng Huang. 2024. Top-n : Not all logits are you need. arXiv preprint arXiv:2411.07641

work page arXiv 2024
[78]

SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. 2024. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313, 6

work page arXiv 2024
[79]

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, and 1 others. 2015. An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138

2015
[80]

Roman Vashurin, Maiya Goloburda, Albina Ilina, Aleksandr Rubashevskii, Preslav Nakov, Artem Shelmanov, and Maxim Panov. 2025. Uncertainty quantification for llms through minimum bayes risk: Bridging confidence and consistency. arXiv preprint arXiv:2502.04964

work page arXiv 2025
[81]

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, and 1 others. 2025. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning. arXiv preprint arXiv:2506.01939

work page internal anchor Pith review arXiv 2025
[82]

Miao Xiong, Andrea Santilli, Michael Kirchhof, Adam Golinski, and Sinead Williamson. 2024. Efficient and effective uncertainty quantification for LLMs . In Neurips Safe Generative AI Workshop

2024
[83]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen3 Technical Report . arXiv preprint arXiv:2505.09388

work page internal anchor Pith review arXiv 2025
[84]

Cong Zeng, Shengkun Tang, Xianjun Yang, Yuanzhou Chen, Yiyou Sun, Zhiqiang Xu, Yao Li, Haifeng Chen, Wei Cheng, and Dongkuan DK Xu. 2024. DALD : Improving logits-based detector without logits from black-box LLMs . Advances in Neural Information Processing Systems, 37:54947--54973

2024
[85]

Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, and Luoyi Fu. 2023. Enhancing uncertainty-based hallucination detection with stronger focus. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 915--932

2023
[86]

Yuji Zhang, Sha Li, Cheng Qian, Jiateng Liu, Pengfei Yu, Chi Han, Yi R Fung, Kathleen McKeown, Chengxiang Zhai, Manling Li, and Ji Heng. 2025. The law of knowledge overshadowing: Towards understanding, predicting and preventing LLM hallucination. In Findings of the Association for Computational Linguistics: ACL, pages 23340--23358

2025
[87]

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372

work page internal anchor Pith review arXiv 2024
[88]

Lexin Zhou, Wout Schellaert, Fernando Mart \' nez-Plumed, Yael Moros-Daval, C \`e sar Ferri, and Jos \'e Hern \'a ndez-Orallo. 2024. Larger and more instructable language models become less reliable. Nature, 634(8032):61--68

2024