arxiv: 2604.19162 · v1 · submitted 2026-04-21 · 💻 cs.CL · stat.AP

Recognition: unknown

Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

Hongxing Pan, Jiashi Lu, Wenqing Kuang, Yingying Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:28 UTC · model grok-4.3

classification 💻 cs.CL stat.AP

keywords LLM uncertainty quantificationsemantic alphabet sizeGood-Turing coverageentailment graphhallucination detectionblack-box samplingsemantic entropygraph spectral estimation

0 comments

The pith

A hybrid estimator fuses coverage statistics with graph spectral traces to better count distinct meanings in small samples from language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SHADE to estimate the effective number of distinct semantic meanings in LLM responses when only a few samples can be drawn per query. It builds an entailment-weighted graph from the responses and combines a generalized Good-Turing coverage signal with the heat-kernel trace of the normalized Laplacian on that graph. Coverage level selects the fusion rule, using a convex combination when coverage is high and LogSumExp when low to emphasize weakly observed modes, followed by a finite-sample correction. The resulting alphabet-size estimate converts to a coverage-adjusted semantic entropy score for uncertainty quantification. Experiments indicate the largest gains in alphabet-size accuracy and QA incorrectness detection occur in the most sample-limited settings.

Core claim

SHADE estimates semantic alphabet size by adaptively fusing Generalized Good-Turing coverage with the heat-kernel trace of the normalized Laplacian on an entailment-weighted graph over sampled responses. High coverage triggers a convex combination of the two signals; low coverage applies LogSumExp fusion to emphasize missing modes. A finite-sample correction stabilizes the cardinality estimate, which then yields a coverage-adjusted semantic entropy score for black-box uncertainty quantification.

What carries the argument

The entailment-weighted graph over sampled responses, whose normalized Laplacian supplies a heat-kernel trace that is fused with Generalized Good-Turing coverage; the coverage value itself selects between convex-combination and LogSumExp fusion rules.

If this is right

Uncertainty scores derived from SHADE improve detection of incorrect QA answers most strongly when sampling budgets are tight.
The hybrid fusion reduces undercounting of rare semantic modes compared with pure frequency or pure spectral estimators.
As sample size grows, the advantage over simpler estimators shrinks, consistent with the method targeting the low-sample regime.
The coverage-adjusted semantic entropy provides a practical proxy for downstream risk under black-box access constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coverage-triggered fusion logic could be tested on other generative tasks such as code or image captioning where semantic modes are costly to sample exhaustively.
Adaptive sampling strategies might stop early once estimated coverage exceeds a threshold, reducing query cost while preserving estimate quality.
If the entailment graph construction generalizes across model families, the estimator could serve as a lightweight post-hoc check on consistency without retraining.

Load-bearing premise

The entailment-weighted graph built from the sampled responses correctly identifies distinct semantic modes, and the coverage estimate accurately chooses the right fusion rule without biasing the final count of meanings.

What would settle it

Collect a very large reference set of responses for the same queries to establish ground-truth semantic mode counts, then check whether SHADE's low-sample estimates deviate systematically from those counts in the small-sample regime.

Figures

Figures reproduced from arXiv: 2604.19162 by Hongxing Pan, Jiashi Lu, Wenqing Kuang, Yingying Guo.

read the original abstract

This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size--that is, the number of distinct meanings expressed in the sampled responses--provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SHADE mixes Good-Turing coverage with a graph Laplacian trace and switches fusion rules by estimated coverage, claiming the largest gains precisely when samples are scarcest.

read the letter

The paper introduces SHADE as a way to estimate the effective number of distinct semantic responses an LLM produces from a small number of black-box samples. It blends a coverage statistic with a spectral quantity from an entailment graph and lets the coverage level pick whether to average the signals or use a LogSumExp to pull in the missing modes, then applies a finite-sample correction before turning the result into an entropy score for hallucination risk.

Referee Report

3 major / 3 minor

Summary. The paper proposes SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a hybrid method for estimating the effective semantic alphabet size (number of distinct meanings) in LLM responses under black-box access with small sample sizes. It fuses a Generalized Good-Turing coverage estimate with the heat-kernel trace of the normalized Laplacian on an entailment-weighted graph over sampled responses; coverage level selects between convex-combination fusion (high coverage) and LogSumExp fusion (low coverage), followed by a finite-sample correction to yield a cardinality estimate that is converted into a coverage-adjusted semantic entropy score. Experiments on pooled alphabet-size estimation against large-sample references and on QA incorrectness detection report the largest gains in the most sample-limited regimes, with the gap narrowing as sample size grows.

Significance. If the central claims hold, the work addresses a practically important gap in black-box LLM uncertainty quantification by providing an interpretable estimator that targets unseen semantic modes when sampling budgets are tight; this could improve hallucination proxies via better semantic-occupancy estimates. The adaptive fusion of frequency-based and graph-spectral signals is a reasonable idea for small-n settings, and the empirical focus on sample-limited regimes is well-motivated. However, the absence of explicit formulas, derivations, or stability analysis for the coverage-driven fusion and finite-sample correction makes it difficult to assess whether the reported gains are robust or artifactual.

major comments (3)

[Abstract] Abstract: the finite-sample correction is described only at a high level with no explicit formula, derivation, or bias analysis; because this correction is applied after the coverage-dependent fusion and is claimed to stabilize the cardinality estimate precisely in the small-n regime where gains are largest, its absence prevents verification that the estimator does not reduce to a fitted or self-referential quantity.
[Abstract] Abstract and §3 (method): the coverage estimate obtained via Generalized Good-Turing on the same small sample used to build the entailment graph is used to select between convex and LogSumExp fusion; yet coverage estimation variance is highest precisely when n is smallest, so mis-selection can systematically bias the final cardinality in the direction opposite to the intended correction for unseen modes. No analytic bound, threshold-stability analysis, or ablation for n ≤ 10 is provided, undermining the central claim that SHADE achieves its strongest improvements in the sample-limited regime.
[Experiments] Experiments section: the pooled semantic alphabet-size results and QA incorrectness detection both rely on the entailment-weighted graph accurately capturing distinct semantic modes and on the coverage scalar reliably indicating the fusion rule without introducing bias; the weakest assumption noted in the reader report is therefore load-bearing, but no sensitivity analysis or alternative graph-construction ablations are reported to confirm that the observed gains survive perturbations to the entailment threshold or graph construction.

minor comments (3)

[Abstract] Abstract: the acronym SHADE is introduced without an initial expansion in the title or first sentence.
Notation: the precise definition of the heat-kernel trace and the normalized Laplacian on the entailment graph should be given explicitly (including any temperature or scaling parameters) rather than left at the level of 'heat-kernel trace of the normalized Laplacian'.
Missing references: prior work on Good-Turing estimators for unseen species and on graph-based semantic clustering for LLM responses should be cited to clarify the incremental contribution.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and have revised the manuscript to improve clarity and add supporting analyses where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: the finite-sample correction is described only at a high level with no explicit formula, derivation, or bias analysis; because this correction is applied after the coverage-dependent fusion and is claimed to stabilize the cardinality estimate precisely in the small-n regime where gains are largest, its absence prevents verification that the estimator does not reduce to a fitted or self-referential quantity.

Authors: We agree that the finite-sample correction was presented only at a high level in the original abstract. In the revised manuscript we have added the explicit formula, its derivation based on the expected unseen mass, and a short bias analysis to Section 3. The correction is a post-fusion multiplicative adjustment that depends only on the coverage scalar and is therefore not self-referential. revision: yes
Referee: [Abstract] Abstract and §3 (method): the coverage estimate obtained via Generalized Good-Turing on the same small sample used to build the entailment graph is used to select between convex and LogSumExp fusion; yet coverage estimation variance is highest precisely when n is smallest, so mis-selection can systematically bias the final cardinality in the direction opposite to the intended correction for unseen modes. No analytic bound, threshold-stability analysis, or ablation for n ≤ 10 is provided, undermining the central claim that SHADE achieves its strongest improvements in the sample-limited regime.

Authors: We acknowledge the risk of mis-selection arising from variance in the coverage estimate at small n. We have added an ablation study restricted to n ≤ 10 that compares the adaptive rule against fixed convex and fixed LogSumExp fusions and reports the observed selection frequencies. These results show that SHADE retains its advantage even under noisy coverage estimates. A full analytic bound on selection stability is non-trivial and is noted as future work. revision: partial
Referee: [Experiments] Experiments section: the pooled semantic alphabet-size results and QA incorrectness detection both rely on the entailment-weighted graph accurately capturing distinct semantic modes and on the coverage scalar reliably indicating the fusion rule without introducing bias; the weakest assumption noted in the reader report is therefore load-bearing, but no sensitivity analysis or alternative graph-construction ablations are reported to confirm that the observed gains survive perturbations to the entailment threshold or graph construction.

Authors: We agree that sensitivity to graph construction merits explicit verification. The revised Experiments section now includes ablations that vary the entailment threshold over [0.6, 0.95] and substitute an embedding-similarity graph for the entailment graph. The performance gains of SHADE remain consistent across these variants. revision: yes

standing simulated objections not resolved

Full analytic bound or threshold-stability analysis for coverage-driven fusion selection at n ≤ 10

Circularity Check

0 steps flagged

No significant circularity in SHADE derivation chain

full rationale

The abstract and described method define SHADE as an explicit combination of Generalized Good-Turing coverage (estimated from samples) with heat-kernel trace on an entailment graph, using the coverage scalar to select between convex combination and LogSumExp fusion before a finite-sample correction. No equation or step reduces the final cardinality estimate to a fitted parameter, self-referential quantity, or prior self-citation by construction. The fusion rule is data-driven but defined externally to the target estimate, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions from coverage estimation and graph Laplacians plus two paper-specific choices (adaptive fusion rule and finite-sample correction) whose justification is not independently validated in the provided abstract.

axioms (2)

domain assumption An entailment-weighted graph over sampled responses meaningfully represents semantic occupancy.
Invoked to construct the normalized Laplacian whose heat-kernel trace is fused with coverage.
ad hoc to paper Coverage level is a reliable indicator for choosing between convex and LogSumExp fusion without introducing systematic bias.
Central to the soft-hybrid rule described in the abstract.

pith-pipeline@v0.9.0 · 5570 in / 1373 out tokens · 80427 ms · 2026-05-10T02:28:19.316412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 14 canonical work pages · 4 internal anchors

[1]

On the effectiveness of random weights in graph neural networks.arXiv preprint arXiv:2502.00190, 2025

Thu Bui, Carola-Bibiane Schönlieb, Bruno Ribeiro, Beatrice Bevilacqua, and Moshe Eliasof. On the effectiveness of random weights in graph neural networks.arXiv preprint arXiv:2502.00190, 2025

work page arXiv 2025
[2]

Nonparametric estimation of Shannon’s index of diversity when there are unseen species in a sample.Environmental and Ecological Statistics, 10:429–443, 2003

Anne Chao and Tsung-Jen Shen. Nonparametric estimation of Shannon’s index of diversity when there are unseen species in a sample.Environmental and Ecological Statistics, 10:429–443, 2003

2003
[3]

INSIDE: LLMs’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744,

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection.ArXiv, abs/2402.03744, 2024

work page arXiv 2024
[4]

Quantifying uncertainty in answers from any language model and enhancing their trustworthiness

Jiuhai Chen and Jonas Mueller. Quantifying uncertainty in answers from any language model and enhancing their trustworthiness. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5186–5200, 2024

2024
[5]

Fan R. K. Chung.Spectral Graph Theory. Number 92 in CBMS Regional Conference Series in Mathematics. American Mathematical Society, Providence, RI, 1997

1997
[6]

Superpixel-based and spatially regularized diffusion learning for unsupervised hyperspectral image clustering.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18, 2024

Kangning Cui, Ruoning Li, Sam L Polk, Yinyi Lin, Hongsheng Zhang, James M Murphy, Robert J Plemmons, and Raymond H Chan. Superpixel-based and spatially regularized diffusion learning for unsupervised hyperspectral image clustering.IEEE Transactions on Geoscience and Remote Sensing, 62:1–18, 2024

2024
[7]

Efficient localization and spatial distribution modeling of canopy palms using uav imagery.IEEE Transactions on Geoscience and Remote Sensing, 2025

Kangning Cui, Wei Tang, Rongkun Zhu, Manqi Wang, Gregory D Larsen, Victor P Pauca, Sarra Alqahtani, Fan Yang, David Segurado, Paul Fine, et al. Efficient localization and spatial distribution modeling of canopy palms using uav imagery.IEEE Transactions on Geoscience and Remote Sensing, 2025

2025
[8]

Signed graph convolutional networks.2018 IEEE International Conference on Data Mining (ICDM), pages 929–934, 2018

Tyler Derr, Yao Ma, and Jiliang Tang. Signed graph convolutional networks.2018 IEEE International Conference on Data Mining (ICDM), pages 929–934, 2018

2018
[9]

Detecting hallucinations in large language models using semantic entropy.Nature, 630:625 – 630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625 – 630, 2024. 5

2024
[10]

Graph random neural networks for semi-supervised learning on graphs.arXiv: Learning, 2020

Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Khar- lamov, and Jie Tang. Graph random neural networks for semi-supervised learning on graphs.arXiv: Learning, 2020

2020
[11]

I. J. Good. The population frequencies of species and the estimation of population parameters.Biometrika, 40(3/4):237–264, 1953

1953
[12]

Pengcheng He, Jianfeng Gao, and Weizhu Chen

Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.CoRR, abs/2111.09543, 2021

work page arXiv 2021
[13]

Are graph convo- lutional networks with random weights feasible?IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:2751–2768, 2022

Changqin Huang, Ming Li, Feilong Cao, Hamido Fujita, Zhao Li, and Xindong Wu. Are graph convo- lutional networks with random weights feasible?IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:2751–2768, 2022

2022
[14]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

2023
[15]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.CoRR, abs/2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. https://doi.org/10.21227/de50-f985, April 2025. Accessed on YYYY-MM-DD

work page doi:10.21227/de50-f985 2025
[17]

Semantic

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth A. Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.ArXiv, abs/2406.15927, 2024

work page arXiv 2024
[18]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

2023
[19]

Evidential semantic entropy for llm uncertainty quantification

Lucie Kunitomo-Jacquin, Edison Marrese-Taylor, Ken Fukuda, and Masahiro Hamasaki. Evidential semantic entropy for llm uncertainty quantification. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7107–7122, 2026

2026
[20]

Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research.Trans...

2019
[21]

Enhancing uncertainty quantification in large language models through semantic graph density

Zhaoye Li, Siyuan Shen, Wenjing Yang, Ruochun Jin, Huan Chen, Ligong Cao, and Jing Ren. Enhancing uncertainty quantification in large language models through semantic graph density. InConference on Uncertainty in Artificial Intelligence, 2025

2025
[22]

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen-Yu Lin, and Hua Wei. Uncertainty quantifica- tion and confidence calibration in large language models: A survey.Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2, 2025

2025
[23]

Estimating semantic alphabet size for llm uncertainty quantification.arXiv preprint arXiv:2509.14478, 2025

Lucas H McCabe, Rimon Melamed, Thomas Hartvigsen, and H Howie Huang. Estimating semantic alphabet size for llm uncertainty quantification.arXiv preprint arXiv:2509.14478, 2025

work page arXiv 2025
[24]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Microsoft. Phi-3 technical report: A highly capable language model locally on your phone.CoRR, abs/2404.14219, 2024

work page internal anchor Pith review arXiv 2024
[25]

George A. Miller. Note on the bias of information estimates.Information Theory, IRE Transactions on, 2(2):190–190, 1955

1955
[26]

Beyond semantic entropy: Boosting llm uncertainty quantification with pairwise semantic similarity

Dang Nguyen, Ali Payani, and Baharan Mirzasoleiman. Beyond semantic entropy: Boosting llm uncertainty quantification with pairwise semantic similarity. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4530–4540, 2025

2025
[27]

Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.ArXiv, abs/2405.20003, 2024

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.ArXiv, abs/2405.20003, 2024

work page arXiv 2024
[28]

Squad: 100, 000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In Jian Su, Xavier Carreras, and Kevin Duh, editors,Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2383–2392. The Association for...

2016
[29]

Sheth, and Amitava Das

Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models.arXiv preprint arXiv:2309.05922, 2023. 6

work page arXiv 2023
[30]

Siva Reddy, Danqi Chen, and Christopher D. Manning. Coqa: A conversational question answering challenge.Trans. Assoc. Comput. Linguistics, 7:249–266, 2019

2019
[31]

Ren, and Anirudha Majumdar

Olaoluwa Shorinwa, Zhiting Mei, Justin Lidard, Allen Z. Ren, and Anirudha Majumdar. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.ACM Computing Surveys, 58:1 – 38, 2024

2024
[32]

Efficient hallucination detection: Adaptive bayesian estimation of semantic entropy with guided semantic exploration

Qiyao Sun, Xingming Li, Xixiang He, Ao Cheng, Xuanyu Ji, Hailun Lu, Runke Huang, and Qingyong Hu. Efficient hallucination detection: Adaptive bayesian estimation of semantic entropy with guided semantic exploration. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on...

2026
[33]

Bilateral signal warping for left ventricular hypertrophy diagnosis

Wei Tang, Kangning Cui, Raymond H Chan, and Jean-Michel Morel. Bilateral signal warping for left ventricular hypertrophy diagnosis. In2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2025

2025
[34]

Tianyun Yang, Ziniu Li, Juan Cao, and Chang Xu

Yihao Xue, Kristjan H. Greenewald, Youssef Mroueh, and Baharan Mirzasoleiman. Verify when uncertain: Beyond self-consistency in black box hallucination detection.ArXiv, abs/2502.15845, 2025

work page arXiv 2025
[35]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces...

2018
[36]

Uncertainty estimation by flexible evidential deep learning.arXiv preprint arXiv:2510.18322, 2025

Taeseong Yoon and Heeyoung Kim. Uncertainty estimation by flexible evidential deep learning.arXiv preprint arXiv:2510.18322, 2025

work page arXiv 2025
[37]

Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection

Yangchen Zeng, Zhenyu Yu, Dongming Jiang, Wenbo Zhang, Yifan Hong, Zhanhua Hu, Jiao Luo, and Kangning Cui. Learning where to embed: Noise-aware positional embedding for query retrieval in small-object detection.arXiv preprint arXiv:2604.15065, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: open pre-trained transformer language models.CoRR, abs/2205.01068, 2022. ...

work page internal anchor Pith review arXiv 2022