Recognition: 2 theorem links
· Lean TheoremA Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Pith reviewed 2026-05-12 01:05 UTC · model grok-4.3
The pith
Semantic sampling of model answers creates an asymptotically unbiased calibration metric for open-ended QA without needing internal logits or verbalized confidence.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sem-ECE samples answers from the model, clusters them by semantic equivalence, and treats the empirical frequency of each cluster as the model's confidence; Sem1-ECE computes the same-sample self-consistency score while Sem2-ECE uses a held-out sample for frequency estimation. Both are asymptotically unbiased estimators of the true expected calibration error. They coincide on easy questions but separate on hard ones, with Sem2 achieving lower error; the gap itself diagnoses question difficulty. On three open-ended QA benchmarks the method matches theoretical predictions and beats verbalized confidence and existing sampling baselines.
What carries the argument
Sem-ECE (Semantic-Sampling Expected Calibration Error), which draws multiple model responses, partitions them into semantic equivalence classes, and derives confidence directly from class frequencies.
If this is right
- Sem-ECE supplies a calibration score for open-ended QA that works without access to model logits or token probabilities.
- The gap between Sem1-ECE and Sem2-ECE serves as an automatic indicator of question difficulty.
- Both estimators converge to the true expected calibration error as the number of samples increases.
- Sem-ECE complements logit-based metrics on models where internal probabilities are unavailable or unreliable.
Where Pith is reading between the lines
- The same sampling-plus-clustering approach could be applied to measure calibration on other open-ended generative tasks such as summarization or code generation.
- If semantic classes prove stable across different clustering algorithms, the framework may reduce the need for hand-crafted extraction rules that currently limit sampling-based calibration methods.
- The divergence between same-sample and held-out estimators on difficult questions suggests that future work could use this gap to decide when to collect additional samples or to trigger human review.
Load-bearing premise
That semantic clustering of the sampled answers yields stable, meaningful classes whose frequencies accurately reflect the model's underlying predictive distribution without the clustering procedure itself adding systematic bias.
What would settle it
Run the estimator on a dataset where each question has a known ground-truth answer and a large number of independent model samples; if the reported Sem-ECE value fails to decrease toward zero as sample size grows, or if different clustering algorithms produce materially different frequency distributions for the same questions, the central claim is falsified.
Figures
read the original abstract
Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We study two estimators within this framework: Sem$_1$-ECE, the same-sample self-consistency score, and Sem$_2$-ECE, a held-out variant that separates answer selection from confidence evaluation. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem$_2$ achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty. Experiments on three open-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem-ECE outperforms verbalized confidence and existing sampling-based methods, while complementing logit-based evaluation when internal probabilities are unavailable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Sem-ECE, a calibration evaluation framework for open-ended QA. It samples answers from an LLM, groups them into semantic classes via clustering, and uses the resulting empirical class frequencies as confidence scores. Two estimators are defined: Sem1-ECE (same-sample self-consistency) and Sem2-ECE (held-out variant). The authors prove both are asymptotically unbiased, show theoretically that the estimators agree on easy questions but diverge on hard ones (with Sem2 yielding lower calibration error), and report experiments on three benchmarks across five LLMs where Sem-ECE outperforms verbalized confidence and prior sampling methods while complementing logit-based metrics.
Significance. If the central claims hold, the work supplies a theoretically grounded, sampling-only method for calibration assessment in the open-ended setting that avoids reliance on internal logits or verbalized self-reports. The proofs of asymptotic unbiasedness for both estimators and the use of their divergence as a question-difficulty diagnostic are notable strengths. The empirical results across commercial LLMs provide concrete evidence that the framework is practical and improves upon existing baselines.
major comments (2)
- [Theoretical analysis / proofs of asymptotic unbiasedness] The proofs of asymptotic unbiasedness for Sem1-ECE and Sem2-ECE (referenced in the abstract and presumably detailed in the theoretical section) treat the semantic partition as given and assume that empirical frequencies after clustering converge to the model's true distribution over semantic equivalence classes. No analysis is provided showing that the clustering error (from embeddings, LLM-as-judge, or similar) vanishes with sample size; if clustering mistakes persist on high-diversity answers, the estimators remain biased even in the large-sample limit. This assumption is load-bearing for the unbiasedness claim.
- [Experiments] The experimental section reports that results match the theoretical predictions and that Sem-ECE outperforms baselines, but provides no details on the concrete clustering procedure, stability checks across random seeds, or how semantic classes are validated against human judgments. Without these, it is impossible to confirm that the observed calibration improvements are not artifacts of the particular clustering implementation.
minor comments (2)
- [Abstract] The abstract states that experiments use 'three open-ended QA benchmarks' but does not name them; adding the dataset names would improve clarity.
- [Introduction / framework definition] Notation for the two estimators (Sem1-ECE vs. Sem2-ECE) is introduced without an explicit equation reference in the abstract; a short definitional equation early in the paper would help readers.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which help clarify the scope and limitations of our theoretical and empirical contributions. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Theoretical analysis / proofs of asymptotic unbiasedness] The proofs of asymptotic unbiasedness for Sem1-ECE and Sem2-ECE (referenced in the abstract and presumably detailed in the theoretical section) treat the semantic partition as given and assume that empirical frequencies after clustering converge to the model's true distribution over semantic equivalence classes. No analysis is provided showing that the clustering error (from embeddings, LLM-as-judge, or similar) vanishes with sample size; if clustering mistakes persist on high-diversity answers, the estimators remain biased even in the large-sample limit. This assumption is load-bearing for the unbiasedness claim.
Authors: We appreciate the referee for identifying this key assumption in our theoretical analysis. The proofs of asymptotic unbiasedness for both Sem1-ECE and Sem2-ECE are derived conditional on a fixed semantic partition, under which the empirical class frequencies converge to the model's true distribution over those classes. We do not provide a formal analysis demonstrating that clustering error (whether from embeddings or an LLM judge) necessarily vanishes with increasing sample size, particularly for high-diversity answers. This is a valid point, and the unbiasedness claim holds only under the additional assumption that the clustering procedure is consistent. In the revised manuscript, we will explicitly state this conditioning assumption in the theoretical section, add a discussion of the conditions required for clustering consistency, and note the potential for persistent bias in high-diversity settings as a limitation of the framework. revision: partial
-
Referee: [Experiments] The experimental section reports that results match the theoretical predictions and that Sem-ECE outperforms baselines, but provides no details on the concrete clustering procedure, stability checks across random seeds, or how semantic classes are validated against human judgments. Without these, it is impossible to confirm that the observed calibration improvements are not artifacts of the particular clustering implementation.
Authors: We agree that the current experimental section lacks sufficient detail on the clustering implementation, which is necessary to ensure reproducibility and to substantiate that the reported improvements are not implementation-specific. In the revised manuscript, we will expand the experimental section to include a complete description of the clustering procedure (embedding model, algorithm, and hyperparameters), quantitative stability results across multiple random seeds for both sampling and clustering, and a human validation study on a representative subset of questions comparing the derived semantic classes to independent human annotations. These additions will directly address the concern and strengthen the empirical claims. revision: yes
Circularity Check
No significant circularity; asymptotic unbiasedness follows from standard LLN on fixed partitions
full rationale
The paper defines Sem-ECE by sampling answers, applying an external semantic clustering step to form classes, and setting confidence to the resulting empirical frequencies. It then claims to prove that both Sem1-ECE (same-sample) and Sem2-ECE (held-out) are asymptotically unbiased. This unbiasedness is the direct consequence of the law of large numbers applied to empirical class frequencies converging to the population probabilities under whatever fixed partition the clustering produces; it does not reduce to the framework definition by construction, nor does it rely on self-citation, fitted parameters renamed as predictions, or an ansatz smuggled from prior work. The clustering procedure itself is treated as an input whose correctness is a modeling assumption, not derived inside the proof. No load-bearing step equates the claimed result to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Semantic classes over generated answers can be reliably identified and grouped without task-specific extraction rules.
- standard math The sampling process produces answers whose empirical distribution converges to the model's predictive distribution.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe introduce Sem-ECE ... samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We prove both are asymptotically unbiased.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearBias expansion ... E[ĉ1 − c*_q | q] = √(p_q / n) J(λ̃_q) + o(n^{-1/2}) ... g_A(λ̃) = φ(2λ̃)
Reference graph
Works this paper leans on
-
[1]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017
work page 2017
-
[2]
Language models (mostly) know what they know, 2022
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
work page 2022
-
[3]
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D. Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages...
-
[4]
John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. InAdvances in Large Margin Classifiers, pages 61–74. MIT Press, 1999
work page 1999
-
[5]
Meelis Kull, Miquel Perello-Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration. InAdvances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[6]
Verified uncertainty calibration
Ananya Kumar, Percy Liang, and Tengyu Ma. Verified uncertainty calibration. InAdvances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[7]
Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J. Su, and Li Shen. Restoring calibration for aligned large language models: A calibration-aware fine-tuning approach. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 68364–68390. PMLR, 2025. URL http...
work page 2025
-
[8]
Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950
work page 1950
-
[9]
Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 2901–2907, 2015
work page 2015
-
[10]
Lin, Jacob Hilton, and Owain Evans
Stephanie C. Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.Transactions on Machine Learning Research, 2022. URL https://openreview. net/forum?id=8s8K2UZGTZ
work page 2022
-
[11]
Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau
Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration.Transactions of the Association for Computational Linguistics, 10:857–872, 2022. doi: 10.1162/tacl_a_00494. 10
-
[12]
Measuring short-form factuality in large language models, 2024
Jason Wei, Karina Nguyen, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models, 2024
work page 2024
-
[13]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023. URL https:// openreview.net/forum?id=1PL1NIMMrw
work page 2023
-
[14]
Calibrating large language models with sample consistency,
Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Mar- ianna Apidianaki, Mrinmaya Sachan, and Chris Callison-Burch. Calibrating large language models with sample consistency. InProceedings of the AAAI Conference on Artificial Intelli- gence, volume 39, pages 19260–19268, 2025. doi: 10.1609/aaai.v39i18.34120
-
[15]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learn- ing Representations, 2023. URLhttps://openreview.net/forum?id=VD-AYtP0dve
work page 2023
-
[16]
Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024. doi: 10.1038/ s41586-024-07421-0
work page 2024
-
[17]
Kagel and Dan Levin.Common Value Auctions and the Winner’s Curse
John H. Kagel and Dan Levin.Common Value Auctions and the Winner’s Curse. Princeton University Press, Princeton, NJ, 2002
work page 2002
-
[18]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, et al. Humanity’s last exam, 2025. URLhttps://arxiv.org/abs/2501.14249
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada, 2023....
-
[20]
OpenAI. GPT-5 models. https://platform.openai.com/docs/models, 2025. Accessed: 2026-01
work page 2025
-
[21]
Anthropic. Claude models. https://docs.claude.com/en/docs/about-claude/ models, 2025. Accessed: 2026-01
work page 2025
-
[22]
Google DeepMind. Gemini api models. https://ai.google.dev/gemini-api/docs/ models, 2025. Accessed: 2026-01
work page 2025
-
[23]
Grok models.https://docs.x.ai/docs/models, 2025
xAI. Grok models.https://docs.x.ai/docs/models, 2025. Accessed: 2026-01
work page 2025
-
[24]
Mistral AI. Mistral Large. https://docs.mistral.ai/getting-started/models/ models_overview/, 2024. Accessed: 2026-01
work page 2024
-
[25]
Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=gjeQKFxFpZ. A Proofs A.1 Proof of Theorem 5.1 Fix q with ∆q >0 . Writ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.