pith. machine review for the scientific record. sign in

arxiv: 2605.08432 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

Bojian Hou, Jiancong Xiao, Li Shen, Ruochen Jin, Shu Yang, Zhanliang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AIstat.ML
keywords calibration evaluationopen-ended question answeringsemantic samplingexpected calibration errorLLM confidence estimationself-consistencyasymptotically unbiased estimator
0
0 comments X

The pith

Semantic sampling of model answers creates an asymptotically unbiased calibration metric for open-ended QA without needing internal logits or verbalized confidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sem-ECE, a framework that samples multiple answers from a language model, groups them into semantic classes, and uses the resulting class frequencies as estimates of the model's confidence for each answer. It defines two estimators: Sem1-ECE, which uses the same samples for both answer selection and confidence, and Sem2-ECE, which holds out samples to separate those steps. The authors prove both estimators are asymptotically unbiased and show that their values agree on easy questions but diverge on hard ones, with Sem2 producing strictly smaller calibration error. Experiments across three benchmarks and five commercial LLMs confirm that Sem-ECE outperforms verbalized confidence and prior sampling methods while remaining usable when internal probabilities are unavailable. This matters because open-ended QA is the primary deployment setting for LLMs, yet existing calibration checks either demand restricted output formats or rely on self-reported scores that tend to be overconfident.

Core claim

Sem-ECE samples answers from the model, clusters them by semantic equivalence, and treats the empirical frequency of each cluster as the model's confidence; Sem1-ECE computes the same-sample self-consistency score while Sem2-ECE uses a held-out sample for frequency estimation. Both are asymptotically unbiased estimators of the true expected calibration error. They coincide on easy questions but separate on hard ones, with Sem2 achieving lower error; the gap itself diagnoses question difficulty. On three open-ended QA benchmarks the method matches theoretical predictions and beats verbalized confidence and existing sampling baselines.

What carries the argument

Sem-ECE (Semantic-Sampling Expected Calibration Error), which draws multiple model responses, partitions them into semantic equivalence classes, and derives confidence directly from class frequencies.

If this is right

  • Sem-ECE supplies a calibration score for open-ended QA that works without access to model logits or token probabilities.
  • The gap between Sem1-ECE and Sem2-ECE serves as an automatic indicator of question difficulty.
  • Both estimators converge to the true expected calibration error as the number of samples increases.
  • Sem-ECE complements logit-based metrics on models where internal probabilities are unavailable or unreliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampling-plus-clustering approach could be applied to measure calibration on other open-ended generative tasks such as summarization or code generation.
  • If semantic classes prove stable across different clustering algorithms, the framework may reduce the need for hand-crafted extraction rules that currently limit sampling-based calibration methods.
  • The divergence between same-sample and held-out estimators on difficult questions suggests that future work could use this gap to decide when to collect additional samples or to trigger human review.

Load-bearing premise

That semantic clustering of the sampled answers yields stable, meaningful classes whose frequencies accurately reflect the model's underlying predictive distribution without the clustering procedure itself adding systematic bias.

What would settle it

Run the estimator on a dataset where each question has a known ground-truth answer and a large number of independent model samples; if the reported Sem-ECE value fails to decrease toward zero as sample size grows, or if different clustering algorithms produce materially different frequency distributions for the same questions, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.08432 by Bojian Hou, Jiancong Xiao, Li Shen, Ruochen Jin, Shu Yang, Zhanliang Wang.

Figure 1
Figure 1. Figure 1: (a) Regime diagram on the ( ˜mq, Kq) plane, partitioned by the JDR boundary m˜ q = 2λ˜⋆ (Theorem 5.7, dashed) and the crossover m˜ q = p log Kq (Theorem 5.1, solid). In JDR (green), Sem2 wins on both raw ECE and oracle distance; in the intermediate band (yellow), Sem2 has smaller raw ECE but is farther from the oracle; in the large-margin region (gray), the two estimators are asymptotically indistinguishab… view at source ↗
Figure 2
Figure 2. Figure 2: Pooled Sem1-ECE (orange) and Sem2-ECE (blue) as functions of the per-question budget n ∈ [10, 50]. The two curves converge to a common limit on every benchmark from opposite sides — the empirical signature of Theorems 5.1 and 5.2 via (9). 0.0 0.2 0.4 0.6 0.8 1.0 per-question margin ¢q 0.0 0.1 0.2 0.3 0.4 0.5 0.6 ECE SimpleQA (K = 7:48; p log K=n = 0:201) 0.0 0.2 0.4 0.6 0.8 1.0 per-question margin ¢q 0.0 0… view at source ↗
Figure 3
Figure 3. Figure 3: Pooled Sem1-ECE and Sem2-ECE stratified by per-question margin ∆q on SimpleQA (left), HLE (middle), PopQA (right). The dashed red line marks the JDR boundary ∆q = 2λ˜⋆/ √ n and the solid brown the low/large boundary ∆q = p log Kq/n, partitioning each panel into the three regions of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reliability diagrams pooled across models on SimpleQA (left), HLE (middle), PopQA [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Margin-stratified ECE curves for OpenAI on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Margin-stratified ECE curves for Anthropic on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Margin-stratified ECE curves for Gemini on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Margin-stratified ECE curves for xAI on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Margin-stratified ECE curves for Mistral on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Direct ECE gap p Sem1-ECE − Sem2-ECE on the low-margin sub-population {q : ∆q < log Kq/n} on a log-log scale, threshold re-evaluated at each n. Fitted slopes −0.58, −0.58, −0.56 are within 0.08 of Theorem 5.6’s prediction −0.50. 0.0 0.2 0.4 0.6 0.8 1.0 predicted confidence ^c 0.0 0.2 0.4 0.6 0.8 1.0 e m piric al a c c u r a c y ¹Y Sem1-ECE: 0:198 Sem2-ECE: 0:179 Ver-ECE: 0:515 0.0 0.2 0.4 0.6 0.8 1.0 pred… view at source ↗
Figure 11
Figure 11. Figure 11: Reliability diagrams for OpenAI on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reliability diagrams for Anthropic on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reliability diagrams for Gemini on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Reliability diagrams for xAI on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Reliability diagrams for Mistral on SimpleQA, HLE, and PopQA. [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
read the original abstract

Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped. Open-ended question answering (QA), the most common deployment setting for modern LLMs, is where existing evaluation methods fall short: logit-based metrics need restricted output formats and internal probabilities; verbalized confidence is self-reported and often overconfident; and sampling-based methods rely on task-specific extraction rules without a clear finite-sample target. We introduce Sem-ECE (Semantic-Sampling Expected Calibration Error), a calibration evaluation framework for open-ended QA that samples answers from the model, groups them into semantic classes, and uses the resulting frequencies as confidence. We study two estimators within this framework: Sem$_1$-ECE, the same-sample self-consistency score, and Sem$_2$-ECE, a held-out variant that separates answer selection from confidence evaluation. We prove both are asymptotically unbiased, and further show that they agree on easy questions but diverge on hard ones with Sem$_2$ achieving strictly smaller calibration error, so their gap also serves as a diagnostic for question difficulty. Experiments on three open-ended QA benchmarks across five leading commercial LLMs match our theoretical predictions and show that Sem-ECE outperforms verbalized confidence and existing sampling-based methods, while complementing logit-based evaluation when internal probabilities are unavailable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sem-ECE, a calibration evaluation framework for open-ended QA. It samples answers from an LLM, groups them into semantic classes via clustering, and uses the resulting empirical class frequencies as confidence scores. Two estimators are defined: Sem1-ECE (same-sample self-consistency) and Sem2-ECE (held-out variant). The authors prove both are asymptotically unbiased, show theoretically that the estimators agree on easy questions but diverge on hard ones (with Sem2 yielding lower calibration error), and report experiments on three benchmarks across five LLMs where Sem-ECE outperforms verbalized confidence and prior sampling methods while complementing logit-based metrics.

Significance. If the central claims hold, the work supplies a theoretically grounded, sampling-only method for calibration assessment in the open-ended setting that avoids reliance on internal logits or verbalized self-reports. The proofs of asymptotic unbiasedness for both estimators and the use of their divergence as a question-difficulty diagnostic are notable strengths. The empirical results across commercial LLMs provide concrete evidence that the framework is practical and improves upon existing baselines.

major comments (2)
  1. [Theoretical analysis / proofs of asymptotic unbiasedness] The proofs of asymptotic unbiasedness for Sem1-ECE and Sem2-ECE (referenced in the abstract and presumably detailed in the theoretical section) treat the semantic partition as given and assume that empirical frequencies after clustering converge to the model's true distribution over semantic equivalence classes. No analysis is provided showing that the clustering error (from embeddings, LLM-as-judge, or similar) vanishes with sample size; if clustering mistakes persist on high-diversity answers, the estimators remain biased even in the large-sample limit. This assumption is load-bearing for the unbiasedness claim.
  2. [Experiments] The experimental section reports that results match the theoretical predictions and that Sem-ECE outperforms baselines, but provides no details on the concrete clustering procedure, stability checks across random seeds, or how semantic classes are validated against human judgments. Without these, it is impossible to confirm that the observed calibration improvements are not artifacts of the particular clustering implementation.
minor comments (2)
  1. [Abstract] The abstract states that experiments use 'three open-ended QA benchmarks' but does not name them; adding the dataset names would improve clarity.
  2. [Introduction / framework definition] Notation for the two estimators (Sem1-ECE vs. Sem2-ECE) is introduced without an explicit equation reference in the abstract; a short definitional equation early in the paper would help readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify the scope and limitations of our theoretical and empirical contributions. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Theoretical analysis / proofs of asymptotic unbiasedness] The proofs of asymptotic unbiasedness for Sem1-ECE and Sem2-ECE (referenced in the abstract and presumably detailed in the theoretical section) treat the semantic partition as given and assume that empirical frequencies after clustering converge to the model's true distribution over semantic equivalence classes. No analysis is provided showing that the clustering error (from embeddings, LLM-as-judge, or similar) vanishes with sample size; if clustering mistakes persist on high-diversity answers, the estimators remain biased even in the large-sample limit. This assumption is load-bearing for the unbiasedness claim.

    Authors: We appreciate the referee for identifying this key assumption in our theoretical analysis. The proofs of asymptotic unbiasedness for both Sem1-ECE and Sem2-ECE are derived conditional on a fixed semantic partition, under which the empirical class frequencies converge to the model's true distribution over those classes. We do not provide a formal analysis demonstrating that clustering error (whether from embeddings or an LLM judge) necessarily vanishes with increasing sample size, particularly for high-diversity answers. This is a valid point, and the unbiasedness claim holds only under the additional assumption that the clustering procedure is consistent. In the revised manuscript, we will explicitly state this conditioning assumption in the theoretical section, add a discussion of the conditions required for clustering consistency, and note the potential for persistent bias in high-diversity settings as a limitation of the framework. revision: partial

  2. Referee: [Experiments] The experimental section reports that results match the theoretical predictions and that Sem-ECE outperforms baselines, but provides no details on the concrete clustering procedure, stability checks across random seeds, or how semantic classes are validated against human judgments. Without these, it is impossible to confirm that the observed calibration improvements are not artifacts of the particular clustering implementation.

    Authors: We agree that the current experimental section lacks sufficient detail on the clustering implementation, which is necessary to ensure reproducibility and to substantiate that the reported improvements are not implementation-specific. In the revised manuscript, we will expand the experimental section to include a complete description of the clustering procedure (embedding model, algorithm, and hyperparameters), quantitative stability results across multiple random seeds for both sampling and clustering, and a human validation study on a representative subset of questions comparing the derived semantic classes to independent human annotations. These additions will directly address the concern and strengthen the empirical claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; asymptotic unbiasedness follows from standard LLN on fixed partitions

full rationale

The paper defines Sem-ECE by sampling answers, applying an external semantic clustering step to form classes, and setting confidence to the resulting empirical frequencies. It then claims to prove that both Sem1-ECE (same-sample) and Sem2-ECE (held-out) are asymptotically unbiased. This unbiasedness is the direct consequence of the law of large numbers applied to empirical class frequencies converging to the population probabilities under whatever fixed partition the clustering produces; it does not reduce to the framework definition by construction, nor does it rely on self-citation, fitted parameters renamed as predictions, or an ansatz smuggled from prior work. The clustering procedure itself is treated as an input whose correctness is a modeling assumption, not derived inside the proof. No load-bearing step equates the claimed result to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of a well-defined semantic equivalence relation over answers and on the assumption that finite samples from the model approximate the true distribution sufficiently for the frequency-based confidence to be meaningful. No explicit free parameters or new physical entities are introduced.

axioms (2)
  • domain assumption Semantic classes over generated answers can be reliably identified and grouped without task-specific extraction rules.
    Invoked when the framework replaces logit or verbalized confidence with class frequencies.
  • standard math The sampling process produces answers whose empirical distribution converges to the model's predictive distribution.
    Required for the asymptotic unbiasedness proof of both estimators.

pith-pipeline@v0.9.0 · 5589 in / 1417 out tokens · 45236 ms · 2026-05-12T01:05:49.631429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1321–1330. PMLR, 2017

  2. [2]

    Language models (mostly) know what they know, 2022

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  3. [3]

    Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D. Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages...

  4. [4]

    John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. InAdvances in Large Margin Classifiers, pages 61–74. MIT Press, 1999

  5. [5]

    Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration

    Meelis Kull, Miquel Perello-Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration. InAdvances in Neural Information Processing Systems, volume 32, 2019

  6. [6]

    Verified uncertainty calibration

    Ananya Kumar, Percy Liang, and Tengyu Ma. Verified uncertainty calibration. InAdvances in Neural Information Processing Systems, volume 32, 2019

  7. [7]

    Su, and Li Shen

    Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J. Su, and Li Shen. Restoring calibration for aligned large language models: A calibration-aware fine-tuning approach. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 68364–68390. PMLR, 2025. URL http...

  8. [8]

    Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950

  9. [9]

    Cooper, and Milos Hauskrecht

    Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. InProceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 2901–2907, 2015

  10. [10]

    Lin, Jacob Hilton, and Owain Evans

    Stephanie C. Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words.Transactions on Machine Learning Research, 2022. URL https://openreview. net/forum?id=8s8K2UZGTZ

  11. [11]

    Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau

    Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration.Transactions of the Association for Computational Linguistics, 10:857–872, 2022. doi: 10.1162/tacl_a_00494. 10

  12. [12]

    Measuring short-form factuality in large language models, 2024

    Jason Wei, Karina Nguyen, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models, 2024

  13. [13]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023. URL https:// openreview.net/forum?id=1PL1NIMMrw

  14. [14]

    Calibrating large language models with sample consistency,

    Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar, Niket Tandon, Mar- ianna Apidianaki, Mrinmaya Sachan, and Chris Callison-Burch. Calibrating large language models with sample consistency. InProceedings of the AAAI Conference on Artificial Intelli- gence, volume 39, pages 19260–19268, 2025. doi: 10.1609/aaai.v39i18.34120

  15. [15]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learn- ing Representations, 2023. URLhttps://openreview.net/forum?id=VD-AYtP0dve

  16. [16]

    Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024

    Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630:625–630, 2024. doi: 10.1038/ s41586-024-07421-0

  17. [17]

    Kagel and Dan Levin.Common Value Auctions and the Winner’s Curse

    John H. Kagel and Dan Levin.Common Value Auctions and the Winner’s Curse. Princeton University Press, Princeton, NJ, 2002

  18. [18]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, et al. Humanity’s last exam, 2025. URLhttps://arxiv.org/abs/2501.14249

  19. [19]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada, 2023....

  20. [20]

    GPT-5 models

    OpenAI. GPT-5 models. https://platform.openai.com/docs/models, 2025. Accessed: 2026-01

  21. [21]

    Claude models

    Anthropic. Claude models. https://docs.claude.com/en/docs/about-claude/ models, 2025. Accessed: 2026-01

  22. [22]

    Gemini api models

    Google DeepMind. Gemini api models. https://ai.google.dev/gemini-api/docs/ models, 2025. Accessed: 2026-01

  23. [23]

    Grok models.https://docs.x.ai/docs/models, 2025

    xAI. Grok models.https://docs.x.ai/docs/models, 2025. Accessed: 2026-01

  24. [24]

    Mistral Large

    Mistral AI. Mistral Large. https://docs.mistral.ai/getting-started/models/ models_overview/, 2024. Accessed: 2026-01

  25. [25]

    Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=gjeQKFxFpZ. A Proofs A.1 Proof of Theorem 5.1 Fix q with ∆q >0 . Writ...