pith. machine review for the scientific record. sign in

arxiv: 2604.19444 · v1 · submitted 2026-04-21 · 💻 cs.LG

Recognition: unknown

Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:10 UTC · model grok-4.3

classification 💻 cs.LG
keywords unsupervised calibrationconfidence estimationreasoning LLMsself-consistencysingle generationdistribution shiftselective prediction
0
0 comments X

The pith

Reasoning LLMs can be calibrated for confidence from a single generation by distilling self-consistency signals learned offline from unlabeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method to produce reliable confidence estimates for reasoning language models without requiring labeled data or multiple samples during actual use. It generates multiple answers offline on unlabeled examples to build a self-consistency score that acts as a stand-in for correctness, then trains a small separate model to predict that score from any one answer. This matters for practical deployment because models that know when they are likely wrong can be used more safely in selective answering or combined with other systems. Evaluations across math and question-answering tasks with nine different models show the approach beats standard baselines and continues to work when the test inputs differ from the offline data. The gains also translate into better results on tasks that depend on knowing how much to trust each output.

Core claim

We introduce a method for unsupervised confidence calibration of reasoning LLMs when only a single generation is available at inference time. Our approach uses offline sampling on unlabeled data to derive a self-consistency-based proxy target, then distills this signal into a lightweight deployment-time confidence predictor. In a broad evaluation across 5 math and question-answering tasks using 9 reasoning models, our method substantially outperforms baselines, including under distribution shift, and improves downstream performance in selective prediction and simulated downstream decision-making.

What carries the argument

A lightweight confidence predictor trained on self-consistency scores computed from multiple offline generations on unlabeled data.

If this is right

  • The calibrated confidence scores enable higher accuracy in selective prediction by rejecting low-confidence answers.
  • Downstream simulated decision tasks that rely on uncertainty estimates show measurable performance gains.
  • The improvements hold when the distribution of inputs at test time differs from the unlabeled data used for training.
  • The method outperforms both supervised calibration approaches and other unsupervised baselines across multiple models and tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could support calibration in settings where labels are unavailable due to cost or privacy constraints.
  • Self-consistency signals might be combined with other unsupervised cues such as output entropy to strengthen the proxy target.
  • The same offline-to-single-generation distillation pattern could apply to non-reasoning generative tasks like code or dialogue.
  • Wider adoption might reduce reliance on repeated sampling at inference time, lowering compute costs in production systems.

Load-bearing premise

The self-consistency signal derived from offline multi-sample generations on unlabeled data serves as a valid proxy target for true calibration that transfers to single-generation inference.

What would settle it

On a held-out set with ground-truth labels, the distilled confidence scores show zero or negative correlation with actual answer correctness, or selective prediction using these scores fails to raise accuracy at fixed coverage levels compared with uncalibrated baselines.

Figures

Figures reproduced from arXiv: 2604.19444 by Jimmy Wang, Richard Zemel, Thomas Zollo.

Figure 1
Figure 1. Figure 1: We study calibration for reasoning LLMs in a practical deployment regime where labeled [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average (top) and worst-case (bottom) calibration results for our method and baselines [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reliability diagrams are used to visualize calibration. For each method, the blue line shows [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examining the relationship between accuracy and calibration error for various methods. A [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: On the left, results for selective prediction using our method and baselines. The X [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Selective prediction results by model family. Results include an ablation of our method [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Linguistic calibration results when a stronger decision-maker LLM (GPT-4o-mini) is given [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reliability diagrams for Qwen3-1.7B. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reliability diagrams for Qwen3-14B. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reliability diagrams for OpenReasoning-Nemotron-7B. [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Temperature ablations for Qwen3-1.7B [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Temperature ablations for Qwen3-4B-Thinking-2507. [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
read the original abstract

Reasoning language models can solve increasingly complex tasks, but struggle to produce the calibrated confidence estimates necessary for reliable deployment. Existing calibration methods usually depend on labels or repeated sampling at inference time, making them impractical in many settings. We introduce a method for unsupervised confidence calibration of reasoning LLMs when only a single generation is available at inference time. Our approach uses offline sampling on unlabeled data to derive a self-consistency-based proxy target, then distills this signal into a lightweight deployment-time confidence predictor. In a broad evaluation across 5 math and question-answering tasks using 9 reasoning models, our method substantially outperforms baselines, including under distribution shift, and improves downstream performance in selective prediction and simulated downstream decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces an unsupervised method for confidence calibration in reasoning LLMs that requires only a single generation at inference time. Offline multi-sample generations on unlabeled data are used to compute self-consistency scores as a proxy target for correctness; these scores are then distilled into a lightweight predictor that produces calibrated confidence estimates from a single forward pass. The approach is evaluated across 5 math and QA tasks and 9 models, with claims of substantial outperformance versus baselines (including under distribution shift) and downstream gains in selective prediction and simulated decision-making tasks.

Significance. If the central claims hold, the work would be significant for practical deployment of reasoning LLMs in label-free or single-sample settings, where existing calibration techniques are often infeasible. The distillation step that transfers the offline self-consistency signal to single-generation inference addresses a clear usability gap, and the reported robustness under distribution shift plus downstream task improvements would strengthen the case for unsupervised calibration in real-world applications.

major comments (3)
  1. [§3] §3 (Method): The self-consistency proxy derived from offline multi-sample generations is treated as a reliable surrogate for P(correct), but the manuscript does not report direct evidence (e.g., correlation with ground-truth accuracy or ECE on held-out labeled data) that this proxy tracks true correctness when models produce consistently incorrect answers across samples. This is load-bearing for the transfer claim to single-generation inference.
  2. [§4] §4 (Experiments): Standard calibration diagnostics such as Expected Calibration Error (ECE), Brier score, or reliability diagrams against ground-truth labels are not presented; only downstream selective-prediction and decision-making metrics are emphasized. Without these, it is unclear whether the distilled scores are calibrated or merely correlate with the proxy in ways that improve task-specific metrics.
  3. [§4.3] §4.3 (Distribution shift): The outperformance under distribution shift is claimed, but the evaluation does not include an ablation isolating whether the proxy remains valid when the offline unlabeled data and test distribution differ in ways that increase consistent errors (e.g., shared hallucinations).
minor comments (3)
  1. [Abstract] The abstract states outperformance but provides no quantitative metrics, baseline names, or effect sizes; these should be added for clarity.
  2. [§3] Notation for the distilled predictor and self-consistency computation should be unified across §3 and the appendix to avoid ambiguity in the distillation loss.
  3. [Figures] Figure 2 (or equivalent reliability plot) would benefit from error bars or multiple random seeds to show stability of the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of validating the self-consistency proxy and calibration metrics. We address each major comment point-by-point below and outline targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The self-consistency proxy derived from offline multi-sample generations is treated as a reliable surrogate for P(correct), but the manuscript does not report direct evidence (e.g., correlation with ground-truth accuracy or ECE on held-out labeled data) that this proxy tracks true correctness when models produce consistently incorrect answers across samples. This is load-bearing for the transfer claim to single-generation inference.

    Authors: We agree that explicit validation of the proxy's alignment with ground-truth correctness strengthens the transfer argument. Self-consistency is an established proxy for reasoning correctness (Wang et al., 2023), and our evaluations use labeled data for benchmarking. We will add a new analysis (e.g., scatter plots and correlation coefficients) showing the relationship between offline self-consistency scores and ground-truth accuracy on held-out labeled data, including subsets where models exhibit consistent errors. This will directly support the distillation claim. revision: yes

  2. Referee: [§4] §4 (Experiments): Standard calibration diagnostics such as Expected Calibration Error (ECE), Brier score, or reliability diagrams against ground-truth labels are not presented; only downstream selective-prediction and decision-making metrics are emphasized. Without these, it is unclear whether the distilled scores are calibrated or merely correlate with the proxy in ways that improve task-specific metrics.

    Authors: The referee correctly notes that standard calibration metrics provide complementary evidence beyond downstream utility. While our primary emphasis is on label-free practical deployment and selective prediction gains, we will incorporate ECE, Brier scores, and reliability diagrams computed against ground-truth labels for our method versus baselines in the revised experiments section. These additions will clarify the calibration quality of the distilled single-generation predictor. revision: yes

  3. Referee: [§4.3] §4.3 (Distribution shift): The outperformance under distribution shift is claimed, but the evaluation does not include an ablation isolating whether the proxy remains valid when the offline unlabeled data and test distribution differ in ways that increase consistent errors (e.g., shared hallucinations).

    Authors: This is a substantive point about proxy robustness under shifts that amplify consistent errors. Our reported results demonstrate gains under the evaluated distribution shifts, but we did not include a dedicated ablation for extreme cases of shared hallucinations. We will add a limitations discussion and a targeted analysis (using existing data splits where possible) examining proxy validity in such scenarios, along with any observed degradation in the distilled predictor. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper derives a self-consistency proxy from independent offline multi-sample generations on unlabeled data, then trains a separate lightweight predictor to regress that proxy from single-generation features. This distillation does not reduce by construction to its own inputs, nor does it rename a fitted quantity as a prediction; the proxy target is computed externally via sampling and remains distinct from the deployed single-generation model. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are present in the provided description, and the central claim rests on empirical outperformance against baselines rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or invented entities; the method implicitly relies on the domain assumption that self-consistency correlates with accuracy.

axioms (1)
  • domain assumption Self-consistency across multiple generations serves as a reliable proxy for true answer correctness or calibration target.
    Central to creating the offline proxy target described in the abstract.

pith-pipeline@v0.9.0 · 5413 in / 1097 out tokens · 34063 ms · 2026-05-10T03:10:52.715777+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

200 extracted references · 44 canonical work pages · 6 internal anchors

  1. [1]

    2023 , eprint=

    Uncertainty in Natural Language Generation: From Theory to Applications , author=. 2023 , eprint=

  2. [2]

    Glushkova, Taisiya and Zerva, Chrysoula and Rei, Ricardo and Martins, André F. T. , year=. Uncertainty-Aware Machine Translation Evaluation , url=. doi:10.18653/v1/2021.findings-emnlp.330 , booktitle=

  3. [3]

    2022 , eprint=

    Disentangling Uncertainty in Machine Translation Evaluation , author=. 2022 , eprint=

  4. [4]

    2020 , eprint=

    Unsupervised Quality Estimation for Neural Machine Translation , author=. 2020 , eprint=

  5. [5]

    2023 , eprint=

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. 2023 , eprint=

  6. [6]

    2021 , eprint=

    Uncertainty Estimation in Autoregressive Structured Prediction , author=. 2021 , eprint=

  7. [7]

    2023 , eprint=

    Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling , author=. 2023 , eprint=

  8. [8]

    2023 , eprint=

    DEUP: Direct Epistemic Uncertainty Prediction , author=. 2023 , eprint=

  9. [9]

    2023 , eprint=

    Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. 2023 , eprint=

  10. [10]

    2023 , eprint=

    Prompting GPT-3 To Be Reliable , author=. 2023 , eprint=

  11. [11]

    2023 , eprint=

    Towards Reliable Misinformation Mitigation: Generalization, Uncertainty, and GPT-4 , author=. 2023 , eprint=

  12. [12]

    Hüllermeier, W

    Hüllermeier, Eyke and Waegeman, Willem , year=. Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods , volume=. Machine Learning , publisher=. doi:10.1007/s10994-021-05946-3 , number=

  13. [13]

    2023 , eprint=

    Cold-Start Data Selection for Few-shot Language Model Fine-tuning: A Prompt-Based Uncertainty Propagation Approach , author=. 2023 , eprint=

  14. [14]

    2023 , eprint=

    Fine-Tuning Language Models via Epistemic Neural Networks , author=. 2023 , eprint=

  15. [15]

    2023 , eprint=

    Do Large Language Models Know What They Don't Know? , author=. 2023 , eprint=

  16. [16]

    2023 , eprint=

    Quantifying Uncertainty in Answers from any Language Model and Enhancing their Trustworthiness , author=. 2023 , eprint=

  17. [17]

    2023 , eprint=

    Do Language Models Know When They're Hallucinating References? , author=. 2023 , eprint=

  18. [18]

    2022 , eprint=

    Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

  19. [19]

    2022 , eprint=

    Teaching Models to Express Their Uncertainty in Words , author=. 2022 , eprint=

  20. [20]

    2023 , eprint=

    Conformal Autoregressive Generation: Beam Search with Coverage Guarantees , author=. 2023 , eprint=

  21. [21]

    2022 , eprint=

    Confident Adaptive Language Modeling , author=. 2022 , eprint=

  22. [22]

    2023 , eprint=

    Conformal Language Modeling , author=. 2023 , eprint=

  23. [23]

    2023 , eprint=

    Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models , author=. 2023 , eprint=

  24. [24]

    A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

    A gentle introduction to conformal prediction and distribution-free uncertainty quantification , author=. arXiv:2107.07511 , year=

  25. [25]

    2022 , eprint=

    Quantile Risk Control: A Flexible Framework for Bounding the Probability of High-Loss Predictions , author=. 2022 , eprint=

  26. [26]

    2023 , eprint=

    Distribution-Free Statistical Dispersion Control for Societal Applications , author=. 2023 , eprint=

  27. [27]

    2023 , eprint=

    Conformal Prediction with Large Language Models for Multi-Choice Question Answering , author=. 2023 , eprint=

  28. [28]

    2023 , eprint=

    Conformal Risk Control , author=. 2023 , eprint=

  29. [29]

    Journal of Machine Learning Research , author =

    A tutorial on conformal prediction , volume =. Journal of Machine Learning Research , author =. 2008 , pages =

  30. [30]

    Defensive Forecasting for Linear Protocols , booktitle =

    Vovk, Vladimir and Takemura, Akimichi and Shafer, Glenn , year =. Defensive Forecasting for Linear Protocols , booktitle =

  31. [31]

    and Bates, Stephen and Cand

    Angelopoulos, Anastasios N. and Bates, Stephen and Cand. Learn Then. arXiv:2110.01052 , year=

  32. [32]

    2017 , eprint=

    Selective Classification for Deep Neural Networks , author=. 2017 , eprint=

  33. [33]

    2019 , eprint=

    SelectiveNet: A Deep Neural Network with an Integrated Reject Option , author=. 2019 , eprint=

  34. [34]

    Calibrated Selective Classification , publisher =

    Fisch, Adam and Jaakkola, Tommi and Barzilay, Regina , keywords =. Calibrated Selective Classification , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2208.12084 , url =

  35. [35]

    Journal of Machine Learning Research , year =

    Ran El-Yaniv and Yair Wiener , title =. Journal of Machine Learning Research , year =

  36. [36]

    2020 , eprint=

    Selective Question Answering under Domain Shift , author=. 2020 , eprint=

  37. [37]

    2024 , eprint=

    Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs , author=. 2024 , eprint=

  38. [38]

    Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

    Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

  39. [39]

    Transactions of the Association for Computational Linguistics , volume =

    Jiang, Zhengbao and Araki, Jun and Ding, Haibo and Neubig, Graham. How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00407

  40. [40]

    2023 , eprint=

    We're Afraid Language Models Aren't Modeling Ambiguity , author=. 2023 , eprint=

  41. [41]

    2023 , eprint=

    Selectively Answering Ambiguous Questions , author=. 2023 , eprint=

  42. [42]

    2023 , eprint=

    Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models , author=. 2023 , eprint=

  43. [43]

    2022 , eprint=

    Task Ambiguity in Humans and Language Models , author=. 2022 , eprint=

  44. [44]

    2023 , eprint=

    CLAM: Selective Clarification for Ambiguous Questions with Generative Language Models , author=. 2023 , eprint=

  45. [45]

    2023 , eprint=

    Why Did the Chicken Cross the Road? Rephrasing and Analyzing Ambiguous Questions in VQA , author=. 2023 , eprint=

  46. [46]

    2022 , eprint=

    Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis , author=. 2022 , eprint=

  47. [47]

    The Eleventh International Conference on Learning Representations , year=

    On Compositional Uncertainty Quantification for Seq2seq Graph Parsing , author=. The Eleventh International Conference on Learning Representations , year=

  48. [48]

    2023 , eprint=

    Neural-Symbolic Inference for Robust Autoregressive Graph Parsing via Compositional Uncertainty Quantification , author=. 2023 , eprint=

  49. [49]

    Proceedings of the 35th International Conference on Machine Learning , pages =

    Analyzing Uncertainty in Neural Machine Translation , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

  50. [50]

    2020 , eprint=

    Calibration of Pre-trained Transformers , author=. 2020 , eprint=

  51. [51]

    2023 , eprint=

    Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , author=. 2023 , eprint=

  52. [52]

    2023 , eprint=

    Calibrated Interpretation: Confidence Estimation in Semantic Parsing , author=. 2023 , eprint=

  53. [53]

    2022 , eprint=

    Calibrating Sequence likelihood Improves Conditional Language Generation , author=. 2022 , eprint=

  54. [54]

    2022 , eprint=

    Reducing conversational agents' overconfidence through linguistic calibration , author=. 2022 , eprint=

  55. [55]

    Towards Collaborative Neural-Symbolic Graph Semantic Parsing via Uncertainty

    Lin, Zi and Liu, Jeremiah Zhe and Shang, Jingbo. Towards Collaborative Neural-Symbolic Graph Semantic Parsing via Uncertainty. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.328

  56. [56]

    2023 , eprint=

    Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence Affect Language Models , author=. 2023 , eprint=

  57. [57]

    2021 , eprint=

    Calibrate Before Use: Improving Few-Shot Performance of Language Models , author=. 2021 , eprint=

  58. [58]

    2023 , eprint=

    The Confidence-Competence Gap in Large Language Models: A Cognitive Study , author=. 2023 , eprint=

  59. [59]

    2017 , eprint=

    On Calibration of Modern Neural Networks , author=. 2017 , eprint=

  60. [60]

    2021 , eprint=

    On Hallucination and Predictive Uncertainty in Conditional Language Generation , author=. 2021 , eprint=

  61. [61]

    2023 , eprint=

    Fine-tuning Language Models for Factuality , author=. 2023 , eprint=

  62. [62]

    Revisiting the

    Minderer, Matthias and Djolonga, Josip and Romijnders, Rob and Hubis, Frances and Zhai, Xiaohua and Houlsby, Neil and Tran, Dustin and Lucic, Mario , keywords =. Revisiting the Calibration of Modern Neural Networks , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2106.07998 , url =

  63. [63]

    Rethinking Calibration of Deep Neural Networks: Do Not Be Afraid of Overconfidence , url =

    Wang, Deng-Bao and Feng, Lei and Zhang, Min-Ling , booktitle =. Rethinking Calibration of Deep Neural Networks: Do Not Be Afraid of Overconfidence , url =

  64. [64]

    , biburl =

    Platt, J. , biburl =. Advances in Large Margin Classifiers , keywords =

  65. [65]

    Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers , volume =

    Zadrozny, Bianca and Elkan, Charles , year =. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers , volume =

  66. [66]

    Journal of the American Statistical Association , volume =

    Tilmann Gneiting and Adrian E Raftery , title =. Journal of the American Statistical Association , volume =. 2007 , publisher =

  67. [68]

    2025 , eprint=

    NVIDIA Nemotron 3: Efficient and Open Intelligence , author=. 2025 , eprint=

  68. [69]

    2025 , eprint=

    AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset , author=. 2025 , eprint=

  69. [70]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  70. [71]

    Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

    To rely or not to rely? evaluating interventions for appropriate reliance on large language models , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

  71. [73]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  72. [74]

    2024 , eprint=

    Calibrating Large Language Models Using Their Generations Only , author=. 2024 , eprint=

  73. [75]

    Proceedings of the AAAI Conference on Artificial Intelligence , year=

    Calibrating large language models with sample consistency , author=. Proceedings of the AAAI Conference on Artificial Intelligence , year=

  74. [76]

    2023 , eprint=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. 2023 , eprint=

  75. [77]

    Obtaining Well Calibrated Probabilities Using Bayesian Binning , booktitle =

    Naeini, Mahdi Pakdaman and Cooper, Gregory and Hauskrecht, Milos , year =. Obtaining Well Calibrated Probabilities Using Bayesian Binning , booktitle =

  76. [78]

    2020 , eprint=

    Verified Uncertainty Calibration , author=. 2020 , eprint=

  77. [79]

    1950 , journal =

    Verification of Forecasts Expressed in Terms of Probability , author =. 1950 , journal =

  78. [80]

    2023 , eprint=

    Active Prompting with Chain-of-Thought for Large Language Models , author=. 2023 , eprint=

  79. [81]

    2016 , eprint=

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , author=. 2016 , eprint=

  80. [82]

    2017 , eprint=

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author=. 2017 , eprint=

Showing first 80 references.