arxiv: 2605.10202 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Task-Aware Calibration: Provably Optimal Decoding in LLMs

Tim Tomov , Dominik Fuchsgruber , Rajeev Verma , Stephan G\"unnemann

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:52 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM decodingtask calibrationminimum Bayes riskoptimal decodinglatent spacecalibration errorgenerative modelsdecision theory

0 comments

The pith

Task calibration in latent output spaces makes Minimum Bayes Risk decoding provably optimal for LLM beliefs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models produce text from predictive distributions that frequently misalign with the actual data-generating process, causing suboptimal results on concrete tasks. The paper reframes calibration as a problem in a lower-dimensional latent space induced by the task itself, such as class labels, integers, or sets, instead of the intractable full language distribution. It applies a decision-theoretic result to prove that Minimum Bayes Risk decoding performed on this calibrated latent distribution is the optimal strategy according to the model's own latent beliefs. The approach also supplies Task Calibration Error, a metric that directly measures excess loss from miscalibration in task terms. Empirically the method lifts generation quality over standard baselines across multiple tasks.

Core claim

Task calibration adjusts the model's predictive probabilities in the task-induced latent space so that Minimum Bayes Risk decoding on the resulting distribution becomes the optimal decoding rule with respect to the latent model beliefs, by direct application of a known decision-theoretic optimality result.

What carries the argument

Task calibration, which maps free-form LLM outputs to a semantically meaningful latent structure (discrete labels, integers, or sets) and calibrates the predictive distribution there, allowing MBR decoding to achieve optimality on model beliefs.

If this is right

Generation quality improves consistently across tasks and existing decoding baselines when MBR is applied to the task-calibrated latent distribution.
Task Calibration Error provides a direct, application-aware measure of how much loss is attributable to miscalibration.
Model decisions become more reliable on tasks whose outputs admit a discrete or set-based latent representation without requiring retraining.
The optimality guarantee holds relative to the model's latent beliefs once calibration in that space is achieved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-calibration step could be tested on tasks whose outputs are naturally continuous or structured, such as code snippets or ranked lists, to check whether the optimality result extends.
If the latent mapping itself can be learned from data rather than hand-specified, the method might apply to a wider range of open-ended generation problems.
The framework suggests a general post-processing route for extracting calibrated decisions from any generative model whose outputs admit a task-relevant latent encoding.

Load-bearing premise

Free-form LLM outputs can be reliably interpreted through a semantically meaningful latent structure in which calibration is well-posed and the decision-theoretic optimality result applies without further qualification.

What would settle it

A controlled experiment on a task with an unambiguous latent structure (for example, integer answers to arithmetic questions) in which MBR decoding on the task-calibrated distribution fails to outperform standard decoding or other calibration baselines.

Figures

Figures reproduced from arXiv: 2605.10202 by Dominik Fuchsgruber, Rajeev Verma, Stephan G\"unnemann, Tim Tomov.

**Figure 1.** Figure 1: LLM responses Y get mapped into a task-dependent latent space L (here: L ⊂ N). Task calibration: the LLM’s distribution pˆ(x) takes on the value q ∈ ∆4 and is distributionally task calibrated with an optimal map f ∗ gT . Minimum Bayes Risk decoding on the recalibrated distribution δ MBR(f ∗ gT (q)) is, on average, the best strategy based on q (Theorem 3.4) and yields a different output rating (4) compared … view at source ↗

**Figure 4.** Figure 4: Learned calibration maps fϕ(ˆp(X) on SimpleQA. 0.00 0.25 0.50 0.75 1.00 Uncalibrated pˆ(X)A 0.00 0.25 0.50 0.75 1.00 Calibrated fϕ(ˆp(X)) A Same ℓ Different ℓ A Threshold t (5a) Gemma-3-4B 0.00 0.25 0.50 0.75 1.00 Uncalibrated pˆ(X)A 0.00 0.25 0.50 0.75 1.00 Calibrated fϕ(ˆp(X)) A Same ℓ Different ℓ A Threshold t (5b) Gemma-3-12B 0.00 0.25 0.50 0.75 1.00 Uncalibrated pˆ(X)A 0.00 0.25 0.50 0.75 1.00 Calibra… view at source ↗

**Figure 5.** Figure 5: Learned calibration maps fϕ(ˆp(X) on TriviaQA. B.3 Task Calibration Error (TCE) [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Task Calibration Error (TCE) and ECE for all Tasks [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Uncertainty quantification (UQ) AUCc scores for Bayes risk of Equation (2) and the task-specific loss dT incurred by the MBR decoding action for different tasks and models. There is no clear trend for task calibration: On some datasets it improves UQ, on others, it does not. This aligns with our discussion. in-depth discussion of calibration and its relationship to reliability and uncertainty quantificatio… view at source ↗

**Figure 8.** Figure 8: Action movement for MAQA ({0, 1} ≤C ). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Action movement for classification tasks. [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: Action movement for abstention tasks. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Action movement for ordinal/interval tasks. [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: HelpSteer ({0, . . . , C} 2 ) Qwen3-4B 0.0 0.2 0.4 0.6 0.8 1.0 P(Cal|Uncal) (0, 0) (0, 1) (1, 0) (1, 1) (1, 2) (1, 3) (1, 4) (2, 0) (2, 1) (2, 2) (2, 3) (2, 4) (3, 1) (3, 2) (3, 3) (3, 4) (4, 1) (4, 2) (4, 3) (4, 4) Uncalibrated Action (0, 0) (0, 1) (1, 0) (1, 1) (1, 2) (1, 3) (1, 4) (2, 0) (2, 1) (2, 2) (2, 3) (2, 4) (3, 1) (3, 2) (3, 3) (3, 4) (4, 1) (4, 2) (4, 3) (4, 4) Calibrated Action [PITH_FULL_IM… view at source ↗

**Figure 13.** Figure 13: HelpSteer ({0, . . . , C} 2 ) Qwen3-30B-A3B 0.0 0.2 0.4 0.6 0.8 1.0 P(Cal|Uncal) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (1, 3) (2, 1) (2, 2) (2, 3) (2, 4) (3, 1) (3, 2) (3, 3) (3, 4) (4, 1) (4, 2) (4, 3) (4, 4) Uncalibrated Action (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (1, 3) (2, 1) (2, 2) (2, 3) (2, 4) (3, 1) (3, 2) (3, 3) (3, 4) (4, 1) (4, 2) (4, 3) (4, 4) Calibrated Action [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 14.** Figure 14: HelpSteer ({0, . . . , C} 2 ) Gemma3-4B 29 [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: HelpSteer ({0, . . . , C} 2 ) Gemma-3-12B 30 [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

read the original abstract

LLM decoding often relies on the model's predictive distribution to generate an output. Consequently, misalignment with respect to the true generating distribution leads to suboptimal decisions in practice. While a natural solution is to calibrate the model's output distribution, for LLMs, this is ill-posed at the combinatorially vast level of free-form language. We address this by building on the insight that in many tasks, these free-form outputs can be interpreted in a semantically meaningful latent structure, for example, discrete class labels, integers, or sets. We introduce task calibration as a paradigm to calibrate the model's predictive distribution in the task-induced latent space. We apply a decision-theoretic result to show that Minimum Bayes Risk (MBR) decoding on the task-calibrated latent distribution is the optimal decoding strategy on latent model beliefs. Empirically, it consistently improves generation quality across different tasks and baselines. We also introduce Task Calibration Error (TCE), an application-aware calibration metric that quantifies the excess loss due to miscalibration. Our work demonstrates that task calibration enables more reliable model decisions across various tasks and applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable way to calibrate LLMs on latent task structures like labels or numbers and reports gains from MBR decoding there, but the optimality claim rests on an exact mapping that real outputs rarely satisfy without extra error terms.

read the letter

The punchline is that this work makes calibration feasible for LLMs by moving it out of the full token space and into a smaller latent task space, then uses the standard decision-theoretic fact that MBR is optimal under a known distribution and loss. They introduce task calibration as the method and TCE as a metric for excess loss from miscalibration, with experiments showing better generation quality on structured tasks compared to baselines. That shift is the clearest practical step forward here, since full-distribution calibration is indeed ill-posed for free-form text. The empirical results look consistent enough to suggest the approach can help on tasks where a clean mapping to latents exists. The theory part applies a known result without inventing new decision theory, which keeps it grounded. The soft spot is exactly the one in the stress-test note: the optimality only transfers if the latent distribution is the precise push-forward measure under a deterministic, surjective mapping. In practice the paper likely estimates that marginal via sampling or top-k, and the abstract gives no error bounds or approximation guarantees for that step. When the mapping is many-to-one or the loss is defined only on latents while generation stays in token space, the claim that MBR on the calibrated latents is optimal on the model's actual beliefs needs an extra term the current write-up does not supply. Experiments would also benefit from tighter controls showing the gains come from the calibration step rather than just switching to MBR. This is for people working on reliable structured generation in LLMs, such as classification, counting, or set-valued outputs. A reader who already uses decision-theoretic decoding will get the most out of it. The paper deserves peer review because the core idea is concrete, the metric is new, and the experiments indicate real utility even if the theory needs tightening on the approximation side.

Referee Report

2 major / 2 minor

Summary. The paper proposes task calibration as a method to calibrate LLM predictive distributions in a task-induced latent space (e.g., discrete labels or sets) rather than the full token space. It applies a decision-theoretic result to argue that Minimum Bayes Risk (MBR) decoding on the task-calibrated latent distribution is provably optimal with respect to latent model beliefs. The authors introduce Task Calibration Error (TCE) as an application-aware metric for excess loss due to miscalibration and report empirical gains in generation quality across tasks and baselines.

Significance. If the optimality result is established without unaccounted approximation error in the latent mapping, the work provides a principled bridge between decision theory and LLM decoding for structured tasks. This could improve reliability in applications where outputs map to semantically meaningful latents. The TCE metric offers a practical tool for evaluating calibration beyond standard probability metrics. Empirical improvements are noted but their strength depends on controls for the latent mapping and baselines.

major comments (2)

[theoretical development / optimality proof] The central optimality claim (abstract and theoretical development) applies a standard decision-theoretic fact (MBR optimality under a known distribution) to the task-calibrated latent distribution. However, this requires the latent distribution to be exactly the push-forward measure induced by a deterministic, surjective mapping from token sequences to latents. The manuscript should explicitly define this mapping and the induced marginal in the relevant theoretical section (likely §3 or §4) and confirm it is obtained without sampling approximations or top-k truncation; otherwise an error term must be derived to preserve the 'provably optimal' statement.
[task calibration definition] Definition and computation of the task-calibrated distribution: if the calibration step itself relies on an estimator (e.g., sampling or auxiliary model) rather than the exact marginal, the optimality guarantee does not transfer directly. The paper must state whether the calibrated distribution is the exact push-forward or an approximation, and quantify any resulting sub-optimality gap.

minor comments (2)

[notation / preliminaries] Clarify notation for the latent space and the mapping function; ensure consistent use of symbols for the token-level measure versus the induced latent measure.
[experiments] In the experimental section, provide more detail on how the latent structures are extracted from free-form outputs for each task (e.g., parsing rules for sets or integers) to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which highlight important aspects of the theoretical development that we will clarify in the revision. We address each major comment below.

read point-by-point responses

Referee: The central optimality claim (abstract and theoretical development) applies a standard decision-theoretic fact (MBR optimality under a known distribution) to the task-calibrated latent distribution. However, this requires the latent distribution to be exactly the push-forward measure induced by a deterministic, surjective mapping from token sequences to latents. The manuscript should explicitly define this mapping and the induced marginal in the relevant theoretical section (likely §3 or §4) and confirm it is obtained without sampling approximations or top-k truncation; otherwise an error term must be derived to preserve the 'provably optimal' statement.

Authors: We agree that the optimality result is stated with respect to the exact push-forward measure. In §3 we define the latent mapping as a deterministic, surjective function from token sequences to the task-induced latent space (e.g., label extraction for classification or set parsing for structured outputs). The MBR optimality then holds exactly for the induced marginal over latents, which encodes the model's calibrated beliefs. We will add an explicit statement of this definition and the induced measure in the revised theoretical section. For the empirical results we will note that Monte Carlo sampling is used to estimate the marginal and will include a brief error analysis (via concentration bounds) in the appendix to quantify the gap from the exact case, thereby preserving the 'provably optimal' claim for the idealized distribution. revision: yes
Referee: Definition and computation of the task-calibrated distribution: if the calibration step itself relies on an estimator (e.g., sampling or auxiliary model) rather than the exact marginal, the optimality guarantee does not transfer directly. The paper must state whether the calibrated distribution is the exact push-forward or an approximation, and quantify any resulting sub-optimality gap.

Authors: Task calibration is defined as the exact marginal obtained by pushing the model's token-level predictive distribution forward through the deterministic latent mapping. The optimality guarantee therefore applies directly to this exact marginal. In practice the marginal is estimated by sampling; we will revise the definition paragraph to distinguish the exact quantity from its estimator and will derive a simple bound on the excess risk incurred by finite-sample estimation (using standard concentration inequalities). This makes the sub-optimality gap explicit while keeping the core claim intact for the exact calibrated distribution. revision: yes

Circularity Check

0 steps flagged

No circularity: optimality imported from external decision theory

full rationale

The paper defines task calibration on a latent structure induced by free-form outputs and then invokes a standard decision-theoretic result (MBR optimality under a known distribution and loss) to conclude that MBR on the calibrated latent distribution is optimal. This step does not reduce by construction to any fitted parameter, self-defined quantity, or prior self-citation within the paper; the optimality statement is an application of an independent external theorem to the newly defined object. No equations or claims in the abstract or description exhibit self-definition, renaming of known results, or load-bearing self-citation chains. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review performed on abstract only; full paper may contain additional free parameters or assumptions not visible here.

axioms (1)

standard math A decision-theoretic result establishing optimality of MBR decoding under calibrated beliefs
Invoked to prove that MBR on the task-calibrated latent distribution is optimal.

invented entities (1)

Task Calibration Error (TCE) no independent evidence
purpose: Quantifies excess loss due to miscalibration in a task-aware manner
Newly proposed metric in the paper.

pith-pipeline@v0.9.0 · 5496 in / 1167 out tokens · 68041 ms · 2026-05-12T02:52:04.360941+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply a decision-theoretic result to show that Minimum Bayes Risk (MBR) decoding on the task-calibrated latent distribution is the optimal decoding strategy on latent model beliefs.
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 3.1. An LLM’s latent push-forward distribution p̂ is (distributionally) task-calibrated … EX,Y[gT(Y)|p̂(X)=q]=q.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 7 internal anchors

[1]

Linguistic calibration of long-form generations, 2024

Neil Band, Xuechen Li, Tengyu Ma, and Tatsunori Hashimoto. Linguistic calibration of long-form generations, 2024. URLhttps://arxiv.org/abs/2404.00474

work page arXiv 2024
[2]

Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval- 2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pages 1–14, 2017

work page 2017
[3]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord

Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.arXiv preprint arXiv:2502.11028, 2025

work page arXiv 2025
[4]

The comparison and evaluation of forecasters

Morris H DeGroot and Stephen E Fienberg. The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician), 32(1-2):12–22, 1983

work page 1983
[5]

Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf, 2023

Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, Xianchao Wu, and Oleksii Kuchaiev. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf, 2023

work page 2023
[6]

Outcome indistinguishability

Cynthia Dwork, Michael P Kim, Omer Reingold, Guy N Rothblum, and Gal Yona. Outcome indistinguishability. InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021

work page 2021
[7]

Sampling-based approximations to minimum Bayes risk decoding for neural machine translation

Bryan Eikema and Wilker Aziz. Sampling-based approximations to minimum Bayes risk decoding for neural machine translation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pages 10978–10993, Abu Dhabi, United Arab Emirates, December 2022. Association for C...

work page doi:10.18653/v1/2022.emnlp-main.754 2022
[8]

Hierarchical neural story generation

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InProceed- ings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, 2018

work page 2018
[9]

Beam search strategies for neural machine translation

Markus Freitag and Yaser Al-Onaizan. Beam search strategies for neural machine translation. InProceedings of the First Workshop on Neural Machine Translation, pages 56–60, 2017

work page 2017
[10]

Bias and fairness in large language models: A survey.Computational linguistics, 50(3):1097–1179, 2024

Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. Bias and fairness in large language models: A survey.Computational linguistics, 50(3):1097–1179, 2024

work page 2024
[11]

Gemma 3 technical report, 2025

Gemma Team. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786

work page 2025
[12]

Safe probability.Journal of Statistical Planning and Inference, 2018

Peter Grünwald. Safe probability.Journal of Statistical Planning and Inference, 2018

work page 2018
[13]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025. URL https://arxiv.org/abs/ 2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR, 2017. 10

work page 2017
[15]

Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026

Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026. URL https://arxiv.org/abs/2509.07968

work page arXiv 2026
[16]

Decoding uncertainty: The impact of decoding strategies for uncertainty estimation in large language models

Wataru Hashimoto, Hidetaka Kamigaito, and Taro Watanabe. Decoding uncertainty: The impact of decoding strategies for uncertainty estimation in large language models. InFindings of the Association for Computational Linguistics, 2025

work page 2025
[17]

Deberta: Decoding-enhanced bert with disentangled attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. InInternational Conference on Learning Representations,

work page
[18]

URLhttps://openreview.net/forum?id=XPZIaotutsD

work page
[19]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https: //arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Hu and Y

Lunjia Hu and Yifan Wu. Calibration error for decision making, 2024. URL https://arxiv. org/abs/2404.13503

work page arXiv 2024
[21]

Calibrating long-form generations from large language models, 2024

Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, and Bhuwan Dhingra. Calibrating long-form generations from large language models, 2024. URL https://arxiv. org/abs/2402.06544

work page arXiv 2024
[22]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017. URL https://arxiv. org/abs/1705.03551

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023. URL https://arxiv.org/ abs/2302.09664

work page internal anchor Pith review arXiv 2023
[25]

Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration

Meelis Kull and Peter Flach. Novel decompositions of proper scoring rules for classification: Score adjustment as precursor to calibration. InJoint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 68–85. Springer, 2015

work page 2015
[26]

Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.Advances in neural information processing systems, 32, 2019

Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration.Advances in neural information processing systems, 32, 2019

work page 2019
[27]

How to correctly report llm-as-a-judge evaluations, 2026

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy yong Sohn, and Kangwook Lee. How to correctly report llm-as-a-judge evaluations, 2026. URL https://arxiv.org/abs/2511. 21140

work page 2026
[28]

Large language models are miscalibrated in-context learners

Chengzu Li, Han Zhou, Goran Glavaš, Anna Korhonen, and Ivan Vuli ´c. Large language models are miscalibrated in-context learners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11575–11596, 2025

work page 2025
[29]

Contrastive decoding: Open-ended text generation as optimization

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori B Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), 2023

work page 2023
[30]

Conftuner: Training large language models to express their confidence verbally, 2025

Yibo Li, Miao Xiong, Jiaying Wu, and Bryan Hooi. Conftuner: Training large language models to express their confidence verbally, 2025. URLhttps://arxiv.org/abs/2508.18847. 11

work page arXiv 2025
[31]

arXiv preprint arXiv:2403.04182 (2024)

Michal Lukasik, Harikrishna Narasimhan, Aditya Krishna Menon, Felix Yu, and Sanjiv Kumar. Regression-aware inference with llms, 2024. URLhttps://arxiv.org/abs/2403.04182

work page arXiv 2024
[32]

Revisiting the calibration of modern neural networks

Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. Revisiting the calibration of modern neural networks. Advances in neural information processing systems, 34:15682–15694, 2021

work page 2021
[33]

Understanding the properties of minimum Bayes risk decoding in neural machine translation

Mathias Müller and Rico Sennrich. Understanding the properties of minimum Bayes risk decoding in neural machine translation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (...

work page doi:10.18653/v1/2021.acl-long.22 2021
[34]

Is there a {object} in the image?

Preetum Nakkiran, Arwen Bradley, Adam Goli´nski, Eugene Ndiaye, Michael Kirchhof, and Sinead Williamson. Trained on tokens, calibrated on concepts: The emergence of semantic calibration in llms, 2025. URLhttps://arxiv.org/abs/2511.04869

work page arXiv 2025
[35]

Measuring calibration in deep learning

Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. InCVPR workshops, volume 2, 2019

work page 2019
[36]

Calibration for decision making: A principled ap- proach to trustworthy ml, 2024

Georgy Noarov and Aaron Roth. Calibration for decision making: A principled ap- proach to trustworthy ml, 2024. URL https://www.let-all.com/blog/2024/03/13/ calibration-for-decision-making-a-principled-approach-to-trustworthy-ml/

work page 2024
[37]

Understanding model calibration - a gentle introduction and visual ex- ploration of calibration and the expected calibration error (ece)

Maja Pavlovic. Understanding model calibration - a gentle introduction and visual ex- ploration of calibration and the expected calibration error (ece). InICLR Blogposts 2025, 2025. URL https://iclr-blogposts.github.io/2025/blog/calibration/. https://iclr-blogposts.github.io/2025/blog/calibration/

work page 2025
[38]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

work page 1999
[39]

arXiv preprint arXiv:2402.13213 , year=

Benjamin Plaut, Nguyen X. Khanh, and Tu Trinh. Probabilities of chat llms are miscalibrated but still predict correctness on multiple-choice q&a, 2025. URL https://arxiv.org/abs/ 2402.13213

work page arXiv 2025
[40]

Blaschko

Teodora Popordanoska, Raphael Sayer, and Matthew B. Blaschko. A consistent and differen- tiable lp canonical calibration error estimator, 2022. URL https://arxiv.org/abs/2210. 07810

work page 2022
[41]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

When2Call: When (not) to call tools

Hayley Ross, Ameya Sunil Mahabaleshwarkar, and Yoshi Suhara. When2Call: When (not) to call tools. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3391–3409, Albuquerque, ...

work page doi:10.18653/v1/2025.naacl-long.174 2025
[43]

A thorough examination of decoding methods in the era of llms

Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. A thorough examination of decoding methods in the era of llms. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8601–8629, 2024

work page 2024
[44]

In: Rogers, A., Boyd-Graber, J., Okazaki, N

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in LLM-as-a-judge. In Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, and Dhirendra Pratap Singh, editors,Proceedings of the 14th Inter...

work page doi:10.18653/v1/ 2025
[45]

Therneau and Elizabeth Atkinson

Terry M. Therneau and Elizabeth Atkinson. Concordance. Vignette of the survival R pack- age, December 2024. URL https://cran.r-project.org/web/packages/survival/ vignettes/concordance.pdf. Accessed: 2025-08-29

work page 2024
[46]

Task-awareness improves llm generations and uncertainty, 2026

Tim Tomov, Dominik Fuchsgruber, and Stephan Günnemann. Task-awareness improves llm generations and uncertainty, 2026. URLhttps://arxiv.org/abs/2601.21500

work page arXiv 2026
[47]

Evaluating model calibration in classification

Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Schön. Evaluating model calibration in classification. InInternational conference on artificial intelligence and statistics, 2019

work page 2019
[48]

Helpsteer: Multi-attribute helpfulness dataset for steerlm, 2023

Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Polak Scowcroft, Neel Kant, Aidan Swope, and Oleksii Kuchaiev. Helpsteer: Multi-attribute helpfulness dataset for steerlm, 2023

work page 2023
[49]

Gustafsson, Edward Phillips, Boyan Gao, Anshul Thakur, and David A

Sean Wu, Fredrik K. Gustafsson, Edward Phillips, Boyan Gao, Anshul Thakur, and David A. Clifton. Bas: A decision-theoretic approach to evaluating large language model confidence,

work page
[50]

URLhttps://arxiv.org/abs/2604.03216

work page internal anchor Pith review Pith/arXiv arXiv
[51]

What (and what not) are calibrated probabilities actually use- ful for? InICLR Blogposts 2026, 2026

Guoxuan Xia. What (and what not) are calibrated probabilities actually use- ful for? InICLR Blogposts 2026, 2026. URL https://iclr-blogposts. github.io/2026/blog/2026/useful-calibrated-uncertainties/. https://iclr- blogposts.github.io/2026/blog/2026/useful-calibrated-uncertainties/

work page 2026
[52]

Calibrating language models with adaptive temperature scaling

Johnathan Xie, Annie S Chen, Yoonho Lee, Eric Mitchell, and Chelsea Finn. Calibrating language models with adaptive temperature scaling. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18128–18138, Miami, Florida, USA, November 2024. Association for...

work page doi:10.18653/v1/2024.emnlp-main.1007 2024
[53]

Maqa: Evaluating uncertainty quantification in llms regarding data uncertainty, 2025

Yongjin Yang, Haneul Yoo, and Hwaran Lee. Maqa: Evaluating uncertainty quantification in llms regarding data uncertainty, 2025. URLhttps://arxiv.org/abs/2408.06816

work page arXiv 2025
[54]

Atomic calibration of llms in long-form generations, 2025

Caiqi Zhang, Ruihan Yang, Zhisong Zhang, Xinting Huang, Sen Yang, Dong Yu, and Nigel Collier. Atomic calibration of llms in long-form generations, 2025. URL https://arxiv. org/abs/2410.13246

work page arXiv 2025
[55]

call tool

Shengjia Zhao, Michael P. Kim, Roshni Sahoo, Tengyu Ma, and Stefano Ermon. Calibrating predictions to decisions: A novel approach to multi-class calibration, 2021. URL https: //arxiv.org/abs/2107.05719. 13 A Impact Statement In this work, we examine how calibration can improve the generation performance of Large Language Models. While any research may be ...

work page arXiv 2021
[56]

Final answer : letter

The value of the resulting score can be interpreted analogously to the traditional Area-under-the-Precision-Recall-Curve (AUC-ROC) metrics, with 0.5 corresponding to random chance and 1 to perfect ranking ability. In our case, we evaluate the Bayes risk (Equation (2) like Tomov et al. [45] as a proxy for per-instance uncertainty and measure how well it pr...

work page