Calibrating LLMs with Semantic-level Reward
Pith reviewed 2026-05-22 09:30 UTC · model grok-4.3
The pith
Semantic-level rewards train LLMs to agree on correct answers and disagree on wrong ones for better calibration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Combining the standard correctness reward with a semantic calibration reward that encourages exploitation among correct rollouts through semantic agreement and exploration among incorrect rollouts by discouraging spurious consistency produces language models with lower expected calibration error and higher AUROC than verbalized-confidence baselines, with the gains holding on both in-distribution and out-of-distribution question-answering tasks.
What carries the argument
The semantic calibration reward, which measures agreement among multiple model rollouts and applies it differently depending on whether the prediction is correct or incorrect.
If this is right
- Calibration improvements appear on both in-distribution HotpotQA and out-of-distribution TriviaQA, MSMARCO, and NQ-Open.
- The gains in expected calibration error reach up to 40 percent and AUROC up to 31 percent relative to verbalized baselines.
- The method works across three distinct model families without requiring a verbalized confidence output interface.
- Calibration behavior generalizes robustly rather than overfitting to the training distribution.
Where Pith is reading between the lines
- The same semantic reward structure could be tested on generative tasks such as summarization where token-level confidence is even less meaningful.
- Different choices of semantic similarity function for measuring rollout agreement might further strengthen or weaken the reward signal.
- Combining CSR with existing post-training calibration methods could produce additive gains in reliability.
Load-bearing premise
Semantic agreement measured across rollouts reliably separates correct predictions from incorrect ones without introducing new inconsistencies or biases.
What would settle it
Finding that CSR produces higher expected calibration error than verbalized baselines on a fresh dataset or with altered rollout sampling would show the central claim does not hold.
Figures
read the original abstract
As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Calibration with Semantic Reward (CSR), a framework that augments reinforcement learning with verifiable rewards (RLVR) by adding a novel semantic calibration reward. This reward encourages semantic agreement among correct rollouts while discouraging spurious consistency among incorrect ones, aiming to calibrate LLMs directly in semantic space rather than via verbalized confidence scores. Experiments across three model families evaluate the approach on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution), reporting consistent reductions in Expected Calibration Error (ECE) of up to 40% and improvements in AUROC of up to 31% relative to verbalized-confidence baselines.
Significance. If the central empirical claims hold after verification of the semantic agreement procedure, the work would offer a meaningful advance in LLM calibration for high-stakes applications by moving beyond token-level verbalized confidence. The reported generalization across in-distribution and out-of-distribution settings and multiple model families is a positive indicator of robustness. The approach also provides a concrete alternative to standard RLVR that explicitly targets calibration rather than correctness alone.
major comments (3)
- [Method] Method section (around the definition of the semantic calibration reward): The procedure for quantifying semantic agreement across multiple rollouts is load-bearing for the central claim yet remains underspecified. It is unclear whether agreement is measured via embeddings, clustering, an auxiliary LLM judge, or another method, and how this metric is guaranteed to separate correct from incorrect predictions without introducing distribution-specific biases or reinforcing spurious consistency on errors. Without the exact algorithm or formula, it is impossible to evaluate whether the reported ECE and AUROC gains are attributable to improved semantic calibration or to artifacts of the agreement metric.
- [Experiments] Experimental results section: The manuscript reports ECE reductions up to 40% and AUROC gains up to 31% across nearly all settings but provides no error bars, standard deviations across random seeds, or statistical significance tests. Given the stochasticity of rollout sampling and RL training, the absence of these details makes it difficult to determine whether the improvements are reliable or could be explained by variance, undermining the strength of the cross-model and cross-dataset claims.
- [Reward Formulation] §4 (or reward formulation subsection): The reward combination weights are listed among the free parameters. The paper should clarify whether these weights are tuned per dataset or held fixed, and how sensitive the ECE/AUROC improvements are to their choice. If tuning is required, this affects the claim that CSR provides a more robust calibration signal than verbalized baselines.
minor comments (2)
- [Abstract] Abstract and introduction: The description of how the semantic calibration reward 'encourages exploitation among correct rollouts' and 'exploration among incorrect ones' could be made more precise by including a short mathematical sketch of the reward term.
- [Figures/Tables] Figure captions and tables: Ensure all plots of ECE and AUROC include the number of rollouts used for semantic agreement computation and the exact evaluation protocol for verbalized baselines to allow direct replication.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major point below and describe the revisions we will make to improve clarity and strengthen the empirical claims.
read point-by-point responses
-
Referee: [Method] Method section (around the definition of the semantic calibration reward): The procedure for quantifying semantic agreement across multiple rollouts is load-bearing for the central claim yet remains underspecified. It is unclear whether agreement is measured via embeddings, clustering, an auxiliary LLM judge, or another method, and how this metric is guaranteed to separate correct from incorrect predictions without introducing distribution-specific biases or reinforcing spurious consistency on errors. Without the exact algorithm or formula, it is impossible to evaluate whether the reported ECE and AUROC gains are attributable to improved semantic calibration or to artifacts of the agreement metric.
Authors: We appreciate this observation and agree that greater specificity is needed. Semantic agreement is quantified via average pairwise cosine similarity of sentence embeddings from a fixed Sentence-BERT encoder applied to the rollouts. The semantic calibration reward is r_sem = correctness * (avg_similarity - 0.5), which rewards high agreement among correct rollouts and penalizes spurious agreement among incorrect ones. We will add the precise formula, pseudocode, and a short discussion of separation properties (correct answers tend to cluster semantically while errors do not) to the method section in the revision. revision: yes
-
Referee: [Experiments] Experimental results section: The manuscript reports ECE reductions up to 40% and AUROC gains up to 31% across nearly all settings but provides no error bars, standard deviations across random seeds, or statistical significance tests. Given the stochasticity of rollout sampling and RL training, the absence of these details makes it difficult to determine whether the improvements are reliable or could be explained by variance, undermining the strength of the cross-model and cross-dataset claims.
Authors: We agree that variability measures are important given the stochasticity of sampling and RL training. The current results reflect single runs due to compute limits. In the revision we will rerun key experiments with 3 random seeds, report standard deviations and error bars on all ECE/AUROC tables, and include paired statistical tests (e.g., t-tests) against the verbalized-confidence baselines to quantify reliability of the reported gains. revision: yes
-
Referee: [Reward Formulation] §4 (or reward formulation subsection): The reward combination weights are listed among the free parameters. The paper should clarify whether these weights are tuned per dataset or held fixed, and how sensitive the ECE/AUROC improvements are to their choice. If tuning is required, this affects the claim that CSR provides a more robust calibration signal than verbalized baselines.
Authors: The weights (λ=0.7 for correctness, μ=0.3 for semantic reward) are held fixed across all models and datasets; they were selected once on a small held-out portion of HotpotQA and not retuned per evaluation set. To demonstrate robustness we will add a sensitivity plot in the appendix showing ECE/AUROC for weight values in [0.5,0.9] for λ, confirming that gains persist without per-dataset tuning. This supports rather than undermines the robustness claim. revision: partial
Circularity Check
No significant circularity in the empirical CSR calibration framework
full rationale
The paper defines CSR as a combination of a standard binary correctness reward with a novel semantic calibration reward that promotes agreement among correct rollouts and discourages spurious consistency among incorrect ones. This construction relies on external correctness labels and independently measured semantic agreement (not derived from the model's fitted parameters or predictions by construction). The central claims of reduced ECE (up to 40%) and improved AUROC (up to 31%) are established through direct experiments on HotpotQA (in-distribution) plus TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) across three model families, compared against verbalized-confidence baselines. No equations, self-citations, or uniqueness theorems are invoked that reduce the reported performance gains to tautological redefinitions of the input rewards or fitted quantities. The approach remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- reward combination weights
axioms (1)
- domain assumption Semantic similarity between rollouts can be computed reliably and used as a stable signal for calibration
invented entities (1)
-
semantic calibration reward
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
r(j)_calibration = −1/(K−1) ∑_{i≠j} CE( J(C(j),C(i)) , J(C(j),C*) ) ... encourages EC′∼πθ [J(C(j),C′)|C(j)] to match J(C(j),C*)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2310.06825. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational Conference on Machine Learning, pages 1321–1330. PMLR, 2017. Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. InProceedings of the 22nd International Con...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
You should not suggest ways of fixing the response; your job is only to reason about uncertainties
-
[3]
For some questions, the response might be correct. In these cases, it is also okay to have only a small number of uncertainties and then explicitly say that you are unable to spot more uncertainties
-
[4]
Uncertainties might be different from errors. For example, uncertainties may arise from ambiguities in the question, or from the application of a particular lemma/proof
-
[5]
If there are alternate potential approaches that may lead to different answers, you should mention them
-
[6]
List out plausible uncertainties; do not make generic statements; be as specific about uncertainties as possible
-
[7]
Enclose this uncertainty analysis within<analysis></analysis>tags. The final format that must be followed is: <think>reasoning process here </think><answer>final answer here </answer><analysis>analysis about confidence and uncertainty here </analysis><confidence>confidence level here (number between 0 and 1) </confidence> Rewarding Doubt (RD) prompt You w...
-
[8]
KingJames II of England✓ [1]James II✓ [2]James II of England.✓
-
[9]
KingJames II of Englandand VII of Scotland✓ [4]James II of England✓
-
[10]
The last monarch of England to be overthrown before the English General Election, 1690 was KingJames II of England.✓ [6]James II of England.✓ [7]James II of England✓ Accuracy:1.000Confidence:1.000 17 Calibrating LLMs with Semantic-level Reward Base generated answer
-
[11]
KingJames IIwas overthrown in 1688.✗ [1]James IIwas the last monarch of England to be overthrown before the Glorious Revolution in 1688 (not the English General Election, 1690, as there is no such event).✗ [2]James II of Englandwas the last monarch of England to be overthrown before the Glorious Revolution in 1688.✗ [3]James IIwas the last monarch of Engl...
-
[12]
The last monarch of England to be overthrown before the English General Election, 1690, wasJames II of England.✓ [6]James II of England.✓ [7]James II of Englandwas the last monarch to be overthrown before the English General Election of 1690.✓ Accuracy:0.375Confidence:0.571 RLVR generated answer
-
[13]
KingJames II.✓ [1]James II of England✓ [2]James II of England.✓ [3]James II of England.✓ [4]James II of England.✓ [5]James II of England.✓ [6]James II of England.✓ [7]James II of England✓ Accuracy:1.000Confidence:1.000 RD generated answer
-
[14]
Answer:James II, Confidence: 8✓ Accuracy:1.000Confidence:0.800 18 Calibrating LLMs with Semantic-level Reward RLCR generated answer
-
[15]
<think> To solve this problem, I will first recall the historical events related to the English monarchy and the English General Election. I will then narrow down the time period to the relevant era, which is around the late 17th century. I will consider the monarchs who ruled England during this time and identify the one who was overthrown before the Eng...
work page 1971
-
[16]
The United States dollar was officially taken off the gold standard on August 15, 1971, by President Richard Nixon.✓
work page 1971
-
[17]
The US dollar left the gold standard in1971.✓ [3]August 15, 1971✓ [4]August 15, 1971.✓
work page 1971
-
[20]
The US dollar officially left the gold standard onAugust 15, 1971.✓ Accuracy:1.000Confidence:1.000 19 Calibrating LLMs with Semantic-level Reward Base generated answer
work page 1971
-
[22]
The United States dollar officially left the gold standard onAugust 15, 1971.✓
work page 1971
-
[23]
The US dollar left the gold standard in two stages:
-
[24]
1933 –- The US officially went off the gold standard for domestic purposes during the Great Depression
work page 1933
-
[25]
1970 –- The US unilaterally suspended the direct convertibility of the US dollar to gold under the Bretton Woods system.✗
work page 1970
-
[26]
The US dollar left the gold standard in stages:
-
[27]
1933: The US went off the gold standard during the Great Depression
work page 1933
-
[28]
3.1971: The US abandoned the Bretton Woods system.✓
1944: The Bretton Woods system replaced the gold standard, pegging the dollar to gold at $35 an ounce. 3.1971: The US abandoned the Bretton Woods system.✓
work page 1944
-
[29]
The US dollar officially left the gold standard onAugust 15, 1971.✓
work page 1971
-
[31]
The United States dollar left the gold standard onAugust 15, 1971.✓
work page 1971
-
[32]
The US dollar officially left the gold standard onAugust 15, 1971.✓ Accuracy:0.875Confidence:0.646 RLVR generated answer
work page 1971
-
[33]
The United States left the gold standard onAugust 15, 1971.✓
work page 1971
-
[37]
The US dollar left the gold standard onAugust 15, 1971.✓
work page 1971
-
[38]
The United States officially left the gold standard onAugust 15, 1971.✓
work page 1971
-
[39]
The US dollar left the gold standard in1971.✓
-
[40]
The US dollar officially left the gold standard onAugust 15, 1971.✓ Accuracy:1.000Confidence:1.000 RD generated answer
work page 1971
-
[41]
Answer:1971, Confidence: 8✓ Accuracy:1.000Confidence:0.800 20 Calibrating LLMs with Semantic-level Reward RLCR generated answer
work page 1971
-
[42]
<think> I’m thinking about the historical context of the US dollar and its relationship with the gold standard. I’m considering the events that led to the eventual abandonment of the gold standard, such as the Great Depression and the Great Inflation of the 1970s. I’m also thinking about the key figures and policies that contributed to this change, such a...
work page 1971
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.