Recognition: no theorem link
BaseCal: Unsupervised Confidence Calibration via Base Model Signals
Pith reviewed 2026-05-16 17:12 UTC · model grok-4.3
The pith
Post-trained LLMs regain calibration by mapping their hidden states to those of their base models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Base LLMs stay well-calibrated while post-trained versions become severely overconfident. BaseCal-ReEval obtains confidence by re-evaluating PoLLM responses with the base model and averaging token probabilities. BaseCal-Proj learns a lightweight projection that maps PoLLM final hidden states into the base LLM space, then routes those states through the base output layer to produce calibrated scores. The method runs without human labels or any modification to the post-trained model itself.
What carries the argument
A lightweight linear projection that maps the final-layer hidden states of a post-trained LLM onto the corresponding hidden states of its base model, allowing the base model's output layer to compute calibrated probabilities for the post-trained responses.
If this is right
- Calibration of post-trained models becomes possible without any ground-truth labels.
- The projection variant adds only a single small matrix multiplication at inference time.
- The same base-model reference works across multiple model families and task domains.
- Deployed PoLLMs can produce uncertainty estimates that better match observed accuracy.
Where Pith is reading between the lines
- Post-training appears to shift the model's internal representations in ways that specifically harm calibration more than other capabilities.
- The success of a simple projection implies that base and post-trained hidden-state spaces remain close enough for linear alignment to transfer calibration.
- The technique could be tested on other adaptation methods such as continued pre-training or instruction tuning to check whether calibration recovery is general.
- In high-stakes settings the method supplies a practical route to more trustworthy uncertainty without requiring new data collection.
Load-bearing premise
Base LLM probabilities or projected hidden states supply a reliable label-free target for the true correctness probability of post-trained outputs.
What would settle it
On any dataset where the projected or re-evaluated base probabilities show higher Expected Calibration Error than the original post-trained outputs or standard unsupervised baselines, the central claim would be falsified.
read the original abstract
Reliable confidence is essential for trusting the outputs of LLMs, yet widely deployed post-trained LLMs (PoLLMs) typically compromise this trust with severe overconfidence. In contrast, we observe that their corresponding base LLMs often remain well-calibrated. This naturally motivates us to calibrate PoLLM confidence using the base LLM as a reference. This work proposes two ways to achieve this. A straightforward solution, BaseCal-ReEval, evaluates PoLLM's responses by feeding them into the base LLM to get average probabilities as confidence. While effective, this approach introduces additional inference overhead. To address this, we propose BaseCal-Proj, which trains a lightweight projection to map the final-layer hidden states of PoLLMs back to those of their base LLMs. These projected states are then processed by the base LLM's output layer to derive base-calibrated confidence for PoLLM's responses. Notably, BaseCal is an unsupervised, plug-and-play solution that operates without human labels or LLM modifications. Experiments across five datasets and three LLM families demonstrate the effectiveness of BaseCal, reducing Expected Calibration Error (ECE) by an average of 42.90\% compared to the best unsupervised baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes BaseCal, an unsupervised method to calibrate confidence in post-trained LLMs (PoLLMs) by leveraging signals from their base LLMs. BaseCal-ReEval re-evaluates PoLLM responses directly with the base model to obtain average token probabilities as confidence scores. BaseCal-Proj instead trains a lightweight projection to map PoLLM final-layer hidden states onto base LLM states, then applies the base output layer. Experiments across five datasets and three LLM families report an average 42.90% reduction in Expected Calibration Error relative to the strongest unsupervised baselines.
Significance. If the central empirical claim holds after addressing the noted gaps, the work supplies a practical, label-free, plug-and-play calibration technique that exploits the observed calibration advantage of base models without retraining or human annotations. The projection variant keeps inference overhead low, which could make reliable uncertainty estimates more accessible in deployed LLM systems.
major comments (2)
- [§3] §3 (method description): The core assumption that base-LLM probabilities or projected states remain reliable calibration targets when applied to PoLLM-generated text is not directly verified. The paper establishes that base LLMs are well-calibrated on their native output distributions, yet BaseCal-ReEval and BaseCal-Proj feed them PoLLM responses that arise after post-training (SFT/RLHF) and therefore occupy a shifted distribution; no results report base-model ECE on PoLLM outputs to confirm the reference signal does not itself become miscalibrated.
- [§4] §4 (experiments): The reported average ECE reduction of 42.90% is presented without per-dataset breakdowns, standard deviations across runs, dataset statistics, or explicit controls for generation parameters; these omissions make it impossible to assess whether the improvement is robust or confounded by response-length or token-distribution differences between base and PoLLM outputs.
minor comments (2)
- [§3.2] Notation for projected hidden states and the exact form of the projection loss should be introduced with an equation in §3.2 for clarity.
- [Table 2] Table captions should explicitly state the number of evaluation samples per dataset and whether ECE is computed with 10 or 15 bins.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications and additional results.
read point-by-point responses
-
Referee: [§3] §3 (method description): The core assumption that base-LLM probabilities or projected states remain reliable calibration targets when applied to PoLLM-generated text is not directly verified. The paper establishes that base LLMs are well-calibrated on their native output distributions, yet BaseCal-ReEval and BaseCal-Proj feed them PoLLM responses that arise after post-training (SFT/RLHF) and therefore occupy a shifted distribution; no results report base-model ECE on PoLLM outputs to confirm the reference signal does not itself become miscalibrated.
Authors: We acknowledge that directly reporting the base LLM's ECE on PoLLM-generated outputs would provide stronger support for the core assumption. While the main experiments demonstrate that BaseCal yields large ECE reductions, we agree this verification step is missing. In the revised manuscript we will add a new table (or subsection in §4) that computes and reports the ECE of each base model when evaluated directly on the PoLLM responses across all five datasets and three model families. This will explicitly confirm that the reference signal remains well-calibrated on the shifted distribution. revision: yes
-
Referee: [§4] §4 (experiments): The reported average ECE reduction of 42.90% is presented without per-dataset breakdowns, standard deviations across runs, dataset statistics, or explicit controls for generation parameters; these omissions make it impossible to assess whether the improvement is robust or confounded by response-length or token-distribution differences between base and PoLLM outputs.
Authors: We agree that greater experimental transparency is needed. In the revised §4 we will include: (i) full per-dataset ECE tables for every method and baseline, (ii) standard deviations computed over multiple runs (with random seeds reported), (iii) dataset statistics such as average response length, token count, and vocabulary overlap between base and PoLLM outputs, and (iv) explicit generation hyperparameters (temperature, top-p, max length, etc.) used for all models. These additions will allow readers to evaluate robustness and rule out length or distribution confounds. revision: yes
Circularity Check
No significant circularity; calibration derives from external base-model reference
full rationale
The derivation chain relies on the empirical observation that base LLMs remain well-calibrated on their native outputs, then applies those outputs (or projected hidden states fed to the base output layer) as an unsupervised reference signal for PoLLM responses. BaseCal-ReEval directly re-evaluates PoLLM text with the base model; BaseCal-Proj learns a projection whose target is the base model's hidden states and output layer. Neither step reduces a claimed prediction to a fitted parameter by construction, nor invokes a self-citation chain or uniqueness theorem to force the method. The projection training uses the base model as an external anchor rather than internal PoLLM statistics, so the central claim does not collapse to a self-defined quantity. Distribution-shift concerns between base and PoLLM outputs affect validity but do not create circularity in the reported derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Base LLMs are well-calibrated while their post-trained counterparts are overconfident
Forward citations
Cited by 2 Pith papers
-
VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
VERDI derives three structural confidence signals from decomposed LLM verification traces and calibrates them with Platt-scaled logistic regression to achieve AUROC 0.72-0.91 on GPT models and 0.56-0.70 on Qwen models...
-
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.