arxiv: 2601.03042 · v4 · submitted 2026-01-06 · 💻 cs.CL

Recognition: no theorem link

BaseCal: Unsupervised Confidence Calibration via Base Model Signals

Hexiang Tan , Wanli Yang , Junwei Zhang , Xin Chen , Rui Tang , Du Su , Jingang Wang , Yuanzhuo Wang

show 2 more authors

Fei Sun Xueqi Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM calibrationunsupervised calibrationbase modelpost-trained LLMconfidence estimationExpected Calibration Errorhidden state projection

0 comments

The pith

Post-trained LLMs regain calibration by mapping their hidden states to those of their base models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Post-trained large language models commonly produce overconfident probability estimates that reduce trust in their answers. Their original base models tend to remain better calibrated on the same inputs. BaseCal recovers this calibration in two ways: either by feeding PoLLM outputs back into the base model for average probabilities, or by training a small projection that aligns PoLLM final-layer hidden states with base-model states and then applies the base output layer. Both routes are fully unsupervised and require no labels or changes to the deployed model. Experiments across five datasets and three LLM families show an average 42.9 percent drop in Expected Calibration Error relative to prior unsupervised methods.

Core claim

Base LLMs stay well-calibrated while post-trained versions become severely overconfident. BaseCal-ReEval obtains confidence by re-evaluating PoLLM responses with the base model and averaging token probabilities. BaseCal-Proj learns a lightweight projection that maps PoLLM final hidden states into the base LLM space, then routes those states through the base output layer to produce calibrated scores. The method runs without human labels or any modification to the post-trained model itself.

What carries the argument

A lightweight linear projection that maps the final-layer hidden states of a post-trained LLM onto the corresponding hidden states of its base model, allowing the base model's output layer to compute calibrated probabilities for the post-trained responses.

If this is right

Calibration of post-trained models becomes possible without any ground-truth labels.
The projection variant adds only a single small matrix multiplication at inference time.
The same base-model reference works across multiple model families and task domains.
Deployed PoLLMs can produce uncertainty estimates that better match observed accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Post-training appears to shift the model's internal representations in ways that specifically harm calibration more than other capabilities.
The success of a simple projection implies that base and post-trained hidden-state spaces remain close enough for linear alignment to transfer calibration.
The technique could be tested on other adaptation methods such as continued pre-training or instruction tuning to check whether calibration recovery is general.
In high-stakes settings the method supplies a practical route to more trustworthy uncertainty without requiring new data collection.

Load-bearing premise

Base LLM probabilities or projected hidden states supply a reliable label-free target for the true correctness probability of post-trained outputs.

What would settle it

On any dataset where the projected or re-evaluated base probabilities show higher Expected Calibration Error than the original post-trained outputs or standard unsupervised baselines, the central claim would be falsified.

read the original abstract

Reliable confidence is essential for trusting the outputs of LLMs, yet widely deployed post-trained LLMs (PoLLMs) typically compromise this trust with severe overconfidence. In contrast, we observe that their corresponding base LLMs often remain well-calibrated. This naturally motivates us to calibrate PoLLM confidence using the base LLM as a reference. This work proposes two ways to achieve this. A straightforward solution, BaseCal-ReEval, evaluates PoLLM's responses by feeding them into the base LLM to get average probabilities as confidence. While effective, this approach introduces additional inference overhead. To address this, we propose BaseCal-Proj, which trains a lightweight projection to map the final-layer hidden states of PoLLMs back to those of their base LLMs. These projected states are then processed by the base LLM's output layer to derive base-calibrated confidence for PoLLM's responses. Notably, BaseCal is an unsupervised, plug-and-play solution that operates without human labels or LLM modifications. Experiments across five datasets and three LLM families demonstrate the effectiveness of BaseCal, reducing Expected Calibration Error (ECE) by an average of 42.90\% compared to the best unsupervised baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

BaseCal gives a practical unsupervised way to borrow calibration from base LLMs for their post-trained versions via re-evaluation or hidden-state projection, but the distribution-shift assumption on PoLLM outputs needs direct checks. The paper starts from the real observation that base models often stay better calibrated than their RLHF or SFT descendants and turns that into two concrete methods. BaseCal-ReEval simply feeds PoLLM responses back into the base model for probabilities. BaseCal-Proj adds a lightweight mapper on final-layer states so the base output layer can be reused without full re-inference. Both are label-free and require no changes to the deployed model, which matches production constraints. The reported average ECE drop of 42.9% across five datasets and three model families is large enough to notice in practice and beats the unsupervised baselines they compare against. That combination of simplicity and measured gain is the useful part. The soft spot is exactly the one flagged in the stress test. The base model is known to be calibrated on its own output distribution, yet both methods apply it to text that has already passed through post-training. No control in the abstract shows that base-model ECE stays low on those shifted inputs. If the reference signal itself degrades, the whole calibration target weakens. The full paper needs either an explicit measurement of base calibration on PoLLM text or a clear argument why the shift does not matter. Minor gaps include missing variance numbers and dataset statistics in the summary, but those are routine to add. This work is aimed at people who deploy LLMs and need better confidence scores without new labels or retraining. A reader looking for plug-and-play calibration tricks will get something concrete to test. It deserves peer review because the core engineering move is clear and the empirical claim is testable once the shift question is addressed.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes BaseCal, an unsupervised method to calibrate confidence in post-trained LLMs (PoLLMs) by leveraging signals from their base LLMs. BaseCal-ReEval re-evaluates PoLLM responses directly with the base model to obtain average token probabilities as confidence scores. BaseCal-Proj instead trains a lightweight projection to map PoLLM final-layer hidden states onto base LLM states, then applies the base output layer. Experiments across five datasets and three LLM families report an average 42.90% reduction in Expected Calibration Error relative to the strongest unsupervised baselines.

Significance. If the central empirical claim holds after addressing the noted gaps, the work supplies a practical, label-free, plug-and-play calibration technique that exploits the observed calibration advantage of base models without retraining or human annotations. The projection variant keeps inference overhead low, which could make reliable uncertainty estimates more accessible in deployed LLM systems.

major comments (2)

[§3] §3 (method description): The core assumption that base-LLM probabilities or projected states remain reliable calibration targets when applied to PoLLM-generated text is not directly verified. The paper establishes that base LLMs are well-calibrated on their native output distributions, yet BaseCal-ReEval and BaseCal-Proj feed them PoLLM responses that arise after post-training (SFT/RLHF) and therefore occupy a shifted distribution; no results report base-model ECE on PoLLM outputs to confirm the reference signal does not itself become miscalibrated.
[§4] §4 (experiments): The reported average ECE reduction of 42.90% is presented without per-dataset breakdowns, standard deviations across runs, dataset statistics, or explicit controls for generation parameters; these omissions make it impossible to assess whether the improvement is robust or confounded by response-length or token-distribution differences between base and PoLLM outputs.

minor comments (2)

[§3.2] Notation for projected hidden states and the exact form of the projection loss should be introduced with an equation in §3.2 for clarity.
[Table 2] Table captions should explicitly state the number of evaluation samples per dataset and whether ECE is computed with 10 or 15 bins.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate the requested clarifications and additional results.

read point-by-point responses

Referee: [§3] §3 (method description): The core assumption that base-LLM probabilities or projected states remain reliable calibration targets when applied to PoLLM-generated text is not directly verified. The paper establishes that base LLMs are well-calibrated on their native output distributions, yet BaseCal-ReEval and BaseCal-Proj feed them PoLLM responses that arise after post-training (SFT/RLHF) and therefore occupy a shifted distribution; no results report base-model ECE on PoLLM outputs to confirm the reference signal does not itself become miscalibrated.

Authors: We acknowledge that directly reporting the base LLM's ECE on PoLLM-generated outputs would provide stronger support for the core assumption. While the main experiments demonstrate that BaseCal yields large ECE reductions, we agree this verification step is missing. In the revised manuscript we will add a new table (or subsection in §4) that computes and reports the ECE of each base model when evaluated directly on the PoLLM responses across all five datasets and three model families. This will explicitly confirm that the reference signal remains well-calibrated on the shifted distribution. revision: yes
Referee: [§4] §4 (experiments): The reported average ECE reduction of 42.90% is presented without per-dataset breakdowns, standard deviations across runs, dataset statistics, or explicit controls for generation parameters; these omissions make it impossible to assess whether the improvement is robust or confounded by response-length or token-distribution differences between base and PoLLM outputs.

Authors: We agree that greater experimental transparency is needed. In the revised §4 we will include: (i) full per-dataset ECE tables for every method and baseline, (ii) standard deviations computed over multiple runs (with random seeds reported), (iii) dataset statistics such as average response length, token count, and vocabulary overlap between base and PoLLM outputs, and (iv) explicit generation hyperparameters (temperature, top-p, max length, etc.) used for all models. These additions will allow readers to evaluate robustness and rule out length or distribution confounds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; calibration derives from external base-model reference

full rationale

The derivation chain relies on the empirical observation that base LLMs remain well-calibrated on their native outputs, then applies those outputs (or projected hidden states fed to the base output layer) as an unsupervised reference signal for PoLLM responses. BaseCal-ReEval directly re-evaluates PoLLM text with the base model; BaseCal-Proj learns a projection whose target is the base model's hidden states and output layer. Neither step reduces a claimed prediction to a fitted parameter by construction, nor invokes a self-citation chain or uniqueness theorem to force the method. The projection training uses the base model as an external anchor rather than internal PoLLM statistics, so the central claim does not collapse to a self-defined quantity. Distribution-shift concerns between base and PoLLM outputs affect validity but do not create circularity in the reported derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation that base LLMs remain well-calibrated after post-training; no free parameters are introduced in the abstract description, and no new entities are postulated.

axioms (1)

domain assumption Base LLMs are well-calibrated while their post-trained counterparts are overconfident
Stated as an observation motivating the method; no derivation or proof supplied.

pith-pipeline@v0.9.0 · 5534 in / 1125 out tokens · 31690 ms · 2026-05-16T17:12:07.134897+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
cs.LG 2026-05 unverdicted novelty 6.0

VERDI derives three structural confidence signals from decomposed LLM verification traces and calibrates them with Platt-scaled logistic regression to achieve AUROC 0.72-0.91 on GPT models and 0.56-0.70 on Qwen models...
Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
cs.LG 2026-04 unverdicted novelty 6.0

Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.