pith. sign in

arxiv: 2604.09624 · v1 · submitted 2026-03-18 · 💻 cs.CL · cs.LG

Self-Calibrating Language Models via Test-Time Discriminative Distillation

Pith reviewed 2026-05-15 09:52 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords language model calibrationtest-time trainingself-supervised adaptationexpected calibration errordistribution shiftconfidence estimationdiscriminative distillation
0
0 comments X

The pith

Language models can self-calibrate at test time by distilling their own internal P(True) signal into updated predictions without labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models routinely express high certainty on answers they get wrong. The paper shows they already hold a stronger calibration signal inside them: the token probability on 'True' when the model is asked whether its own answer is correct. This signal serves as free self-supervision for a test-time training procedure that updates only a small slice of incoming questions when the input distribution shifts. The resulting SECL method cuts Expected Calibration Error by 56 to 78 percent across four small models and four domains while using far less compute than the baseline it improves upon. The work matters because reliable uncertainty estimates are essential for deploying models in changing real-world settings where labeled validation data is unavailable.

Core claim

SECL is a test-time training pipeline that turns the gap between verbalized confidence and the model's P(True) probability into label-free self-supervision. By performing discriminative distillation on this signal, the method adapts model parameters selectively on 6 to 26 percent of a shifting question stream. Across four small language models from three families and four domains, SECL lowers Expected Calibration Error by 56 to 78 percent, beats its own supervision source, and matches or exceeds recent inference-time calibration techniques. Seven ablations confirm that signal choice, gating, loss design, and layer selection are each necessary for the gains.

What carries the argument

SECL test-time training pipeline, which performs discriminative distillation by training the model to match its own P(True) token probabilities as soft labels.

If this is right

  • Models update only on detected shifts and only on a small fraction of the input stream.
  • Final calibration exceeds what the raw P(True) signal itself provides.
  • The method matches or beats recent inference-time calibration baselines at lower cost.
  • Each design component (gating, loss, layer choice) contributes measurably to the result.
  • Adaptation occurs without labeled data and without harming reported task performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Self-calibration of this form could let models remain trustworthy in streaming applications where data keeps changing.
  • The same internal signal might support test-time adaptation on tasks other than calibration.
  • If the approach scales to larger models, it could reduce reliance on periodic full retraining for reliability.
  • Combining this distillation with other cheap self-supervision signals might yield broader on-the-fly robustness.

Load-bearing premise

The P(True) token probability stays a reliably better-calibrated signal than verbalized confidence even when the input distribution shifts.

What would settle it

Apply SECL to a new domain shift and observe that post-update Expected Calibration Error is no lower than the unadapted model or that downstream answer accuracy falls.

Figures

Figures reproduced from arXiv: 2604.09624 by Chris Biemann, Jan Strich, Martin Semmann, Mohamed Rissal Hedna.

Figure 1
Figure 1. Figure 1: Calibration error vs. inference cost for [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SECL. (a) Test-Time Inference. An entropy-based change detector (Section 3.1) monitors the input stream. If no shift is detected, the adapted model θ ′ t is used directly; otherwise, a calibration burst updates it to θ ′ t+1. (b) Calibration Burst. For each of B=50 questions: the frozen model generates an answer with confidence ct and distractors, computes NormPTrue (Section 3.2), and applies a… view at source ↗
Figure 3
Figure 3. Figure 3: Reliability diagrams for Llama 3.2-3B (2,000 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Raw P(True) (left) vs. distractor-normalized NormPTrue (right) on Llama 3.2-3B. Normalization sup￾presses suggestibility bias, producing a better-calibrated supervision signal (ECE: 0.161→ 0.065). N Negative Control: Qwen 2.5-3B We evaluate Qwen 2.5-3B-Instruct (Qwen Team et al., 2025) as a negative control. Unlike the other four models, Qwen’s P(True) Norm baseline (best τ=1.0, ECE=0.257) is worse than it… view at source ↗
Figure 6
Figure 6. Figure 6: Raw P(True) (left) vs. NormPTrue (right) for the three non-8B models. Distractor normalization consistently reduces calibration error across model families [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reliability diagrams: Verbalized baseline ( [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reliability diagrams comparing candidate training targets on Llama 3.2-3B. Self-Consistency ( [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Confidence score distributions for correct (green) and incorrect (red) predictions on Llama 3.2-3B. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

Large language models (LLMs) are systematically overconfident: they routinely express high certainty on questions they often answer incorrectly. Existing calibration methods either require labeled validation data, degrade under distribution shifts, or incur substantial inference costs. Recent work has shown that LLMs already contain a better-calibrated signal than the one they verbalize: the token probability of "True" when the model is asked "Is this answer correct?" ($P(\text{True})$) consistently outperforms their stated confidence, a gap that is theoretically grounded as generative error is lower-bounded by roughly twice the corresponding discriminative error. We introduce $\textbf{SECL}$ ($\textbf{SE}$lf-$\textbf{C}$alibrating $\textbf{L}$anguage Models), a test-time training (TTT) pipeline that exploits this gap as label-free self-supervision, requiring no labeled data or human supervision. SECL adapts only when the input distribution shifts, training on just 6--26% of the question stream at lower cost than the baseline it distills from. Across four small language models from three model families and four diverse domains, SECL reduces Expected Calibration Error (ECE) by 56--78%, outperforming its own supervision signal and matching or outperforming recent inference-time methods. SECL is the first method to apply TTT to calibration; seven ablations covering signal quality, gating strategy, weight accumulation, loss design, domain ordering, hyperparameter sensitivity, and layer selection confirm that each component is crucial and robust across configurations. Code: https://anonymous.4open.science/r/secl-emnlp26-submission-C890

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SECL, a test-time training (TTT) pipeline that exploits the P(True) token probability (when querying 'Is this answer correct?') as label-free self-supervision to calibrate LLMs. It claims 56-78% reductions in Expected Calibration Error (ECE) across four small models from three families and four domains, outperforming the P(True) supervision signal itself while adapting only on detected shifts and training on 6-26% of the stream. The approach is grounded in prior bounds on generative vs. discriminative error; seven ablations on signal quality, gating, loss design, and layer selection are reported to confirm robustness.

Significance. If the results hold, this is a meaningful contribution as the first application of TTT to calibration. It offers a label-free, low-cost alternative to existing methods that require validation data or incur high inference overhead, with the code release and ablation coverage as clear strengths. The theoretical grounding and efficiency claims (adaptation on a small fraction of the stream) would make it relevant for practical deployment under distribution shifts.

major comments (3)
  1. [Abstract and Results] Abstract and Results sections: The headline claim that SECL outperforms its own P(True) supervision signal requires explicit confirmation that P(True) probabilities were re-measured after the test-time updates on the same stream. The reported ablations on signal quality and layer selection do not address this re-evaluation, leaving open the possibility that apparent gains arise from the distillation process rather than genuine calibration improvement.
  2. [Experimental Setup and Ablations] Experimental Setup and Ablations: No post-adaptation measurements of task accuracy or other capabilities are reported. This is load-bearing for the central claim, as the skeptic note correctly identifies that TTT updates could close the generative-discriminative gap or degrade performance; the gating and domain-ordering ablations do not substitute for direct accuracy tracking.
  3. [Method] Method section on gating: The shift-detection mechanism (ablated under 'gating strategy') needs concrete specification of the detection criterion, threshold, or statistic used. Without this, it is difficult to assess the risk that misfired gating could amplify noise in the self-supervision loop rather than correct calibration.
minor comments (2)
  1. [Method] Notation: Provide an explicit equation or pseudocode definition of how P(True) is extracted from the forward pass (e.g., exact prompt template and token indexing) to aid reproducibility.
  2. [Results] Figures: Ensure calibration plots and tables clearly label pre-SECL vs. post-SECL ECE values and include confidence intervals or statistical tests for the 56-78% reduction range.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the presentation of our results and method.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results sections: The headline claim that SECL outperforms its own P(True) supervision signal requires explicit confirmation that P(True) probabilities were re-measured after the test-time updates on the same stream. The reported ablations on signal quality and layer selection do not address this re-evaluation, leaving open the possibility that apparent gains arise from the distillation process rather than genuine calibration improvement.

    Authors: We confirm that P(True) was re-measured after test-time adaptation on the updated model for all reported SECL results; the comparison is therefore between the original P(True) baseline and the post-distillation model on the same adapted stream. To make this explicit and rule out the alternative interpretation, we will add a clarifying sentence in both the abstract and results sections plus a short table or plot showing P(True) ECE before versus after adaptation. revision: yes

  2. Referee: [Experimental Setup and Ablations] Experimental Setup and Ablations: No post-adaptation measurements of task accuracy or other capabilities are reported. This is load-bearing for the central claim, as the skeptic note correctly identifies that TTT updates could close the generative-discriminative gap or degrade performance; the gating and domain-ordering ablations do not substitute for direct accuracy tracking.

    Authors: We agree that direct post-adaptation accuracy tracking is necessary. We will add these measurements to the Experimental Setup and Ablations sections, reporting accuracy on the identical question streams before and after SECL adaptation for every model and domain to confirm that calibration gains do not come at the expense of task performance. revision: yes

  3. Referee: [Method] Method section on gating: The shift-detection mechanism (ablated under 'gating strategy') needs concrete specification of the detection criterion, threshold, or statistic used. Without this, it is difficult to assess the risk that misfired gating could amplify noise in the self-supervision loop rather than correct calibration.

    Authors: We will revise the Method section to give the precise shift-detection criterion (change in P(True) variance exceeding a fixed threshold), the exact threshold value, and the statistic employed, together with pseudocode for the gating decision. This addition will allow readers to evaluate the robustness of the gating component directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical adaptation uses external theoretical grounding

full rationale

The paper's derivation chain consists of an empirical test-time training procedure that extracts P(True) token probabilities from the model's own forward pass and uses them as label-free supervision to update model parameters via distillation. This process is not self-definitional or a fitted input renamed as prediction: the ECE reduction is measured post-adaptation on evaluation streams and is not forced by construction. The theoretical claim that generative error is lower-bounded by roughly twice the discriminative error is presented as a citation to prior work rather than derived within the paper. No self-citation chain is load-bearing for the central result, no uniqueness theorem is imported from the authors' prior work, and no ansatz is smuggled via citation. The method includes explicit gating on distribution shifts and selective updates on only 6-26% of the stream, making outcomes data-dependent rather than tautological. All reported gains are externally verifiable via the provided code and benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a superior internal discriminative signal (P(True)) and the ability of lightweight TTT to exploit it without labeled data or performance degradation.

free parameters (1)
  • training fraction
    Portion of the question stream (6-26%) selected for adaptation; chosen via shift detection rather than fixed in advance.
axioms (1)
  • domain assumption Generative error is lower-bounded by roughly twice the corresponding discriminative error
    Invoked to justify why P(True) outperforms verbalized confidence; treated as established prior result.

pith-pipeline@v0.9.0 · 5604 in / 1289 out tokens · 37279 ms · 2026-05-15T09:52:25.333940+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Prob- lems.Preprint, arXiv:2110.14168. Mehul Damani, Isha Puri, Stewart Slocum, Idan Shen- feld, Leshem Choshen, Yoon Kim, and Jacob An- dreas. 2025. Beyond Binary Rewards: Training LMs To Reason About Their Uncertainty.Preprint, arXiv:2507.16806. Xuefeng Du, Chaowei Xiao, and Yixuan Li. 2024. Halo- Scope: Harnessing ...

  2. [2]

    Language Models (Mostly) Know What They Know

    Measuring Massive Multitask Language Un- derstanding. InInternational Conference on Learn- ing Representations, ICLR 2021, Vienna, Austria. Juyeon Heo, Miao Xiong, Christina Heinze-Deml, and Jaya Narain. 2025. Do LLMs Estimate Uncertainty Well In Instruction-Following? InThe Thirteenth In- ternational Conference On Learning Representations, ICLR 2025, Sin...

  3. [3]

    arXiv preprint arXiv:2503.02623 (2025)

    Towards Understanding Sycophancy in Lan- guage Models. InThe Twelfth International Con- ference on Learning Representations, ICLR 2024, Vienna, Austria. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar. 2025. Scaling LLM Test-Time Compute Opti- mally Can Be More Effective Than Scaling Model Parameters. InProceedings of the International Con- fere...

  4. [4]

    In 2022 IEEE/CVF Conference On Computer Vision and Pattern Recognition (CVPR), pages 7191–7201, New Orleans, USA

    Continual Test-Time Domain Adaptation. In 2022 IEEE/CVF Conference On Computer Vision and Pattern Recognition (CVPR), pages 7191–7201, New Orleans, USA. Victor Wang and Elias Stengel-Eskin. 2025. Calibrating Verbalized Confidence With Self-Generated Distrac- tors.Preprint, arXiv:2509.25532. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan ...

  5. [5]

    In Findings of the Association for Computational Lin- guistics: ACL 2024, pages 8702–8718, Bangkok, Thailand

    Fact-And-Reflection (FaR) Improves Confi- dence Calibration Of Large Language Models. In Findings of the Association for Computational Lin- guistics: ACL 2024, pages 8702–8718, Bangkok, Thailand. Association for Computational Linguistics. Adam Zweiger, Jyo Pari, Han Guo, Yoon Kim, and Pulkit Agrawal. 2025. Self-Adapting Language Mod- els. InProceedings of...

  6. [6]

    knows more than it says

    (single parameter T fitted to minimize NLL) andPlatt scaling(Platt, 1999) (logistic regression σ(a·c+b) ). Both are applied to the soft verbalized confidence and fitted via 5-fold cross-validation (each fold uses 1,600 labeled examples for fitting). Key difference from SECL.Temperature and Platt scaling aresupervisedpost-hoc methods: they require ground-t...