pith. sign in

arxiv: 2605.21552 · v1 · pith:EYHCYR56new · submitted 2026-05-20 · 💻 cs.LG · stat.ML

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

Pith reviewed 2026-05-22 00:42 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords confidence calibrationcovariate shiftexpectation consistencyunsupervised domain adaptationmodel calibration
0
0 comments X

The pith

The expectation consistency condition is necessary and sufficient for confidence calibration under covariate shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Models often lose calibration when test data follows a different distribution from training data, a situation called covariate shift. Standard fixes either assume no shift occurs or depend on unstable importance weights when distributions differ sharply. This paper derives a necessary and sufficient condition for calibration to survive such shifts, called the expectation consistency condition, which only equates average predicted probabilities to average accuracy in the target domain. The condition is weaker than requiring the full covariate distributions to match. From it the authors build an unsupervised loss that enforces the condition on unlabeled target samples and works for canonical, class-wise, and top-label calibration.

Core claim

The expectation consistency condition is necessary and sufficient for confidence calibration under covariate shifts. It reveals that covariate shifts do not necessarily produce uncalibrated models and supplies a weaker requirement than global alignment of the covariate distributions. The expectation consistency loss then uses only unlabeled target samples to enforce the condition while preserving compatibility with multiple calibration definitions.

What carries the argument

Expectation consistency condition, requiring that the expected value of the model's predicted probability for the true label equals the expected value of the correctness indicator under the target distribution.

If this is right

  • Calibration can hold under covariate shift without requiring the training and test covariate distributions to be identical.
  • The expectation consistency loss supports canonical calibration, class-wise calibration, and top-label calibration.
  • The sample complexity of computing the expectation consistency loss equals that of computing expected calibration error.
  • A mini-batch scheme for training with the expectation consistency loss is supported by matching theoretical guarantees.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Enforcing only expectation consistency may let practitioners avoid unstable density-ratio estimates that break down under large shifts.
  • The same consistency idea could be adapted to create unsupervised calibration methods for other shift types such as label shift.
  • Linking calibration directly to domain-adaptation objectives may improve reliability when models are deployed in changing real-world environments.

Load-bearing premise

The derivation assumes the covariate shift preserves the conditional label distribution so that expectations over the target domain can be estimated from unlabeled samples without density-ratio bounds or support restrictions.

What would settle it

A concrete dataset or simulation in which expectation consistency holds yet calibration error remains high, or in which consistency fails yet calibration holds, under a covariate shift that leaves the conditional label distribution unchanged.

Figures

Figures reproduced from arXiv: 2605.21552 by Bo Yang, Jinzong Dong, Zhaohui Jiang.

Figure 1
Figure 1. Figure 1: A binary classification example where covariate shift occurs but calibration error remains unchanged, where P(Y |X) = (P(Y1|X), P(Y2|X)) and S = (S1, S2). P(Y2|X) = 1 − P(Y1|X) and S2 = 1 − S1. Ps(X) = √ 2π −1 e −0.5(X+0.5)2 , Pt(X) = √ 2π −1 e −0.5(X−0.5)2 , S1 = −0.25X 2 + 1, and P(Y1 = 1|X) = −0.5|X| + 1. 1|X)] holds for ∀S1 ∈ [0, 1]. Moreover, such examples are infinite because they include but are n… view at source ↗
Figure 2
Figure 2. Figure 2: Calibration effect display. Figures (c) to (k) show the calibration effect on the simulated covariate shift dataset (see Figures (a) and (b)), and Figures (l) to (t) show the calibration effect on the real-world covariate shift dataset PACS (three classes). NLL represents cross-entropy loss, Soft-ECE represents softened differentiable ECE loss, CwECE represents class-wise ECE, and CaECE represents canonica… view at source ↗
Figure 3
Figure 3. Figure 3: The calibration results are presented using simulated data under a uniformly distributed covariate shift. From the calibration metric on the target domain and the reliability diagram of the calibrated classifier, ECL achieves the smallest calibration error. (or P(Y ∗ = Yˆ |X) for top-label calibration) on the source domain using Soft-ECE loss. This classification head has the same network structure as the … view at source ↗
read the original abstract

Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and identically distributed, limiting their effectiveness under covariate shifts. Previous calibration methods under covariate shift struggle with class-wise or canonical calibrations and often rely on unstable importance weighting when density ratios are large or unbounded. Given the above limitations, this paper rethinks confidence calibration under covariate shifts. First, we derive a necessary and sufficient condition for confidence calibration under covariate shifts, named Expectation consistency condition, which reveals covariate shifts do not necessarily lead to uncalibrated confidence and provides a weaker condition for confidence calibration than global covariate distribution alignment. Then, utilizing Expectation consistency condition, this paper proposes an unsupervised domain adaptation loss to calibrate confidence of the target domain, named Expectation consistency loss (ECL), which is compatible with canonical calibration, class-wise calibration, and top-label calibration. Third, we prove that computing ECL loss has the same sample complexity as Expected Calibration Error (ECE) and provide a theoretically grounded mini-batch trainable scheme for ECL loss. Finally, we validate the effectiveness of our method on both simulated and real-world covariate shift datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives a necessary and sufficient 'Expectation Consistency' condition for confidence calibration under covariate shift (weaker than global distribution alignment), proposes an unsupervised Expectation Consistency Loss (ECL) compatible with canonical, class-wise, and top-label calibration, proves that ECL has the same sample complexity as ECE, and validates the approach on simulated and real-world covariate-shift datasets.

Significance. If the necessity and sufficiency derivation holds under the stated assumptions, the work supplies a practical unsupervised calibration method that avoids unstable importance weighting and provides a strictly weaker condition than full covariate alignment; the matching sample-complexity result and mini-batch scheme are additional strengths.

major comments (2)
  1. [§3] §3 (derivation of necessity): the necessity direction equates an expectation over the target marginal to a source quantity using only P_s(Y|X)=P_t(Y|X); when target support is not contained in source support, the equality requires an explicit Radon-Nikodym factor or indicator on the overlap set that is omitted in the stated condition, so necessity does not follow from unlabeled target samples alone.
  2. [§4] Theorem on sample complexity (presumably §4): the claim that ECL has identical sample complexity to ECE is stated without the precise concentration inequalities or bounded-ratio assumptions needed to control the density-ratio-free estimator; the proof sketch should make these explicit.
minor comments (2)
  1. [§3] Notation for the expectation-consistency condition should be introduced with an explicit equation number and contrasted directly with the standard calibration definition E[1{Y=y}|f(X)=p]=p.
  2. [Experiments] The abstract states compatibility with canonical, class-wise, and top-label calibration, but the experimental section should report separate metrics for each rather than a single aggregate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points for rigor in the necessity derivation and the sample-complexity analysis. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications.

read point-by-point responses
  1. Referee: [§3] §3 (derivation of necessity): the necessity direction equates an expectation over the target marginal to a source quantity using only P_s(Y|X)=P_t(Y|X); when target support is not contained in source support, the equality requires an explicit Radon-Nikodym factor or indicator on the overlap set that is omitted in the stated condition, so necessity does not follow from unlabeled target samples alone.

    Authors: We appreciate this observation on the support-overlap issue. The original derivation in §3 implicitly restricts attention to the common support of source and target distributions under the covariate-shift model. To make the necessity direction fully rigorous, we will revise §3 to explicitly include an indicator on the overlap set and note the role of the Radon-Nikodym derivative when the densities are unbounded. This clarification preserves the result that the expectation-consistency condition is necessary and sufficient on the relevant support while acknowledging that unlabeled target samples alone do not suffice outside the overlap. revision: yes

  2. Referee: [§4] Theorem on sample complexity (presumably §4): the claim that ECL has identical sample complexity to ECE is stated without the precise concentration inequalities or bounded-ratio assumptions needed to control the density-ratio-free estimator; the proof sketch should make these explicit.

    Authors: We agree that the sample-complexity argument in §4 requires more explicit technical detail. In the revised manuscript we will expand the proof to state the precise concentration inequalities (e.g., Hoeffding or Bernstein bounds) employed and to list the bounded-ratio or bounded-variance assumptions needed for the density-ratio-free estimator. These additions will render the claim that ECL matches the sample complexity of ECE fully rigorous. revision: yes

Circularity Check

0 steps flagged

Derivation of expectation consistency condition is self-contained from calibration definitions and covariate shift assumptions

full rationale

The paper derives the expectation consistency condition directly from the definition of confidence calibration (E[1{Y=y}|f(X)=p] = p) combined with the standard covariate shift assumption P_s(Y|X)=P_t(Y|X). This produces a mathematical equivalence that is not tautological or fitted; the resulting ECL loss is then defined from that condition rather than reverse-engineered to match data. No self-citation chain, ansatz smuggling, or renaming of known results is load-bearing for the central necessity/sufficiency claim. The derivation remains independent of the target result and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review limits visibility into explicit free parameters or axioms; the central claim rests on the derivation of the consistency condition and the assumption that it can be estimated unsupervised.

pith-pipeline@v0.9.0 · 5742 in / 1052 out tokens · 39076 ms · 2026-05-22T00:42:59.608901+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    cc/paper_files/paper/2010/file/ 59c33016884a62116be975a9bb8257e3- Paper.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2010/file/ 59c33016884a62116be975a9bb8257e3- Paper.pdf. Dong, J., Jiang, Z., Pan, D., Chen, Z., Guan, Q., Zhang, H., Gui, G., and Gui, W. A survey on confidence calibration of deep learning-based classification models under class imbalance data.IEEE Transactions on Neural Networks and Learning Systems,...

  2. [2]

    Gawlikowski, C

    doi: 10.1007/s10462-023-10562-9. URL https: //doi.org/10.1007/s10462-023-10562-9. Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., Norouzi, M., and Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations,

  3. [3]

    Guo, C., Pleiss, G., Sun, Y ., and Weinberger, K

    URL https://openreview.net/forum? id=Hkxzx0NtDB. Guo, C., Pleiss, G., Sun, Y ., and Weinberger, K. Q. On cali- bration of modern neural networks. In Precup, D. and Teh, Y . W. (eds.),Proceedings of the 34th International Confer- ence on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pp. 1321–1330. PMLR, 06– 11 Aug 2017. URL https:...

  4. [4]

    URL https://www.sciencedirect.com/ science/article/pii/S0031320324001365

    doi: https://doi.org/10.1016/j.patcog.2024.110385. URL https://www.sciencedirect.com/ science/article/pii/S0031320324001365. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. Hebbalaguppe, R., Prakash, J., Madan, N., and...

  5. [5]

    cc/paper_files/paper/2021/file/ f8905bd3df64ace64a68e154ba72f24c- Paper.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ f8905bd3df64ace64a68e154ba72f24c- Paper.pdf. Kimura, M. and Hino, H. A short survey on impor- tance weighting for machine learning.Transactions on Machine Learning Research, 2024. ISSN 2835-

  6. [6]

    Survey Certification

    URL https://openreview.net/forum? id=IhXM3g2gxg. Survey Certification. Kull, M., Perello Nieto, M., K ¨angsepp, M., Silva Filho, T., Song, H., and Flach, P. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch ´e-Buc, F., Fox, E., and Garnett, R....

  7. [7]

    Lecun, L

    URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ 8ca01ea920679a0fe3728441494041b9- Paper.pdf. Lecun, Y ., Bottou, L., Bengio, Y ., and Haffner, P. Gradient- based learning applied to document recognition.Pro- ceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. LeCun, Y ., Bengio, Y ., and Hinton, G. Deep learning.Na- ture,...

  8. [8]

    findings-acl.393/

    URL https://aclanthology.org/2023. findings-acl.393/. Liu, B., Rony, J., Galdran, A., Dolz, J., and Ben Ayed, I. Class adaptive network calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16070–16079, June 2023. M¨uller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Wallach...

  9. [9]

    cc/paper_files/paper/2019/file/ f1748d6b0fd9d439f71450117eba2725- Paper.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ f1748d6b0fd9d439f71450117eba2725- Paper.pdf. Munir, M. A., Khan, S. H., Khan, M. H., Ali, M., and Shahbaz Khan, F. Cal-detr: Calibrated detection transformer. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing System...

  10. [10]

    cc/paper_files/paper/2023/file/ e271e30de7a2e462ca1f85cefa816380- Paper-Conference.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ e271e30de7a2e462ca1f85cefa816380- Paper-Conference.pdf. Netzer, Y ., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A. Y ., et al. Reading digits in natural images with unsupervised feature learning. InNIPS workshop on deep learning and unsupervised feature learning, volume 2011, pp. 7. Gra...

  11. [11]

    Popordanoska, T., Sayer, R., and Blaschko, M

    URL https://proceedings.mlr.press/ v108/park20b.html. Popordanoska, T., Sayer, R., and Blaschko, M. A consistent and differentiable lp canonical calibration error estimator. In Koyejo, S., Mohamed, S., Agar- wal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 7933–7946. Curran Associates, In...

  12. [12]

    cc/paper_files/paper/2022/file/ 33d6e648ee4fb24acec3a4bbcd4f001e- Paper-Conference.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 33d6e648ee4fb24acec3a4bbcd4f001e- Paper-Conference.pdf. Rahimi, A., Shaban, A., Cheng, C.-A., Hartley, R., and Boots, B. Intra order-preserving functions for calibration of multi-class neural networks. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in N...

  13. [13]

    cc/paper_files/paper/2020/file/ 9bc99c590be3511b8d53741684ef574c- Paper.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 9bc99c590be3511b8d53741684ef574c- Paper.pdf. Wang, H., Ge, S., Lipton, Z., and Xing, E. P. Learning robust global representations by penalizing local predictive power. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch´e-Buc, F., Fox, E., and Garnett, R. (eds.),Advances in Neural Inform...

  14. [14]

    cc/paper_files/paper/2019/file/ 3eefceb8087e964f89c2d59e8a249915- Paper.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ 3eefceb8087e964f89c2d59e8a249915- Paper.pdf. Wang, H., Yu, Z., Yue, Y ., Anandkumar, A., Liu, A., and Yan, J. Learning calibrated uncertainties for domain shift: a distributionally robust learning approach. In Proceedings of the Thirty-Second International Joint Conference on Artificial Inte...

  15. [15]

    2023/120

    URL https://doi.org/10.24963/ijcai. 2023/162. Wang, X., Long, M., Wang, J., and Jordan, M. Trans- ferable calibration with lower bias and variance in domain adaptation. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.),Ad- vances in Neural Information Processing Systems, volume 33, pp. 19212–19223. Curran Associates, Inc.,

  16. [16]

    cc/paper_files/paper/2020/file/ df12ecd077efc8c23881028604dbb8cc- Paper.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ df12ecd077efc8c23881028604dbb8cc- Paper.pdf. Yang, X. and Ji, S. Jem++: Improved techniques for train- ing jem. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6494–6503, October 2021. Zagoruyko, S. and Komodakis, N. Wide residual networks. InProcedings ...