arxiv: 2605.07808 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

The Minimax Rate of Second-Order Calibration

Banafsheh Rafiee, Kamil Ciosek, Nicol\`o Felicioni, Sina Ghiassian

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords second-order calibrationminimax ratepolynomial regressionsech kernelcalibration errorPlatt scalingepistemic uncertaintybinary classification

0 comments

The pith

Polynomial regression estimates second-order calibration error at the optimal rate of Õ(1/√n).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper characterizes the minimax rate for estimating second-order calibration error, which measures whether a higher-order predictor's uncertainty estimate matches the conditional label variance on its level sets. The central observation is that the sech perturbation kernel renders the relevant calibration functions analytic in a strip, so that polynomial regression can recover the error at rate Õ(1/√n) with explicit constants. This rate is qualitatively faster than the O(n^{-1/4}) achievable by bucketing or kernel smoothing, and a matching lower bound establishes near-optimality. As a direct consequence the work supplies the first finite-sample guarantee for a post-hoc recalibration procedure that adjusts both the mean prediction and the epistemic-variance estimate of any given predictor.

Core claim

The second-order calibration error for binary classification can be estimated at the minimax rate Õ(1/√n) by polynomial regression once the sech perturbation kernel is applied; the kernel makes the calibration functions analytic in a strip of half-width hπ/2. This rate improves on the slower O(n^{-1/4}) rate of bucketing or kernel smoothing, is matched by an Ω(1/√n) lower bound up to logarithmic factors, and yields the first finite-sample guarantee for second-order Platt scaling as a post-hoc recalibration method. A bucket-free definition of second-order calibration is also related quantitatively to the earlier bucketed formulation.

What carries the argument

The sech perturbation kernel, which makes calibration functions analytic in a strip of half-width hπ/2 and thereby allows polynomial regression to achieve the fast estimation rate.

If this is right

Polynomial regression supplies explicit constants alongside the Õ(1/√n) upper bound.
Second-order Platt scaling obtains the first finite-sample guarantee for jointly recalibrating mean predictions and epistemic-variance estimates.
A bucket-free definition of second-order calibration is shown to be quantitatively close to the earlier bucketed version.
Empirical checks confirm both the predicted convergence rate and the quality of the recalibrated uncertainties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same analyticity property could be exploited to obtain fast rates for related smoothing tasks that currently rely on kernels or binning.
The post-hoc recalibration procedure might be combined with existing mean-calibration methods to improve uncertainty estimates in deployed classifiers without retraining.
If analogous analyticity can be arranged in multi-class or regression settings, the minimax rate result could extend beyond binary classification.

Load-bearing premise

The sech perturbation kernel makes the calibration functions analytic in a strip of half-width hπ/2 without further restrictions on the function class or data distribution.

What would settle it

A concrete numerical check would be whether the estimation error of the second-order calibration quantity decays at rate 1/√n (up to logs) when polynomial regression is applied to data generated from functions smoothed by the sech kernel; failure to observe this decay, or observation of a strictly slower rate, would refute the upper bound.

Figures

Figures reproduced from arXiv: 2605.07808 by Banafsheh Rafiee, Kamil Ciosek, Nicol\`o Felicioni, Sina Ghiassian.

**Figure 2.** Figure 2: Experiment 1 (rate). |CEc 2 − CEpert 2 | vs. n at h=1/16 (left) and h=1/64 (right); mean ± Student-t 90% CI across 20 seeds, log–log slopes in legend. Proof sketch. The first-moment calibration error of Tˆ is E[|E[Y | Tˆ] − m′ (X)|]. By the tower property, E[Y | Tˆ] = E[η1(S) | Tˆ]. Since m′ (X) = ˆη1(S(X)): E[|E[η1 | Tˆ] − ηˆ1|] ≤ E[|η1(S) − ηˆ1(S)|] = ∥η1 − ηˆ1∥L1(P ) = O [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 3.** Figure 3: Experiment 2 (recalibration). Reported σ 2 (left) and corrected (σ 2 ) ′ (right) versus the ground-truth conditional variance on a held-out split, at h=1/64. Dashed line is y=x. 0.0 0.1 0.2 0.3 Referral threshold τ 0 2 4 6 Realized gain per 100 cases raw plug-in from m (≈ 1D first-order Platt) raw plug-in from (m, s) 1D first-order Platt (≈ raw plug-in from m) 2D first-order Platt 1D second-order Platt 2D … view at source ↗

**Figure 4.** Figure 4: Experiment 3 (decision utility). Realised gain per 100 borderline patients vs. referral threshold τ (mean ±1.96 SEM, 200 repeats). 2-D second-order Platt tracks the oracle; the raw (m, σ2 ) plug-in collapses near τ≈0.06. (θ≈0.5, small σ 2 ) and a hidden-subtype one (θ ∈ {0.12, 0.88}, large σ 2 ), both centred at m≈0.5 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Experiment 4: full audit-yield curves. Yield (expected fresh-worker disagreements found per 100 audited items) vs. audit budget on the Weather Sentiment AMT cohort, over 1000 evaluation repeats. 2-D second-order Platt is the only non-oracle method that exploits the joint (m, s) structure required by the decision problem; all 1-D and raw baselines collapse onto a common curve because the 1-D marginals are i… view at source ↗

read the original abstract

We characterize the minimax rate of estimating the second-order calibration error for binary classification, which quantifies whether a higher-order predictor's epistemic-uncertainty estimate matches the conditional variance of the label probability on its level sets. Our key observation is that the sech perturbation kernel, previously used only to enforce smoothness of calibration functions, in fact makes them analytic in a strip of half-width $h\pi/2$. Polynomial regression then estimates the calibration error at rate $\tilde{O}(1/\sqrt{n})$, with explicit constants, a qualitative improvement over the $O(n^{-1/4})$ rate achievable by bucketing or kernel smoothing. A matching $\Omega(1/\sqrt{n})$ lower bound establishes minimax optimality up to logarithmic factors. As a corollary, we give the first finite-sample guarantee for second-order Platt scaling, yielding a post-hoc procedure that recalibrates both the mean prediction and the epistemic-variance estimate of any higher-order predictor. Along the way, we provide a bucket-free definition of second-order calibration and relate it quantitatively to the bucketed formulation of Ahdritz et al. [2025]. Our experiments confirm the predicted rate and the quality of the recalibrated uncertainties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper characterizes the minimax rate of estimating second-order calibration error for binary classification, which measures whether a higher-order predictor's epistemic-uncertainty estimate matches the conditional variance of the label probability on its level sets. The central claim is that the sech perturbation kernel renders the relevant calibration functions analytic in a strip of half-width hπ/2, enabling polynomial regression to achieve an Õ(1/√n) rate with explicit constants (a qualitative improvement over the O(n^{-1/4}) rate from bucketing or kernel smoothing). A matching Ω(1/√n) lower bound establishes minimax optimality up to log factors. As a corollary, the paper provides the first finite-sample guarantee for second-order Platt scaling and introduces a bucket-free definition of second-order calibration, relating it quantitatively to the bucketed formulation of Ahdritz et al. [2025]. Experiments are said to confirm the predicted rate.

Significance. If the analyticity property and rate derivations hold for the defined function class, this establishes a meaningful theoretical advance in calibration and uncertainty quantification by delivering the first minimax-optimal rate with explicit constants for second-order calibration error. The finite-sample guarantee for post-hoc recalibration of both mean predictions and epistemic-variance estimates is a practical strength, and the bucket-free definition clarifies the relationship to prior work. The explicit constants and matching lower bound (if verified) would be notable contributions.

major comments (2)

[§4] §4 (Upper Bound): The claim that the sech perturbation kernel makes the calibration functions analytic in a strip of half-width hπ/2 (allowing polynomial regression to attain the Õ(1/√n) rate) is load-bearing for the central minimax result. The manuscript must explicitly verify that this analyticity holds uniformly for the function class arising from the bucket-free second-order calibration definition (relating conditional variance on level sets), without imposing extra restrictions on the predictor or data distribution. The abstract presents this as a key observation, but the proof details on how the perturbation is applied to the calibration map and whether the strip width remains uniform are needed to support the rate improvement over bucketing/kernel methods.
[Theorem 5.1] Theorem 5.1 (Lower Bound): The matching Ω(1/√n) lower bound is asserted to establish minimax optimality up to logarithmic factors. It is necessary to confirm that the lower-bound construction operates over precisely the same function class as the upper bound (i.e., the analytic functions induced by the sech perturbation in the second-order setting), so that the bounds are comparable and the optimality claim is not circular with respect to the function class.

minor comments (2)

[Introduction / §3] The quantitative relation between the bucket-free definition and the bucketed formulation of Ahdritz et al. [2025] is mentioned in the abstract and introduction; including a short corollary or table that states the precise approximation error between the two would strengthen the presentation.
[Experiments] Experimental details (data exclusion rules, choice of polynomial degree, and hyperparameter selection for the regression estimator) should be expanded in the experiments section to support full reproducibility of the reported rate confirmation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We have addressed each major point by adding explicit verifications and clarifications in the revised version, as detailed below.

read point-by-point responses

Referee: [§4] §4 (Upper Bound): The claim that the sech perturbation kernel makes the calibration functions analytic in a strip of half-width hπ/2 (allowing polynomial regression to attain the Õ(1/√n) rate) is load-bearing for the central minimax result. The manuscript must explicitly verify that this analyticity holds uniformly for the function class arising from the bucket-free second-order calibration definition (relating conditional variance on level sets), without imposing extra restrictions on the predictor or data distribution. The abstract presents this as a key observation, but the proof details on how the perturbation is applied to the calibration map and whether the strip width remains uniform are needed to support the rate improvement over bucketing/kernel methods.

Authors: We agree that explicit verification of uniform analyticity is essential for the bucket-free definition. In the revised manuscript, we have added Lemma 4.3 in Section 4, which proves that the sech perturbation applied pointwise to the second-order calibration map (defined via conditional variance on level sets) yields analyticity in a strip of half-width hπ/2 uniformly, without extra restrictions on the predictor or data distribution. The proof details, including preservation of the strip width, are now in new Appendix C. This directly supports the polynomial regression rate. revision: yes
Referee: [Theorem 5.1] Theorem 5.1 (Lower Bound): The matching Ω(1/√n) lower bound is asserted to establish minimax optimality up to logarithmic factors. It is necessary to confirm that the lower-bound construction operates over precisely the same function class as the upper bound (i.e., the analytic functions induced by the sech perturbation in the second-order setting), so that the bounds are comparable and the optimality claim is not circular with respect to the function class.

Authors: The lower-bound construction in Theorem 5.1 uses hard instances that are explicitly within the same analytic function class induced by the sech perturbation for second-order calibration maps. We have revised the theorem statement and discussion to clarify that these instances (perturbations of constant functions) lie in the strip of width hπ/2, making the upper and lower bounds directly comparable and the minimax optimality claim non-circular. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained via standard approximation theory for analytic functions

full rationale

The paper's central minimax claim rests on a new observation that the sech kernel renders calibration functions analytic in a strip (allowing polynomial regression to achieve Õ(1/√n) rates via classical results on analytic approximation), together with a matching lower bound and a quantitative relation to prior bucketed definitions. No step reduces a claimed prediction or rate to a fitted parameter, self-referential definition, or unverified self-citation chain; the analyticity property is presented as an independent mathematical fact derived from the kernel, and the bucket-free definition is explicitly related to external work without circular dependence. The derivation chain is therefore externally grounded in approximation theory and does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the analyticity property induced by the sech kernel together with standard results from complex analysis and polynomial approximation theory; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The sech perturbation kernel makes calibration functions analytic in a strip of half-width hπ/2
This is presented as the key observation that enables polynomial regression to achieve the faster rate.

pith-pipeline@v0.9.0 · 5516 in / 1388 out tokens · 43108 ms · 2026-05-11T02:06:19.276401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

The Thirteenth International Conference on Learning Representations , year=

Provable Uncertainty Decomposition via Higher-Order Calibration , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[2]

ESAIM: probability and statistics , volume=

Theory of classification: A survey of some recent advances , author=. ESAIM: probability and statistics , volume=. 2005 , publisher=

work page 2005
[3]

The Fourteenth International Conference on Learning Representations , year=

Measuring Uncertainty Calibration , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[4]

International conference on machine learning , pages=

Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[5]

International conference on machine learning , pages=

On calibration of modern neural networks , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[6]

2024 , eprint=

Quantifying Aleatoric and Epistemic Uncertainty with Proper Scoring Rules , author=. 2024 , eprint=

work page 2024
[7]

Bayesian active learning for classiﬁcation and preferenc e learning,

Bayesian active learning for classification and preference learning , author=. arXiv preprint arXiv:1112.5745 , year=

work page arXiv
[8]

Machine Learning , year=

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods , author=. Machine Learning , year=

work page
[9]

Advances in neural information processing systems , volume=

What uncertainties do we need in bayesian deep learning for computer vision? , author=. Advances in neural information processing systems , volume=

work page
[10]

Advances in neural information processing systems , volume=

Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in neural information processing systems , volume=

work page
[11]

Advances in large margin classifiers , volume=

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods , author=. Advances in large margin classifiers , volume=. 1999 , publisher=

work page 1999
[12]

arXiv preprint arXiv:2312.00995 , year=

Second-order uncertainty quantification: A distance-based approach , author=. arXiv preprint arXiv:2312.00995 , year=

work page arXiv
[13]

Uncertainty in artificial intelligence , pages=

Quantifying aleatoric and epistemic uncertainty in machine learning: Are conditional entropy and mutual information appropriate measures? , author=. Uncertainty in artificial intelligence , pages=. 2023 , organization=

work page 2023
[14]

, title =

Fawzi, H. , title =. 2023 , url =

work page 2023
[15]

Bartlett, Peter , year=

work page
[16]

2021 , url =

Rebeschini, Patrick , title =. 2021 , url =

work page 2021
[17]

Machine learning , volume=

Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods , author=. Machine learning , volume=. 2021 , publisher=

work page 2021
[18]

Proceedings of the AAAI conference on artificial intelligence , volume=

Obtaining well calibrated probabilities using bayesian binning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[19]

The 22nd international conference on artificial intelligence and statistics , pages=

Evaluating model calibration in classification , author=. The 22nd international conference on artificial intelligence and statistics , pages=. 2019 , organization=

work page 2019
[20]

Advances in neural information processing systems , volume=

Verified uncertainty calibration , author=. Advances in neural information processing systems , volume=

work page
[21]

International Conference on Artificial Intelligence and Statistics , pages=

Mitigating bias in calibration error estimation , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[22]

Advances in neural information processing systems , volume=

Calibration tests in multi-class classification: A unifying framework , author=. Advances in neural information processing systems , volume=

work page
[23]

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Transforming classifier scores into accurate multiclass probability estimates , author=. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

work page
[24]

Artificial intelligence and statistics , pages=

Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers , author=. Artificial intelligence and statistics , pages=. 2017 , organization=

work page 2017
[25]

Advances in neural information processing systems , volume=

Predictive uncertainty estimation via prior networks , author=. Advances in neural information processing systems , volume=

work page
[26]

Advances in neural information processing systems , volume=

Evidential deep learning to quantify classification uncertainty , author=. Advances in neural information processing systems , volume=

work page
[27]

2005 , publisher=

Algorithmic learning in a random world , author=. 2005 , publisher=

work page 2005
[28]

Foundations and Trends in Machine Learning , volume=

Conformal prediction: A gentle introduction , author=. Foundations and Trends in Machine Learning , volume=. 2023 , publisher=

work page 2023
[29]

2025 , publisher=

Aleatoric and Epistemic Uncertainty in Conformal Prediction , author=. 2025 , publisher=

work page 2025
[30]

Information Theory: From Coding to Learning , publisher=

Polyanskiy, Yury and Wu, Yihong , year=. Information Theory: From Coding to Learning , publisher=

work page
[31]

Proceedings of the 55th Annual ACM Symposium on Theory of Computing , pages=

A unifying theory of distance from calibration , author=. Proceedings of the 55th Annual ACM Symposium on Theory of Computing , pages=

work page
[32]

arXiv preprint arXiv:2309.12236 , year=

Smooth ECE: Principled reliability diagrams via kernel smoothing , author=. arXiv preprint arXiv:2309.12236 , year=

work page arXiv
[33]

Advances in Neural Information Processing Systems , volume=

Distribution-free binary classification: prediction sets, confidence intervals and calibration , author=. Advances in Neural Information Processing Systems , volume=

work page
[34]

Journal of Machine Learning Research , volume=

T-cal: An optimal test for the calibration of predictive models , author=. Journal of Machine Learning Research , volume=

work page
[35]

International Conference on Machine Learning , pages=

Multicalibration: Calibration for the (computationally-identifiable) masses , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[36]

Advances in Neural Information Processing Systems , volume=

Epistemic neural networks , author=. Advances in Neural Information Processing Systems , volume=

work page