pith. sign in

arxiv: 2605.24983 · v1 · pith:I6L6Q4GFnew · submitted 2026-05-24 · 💻 cs.LG

Benchmarking non-conformity score functions in conformal prediction

Pith reviewed 2026-06-30 11:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords conformal predictionnon-conformity scoreprediction setsclass-conditionalimbalanced classesbenchmarkingmachine learning classification
0
0 comments X

The pith

A new method for measuring prediction set sizes enables direct comparison of non-conformity score functions in conformal prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an original method to evaluate the sizes of prediction sets generated by conformal predictors. This method is then applied to benchmark various non-conformity score functions drawn from the literature along with some original modifications. The same approach is used to test how different scores perform in class-conditional conformal prediction when classes are imbalanced. A sympathetic reader would care because the size of the prediction sets determines how informative the conformal output is in practice. If the evaluation method works, it supplies a concrete basis for choosing one score function over another.

Core claim

We introduce an original method of evaluating the prediction set sizes of conformal predictors and use it to provide a comparison between non-conformity score functions. We also examine efficacy of different non-conformity score functions for class-conditional conformal prediction in a setting with imbalanced classes.

What carries the argument

An original method of evaluating the prediction set sizes of conformal predictors that is used to benchmark non-conformity score functions.

If this is right

  • Non-conformity score functions produce measurably different prediction set sizes when assessed with the new evaluation procedure.
  • Original modifications to existing score functions can be ranked against published ones using the same size metric.
  • In imbalanced class settings the relative performance of score functions changes under class-conditional conformal prediction.
  • The method supplies a uniform yardstick that was previously absent for comparing score functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A practitioner facing a new dataset could apply the method to pick the score that keeps sets smallest while maintaining coverage.
  • The evaluation procedure might be extended to regression tasks or to settings with multiple models to check consistency of rankings.
  • If the method reveals that certain scores systematically enlarge sets on imbalanced data, new score designs could target that failure mode.

Load-bearing premise

The introduced original method of evaluating prediction set sizes yields a fair and informative comparison of non-conformity score functions.

What would settle it

Running the new evaluation method on a collection of standard non-conformity scores and finding that the resulting size rankings contradict those obtained from direct measurement of average set size on the same data would undermine the method.

read the original abstract

Conformal prediction is a useful and versatile alternative to model calibration in machine learning classification. It replaces single-class prediction with prediction sets, guaranteeing that the \textit{a priori} probability of the prediction sets containing the true class is larger than or equal to a pre-specified rate. The size and usefulness of the prediction sets relies heavily on the choice of the non-conformity score function. The scientific literature contains many examples of non-conformity score functions but there is an absence of studies examining their properties and effectiveness. In this paper, we give an overview of properties of non-conformity score functions. We give examples of non-conformity score functions in the existing literature and introduce original modifications. We introduce an original method of evaluating the prediction set sizes of conformal predictors and use it to provide a comparison between non-conformity score functions. We also examine efficacy of different non-conformity score functions for class-conditional conformal prediction in a setting with imbalanced classes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript surveys properties of non-conformity score functions for conformal prediction, presents examples from the literature along with original modifications, introduces a novel method for evaluating prediction set sizes, applies this method to benchmark multiple non-conformity scores, and studies their performance under class-conditional conformal prediction with imbalanced classes.

Significance. A well-validated benchmarking study of non-conformity scores, particularly one that addresses class imbalance, would help practitioners select scores that produce smaller, more informative prediction sets while preserving coverage guarantees.

major comments (2)
  1. [Abstract] Abstract: The central contribution is an 'original method of evaluating the prediction set sizes' used to benchmark non-conformity scores, yet the manuscript supplies no formulation, theoretical justification, or comparison to standard efficiency metrics (e.g., average set size at fixed marginal coverage). Without evidence that the method avoids artifacts or produces unbiased rankings, the reported comparisons cannot be attributed to the scores themselves.
  2. [Abstract] Abstract: No empirical results, tables, error bars, or ablation studies are described, so it is impossible to evaluate whether observed differences between scores (including in the imbalanced class-conditional setting) are statistically meaningful or reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and recommendation. We agree that the abstract is too high-level and will revise the manuscript to include the requested formulation, justification, comparisons, and empirical details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central contribution is an 'original method of evaluating the prediction set sizes' used to benchmark non-conformity scores, yet the manuscript supplies no formulation, theoretical justification, or comparison to standard efficiency metrics (e.g., average set size at fixed marginal coverage). Without evidence that the method avoids artifacts or produces unbiased rankings, the reported comparisons cannot be attributed to the scores themselves.

    Authors: We agree the current abstract provides no formulation or justification. In revision we will add a concise formulation of the original evaluation method to the abstract, include its theoretical motivation, and add explicit comparisons to standard metrics such as average set size at fixed marginal coverage. We will also insert supporting analysis (e.g., controlled simulations) demonstrating that the method produces consistent rankings without obvious artifacts. revision: yes

  2. Referee: [Abstract] Abstract: No empirical results, tables, error bars, or ablation studies are described, so it is impossible to evaluate whether observed differences between scores (including in the imbalanced class-conditional setting) are statistically meaningful or reproducible.

    Authors: We agree the abstract does not describe the empirical results. The revised version will update the abstract to summarize the key findings, including performance differences across scores in both marginal and class-conditional imbalanced settings. The main text will be augmented with tables, error bars, and ablation studies that allow assessment of statistical significance and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with new evaluation method has no self-referential derivations

full rationale

The paper is an empirical benchmarking study that overviews non-conformity scores, introduces modifications, and proposes an original evaluation method for prediction set sizes to compare them (including class-conditional on imbalanced data). No equations, fitted parameters, or derivation chains appear in the provided text. The central contribution is the new evaluation procedure itself, presented as original rather than derived from prior results by the same authors. No self-citations are load-bearing, no ansatzes are smuggled, and no predictions reduce to inputs by construction. The work is self-contained against external benchmarks as a comparative study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; the described work is an empirical benchmarking study that appears to rest on standard assumptions of conformal prediction and supervised classification without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5682 in / 1057 out tokens · 39168 ms · 2026-06-30T11:37:43.624508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Angelopoulos and Stephen Bates

    Anastasios N. Angelopoulos and Stephen Bates. A Gentle Introduction to Confor- mal Prediction and Distribution-Free Uncertainty Quantification, 2021

  2. [2]

    A tutorial on conformal prediction

    Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. 2007. doi: 10.48550/ARXIV.0706.3188

  3. [3]

    Conditional validity of inductive conformal predictors

    Vladimir Vovk. Conditional validity of inductive conformal predictors, 2012. URL https://arxiv.org/abs/1209.2673. Version Number: 2

  4. [4]

    Least Ambiguous Set-Valued Classifiers with Bounded Error Levels

    Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least Ambiguous Set-Valued Classifiers with Bounded Error Levels. 2016. doi: 10.48550/ARXIV.1609.00451. 16

  5. [5]

    Bias re- duction through conditional conformal prediction.Intelligent Data Analysis, 19 (6):1355–1375, November 2015

    Tuve L ¨ofstr¨om, Henrik Bostr ¨om, Henrik Linusson, and Ulf Johansson. Bias re- duction through conditional conformal prediction.Intelligent Data Analysis, 19 (6):1355–1375, November 2015. ISSN 1088467X, 15714128. doi: 10.3233/ IDA-150786

  6. [6]

    Cand `es

    Yaniv Romano, Matteo Sesia, and Emmanuel J. Cand `es. Classification with Valid and Adaptive Coverage. 2020. doi: 10.48550/ARXIV.2006.02544

  7. [7]

    Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I. Jordan. Uncertainty Sets for Image Classifiers using Conformal Prediction, 2020

  8. [8]

    Conformal Prediction for Deep Classifier via Label Ranking, 2023

    Jianguo Huang, Huajun Xi, Linjun Zhang, Huaxiu Yao, Yue Qiu, and Hongxin Wei. Conformal Prediction for Deep Classifier via Label Ranking, 2023

  9. [9]

    Predictive Inference with Feature Conformal Prediction, 2022

    Jiaye Teng, Chuan Wen, Dinghuai Zhang, Yoshua Bengio, Yang Gao, and Yang Yuan. Predictive Inference with Feature Conformal Prediction, 2022

  10. [10]

    Predictive Inference With Fast Feature Conformal Prediction, 2024

    Zihao Tang, Boyuan Wang, Chuan Wen, and Jiaye Teng. Predictive Inference With Fast Feature Conformal Prediction, 2024

  11. [11]

    Deep Residual Learning for Image Recognition, 2015

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015

  12. [12]

    Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convo- lutional Neural Networks, September 2020. 17 Appendices A Complete version of table 1 Dataset Model accuracy (top-1 / top-5) Score function Conformal domain Distance metricI 1 ImageNet 76.15% / 92.87% Label Distance Probability space Euclidean distance 137.7 Label Distance Prob...

  13. [13]

    Denote by ˆπi the ordered list of probabilities for the i-th example

    Given a predictive classification model and a calibration data setCnot seen during training, denote byX i the i-th data point andy i its class (i.e., its label). Denote by ˆπi the ordered list of probabilities for the i-th example

  14. [14]

    Sample a uniform random variableU i ∼Uniform(0,1) for each example

  15. [15]

    Define the (non-conformity) score function s(yi, Ui, ˆπi) = min{τ:y i ∈ Si(Ui, ˆπi, τ)},(24) Si(Ui, ˆπi, τ) =  {y(j) i :j <min{j ′ | Pj′ k=1 ˆπ(k) i ≥τ}}ifU i ≤V i( ˆπi, τ) {y(j) i :j≤min{j ′ | Pj′ k=1 ˆπ(k) i ≥τ}}ifU i > Vi( ˆπi, τ) ,(25) 19 Vi( ˆπi, τ) = 1 ˆπ(c) i ( X ˆπ(j) ≥ ˆπ(c) ˆπ(j) −τ).(26) Here,y i is the label of the i-th example,y (j) i ...

  16. [16]

    Calibrate the conformal predictor by calculatings i for all the examples inC

  17. [17]

    Choose an allowed error rate (sometimes called significance)α

  18. [18]

    Construct the prediction set for a new test example asS i+1(Ui+1, ˆπi+1, τc) withτ c the ⌈(|C|+1)(1−α)⌉ |C| -th quantile ofs i onC. A na¨ıve implementation of the above algorithm in pseudo-code could look like def calibration vars label , sorted_indices , sorted_scores c = cumsum ( sorted_scores ) while calibrating : L = index( sorted_indices == label) U ...