arxiv: 2604.05470 · v1 · submitted 2026-04-07 · 📊 stat.ME

Recognition: no theorem link

Evaluating Black-Box Classifiers via Stable Adaptive Two-Sample Inference

Jing Lei, Yuchen Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:23 UTC · model grok-4.3

classification 📊 stat.ME

keywords black-box classifier evaluationtwo-sample testinggoodness-of-fitconformal p-valuesmulti-class classificationcross-validation central limit theoremdistinguisherstability conditions

0 comments

The pith

An auxiliary stable distinguisher reduces black-box multi-class classifier evaluation to an asymptotically valid two-sample test.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames evaluation of any black-box classifier as a test of whether its predicted conditional label distribution is close to the true one, measured by a metric rho. It generates a second sample of labels from the classifier's predictions on the same features and trains a separate binary distinguisher to separate real labeled data from this synthetic sample. The distinguisher's rank-sum performance on held-out data becomes the test statistic, and cross-validation central limit theorems supply asymptotic normality and valid p-values when the distinguisher meets stability conditions. This requires no modeling assumptions on how the original classifier was trained and works from a single holdout set.

Core claim

Given holdout data and a black-box estimate hat eta, sample synthetic labels Y' from Multinom(hat eta(X)) on the same X; train an auxiliary binary distinguisher to classify real (X,Y) versus synthetic (X,Y'); the rank-sum statistic computed from the distinguisher's scores on a held-out portion has an asymptotic standard normal distribution under the null that rho between the true and predicted distributions is at most delta, provided the distinguisher satisfies suitable stability conditions that validate the cross-validation central limit theorem.

What carries the argument

The stable auxiliary distinguisher, a binary classifier trained to separate real labeled pairs from pairs whose labels are drawn from the black-box predictor, whose rank-sum statistic is analyzed through cross-validation central limit theorems.

If this is right

The resulting p-values are asymptotically valid for the goodness-of-fit hypothesis under the stated stability conditions on the distinguisher.
No assumptions are imposed on the training algorithm or functional form of the original black-box classifier.
The test statistic directly reflects the distinguishability between the true conditional distribution and the one implied by the classifier.
Ideas from algorithmic fairness, Neyman-Pearson testing, and conformal p-values are combined to produce the procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-sample reduction could be applied to regression or other supervised tasks by defining an appropriate synthetic response generator and distinguisher.
In practice, simple stable models such as logistic regression or shallow trees may be preferred for the distinguisher to meet the stability requirement.
Sequential or streaming versions might be feasible by updating the distinguisher incrementally as new holdout batches arrive.

Load-bearing premise

The auxiliary distinguisher must satisfy stability conditions that let the cross-validation central limit theorem produce valid asymptotic p-values.

What would settle it

Simulate data from a known true eta, set hat eta equal to eta, apply the procedure with a stable distinguisher, and verify that the rejection rate at nominal level alpha converges to alpha as sample size increases.

Figures

Figures reproduced from arXiv: 2604.05470 by Jing Lei, Yuchen Chen.

**Figure 1.** Figure 1: Left: Comparison of sample split method, cross-fit method and GRASP. The empirical power of the three methods indicated by line type are plotted across multiple sample sizes. Right: Comparison of empirical power of three distinguisher choices in sparse setting. The adaptive LASSO distinguisher performs the best. in the model-X setting with auxiliary unsupervised data. As we do not assume access to a model-… view at source ↗

read the original abstract

We consider the problem of evaluating black-box multi-class classifiers. In the standard setup, we observe class labels $Y\in \{0,1,\ldots,M-1\}$ generated according to the conditional distribution $ Y|X \sim \text{ Multinom}\big(\eta(X)\big), $ where $X$ denotes the features and $\eta$ maps from the feature space to the $(M-1)$-dimensional simplex. A black-box classifier is an estimate $\hat{\eta}$ for which we make no assumptions about the training algorithm. Given holdout data, our goal is to evaluate the performance of the classifier $\hat{\eta}$. Recent work suggests treating this as a goodness-of-fit problem by testing the hypothesis $H_0: \rho((X,Y),(X',Y')) \le \delta$, where $\rho$ is some metric between two distributions, and $(X',Y')\sim P_X\times \text{ Multinom}(\hat\eta(X))$. Combining ideas from algorithmic fairness, Neyman-Pearson lemma, and conformal p-values, we propose a new methodology for this testing problem. The key idea is to generate a second sample $(X',Y') \sim P_X \times \text{ Multinom}\big(\hat\eta(X)\big)$ allowing us to reduce the task to two-sample conditional distribution testing. Using part of the data, we train an auxiliary binary classifier called a distinguisher to attempt to distinguish between the two samples. The distinguisher's ability to differentiate samples, measured using a rank-sum statistic, is then used to assess the difference between $\hat{\eta}$ and $\eta$ . Using techniques from cross-validation central limit theorems, we derive an asymptotically rigorous test under suitable stability conditions of the distinguisher.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces black-box classifier evaluation to a two-sample test via a trained distinguisher and rank-sum statistic with CV-CLT p-values, but the stability conditions needed for asymptotic validity are stated without being verified for the data-dependent case.

read the letter

The main contribution here is a practical reduction: generate synthetic labels from the black-box, train an auxiliary binary distinguisher to separate real from synthetic pairs, and turn the distinguisher's rank-sum performance into a test statistic for whether the classifier matches the true conditional distribution. They justify the p-values asymptotically using cross-validation central limit theorems, provided the distinguisher meets suitable stability conditions. This framing pulls together algorithmic fairness ideas, Neyman-Pearson distinguishers, and conformal-style p-values in a way that looks new for this exact problem of distribution-level validation on hold-out data.

Referee Report

2 major / 1 minor

Summary. The paper proposes a method for evaluating black-box multi-class classifiers by reducing the problem to a two-sample test: synthetic labels are generated from the classifier's predicted conditional distribution, an auxiliary binary distinguisher is trained on holdout data to separate real from synthetic samples, and a rank-sum statistic on the distinguisher outputs is used to test the null that the distributions are close. Asymptotic validity of the resulting p-value is claimed via cross-validation central limit theorems, conditional on suitable stability conditions holding for the (data-dependent) distinguisher.

Significance. If the stability conditions can be rigorously verified for the adaptive distinguisher and the CV-CLT applies, the approach would provide a statistically grounded, assumption-light procedure for black-box classifier evaluation that combines ideas from conformal inference and two-sample testing; this could be a useful contribution to statistical methodology for model assessment.

major comments (2)

[§3 and §4] §3 (Method) and §4 (Theory): the stability conditions required for the CV-CLT to justify the asymptotic distribution of the rank-sum statistic are stated but no explicit bound or verification is given showing that these conditions hold when the distinguisher itself is trained on the same hold-out data used to compute the test statistic. Because the distinguisher is data-dependent, standard uniform-stability or Lipschitz arguments do not apply automatically; this is load-bearing for the central claim of asymptotic rigor.
[§4] §4 (Theoretical Results): the manuscript derives the test under the stated stability conditions but provides no discussion of finite-sample behavior, no explicit error-bar construction, and no simulation evidence that the asymptotic p-values are reliable when the stability conditions are only approximately satisfied. This weakens the practical utility of the proposed procedure.

minor comments (1)

[§3] Notation for the distinguisher and the rank-sum statistic is introduced without a clear summary table or diagram, making it harder to follow the reduction from the original goodness-of-fit problem to the two-sample test.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments identify important aspects of the stability assumptions and practical validation that merit clarification and expansion. We respond to each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [§3 and §4] §3 (Method) and §4 (Theory): the stability conditions required for the CV-CLT to justify the asymptotic distribution of the rank-sum statistic are stated but no explicit bound or verification is given showing that these conditions hold when the distinguisher itself is trained on the same hold-out data used to compute the test statistic. Because the distinguisher is data-dependent, standard uniform-stability or Lipschitz arguments do not apply automatically; this is load-bearing for the central claim of asymptotic rigor.

Authors: We agree that the stability conditions are load-bearing for the asymptotic claim and that the data-dependent nature of the distinguisher precludes automatic application of generic uniform-stability bounds. The manuscript derives the limiting distribution of the rank-sum statistic conditional on these conditions holding for the trained distinguisher. In the revision we will add a dedicated paragraph in §4 that supplies sufficient conditions (e.g., bounded Rademacher complexity of the distinguisher class together with a Lipschitz loss and a fixed regularization parameter) under which the required stability can be verified for common training procedures such as regularized logistic regression or early-stopped neural networks. We will also state explicitly that, for completely arbitrary black-box distinguishers without such restrictions, the conditions remain an assumption rather than a derived guarantee. revision: partial
Referee: [§4] §4 (Theoretical Results): the manuscript derives the test under the stated stability conditions but provides no discussion of finite-sample behavior, no explicit error-bar construction, and no simulation evidence that the asymptotic p-values are reliable when the stability conditions are only approximately satisfied. This weakens the practical utility of the proposed procedure.

Authors: The present version emphasizes the asymptotic theory. To improve practical guidance we will revise §4 to include (i) a brief derivation of a consistent estimator for the asymptotic variance that can be used to form approximate finite-sample error bars around the test statistic, and (ii) a new simulation subsection that examines coverage and type-I error of the asymptotic p-values for moderate sample sizes when the stability conditions hold only approximately (e.g., under mild misspecification of the distinguisher). These additions will be placed after the main theoretical results and will not alter the asymptotic claims. revision: yes

standing simulated objections not resolved

Providing a universal, assumption-free explicit bound on the stability of an arbitrary data-dependent distinguisher trained on the same hold-out sample.

Circularity Check

0 steps flagged

No significant circularity; derivation applies external CV-CLT to constructed statistic under stated assumptions

full rationale

The paper reduces the evaluation task to a two-sample test by generating synthetic samples from the black-box classifier and training an auxiliary distinguisher on a data split, then applies a rank-sum statistic whose asymptotic distribution is justified by citing cross-validation central limit theorem techniques under explicit stability conditions on the distinguisher. No equation or step shows the final test statistic or p-value reducing by construction to a fitted parameter, self-defined quantity, or input data summary; the CLT is treated as an external result whose applicability is conditioned on stability rather than proven internally via the target hypothesis. The distinguisher is data-dependent by design, but this is an assumption for the asymptotic justification, not a self-referential loop. No self-citations are load-bearing for the core derivation, and the approach remains independent of the specific fitted values of the classifier under test.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard asymptotic statistics results and an unstated stability condition for the auxiliary distinguisher; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

standard math Cross-validation central limit theorems apply to the rank-sum statistic computed from the distinguisher
Invoked to obtain asymptotic normality and valid p-values
domain assumption The auxiliary distinguisher satisfies suitable stability conditions
Required for the CLT justification; location not specified beyond the abstract statement

invented entities (1)

Auxiliary binary distinguisher no independent evidence
purpose: To measure discrepancy between real and classifier-generated samples
Trained on part of the holdout data; no independent evidence of its properties beyond the stability assumption

pith-pipeline@v0.9.0 · 5624 in / 1381 out tokens · 43206 ms · 2026-05-10T19:23:53.587086+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Journal of the American Statistical Association , volume =

ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2023.2218030. Yuchen Chen and Jing Lei. De-biased two-sample u-statistics with application to conditional distribution testing.Machine Learning, 114(2):33, 2025. Cynthia Dwork, Michael P. Kim, Omer Reingold, Guy N. Rothblum, and Gal Yona. Outcome indistinguishability. InProceedings of the 53rd Annual ACM SI...

work page doi:10.1080/01621459.2023.2218030 2023
[2]

Journal of the Royal Statistical Society Series B: Statistical Methodology , author =

ISSN 1369-7412, 1467-9868. doi: 10.1111/j.2517-6161.1996.tb02086.x. Patrik R´ obert Gerber, Yanjun Han, and Yury Polyanskiy. Minimax optimal testing by classification, June 2023. arXiv:2306.11085 [math]. Xiaoyu Hu and Jing Lei. A Two-Sample Conditional Distribution Test Using Conformal Prediction and Weighted Rank Sum.Journal of the American Statistical A...

work page doi:10.1111/j.2517-6161.1996.tb02086.x 1996
[3]

doi: 10.1093/jrsssb/qkad106

ISSN 1369-7412, 1467-9868. doi: 10.1093/jrsssb/qkad106. Ilmun Kim and Aaditya Ramdas. Dimension-agnostic inference using cross U-statistics.Bernoulli, 30(1), February 2024. ISSN 1350-7265. doi: 10.3150/23-BEJ1613. Ilmun Kim, Aaditya Ramdas, Aarti Singh, and Larry Wasserman. Classification accuracy as a proxy for two-sample testing.The Annals of Statistics...

work page doi:10.1093/jrsssb/qkad106 2024