Recognition: no theorem link
Evaluating Black-Box Classifiers via Stable Adaptive Two-Sample Inference
Pith reviewed 2026-05-10 19:23 UTC · model grok-4.3
The pith
An auxiliary stable distinguisher reduces black-box multi-class classifier evaluation to an asymptotically valid two-sample test.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given holdout data and a black-box estimate hat eta, sample synthetic labels Y' from Multinom(hat eta(X)) on the same X; train an auxiliary binary distinguisher to classify real (X,Y) versus synthetic (X,Y'); the rank-sum statistic computed from the distinguisher's scores on a held-out portion has an asymptotic standard normal distribution under the null that rho between the true and predicted distributions is at most delta, provided the distinguisher satisfies suitable stability conditions that validate the cross-validation central limit theorem.
What carries the argument
The stable auxiliary distinguisher, a binary classifier trained to separate real labeled pairs from pairs whose labels are drawn from the black-box predictor, whose rank-sum statistic is analyzed through cross-validation central limit theorems.
If this is right
- The resulting p-values are asymptotically valid for the goodness-of-fit hypothesis under the stated stability conditions on the distinguisher.
- No assumptions are imposed on the training algorithm or functional form of the original black-box classifier.
- The test statistic directly reflects the distinguishability between the true conditional distribution and the one implied by the classifier.
- Ideas from algorithmic fairness, Neyman-Pearson testing, and conformal p-values are combined to produce the procedure.
Where Pith is reading between the lines
- The same two-sample reduction could be applied to regression or other supervised tasks by defining an appropriate synthetic response generator and distinguisher.
- In practice, simple stable models such as logistic regression or shallow trees may be preferred for the distinguisher to meet the stability requirement.
- Sequential or streaming versions might be feasible by updating the distinguisher incrementally as new holdout batches arrive.
Load-bearing premise
The auxiliary distinguisher must satisfy stability conditions that let the cross-validation central limit theorem produce valid asymptotic p-values.
What would settle it
Simulate data from a known true eta, set hat eta equal to eta, apply the procedure with a stable distinguisher, and verify that the rejection rate at nominal level alpha converges to alpha as sample size increases.
Figures
read the original abstract
We consider the problem of evaluating black-box multi-class classifiers. In the standard setup, we observe class labels $Y\in \{0,1,\ldots,M-1\}$ generated according to the conditional distribution $ Y|X \sim \text{ Multinom}\big(\eta(X)\big), $ where $X$ denotes the features and $\eta$ maps from the feature space to the $(M-1)$-dimensional simplex. A black-box classifier is an estimate $\hat{\eta}$ for which we make no assumptions about the training algorithm. Given holdout data, our goal is to evaluate the performance of the classifier $\hat{\eta}$. Recent work suggests treating this as a goodness-of-fit problem by testing the hypothesis $H_0: \rho((X,Y),(X',Y')) \le \delta$, where $\rho$ is some metric between two distributions, and $(X',Y')\sim P_X\times \text{ Multinom}(\hat\eta(X))$. Combining ideas from algorithmic fairness, Neyman-Pearson lemma, and conformal p-values, we propose a new methodology for this testing problem. The key idea is to generate a second sample $(X',Y') \sim P_X \times \text{ Multinom}\big(\hat\eta(X)\big)$ allowing us to reduce the task to two-sample conditional distribution testing. Using part of the data, we train an auxiliary binary classifier called a distinguisher to attempt to distinguish between the two samples. The distinguisher's ability to differentiate samples, measured using a rank-sum statistic, is then used to assess the difference between $\hat{\eta}$ and $\eta$ . Using techniques from cross-validation central limit theorems, we derive an asymptotically rigorous test under suitable stability conditions of the distinguisher.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a method for evaluating black-box multi-class classifiers by reducing the problem to a two-sample test: synthetic labels are generated from the classifier's predicted conditional distribution, an auxiliary binary distinguisher is trained on holdout data to separate real from synthetic samples, and a rank-sum statistic on the distinguisher outputs is used to test the null that the distributions are close. Asymptotic validity of the resulting p-value is claimed via cross-validation central limit theorems, conditional on suitable stability conditions holding for the (data-dependent) distinguisher.
Significance. If the stability conditions can be rigorously verified for the adaptive distinguisher and the CV-CLT applies, the approach would provide a statistically grounded, assumption-light procedure for black-box classifier evaluation that combines ideas from conformal inference and two-sample testing; this could be a useful contribution to statistical methodology for model assessment.
major comments (2)
- [§3 and §4] §3 (Method) and §4 (Theory): the stability conditions required for the CV-CLT to justify the asymptotic distribution of the rank-sum statistic are stated but no explicit bound or verification is given showing that these conditions hold when the distinguisher itself is trained on the same hold-out data used to compute the test statistic. Because the distinguisher is data-dependent, standard uniform-stability or Lipschitz arguments do not apply automatically; this is load-bearing for the central claim of asymptotic rigor.
- [§4] §4 (Theoretical Results): the manuscript derives the test under the stated stability conditions but provides no discussion of finite-sample behavior, no explicit error-bar construction, and no simulation evidence that the asymptotic p-values are reliable when the stability conditions are only approximately satisfied. This weakens the practical utility of the proposed procedure.
minor comments (1)
- [§3] Notation for the distinguisher and the rank-sum statistic is introduced without a clear summary table or diagram, making it harder to follow the reduction from the original goodness-of-fit problem to the two-sample test.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. The comments identify important aspects of the stability assumptions and practical validation that merit clarification and expansion. We respond to each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Method) and §4 (Theory): the stability conditions required for the CV-CLT to justify the asymptotic distribution of the rank-sum statistic are stated but no explicit bound or verification is given showing that these conditions hold when the distinguisher itself is trained on the same hold-out data used to compute the test statistic. Because the distinguisher is data-dependent, standard uniform-stability or Lipschitz arguments do not apply automatically; this is load-bearing for the central claim of asymptotic rigor.
Authors: We agree that the stability conditions are load-bearing for the asymptotic claim and that the data-dependent nature of the distinguisher precludes automatic application of generic uniform-stability bounds. The manuscript derives the limiting distribution of the rank-sum statistic conditional on these conditions holding for the trained distinguisher. In the revision we will add a dedicated paragraph in §4 that supplies sufficient conditions (e.g., bounded Rademacher complexity of the distinguisher class together with a Lipschitz loss and a fixed regularization parameter) under which the required stability can be verified for common training procedures such as regularized logistic regression or early-stopped neural networks. We will also state explicitly that, for completely arbitrary black-box distinguishers without such restrictions, the conditions remain an assumption rather than a derived guarantee. revision: partial
-
Referee: [§4] §4 (Theoretical Results): the manuscript derives the test under the stated stability conditions but provides no discussion of finite-sample behavior, no explicit error-bar construction, and no simulation evidence that the asymptotic p-values are reliable when the stability conditions are only approximately satisfied. This weakens the practical utility of the proposed procedure.
Authors: The present version emphasizes the asymptotic theory. To improve practical guidance we will revise §4 to include (i) a brief derivation of a consistent estimator for the asymptotic variance that can be used to form approximate finite-sample error bars around the test statistic, and (ii) a new simulation subsection that examines coverage and type-I error of the asymptotic p-values for moderate sample sizes when the stability conditions hold only approximately (e.g., under mild misspecification of the distinguisher). These additions will be placed after the main theoretical results and will not alter the asymptotic claims. revision: yes
- Providing a universal, assumption-free explicit bound on the stability of an arbitrary data-dependent distinguisher trained on the same hold-out sample.
Circularity Check
No significant circularity; derivation applies external CV-CLT to constructed statistic under stated assumptions
full rationale
The paper reduces the evaluation task to a two-sample test by generating synthetic samples from the black-box classifier and training an auxiliary distinguisher on a data split, then applies a rank-sum statistic whose asymptotic distribution is justified by citing cross-validation central limit theorem techniques under explicit stability conditions on the distinguisher. No equation or step shows the final test statistic or p-value reducing by construction to a fitted parameter, self-defined quantity, or input data summary; the CLT is treated as an external result whose applicability is conditioned on stability rather than proven internally via the target hypothesis. The distinguisher is data-dependent by design, but this is an assumption for the asymptotic justification, not a self-referential loop. No self-citations are load-bearing for the core derivation, and the approach remains independent of the specific fitted values of the classifier under test.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Cross-validation central limit theorems apply to the rank-sum statistic computed from the distinguisher
- domain assumption The auxiliary distinguisher satisfies suitable stability conditions
invented entities (1)
-
Auxiliary binary distinguisher
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Journal of the American Statistical Association , volume =
ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.2023.2218030. Yuchen Chen and Jing Lei. De-biased two-sample u-statistics with application to conditional distribution testing.Machine Learning, 114(2):33, 2025. Cynthia Dwork, Michael P. Kim, Omer Reingold, Guy N. Rothblum, and Gal Yona. Outcome indistinguishability. InProceedings of the 53rd Annual ACM SI...
-
[2]
Journal of the Royal Statistical Society Series B: Statistical Methodology , author =
ISSN 1369-7412, 1467-9868. doi: 10.1111/j.2517-6161.1996.tb02086.x. Patrik R´ obert Gerber, Yanjun Han, and Yury Polyanskiy. Minimax optimal testing by classification, June 2023. arXiv:2306.11085 [math]. Xiaoyu Hu and Jing Lei. A Two-Sample Conditional Distribution Test Using Conformal Prediction and Weighted Rank Sum.Journal of the American Statistical A...
-
[3]
ISSN 1369-7412, 1467-9868. doi: 10.1093/jrsssb/qkad106. Ilmun Kim and Aaditya Ramdas. Dimension-agnostic inference using cross U-statistics.Bernoulli, 30(1), February 2024. ISSN 1350-7265. doi: 10.3150/23-BEJ1613. Ilmun Kim, Aaditya Ramdas, Aarti Singh, and Larry Wasserman. Classification accuracy as a proxy for two-sample testing.The Annals of Statistics...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.