Beyond Point Estimates: Reliable Evaluation of Prediction Performance Metrics under Clustered Data

Daeyoung Lim; Taekwon Hong; Woojung Bae

arxiv: 2606.03656 · v2 · pith:5IBZELKMnew · submitted 2026-06-02 · 📊 stat.ME

Beyond Point Estimates: Reliable Evaluation of Prediction Performance Metrics under Clustered Data

Taekwon Hong , Daeyoung Lim , Woojung Bae This is my paper

Pith reviewed 2026-06-28 09:07 UTC · model grok-4.3

classification 📊 stat.ME

keywords clustered dataperformance metricssandwich estimatorconfidence intervalsconfusion matrixmodel evaluationasymptotic inferenceuncertainty quantification

0 comments

The pith

Representing prediction performance metrics as smooth functionals of confusion-matrix probabilities enables cluster-robust inference under dependent data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that connects common performance metrics like accuracy and F1 score to smooth functions of the probabilities in a confusion matrix. This link makes it possible to apply a cluster-robust sandwich variance estimator, which produces asymptotically valid confidence intervals, hypothesis tests, and comparisons between models even when observations are clustered or dependent. The approach also supplies formulas for power and sample-size calculations based on pilot data. Simulations confirm that the method maintains proper coverage where simpler approaches fall short, and a real-data example shows that ignoring clustering can alter substantive conclusions about model performance.

Core claim

A broad class of prediction performance metrics can be expressed as smooth functionals of the cell probabilities in the confusion matrix; this representation justifies the use of the cluster-robust sandwich estimator to obtain valid asymptotic inference for both binary and multiclass classification under clustered sampling.

What carries the argument

Smooth functional representation of metrics in terms of confusion-matrix probabilities, which licenses the cluster-robust sandwich variance estimator.

Load-bearing premise

A broad class of performance metrics admit representation as smooth functionals of confusion-matrix probabilities.

What would settle it

Empirical coverage probabilities of the resulting intervals falling well below the nominal level in Monte Carlo experiments with clustered data for a metric outside the smooth-functional class.

Figures

Figures reproduced from arXiv: 2606.03656 by Daeyoung Lim, Taekwon Hong, Woojung Bae.

read the original abstract

Prediction performance metrics such as accuracy and the F1 score are typically reported as single numbers, with no measure of uncertainty. The omission has been tolerable in exploratory settings, where model evaluation is used for informal comparison rather than formal decision-making. But as machine learning is deployed in real-world applications, evaluation results are increasingly used to support binary decisions -- whether a model meets a required standard or not -- making uncertainty quantification essential. The problem is compounded when data are dependent, as in repeated measurements, clustered subjects, or time series, where variability is harder to assess and easy to underestimate. We develop a unified framework that links a broad class of performance metrics through their representation as smooth functionals of confusion-matrix probabilities. This representation allows the use of the cluster-robust sandwich variance estimator to obtain asymptotically valid confidence intervals, hypothesis tests, and paired model comparisons for both binary and multiclass problems under clustered data. We also provide power and sample size approximations based on pilot data, enabling principled study design for model evaluation. Simulations show that the proposed methods achieve near-nominal coverage across a range of dependence structures, while naive methods underestimate variability. A real-data application further illustrates how accounting for clustering can materially change conclusions. These results offer a practical foundation for uncertainty quantification and study design in prediction performance evaluation, in settings where decisions should be justified under dependent and clustered data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a direct way to get cluster-robust CIs and tests for metrics like accuracy and F1 by writing them as smooth functionals of confusion-matrix cell probabilities and plugging into the sandwich estimator.

read the letter

The paper's main move is to represent a range of performance metrics as smooth functions of the confusion-matrix cell probabilities, estimate those probabilities from cluster averages, and then apply the cluster-robust sandwich variance. The delta method supplies the asymptotic variance for the metric itself. This covers binary and multiclass cases and adds power and sample-size formulas based on pilot data.

What is new is the unified functional representation that lets the same sandwich machinery handle many metrics at once under independent clusters. The simulations check coverage across dependence structures and show the naive approach underestimates variability, which matches the theory. The real-data example illustrates that the adjustment can change substantive conclusions.

The argument rests on standard M-estimation results, so the derivations are straightforward once the functional representation is granted. The smoothness assumption holds for the usual metrics the paper targets; discontinuous or non-differentiable cases are outside the scope and would need separate treatment. No circularity or unsupported regularity condition appears.

This is aimed at applied statisticians and ML practitioners who evaluate models on clustered or longitudinal data and need defensible uncertainty for deployment decisions. Readers working on study design or formal model comparisons will get immediate use from the power approximations and paired tests.

I would send it to referees. The contribution is focused, the simulations are reassuring, and the gap it fills is real.

Referee Report

2 major / 3 minor

Summary. The paper develops a unified framework representing a broad class of prediction performance metrics (accuracy, F1, etc.) as smooth functionals g(θ) of confusion-matrix cell probabilities θ. Cluster-level averages yield θ̂ whose asymptotic covariance is estimated by the cluster-robust sandwich; the delta method then supplies the asymptotic variance of g(θ̂). This yields CIs, tests, and paired comparisons under clustered data for binary and multiclass settings, plus power/sample-size approximations from pilot data. Simulations show near-nominal coverage; a real-data example illustrates changed conclusions when clustering is accounted for.

Significance. If the representation and regularity conditions hold, the work supplies a practical, asymptotically justified tool for uncertainty quantification in ML evaluation under dependence—an increasingly relevant need as metrics support binary decisions. It applies standard M-estimation/sandwich theory to this domain without circularity. Credit is due for the explicit simulation confirmation of coverage across dependence structures and for including pilot-data-based study-design formulas.

major comments (2)

[Abstract and functional-representation section] Abstract and the section introducing the functional representation: the claim that a broad class of metrics (including F1) admit smooth functional representations g(θ) requires explicit statement of the differentiability conditions and handling of points where denominators vanish; without these, the delta-method justification for asymptotic validity under clustering is not fully load-bearing.
[Simulation section] Simulation results (coverage tables): the reported near-nominal coverage is consistent with theory, but the manuscript must specify the exact cluster sizes, intra-cluster correlation values, and multiclass zero-cell handling rules used, because these directly test the weakest assumption that the functional representation remains valid at the simulated operating points.

minor comments (3)

Introduce the notation θ and g(θ) with a short display equation in the introduction rather than deferring it.
[Real-data application] In the real-data application, state the number of clusters and average cluster size explicitly.
[Method section] Add a short remark on whether the sandwich estimator is the usual one or a finite-sample variant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation. We address each major comment below.

read point-by-point responses

Referee: [Abstract and functional-representation section] Abstract and the section introducing the functional representation: the claim that a broad class of metrics (including F1) admit smooth functional representations g(θ) requires explicit statement of the differentiability conditions and handling of points where denominators vanish; without these, the delta-method justification for asymptotic validity under clustering is not fully load-bearing.

Authors: We agree that the differentiability conditions and treatment of vanishing denominators should be stated explicitly. The revised manuscript will add a paragraph in the functional-representation section specifying that g is continuously differentiable in a neighborhood of the true θ (with the gradient bounded away from zero) and will note that the probability of denominator zero events vanishes asymptotically under the maintained assumption that class probabilities are bounded away from zero; boundary cases are handled by continuity of the metric. This makes the delta-method argument fully rigorous. revision: yes
Referee: [Simulation section] Simulation results (coverage tables): the reported near-nominal coverage is consistent with theory, but the manuscript must specify the exact cluster sizes, intra-cluster correlation values, and multiclass zero-cell handling rules used, because these directly test the weakest assumption that the functional representation remains valid at the simulated operating points.

Authors: We accept the request for greater transparency. The revised simulation section will include an explicit table (or expanded text) listing the precise cluster sizes, intra-cluster correlation values, and the zero-cell handling rule employed for multiclass problems. This addition will allow direct verification that the functional representation remains valid at the simulated points. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution is a direct application of M-estimation and the delta method: performance metrics are expressed as smooth functionals g(θ) of confusion-matrix cell probabilities θ (estimated via cluster averages), with the cluster-robust sandwich providing the asymptotic covariance of θ̂ and the delta method supplying the variance of g(θ̂). This is standard estimating-equation theory under independent clusters and does not reduce any target quantity to a parameter fitted from the same data by construction, nor does it rely on self-citations, uniqueness theorems, or smuggled ansatzes. Simulations and the real-data example serve as external checks rather than internal definitions. The derivation chain is therefore self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that metrics are smooth functionals of confusion-matrix probabilities and on the applicability of the cluster-robust sandwich estimator to clustered dependence structures; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Performance metrics can be represented as smooth functionals of confusion-matrix probabilities
This representation is invoked to link the metrics to the sandwich variance estimator.

pith-pipeline@v0.9.1-grok · 5774 in / 1235 out tokens · 34908 ms · 2026-06-28T09:07:46.100258+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 26 canonical work pages · 1 internal anchor

[1]

doi:https://doi.org/10.1093/qje/qjac038

When should you adjust standard errors for clustering? The Quarterly Journal of Eco- nomics 138, 1–35. doi:https://doi.org/10.1093/qje/qjac038. Abnoosian, K., Farnoosh, R., Behzadi, M.H.,

work page doi:10.1093/qje/qjac038
[2]

Journal of medical Internet research 27, e72216

At-home evaluation of both wearable and touchless digital health tech- nologies for measuring nocturnal scratching in atopic dermatitis: Ana- lytical validation study. Journal of medical Internet research 27, e72216. doi:https://doi.org/10.2196/72216. Bartko, J.J.,

work page doi:10.2196/72216
[3]

Psychological reports 19, 3–11

The intraclass correlation coefficient as a measure of re- liability. Psychological reports 19, 3–11. doi:https://doi.org/10.2466/ pr0.1966.19.1.3. Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L.,

1966
[4]

The Annals of Statistics , 802–837doi:http://doi

Valid post- selection inference. The Annals of Statistics , 802–837doi:http://doi. org/10.1214/12-AOS1077. Billingsley, P.,

work page doi:10.1214/12-aos1077
[5]

Convergence of

Convergence of probability measures. John Wiley & Sons, New York. doi:https://doi.org/10.1002/9780470316962. Cameron, A.C., Gelbach, J.B., Miller, D.L.,

work page doi:10.1002/9780470316962
[6]

Chan, I.S., Zhang, Z.,

doi:https://doi.org/10.1198/jbes.2010.07136. Chan, I.S., Zhang, Z.,

work page doi:10.1198/jbes.2010.07136 2010
[7]

Biometrics 55, 1202–1209

Test-based exact confidence intervals for the difference of two binomial proportions. Biometrics 55, 1202–1209. doi:https://doi.org/10.1111/j.0006-341X.1999.01202.x. 48 Chekroud, A.M., Hawrilenko, M., Loho, H., Bondar, J., Gueorguieva, R., Hasan, A., Kambeitz, J., Corlett, P.R., Koutsouleris, N., Krumholz, H.M., et al.,

work page doi:10.1111/j.0006-341x.1999.01202.x 1999
[8]

Science 383, 164–167

Illusory generalizability of clinical prediction models. Science 383, 164–167. doi:https://doi.org/10.1126/science.adg8538. Chicco, D., Jurman, G.,

work page doi:10.1126/science.adg8538
[9]

Biometrika 26, 404–413

The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26, 404–413. doi:https: //doi.org/10.2307/2331986. Colin Cameron, A., Miller, D.L.,

work page doi:10.2307/2331986
[10]

Journal of human resources 50, 317–372

A practitioner’s guide to cluster- robust inference. Journal of human resources 50, 317–372. doi:https: //doi.org/10.3368/jhr.50.2.317. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.,

work page doi:10.3368/jhr.50.2.317
[11]

Japanese journal of radiol- ogy 37, 466–472

Distinction between benign and malignant breast masses at breast ultrasound using deep learn- ing method with convolutional neural network. Japanese journal of radiol- ogy 37, 466–472. doi:https://doi.org/10.1007/s11604-019-00831-5. Hastie, T., Tibshirani, R., Friedman, J.,

work page doi:10.1007/s11604-019-00831-5
[12]

2009.The Elements of Statistical Learning: Data Mining, Inference, and Prediction(2 ed.)

The elements of statistical learning: data mining, inference, and prediction. 2nd ed., Springer, New York. doi:https://doi.org/10.1007/978-0-387-84858-7. Itaya, Y., Tamura, J., Hayashi, K., Yamamoto, K.,

work page doi:10.1007/978-0-387-84858-7
[13]

Statistics in Medicine 44, e10303

Asymptotic proper- ties of matthews correlation coefficient. Statistics in Medicine 44, e10303. doi:https://doi.org/10.1002/sim.10303. Joe, H.,

work page doi:10.1002/sim.10303
[14]

1st ed., Chapman and Hall/CRC, New York

Dependence modeling with copulas. 1st ed., Chapman and Hall/CRC, New York. doi:https://doi.org/10.1201/b17116. 49 Liang, K.Y., Zeger, S.L.,

work page doi:10.1201/b17116
[15]

Journal of Applied Econometrics 32, 233–254

Wild bootstrap inference for wildly different cluster sizes. Journal of Applied Econometrics 32, 233–254. doi:https://doi.org/10.1002/jae.2508. Maleki, F., Ovens, K., Gupta, R., Reinhold, C., Spatz, A., Forghani, R.,

work page doi:10.1002/jae.2508
[16]

Radiology: Artificial Intelligence 5, e220028

Generalizability of machine learning models: quantitative evaluation of three methodological pitfalls. Radiology: Artificial Intelligence 5, e220028. doi:https://doi.org/10.1148/ryai.220028. Manning, C.D., Raghavan, P., Schütze, H.,

work page doi:10.1148/ryai.220028
[17]

Cambridge University Press, Cambridge

Introduction to information retrieval. Cambridge University Press, Cambridge. doi:https://doi.org/ 10.1017/CBO9780511809071. Matthews, B.W.,

work page doi:10.1017/cbo9780511809071
[18]

Müller, R., Büttner, P.,

doi:https: //doi.org/10.1038/s41746-023-00841-8. Müller, R., Büttner, P.,

work page doi:10.1038/s41746-023-00841-8
[19]

Statistics in medicine 13, 2465–2476

A critical discussion of intraclass correlation coefficients. Statistics in medicine 13, 2465–2476. doi:https://doi.org/ 10.1002/sim.4780132310. Oehlert, G.W.,

work page doi:10.1002/sim.4780132310
[20]

HTTP Mailbox - Asynchronous RESTful Communication

A note on the delta method. The American Statistician 46, 27–29. doi:https://doi.org/10.1080/00031305.1992.10475842. Sathyanarayana, A., Joty, S., Fernandez-Luque, L., Ofli, F., Srivastava, J., Elmagarmid, A., Arora, T., Taheri, S.,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/00031305.1992.10475842 1992
[21]

JMIR mHealth and uHealth 4, e6562

Sleep quality prediction from wearable data using deep learning. JMIR mHealth and uHealth 4, e6562. doi:https://doi.org/10.2196/mhealth.6562. Stone, M.,

work page doi:10.2196/mhealth.6562
[22]

Journal of the royal statistical society: Series B (Methodolog- ical) 36, 111–133

Cross-validatory choice and assessment of statistical pre- dictions. Journal of the royal statistical society: Series B (Methodolog- ical) 36, 111–133. doi:https://doi.org/10.1111/j.2517-6161.1974. tb00994.x. 50 Takahashi, K., Yamamoto, K., Kuchiba, A., Koyama, T.,

work page doi:10.1111/j.2517-6161.1974 1974
[23]

Statistics in medicine 42, 4177–4192

Hypothesis testing procedure for binary and multi-class f1-scores in the paired design. Statistics in medicine 42, 4177–4192. doi:https://doi. org/10.1002/sim.9853. Van der Vaart, A.W.,

work page doi:10.1002/sim.9853
[24]

Cambridge university press, Cambridge

Asymptotic statistics. Cambridge university press, Cambridge. doi:https://doi.org/10.1017/CBO9780511802256. Walker, E., Nowacki, A.S.,

work page doi:10.1017/cbo9780511802256
[25]

Journal of general internal medicine 26, 192–196

Understanding equivalence and non- inferiority testing. Journal of general internal medicine 26, 192–196. doi:https://doi.org/10.1007/s11606-010-1513-8. Windmeijer, F.,

work page doi:10.1007/s11606-010-1513-8
[26]

Journal of econometrics 126, 25–51

A finite sample correction for the variance of linear efficient two-step gmm estimators. Journal of econometrics 126, 25–51. doi:https://doi.org/10.1016/j.jeconom.2004.02.005. Zwack, C.C., Haghani, M., Hollings, M., Zhang, L., Gauci, S., Gallagher, R., Redfern, J.,

work page doi:10.1016/j.jeconom.2004.02.005 2004
[27]

doi:https: //doi.org/10.1038/s41746-022-00734-2. 51

work page doi:10.1038/s41746-022-00734-2

[1] [1]

doi:https://doi.org/10.1093/qje/qjac038

When should you adjust standard errors for clustering? The Quarterly Journal of Eco- nomics 138, 1–35. doi:https://doi.org/10.1093/qje/qjac038. Abnoosian, K., Farnoosh, R., Behzadi, M.H.,

work page doi:10.1093/qje/qjac038

[2] [2]

Journal of medical Internet research 27, e72216

At-home evaluation of both wearable and touchless digital health tech- nologies for measuring nocturnal scratching in atopic dermatitis: Ana- lytical validation study. Journal of medical Internet research 27, e72216. doi:https://doi.org/10.2196/72216. Bartko, J.J.,

work page doi:10.2196/72216

[3] [3]

Psychological reports 19, 3–11

The intraclass correlation coefficient as a measure of re- liability. Psychological reports 19, 3–11. doi:https://doi.org/10.2466/ pr0.1966.19.1.3. Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L.,

1966

[4] [4]

The Annals of Statistics , 802–837doi:http://doi

Valid post- selection inference. The Annals of Statistics , 802–837doi:http://doi. org/10.1214/12-AOS1077. Billingsley, P.,

work page doi:10.1214/12-aos1077

[5] [5]

Convergence of

Convergence of probability measures. John Wiley & Sons, New York. doi:https://doi.org/10.1002/9780470316962. Cameron, A.C., Gelbach, J.B., Miller, D.L.,

work page doi:10.1002/9780470316962

[6] [6]

Chan, I.S., Zhang, Z.,

doi:https://doi.org/10.1198/jbes.2010.07136. Chan, I.S., Zhang, Z.,

work page doi:10.1198/jbes.2010.07136 2010

[7] [7]

Biometrics 55, 1202–1209

Test-based exact confidence intervals for the difference of two binomial proportions. Biometrics 55, 1202–1209. doi:https://doi.org/10.1111/j.0006-341X.1999.01202.x. 48 Chekroud, A.M., Hawrilenko, M., Loho, H., Bondar, J., Gueorguieva, R., Hasan, A., Kambeitz, J., Corlett, P.R., Koutsouleris, N., Krumholz, H.M., et al.,

work page doi:10.1111/j.0006-341x.1999.01202.x 1999

[8] [8]

Science 383, 164–167

Illusory generalizability of clinical prediction models. Science 383, 164–167. doi:https://doi.org/10.1126/science.adg8538. Chicco, D., Jurman, G.,

work page doi:10.1126/science.adg8538

[9] [9]

Biometrika 26, 404–413

The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26, 404–413. doi:https: //doi.org/10.2307/2331986. Colin Cameron, A., Miller, D.L.,

work page doi:10.2307/2331986

[10] [10]

Journal of human resources 50, 317–372

A practitioner’s guide to cluster- robust inference. Journal of human resources 50, 317–372. doi:https: //doi.org/10.3368/jhr.50.2.317. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.,

work page doi:10.3368/jhr.50.2.317

[11] [11]

Japanese journal of radiol- ogy 37, 466–472

Distinction between benign and malignant breast masses at breast ultrasound using deep learn- ing method with convolutional neural network. Japanese journal of radiol- ogy 37, 466–472. doi:https://doi.org/10.1007/s11604-019-00831-5. Hastie, T., Tibshirani, R., Friedman, J.,

work page doi:10.1007/s11604-019-00831-5

[12] [12]

2009.The Elements of Statistical Learning: Data Mining, Inference, and Prediction(2 ed.)

The elements of statistical learning: data mining, inference, and prediction. 2nd ed., Springer, New York. doi:https://doi.org/10.1007/978-0-387-84858-7. Itaya, Y., Tamura, J., Hayashi, K., Yamamoto, K.,

work page doi:10.1007/978-0-387-84858-7

[13] [13]

Statistics in Medicine 44, e10303

Asymptotic proper- ties of matthews correlation coefficient. Statistics in Medicine 44, e10303. doi:https://doi.org/10.1002/sim.10303. Joe, H.,

work page doi:10.1002/sim.10303

[14] [14]

1st ed., Chapman and Hall/CRC, New York

Dependence modeling with copulas. 1st ed., Chapman and Hall/CRC, New York. doi:https://doi.org/10.1201/b17116. 49 Liang, K.Y., Zeger, S.L.,

work page doi:10.1201/b17116

[15] [15]

Journal of Applied Econometrics 32, 233–254

Wild bootstrap inference for wildly different cluster sizes. Journal of Applied Econometrics 32, 233–254. doi:https://doi.org/10.1002/jae.2508. Maleki, F., Ovens, K., Gupta, R., Reinhold, C., Spatz, A., Forghani, R.,

work page doi:10.1002/jae.2508

[16] [16]

Radiology: Artificial Intelligence 5, e220028

Generalizability of machine learning models: quantitative evaluation of three methodological pitfalls. Radiology: Artificial Intelligence 5, e220028. doi:https://doi.org/10.1148/ryai.220028. Manning, C.D., Raghavan, P., Schütze, H.,

work page doi:10.1148/ryai.220028

[17] [17]

Cambridge University Press, Cambridge

Introduction to information retrieval. Cambridge University Press, Cambridge. doi:https://doi.org/ 10.1017/CBO9780511809071. Matthews, B.W.,

work page doi:10.1017/cbo9780511809071

[18] [18]

Müller, R., Büttner, P.,

doi:https: //doi.org/10.1038/s41746-023-00841-8. Müller, R., Büttner, P.,

work page doi:10.1038/s41746-023-00841-8

[19] [19]

Statistics in medicine 13, 2465–2476

A critical discussion of intraclass correlation coefficients. Statistics in medicine 13, 2465–2476. doi:https://doi.org/ 10.1002/sim.4780132310. Oehlert, G.W.,

work page doi:10.1002/sim.4780132310

[20] [20]

HTTP Mailbox - Asynchronous RESTful Communication

A note on the delta method. The American Statistician 46, 27–29. doi:https://doi.org/10.1080/00031305.1992.10475842. Sathyanarayana, A., Joty, S., Fernandez-Luque, L., Ofli, F., Srivastava, J., Elmagarmid, A., Arora, T., Taheri, S.,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/00031305.1992.10475842 1992

[21] [21]

JMIR mHealth and uHealth 4, e6562

Sleep quality prediction from wearable data using deep learning. JMIR mHealth and uHealth 4, e6562. doi:https://doi.org/10.2196/mhealth.6562. Stone, M.,

work page doi:10.2196/mhealth.6562

[22] [22]

Journal of the royal statistical society: Series B (Methodolog- ical) 36, 111–133

Cross-validatory choice and assessment of statistical pre- dictions. Journal of the royal statistical society: Series B (Methodolog- ical) 36, 111–133. doi:https://doi.org/10.1111/j.2517-6161.1974. tb00994.x. 50 Takahashi, K., Yamamoto, K., Kuchiba, A., Koyama, T.,

work page doi:10.1111/j.2517-6161.1974 1974

[23] [23]

Statistics in medicine 42, 4177–4192

Hypothesis testing procedure for binary and multi-class f1-scores in the paired design. Statistics in medicine 42, 4177–4192. doi:https://doi. org/10.1002/sim.9853. Van der Vaart, A.W.,

work page doi:10.1002/sim.9853

[24] [24]

Cambridge university press, Cambridge

Asymptotic statistics. Cambridge university press, Cambridge. doi:https://doi.org/10.1017/CBO9780511802256. Walker, E., Nowacki, A.S.,

work page doi:10.1017/cbo9780511802256

[25] [25]

Journal of general internal medicine 26, 192–196

Understanding equivalence and non- inferiority testing. Journal of general internal medicine 26, 192–196. doi:https://doi.org/10.1007/s11606-010-1513-8. Windmeijer, F.,

work page doi:10.1007/s11606-010-1513-8

[26] [26]

Journal of econometrics 126, 25–51

A finite sample correction for the variance of linear efficient two-step gmm estimators. Journal of econometrics 126, 25–51. doi:https://doi.org/10.1016/j.jeconom.2004.02.005. Zwack, C.C., Haghani, M., Hollings, M., Zhang, L., Gauci, S., Gallagher, R., Redfern, J.,

work page doi:10.1016/j.jeconom.2004.02.005 2004

[27] [27]

doi:https: //doi.org/10.1038/s41746-022-00734-2. 51

work page doi:10.1038/s41746-022-00734-2