arxiv: 2605.11284 · v1 · submitted 2026-05-11 · 📊 stat.ME · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Rethinking external validation for the target population: Capturing patient-level similarity with a generative model

Ameen Abu-Hanna, Giovanni Cin\`a (on behalf of the NHR THI registration committee), Marije M. Vis, Mohammad Azizmalayeri, Saskia Houterman

Pith reviewed 2026-05-13 01:26 UTC · model grok-4.3

classification 📊 stat.ME cs.AIcs.LG

keywords external validationgenerative modelsautoencoderscase-mixpredictive modelstransportabilitysimilarity measuremortality prediction

0 comments

The pith

Autoencoder similarity scores separate case-mix effects from model deficiencies during external validation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework that measures how similar each patient in an external dataset is to the original development population using generative autoencoders. Performance is then assessed within subgroups that share different degrees of alignment with the development data. This approach shows that overall validation metrics can hide large differences: some subgroups perform as expected from internal testing while others do not. The method works without sharing the original training data. Readers would care because it supplies a concrete way to decide which new patients a model can be used on safely rather than treating the entire external population as uniform.

Core claim

By training autoencoders on development data to produce a patient-level similarity score, the framework evaluates predictive model performance separately in external subgroups ordered by their alignment to the development distribution; this distinguishes true model shortcomings from differences in patient characteristics and shows that conventional aggregate metrics can either understate or overstate transportability.

What carries the argument

The autoencoder-derived similarity score, which quantifies each external patient's alignment with the development population distribution without requiring data sharing.

Load-bearing premise

The similarity score derived from the autoencoder actually tracks the patient characteristics that drive changes in model accuracy.

What would settle it

In synthetic data with injected case-mix shifts known to alter performance, the similarity subgroups show no systematic difference in model metrics.

read the original abstract

Background: External validation is essential for assessing the transportability of predictive models. However, its interpretation is often confounded by differences between external and development populations. This study introduces a framework to distinguish model deficiencies from case-mix effects. Method: We propose a framework that quantifies each external patient's similarity to the development data and measures performance in subgroups with varying levels of alignment to the development distribution. We use generative models, specifically autoencoders, to estimate similarity, offering a more flexible alternative to traditional linear approaches and enabling validation without sharing the original development data. The utility of autoencoder-based similarity measure is demonstrated using synthetic data, and the framework's application is illustrated using data from the Netherlands Heart Registration (NHR) to predict mortality after transcatheter aortic valve implantation. Results: Our framework revealed substantial variation in model performance across similarity-defined subgroups, differences that remain hidden under conventional external validation yet can meaningfully alter conclusions. In several settings, conventional external validation suggested poor overall performance. However, after accounting for differences in patient characteristics, for some sub-groups, the model performance was consistent with internal validation results. Conversely, apparently acceptable overall performance could mask clinically relevant performance deficits in specific subgroups. Conclusion: The proposed framework enhances the interpretability of external validation by linking model performance to population alignment with the development data. This provides a more principled basis for deciding whether a model is transportable and to which patients it can be safely applied.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete way to use autoencoders for patient similarity scores that split external validation results by alignment to the development data, which helps separate case-mix effects from true transportability problems.

read the letter

The main contribution is a framework that trains an autoencoder only on the development sample, then scores external patients by how well they reconstruct in that latent space, and finally reports model performance within similarity strata. This avoids sharing raw development records while showing that overall external metrics can mask both good and bad subgroup performance. On synthetic data they control the shifts, and on the NHR TAVI mortality data they illustrate the point that some low-similarity groups drag down the aggregate numbers while higher-similarity ones track internal validation results. That is a practical step beyond standard external validation or simple propensity matching. The no-data-sharing feature and the use of a generative model rather than linear summaries are the clearest advances. The write-up is clear on the motivation and the workflow steps. The soft spots are in the supporting evidence. The abstract and results give no numeric values for the performance differences, no confidence intervals, and no sensitivity checks on autoencoder architecture or the similarity cut-points. There is also no direct test that the latent similarity ranks patients by the covariates or interactions that actually move calibration or discrimination; if the embedding picks up unrelated variation, the subgroups could be misleading. The application to real data is illustrative rather than confirmatory. This is for methodologists and applied statisticians working on clinical prediction models who already run external validations and want a diagnostic layer. A reader who needs tools for transportability questions will find the example useful even if the method still needs tighter validation. It should go to peer review because the core idea is workable and addresses a genuine interpretation gap, though referees will likely ask for quantitative results and checks on whether the similarity measure tracks the relevant distributional shifts.

Referee Report

3 major / 2 minor

Summary. The paper proposes a framework for external validation of predictive models that uses autoencoders trained on development data to compute patient-level similarity scores for external cases, then stratifies performance evaluation by these similarity levels to separate case-mix effects from model deficiencies. It demonstrates the approach on synthetic data and applies it to Netherlands Heart Registration (NHR) data for post-TAVI mortality prediction, claiming that conventional overall metrics mask important subgroup variation that can change transportability conclusions.

Significance. If the autoencoder similarity reliably isolates case-mix dimensions that drive changes in calibration or discrimination, the framework offers a practical way to interpret external validation results without sharing raw development data and could improve decisions on model applicability. The generative-model approach is a clear strength for privacy-preserving validation and addresses a real gap in standard practice.

major comments (3)

[Abstract and Results] Abstract/Results: the claim that the framework 'revealed substantial variation in model performance across similarity-defined subgroups' and that this 'can meaningfully alter conclusions' is presented without any quantitative metrics, confidence intervals, p-values, or details on how similarity thresholds or subgroups were defined; this absence makes it impossible to judge the magnitude or robustness of the reported gradients.
[Methods] Methods: the autoencoder is trained solely on development data and then used to score external patients, yet the manuscript provides no explicit test (e.g., controlled feature perturbations known to degrade performance or correlation analysis between similarity scores and outcome-relevant shifts) showing that the learned latent representation captures the dimensions that actually affect the target model's discrimination or calibration rather than orthogonal variance.
[Synthetic data experiment] Synthetic-data experiment: while the abstract states that synthetic data demonstrate the framework's ability to reveal hidden performance variation, no description is given of how the synthetic case-mix shifts were constructed, what performance metrics were tracked, or how the autoencoder hyperparameters were chosen, leaving the demonstration non-reproducible and the free parameters unexamined.

minor comments (2)

[Methods] The manuscript would benefit from an explicit equation defining the similarity score (e.g., reconstruction error or latent distance) and from a sensitivity analysis showing stability of the performance gradients to autoencoder architecture choices.
[Results] Figure captions and table legends should include the exact similarity thresholds or quantile cut-points used to define subgroups so readers can replicate the stratification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the quantitative presentation and reproducibility of the framework. We respond to each major comment below and will incorporate revisions in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and Results] Abstract/Results: the claim that the framework 'revealed substantial variation in model performance across similarity-defined subgroups' and that this 'can meaningfully alter conclusions' is presented without any quantitative metrics, confidence intervals, p-values, or details on how similarity thresholds or subgroups were defined; this absence makes it impossible to judge the magnitude or robustness of the reported gradients.

Authors: The abstract is a high-level summary. The Results section reports specific metrics (AUC, calibration slope) stratified by similarity quantiles in both synthetic and NHR analyses, showing gradients that alter transportability conclusions in several cases. We agree that confidence intervals, formal tests for trend, and explicit quantile thresholds are needed for robustness assessment. We will revise the Results and abstract to include these elements along with clearer subgroup definitions. revision: yes
Referee: [Methods] Methods: the autoencoder is trained solely on development data and then used to score external patients, yet the manuscript provides no explicit test (e.g., controlled feature perturbations known to degrade performance or correlation analysis between similarity scores and outcome-relevant shifts) showing that the learned latent representation captures the dimensions that actually affect the target model's discrimination or calibration rather than orthogonal variance.

Authors: The autoencoder is trained to reconstruct the joint distribution of development predictors; similarity is defined in the latent space to capture multivariate alignment. The synthetic experiments already show systematic performance degradation with decreasing similarity when shifts are introduced in outcome-relevant features. We will add an explicit validation subsection (Methods/Results) that includes feature-perturbation experiments and correlation analyses between similarity scores and shifts in calibration/discrimination to directly demonstrate capture of relevant dimensions. revision: partial
Referee: [Synthetic data experiment] Synthetic-data experiment: while the abstract states that synthetic data demonstrate the framework's ability to reveal hidden performance variation, no description is given of how the synthetic case-mix shifts were constructed, what performance metrics were tracked, or how the autoencoder hyperparameters were chosen, leaving the demonstration non-reproducible and the free parameters unexamined.

Authors: The Methods section describes synthetic data generation via controlled shifts in feature means, variances, and correlations to simulate case-mix differences, with performance tracked via AUC and calibration metrics. Hyperparameters were chosen by reconstruction error on held-out development samples. We will expand this section with precise shift magnitudes, the full list of tracked metrics, and the hyperparameter search procedure to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: similarity scores and performance metrics remain independent

full rationale

The paper trains an autoencoder solely on development data to produce a latent representation, then computes similarity scores for external patients and evaluates the predictive model's performance (discrimination/calibration) on those patients' actual outcomes within similarity-defined subgroups. No equation or procedure reduces the reported subgroup performance to a fitted parameter of the autoencoder or to the similarity score itself. The external outcomes supply an independent test set, and conventional external validation is contrasted without any self-referential closure. No load-bearing self-citation or uniqueness theorem is invoked to force the result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the generative model successfully encoding the relevant aspects of the development distribution. Because only the abstract is available, the ledger is limited to elements explicitly invoked in the summary.

free parameters (1)

Autoencoder architecture and training hyperparameters
Latent dimension, network depth, regularization, and reconstruction loss weighting must be chosen; these choices affect the similarity scores that define the subgroups.

axioms (1)

domain assumption An autoencoder trained on development data produces a similarity measure that aligns with the dimensions of case-mix that affect predictive performance.
Invoked when the authors state that the generative model offers a flexible alternative to linear approaches for capturing alignment.

pith-pipeline@v0.9.0 · 5594 in / 1342 out tokens · 52353 ms · 2026-05-13T01:26:03.556097+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We use generative models, specifically autoencoders, to estimate similarity... performance across similarity-defined subgroups
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
stratify the external dataset into ID-like and OOD instances

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

Briefings in bioinformatics 2018, 19(6):1236–1246

Miotto R, Wang F , Wang S, Jiang X, Dudley JT: Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics 2018, 19(6):1236–1246

work page 2018
[2]

NPJ digital medicine 2022, 5(1):2

de Hond AA, Leeuwenberg AM, Hooft L, Kant IM, Nijman SW, van Os HJ, Aardoom JJ, Debray TP , Schuit E, van Smeden M: Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ digital medicine 2022, 5(1):2

work page 2022
[3]

In: 2019 global conference for advancement in technology (GCAT): 2019: IEEE; 2019: 1–6

Nair NG, Satpathy P , Christopher J: Covariate shift: A review and analysis on classifiers. In: 2019 global conference for advancement in technology (GCAT): 2019: IEEE; 2019: 1–6

work page 2019
[4]

Scientific reports 2022, 12(1):2726

Guo LL, Pfohl SR, Fries J, Johnson AE, Posada J, Aftandilian C, Shah N, Sung L: Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Scientific reports 2022, 12(1):2726

work page 2022
[5]

Ramspek CL, Jager KJ, Dekker FW, Zoccali C, van Diepen M: External validation of prognostic models: what, why, how, when and where? Clinical kidney journal 2021, 14(1):49–58

work page 2021
[6]

Bmj 2024, 384

Collins GS, Dhiman P , Ma J, Schlussel MM, Archer L, Van Calster B, Harrell FE, Martin GP , Moons KG, Van Smeden M: Evaluation of clinical prediction models (part 1): from development to external validation. Bmj 2024, 384

work page 2024
[7]

Diagnostic and prognostic research 2022, 6(1):24

Sperrin M, Riley RD, Collins GS, Martin GP: Targeted validation: validating clinical prediction models in their intended population and setting. Diagnostic and prognostic research 2022, 6(1):24

work page 2022
[8]

American journal of epidemiology 2010, 172(8):971–980

Vergouwe Y , Moons KG, Steyerberg EW: External validity of risk models: use of benchmark values to disentangle a case-mix effect from incorrect coefficients. American journal of epidemiology 2010, 172(8):971–980

work page 2010
[9]

Journal of clinical epidemiology 2015, 68(3):279–289

Debray TP , Vergouwe Y , Koffijberg H, Nieboer D, Steyerberg EW, Moons KG: A new framework to enhance the interpretation of external validation studies of clinical prediction models. Journal of clinical epidemiology 2015, 68(3):279–289

work page 2015
[10]

Statistics in Medicine 2023, 42(19):3508–3528

de Jong VM, Hoogland J, Moons KG, Riley RD, Nguyen TL, Debray TP: Propensity‐based standardization to enhance the validation and interpretation of prediction model discrimination for a target population. Statistics in Medicine 2023, 42(19):3508–3528

work page 2023
[11]

Statistics in medicine 2022, 41(24):4756– 4780

Pfeiffer RM, Chen Y , Gail MH, Ankerst DP: Accommodating population differences when validating risk prediction models. Statistics in medicine 2022, 41(24):4756– 4780

work page 2022
[12]

International Journal of Medical Informatics 2025, 195:105762

Azizmalayeri M, Abu-Hanna A, Ciná G: Unmasking the chameleons: A benchmark for out-of-distribution detection in medical tabular data. International Journal of Medical Informatics 2025, 195:105762

work page 2025
[13]

BMC medicine 2023, 21(1):70

Van Calster B, Steyerberg EW, Wynants L, Van Smeden M: There is no such thing as a validated prediction model. BMC medicine 2023, 21(1):70

work page 2023
[14]

Journal of Clinical Epidemiology 2024, 172:111387

la Roi-Teeuw HM, van Royen FS, de Hond A, Zahra A, de Vries S, Bartels R, Carriero AJ, van Doorn S, Dunias ZS, Kant I: Don't be misled: 3 misconceptions about external validation of clinical prediction models. Journal of Clinical Epidemiology 2024, 172:111387

work page 2024
[15]

Journal of the American Medical Directors Association 2023, 24(12):1996–2001

van de Loo B, Heymans MW, Medlock S, Boyé ND, van der Cammen TJ, Hartholt KA, Emmelot-Vonk MH, Mattace-Raso FU, Abu-Hanna A, van der Velde N: Validation of the ADFICE_IT models for predicting falls and recurrent falls in geriatric outpatients. Journal of the American Medical Directors Association 2023, 24(12):1996–2001

work page 2023
[16]

Revolutionizing the Healthcare Sector with AI 2024:79–90

Helen D, Suresh N: Generative AI in healthcare: Opportunities, challenges, and future perspectives. Revolutionizing the Healthcare Sector with AI 2024:79–90

work page 2024
[17]

BMC Medical Research Methodology 2025, 25(1):1–17

Elayan H, Sperrin M, Martin GP , Peek N, Braunschweig F , Faxén J, Alfredsson J, Jenkins DA: Correcting for case-mix shift when developing clinical prediction models. BMC Medical Research Methodology 2025, 25(1):1–17

work page 2025
[18]

Machine Learning in Clinical Neuroscience: Foundations and Applications 2021:15–21

Kernbach JM, Staartjes VE: Foundations of machine learning-based clinical prediction modeling: Part II—Generalization and overfitting. Machine Learning in Clinical Neuroscience: Foundations and Applications 2021:15–21

work page 2021
[19]

NPJ digital medicine 2022, 5(1):66

Feng J, Phillips RV , Malenica I, Bishara A, Hubbard AE, Celi LA, Pirracchio R: Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. NPJ digital medicine 2022, 5(1):66

work page 2022
[20]

Advances in neural information processing systems 2018, 31

Lee K, Lee K, Lee H, Shin J: A simple unified framework for detecting out-of- distribution samples and adversarial attacks. Advances in neural information processing systems 2018, 31

work page 2018
[21]

In: Multimodal AI in healthcare: A paradigm shift in health intelligence

Zadorozhny K, Thoral P , Elbers P , Cinà G: Out-of-distribution detection for medical applications: Guidelines for practical evaluation. In: Multimodal AI in healthcare: A paradigm shift in health intelligence. Springer; 2022: 137–153

work page 2022
[22]

In: Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence: 2024; 2024: 203–224

Azizmalayeri M, Abu-Hanna A, Cinà G: Mitigating overconfidence in out-of- distribution detection by capturing extreme activations. In: Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence: 2024; 2024: 203–224

work page 2024
[23]

arXiv preprint arXiv:210312628 2021

Kaur R, Jha S, Roy A, Sokolsky O, Lee I: Are all outliers alike? on understanding the diversity of outliers for detecting oods. arXiv preprint arXiv:210312628 2021

work page 2021
[24]

Journal of Biomedical Informatics 2022, 127:103996

Nicora G, Rios M, Abu-Hanna A, Bellazzi R: Evaluating pointwise reliability of machine learning prediction. Journal of Biomedical Informatics 2022, 127:103996

work page 2022
[25]

IEEe Access 2024

Bengesi S, El-Sayed H, Sarker MK, Houkpati Y , Irungu J, Oladunni T: Advancements in Generative AI: A Comprehensive Review of GANs, GPT, Autoencoders, Diffusion Model, and Transformers. IEEe Access 2024

work page 2024
[26]

In: Machine Learning for Health: 2020: PMLR; 2020: 341–354

Ulmer D, Meijerink L, Cinà G: Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data. In: Machine Learning for Health: 2020: PMLR; 2020: 341–354

work page 2020
[27]

IEEE journal of biomedical and health informatics 2018, 23(1):103– 111

Zhou C, Jia Y , Motani M: Optimizing autoencoders for learning deep representations from health data. IEEE journal of biomedical and health informatics 2018, 23(1):103– 111

work page 2018
[28]

In: Machine learning

Pinaya WHL, Vieira S, Garcia-Dias R, Mechelli A: Autoencoders. In: Machine learning. Elsevier; 2020: 193–208

work page 2020
[29]

Netherlands Heart Journal 2022, 30(12):546–556

Timmermans MJ, Houterman S, Daeter ED, Danse PW, Li WW, Lipsic E, Roefs MM, van Veghel D, Registration PRCotNH, Registration tCSRCotNH: Using real-world data to monitor and improve quality of care in coronary artery disease: results from the Netherlands Heart Registration. Netherlands Heart Journal 2022, 30(12):546–556

work page 2022
[30]

Olsthoorn J, Heuts S, Houterman S, Roefs M, Maessen J, Nia P: Cardiothoracic Surgery Registration Committee of the Netherlands Heart Registration. Does concomitant tricuspid valve surgery increase the risks of minimally invasive mitral valve surgery? A multicentre comparison based on data from The Netherlands Heart Registration. J Card Surg 2022, 37:4362–4370

work page 2022
[31]

Netherlands Heart Journal 2024, 32(6):228–237

Derks L, Medendorp NM, Houterman S, Umans VA, Maessen JG, van Veghel D, Registration aRCotNH: Building a patient-centred nationwide integrated cardiac care registry: intermediate results from the Netherlands. Netherlands Heart Journal 2024, 32(6):228–237

work page 2024
[32]

Journal of Clinical Epidemiology 2023, 157:13–21

Yordanov TR, Lopes RR, Ravelli AC, Vis M, Houterman S, Marquering H, Abu-Hanna A: An integrated approach to geographic validation helped scrutinize prediction model performance and its variability. Journal of Clinical Epidemiology 2023, 157:13–21

work page 2023
[33]

Biostatistics 2020, 21(2):345–352

Subbaswamy A, Saria S: From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 2020, 21(2):345–352

work page 2020
[34]

similar” and “dissimilar

Taheri T, Farahani A, Liu Z-Q, Ceballos EG, Harroud A, Dagher A, Misic B: Spatial organization of AQP4 channels in the human brain: links with perfusion, edema, and disease vulnerability. bioRxiv 2026:2026.2002. 2018.706679. Appendix Appendix A. Propensity Score Weighting for Distribution Matching When evaluating model performance on external datasets, di...

work page arXiv 2026