arxiv: 2605.07312 · v1 · submitted 2026-05-08 · 📊 stat.ME

Recognition: no theorem link

Incorporating Missing Data Considerations into Sample Size Calculations for Developing Clinical Prediction Models

Gary S. Collins, Glen P. Martin, Molly Wells, Rebecca Whittle, Richard D. Riley, Sian Bladon

Pith reviewed 2026-05-11 00:51 UTC · model grok-4.3

classification 📊 stat.ME

keywords missing datasample size calculationclinical prediction modelsimputationoverfittingcalibrationposterior distributionssimulation study

0 comments

The pith

Missing predictor data requires larger sample sizes than current methods assume to develop stable clinical prediction models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how missing values in predictor variables affect the sample sizes needed for developing clinical prediction models that avoid overfitting and maintain good calibration. Current sample size guidelines assume complete data for all participants, but missing data is common and degrades model performance even when minimum sizes are met. By adapting a framework based on anticipated posterior distributions, the authors show how to incorporate missing data assumptions and imputation strategies directly into calculations. Simulations demonstrate that in some cases up to twice the usual minimum sample size is needed to achieve comparable performance to fully observed data. This provides a practical way to plan studies that account for expected missingness upfront.

Core claim

The paper claims that missing predictor data increases minimum sample size requirements for developing stable and well-calibrated clinical prediction models. Using simulations on datasets meeting current criteria, expected calibration slopes often fell below 0.9 due to missingness. Adapting posterior sampling sample size calculations to include missing data and handling methods allows determination of inflated sample sizes, sometimes doubling the requirement, to restore performance. Two applied examples illustrate the approach.

What carries the argument

The adaptation of the general sample size framework based on anticipated posterior (sampling) distributions to incorporate missing data assumptions and different imputation methods.

Load-bearing premise

The simulation scenarios and imputation methods used accurately reflect the patterns of missing data and handling strategies encountered in real clinical prediction model development.

What would settle it

Observing in a new simulation or real dataset that models developed with the adjusted larger sample size still show calibration slopes below 0.9 would challenge the claim that the method sufficiently accounts for missing data effects.

read the original abstract

Clinical prediction models must be developed using sufficiently large datasets to minimise overfitting and ensure robust predictive performance. Existing sample size calculations assume complete predictor data for all included participants, yet missing values are common and may increase required sample sizes. This study aimed to quantify how missing predictor data and different imputation methods affect overfitting and model degradation, within datasets that adhere to current sample size criteria. We also aimed to explore how a general sample size framework based on anticipated posterior (sampling) distributions can be adapted to incorporate missing data assumptions and handling strategies. Using a simulation study, we found that in development data meeting current minimum sample size requirements, missing data reduced predictive performance, with expected calibration slopes frequently falling below the targeted value of 0.9. Increasing the required sample size to account for missing data reduced overfitting concerns, but the necessary inflation factor was context specific. In some scenarios, up to twice the minimum sample size was needed to achieve performance comparable to models developed using fully observed data. Expected value of perfect information calculations allowed quantification of the expected loss due to finite samples and missingness. Through two applied examples, we illustrate how embedding missing data assumptions and handling within the posterior sampling approach provides a principled way to determine required minimum sample sizes under missing data. Overall, missing predictor data increases minimum sample size requirements to develop stable and well-calibrated models. Our adaptations to recent posterior (sampling) sample size calculations offer a practical approach for incorporating missing data directly into sample size calculations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Missing data in predictors can push sample sizes up by as much as 2x to keep calibration and overfitting under control, and the paper adapts the posterior sampling framework to bake that in.

read the letter

The main thing to know is that standard sample size rules for clinical prediction models assume complete data, but when predictors have missing values the models degrade faster than expected. Their simulations show calibration slopes often drop below 0.9 even when the dataset meets current minimums, and in some cases you need roughly twice as many observations to recover comparable performance. They then show how to fold missingness assumptions and imputation choices into the anticipated posterior distribution approach, with expected value of perfect information used to quantify the extra uncertainty from finite samples plus missingness. Two applied examples walk through the calculations in real datasets. That adaptation is the concrete new piece; it turns an existing sample size method into something that can handle a common practical problem without requiring entirely new theory. The simulations are straightforward and the results are easy to interpret, which is helpful for applied statisticians who already use the posterior framework. The main limitation is that the inflation factors depend on the specific missingness rates, mechanisms, and imputation methods they chose. If real clinical data has stronger dependence between missingness and the outcome or more complex patterns than the simulated scenarios, the reported multipliers could be too small or too large. The paper does not test against external validation sets with known missingness structures, so the guidance stays simulation-driven. This is aimed at researchers planning or reviewing studies that develop prediction models from observational data with incomplete predictors. Anyone who has to justify a sample size to a funder or ethics board when missing data is likely will find the worked examples useful. It is solid enough on its own terms to go to peer review; the simulations are transparent and the adaptation is clearly described, even if reviewers will want more sensitivity checks on the missing data assumptions.

Referee Report

2 major / 1 minor

Summary. The paper uses a simulation study to demonstrate that missing predictor data in development datasets meeting current sample size criteria leads to degraded calibration (e.g., slopes frequently below 0.9) and increased overfitting risk. It quantifies context-specific sample size inflation factors (up to 2x in some scenarios) and adapts a posterior sampling framework for sample size calculations to incorporate missing data assumptions and imputation strategies, with illustrations via expected value of perfect information and two applied examples.

Significance. If the simulation results prove robust, this provides a timely practical extension to existing sample size guidelines for clinical prediction models, where missing data is common. The adaptation of the posterior sampling approach and use of EVPI calculations to quantify losses from finite samples and missingness represent clear strengths, offering a principled method beyond ad-hoc adjustments.

major comments (2)

Simulation study section: The manuscript does not provide sufficient detail on the number of scenarios examined, the specific missingness rates and mechanisms (MAR/MNAR), the imputation methods (single vs. multiple), or how performance metrics such as calibration slope were aggregated across replications. This is load-bearing for the central claims about performance degradation and the derived inflation factors, as the skeptic note highlights potential mismatch with real-world clinical data patterns.
Adaptation of posterior sampling framework (methods/results): It is unclear how missing data and imputation variability are formally embedded into the anticipated posterior distributions used for sample size determination (e.g., whether imputation uncertainty is propagated through the sampling or treated separately). Without this, the practical recommendations for adjusted sample sizes lack a fully specified derivation.

minor comments (1)

Abstract: The reference to 'posterior (sampling) sample size calculations' assumes familiarity with the base method; a brief parenthetical citation or one-sentence description would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of our work. We address each major comment below, providing clarifications and outlining revisions to strengthen the manuscript.

read point-by-point responses

Referee: Simulation study section: The manuscript does not provide sufficient detail on the number of scenarios examined, the specific missingness rates and mechanisms (MAR/MNAR), the imputation methods (single vs. multiple), or how performance metrics such as calibration slope were aggregated across replications. This is load-bearing for the central claims about performance degradation and the derived inflation factors, as the skeptic note highlights potential mismatch with real-world clinical data patterns.

Authors: We agree that greater detail on the simulation design is required to substantiate our central claims. In the revised manuscript, we will expand the Simulation study section to explicitly state the number of scenarios examined (covering variations in sample size, predictor count, event prevalence, and missing data proportions), the missingness rates (5% to 40%) and mechanisms (MAR and MNAR), the imputation methods (single imputation by mean/mode versus multiple imputation by chained equations with 10 imputations), and the aggregation of metrics (median and interquartile range of calibration slope across 5000 replications per scenario). We will also add a paragraph discussing the alignment of our simulated patterns with typical clinical datasets to address generalizability concerns. revision: yes
Referee: Adaptation of posterior sampling framework (methods/results): It is unclear how missing data and imputation variability are formally embedded into the anticipated posterior distributions used for sample size determination (e.g., whether imputation uncertainty is propagated through the sampling or treated separately). Without this, the practical recommendations for adjusted sample sizes lack a fully specified derivation.

Authors: We acknowledge that the formal integration of missing data and imputation variability into the anticipated posterior distributions requires more explicit description. The current adaptation modifies the posterior by incorporating the expected reduction in information from missing predictors and by averaging over imputation variability within the sampling process. In the revised manuscript, we will add a dedicated subsection in Methods with the full mathematical derivation, specifying that imputation uncertainty is propagated by integrating over multiple imputed datasets when constructing the anticipated posterior (rather than handled separately). This will provide the complete, step-by-step specification supporting the adjusted sample size recommendations and EVPI calculations. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on new simulation results and applied examples

full rationale

The paper derives its conclusions from a dedicated simulation study that empirically measures performance degradation under missing data scenarios and then illustrates adaptations of an existing posterior sampling framework via two applied examples. No load-bearing step reduces a claimed prediction or minimum sample size formula to a fitted parameter or self-citation by construction; the inflation factors and EVPI calculations are outputs of the simulations rather than tautological rearrangements of the inputs. Self-citations to prior sample-size work are present but non-load-bearing, as the novel contribution is the missing-data extension demonstrated through independent simulation evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim relies on simulation-based evidence and two applied examples rather than new axioms or free parameters; the sample size inflation is determined empirically from simulations under standard missing data assumptions.

axioms (1)

domain assumption Missing data mechanisms are missing at random or can be appropriately handled by standard imputation methods such as multiple imputation.
The paper explores different imputation methods and their impact, invoking standard statistical assumptions for missing data handling.

pith-pipeline@v0.9.0 · 5577 in / 1389 out tokens · 87390 ms · 2026-05-11T00:51:58.830144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 16 canonical work pages · 1 internal anchor

[1]

Figure 3: Assurance probability of the calibration slope being between 0.9 and 1.1 for each model across dicerent sample sizes relative to the closed-form minimum sample size calculation, and dicerent levels of missing data. Panel (A) shows the results estimated within the fully observed target population and Panel (B) shows the results estimated within t...

2001
[2]

D.; Ensor, J.; Snell, K

(3) Riley, R. D.; Ensor, J.; Snell, K. I. E.; Archer, L.; Whittle, R.; Dhiman, P .; Alderman, J.; Liu, X.; Kirton, L.; Manson-Whitton, J.; van Smeden, M.; Moons, K. G.; Nirantharakumar, K.; Cazier, J.-B.; Denniston, A. K.; Van Calster, B.; Collins, G. S. Importance of Sample Size on the Quality and Utility of AI-Based Prediction Models for Healthcare. Lan...

work page doi:10.1016/j.landig.2025.01.013 2025
[3]

(5) van Smeden, M.; Moons, K

https://doi.org/10.1186/s12874-016-0267-3. (5) van Smeden, M.; Moons, K. G.; de Groot, J. A.; Collins, G. S.; Altman, D. G.; Eijkemans, M. J.; Reitsma, J. B. Sample Size for Binary Logistic Prediction Models: Beyond Events per Variable Criteria. Stat Methods Med Res 2019, 28 (8), 2455–2474. https://doi.org/10.1177/0962280218784726. (6) Riley, R. D.; Snell...

work page doi:10.1186/s12874-016-0267-3 2019
[4]

(12) Riley, R

https://doi.org/10.1186/s12874-026-02856-7. (12) Riley, R. D.; Collins, G. S.; Whittle, R.; Archer, L.; Snell, K. I. E.; Dhiman, P .; Kirton, L.; Legha, A.; Liu, X.; Denniston, A. K.; Harrell, F . E.; Wynants, L.; Martin, G. P .; Ensor, J. A Decomposition of Fisher’s Information to Inform Sample Size for Developing or 19 Updating Fair and Precise Clinical...

work page doi:10.1186/s12874-026-02856-7 2025
[5]

(13) Riley, R

https://doi.org/10.1186/s41512-025-00193-9. (13) Riley, R. D.; Collins, G. S.; Archer, L.; Whittle, R.; Legha, A.; Kirton, L.; Dhiman, P .; Sadatsafavi, M.; Adderley, N. J.; Alderman, J.; Martin, G. P .; Ensor, J. A Decomposition of Fisher’s Information to Inform Sample Size for Developing or Updating Fair and Precise Clinical Prediction Models - Part 2: ...

work page doi:10.1186/s41512-025-00193-9 2025
[6]

(14) Whittle, R.; Riley, R

https://doi.org/10.1186/s41512-025-00204-9. (14) Whittle, R.; Riley, R. D.; Archer, L.; Collins, G. S.; Legha, A.; Snell, K. I.; Ensor, J. A Decomposition of Fisher’s Information to Inform Sample Size for Developing or Updating Fair and Precise Clinical Prediction Models – Part 3: Continuous Outcomes. Diagn. Progn. Res. 2026, 10 (1),

work page doi:10.1186/s41512-025-00204-9 2026
[7]

(15) Pate, A.; Riley, R

https://doi.org/10.1186/s41512-026-00228-9. (15) Pate, A.; Riley, R. D.; Collins, G. S.; van Smeden, M.; Van Calster, B.; Ensor, J.; Martin, G. P . Minimum Sample Size for Developing a Multivariable Prediction Model Using Multinomial Logistic Regression. Stat Methods Med Res 2023, 32 (3), 555–571. https://doi.org/10.1177/09622802231151220. (16) Pavlou, M....

work page doi:10.1186/s41512-026-00228-9 2023
[9]

(18) Archer, L.; Snell, K

https://doi.org/10.1186/s12874-024-02268-5. (18) Archer, L.; Snell, K. I. E.; Ensor, J.; Hudda, M. T.; Collins, G. S.; Riley, R. D. Minimum Sample Size for External Validation of a Clinical Prediction Model with a Continuous Outcome. Stat Med 2021, 40 (1), 133–146. https://doi.org/10.1002/sim.8766. (19) Riley, R. D.; Debray, T. P . A.; Collins, G. S.; Arc...

work page doi:10.1186/s12874-024-02268-5 2021
[10]

Language Model Cascades: Token-Level Uncertainty and Beyond

https://doi.org/10.48550/ARXIV .2504.06799. (28) Hoogland, J.; van Barreveld, M.; Debray, T. P . A.; Reitsma, J. B.; Verstraelen, T. E.; Dijkgraaf, M. G. W.; Zwinderman, A. H. Handling Missing Predictor Values When Validating and Applying a Prediction Model to New Patients. Stat Med 2020, 39 (25), 3591–3607. https://doi.org/10.1002/sim.8682. (29) Janssen,...

work page internal anchor Pith review doi:10.48550/arxiv 2020
[11]

(34) Albu, E.; Gao, S.; Wynants, L.; Van Calster, B

https://doi.org/10.1186/1471-2288-9-57. (34) Albu, E.; Gao, S.; Wynants, L.; Van Calster, B. missForestPredict—Missing Data Imputation for Prediction Settings. PLOS One 2025, 20 (11), e0334125. https://doi.org/10.1371/journal.pone.0334125. (35) Nijman, S. W. J.; Groenhof, T. K. J.; Hoogland, J.; Bots, M. L.; Brandjes, M.; Jacobs, J. J. L.; Asselbergs, F ....

work page doi:10.1186/1471-2288-9-57 2025
[12]

(39) Reilly, B

https://doi.org/10.1136/bmj-2023-074820. (39) Reilly, B. M.; Evans, A. T. Translating Clinical Research into Clinical Practice: Impact of Using Prediction Rules To Make Decisions. Ann. Intern. Med. 2006, 144 (3), 201–209. https://doi.org/10.7326/0003-4819-144-3-200602070-00009. (40) Localio, A. R.; Goodman, S. Beyond the Usual Prediction Accuracy Metrics:...

work page doi:10.1136/bmj-2023-074820 2023
[13]

(42) Vickers, A

https://doi.org/10.1186/s41512-019-0064-7. (42) Vickers, A. J.; Elkin, E. B. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med. Decis. Making 2006, 26 (6), 565–574. https://doi.org/10.1177/0272989X06295361. (43) Sadatsafavi, M.; Yoon Lee, T.; Gustafson, P . Uncertainty and the Value of Information in Risk Prediction Modeling. M...

work page doi:10.1186/s41512-019-0064-7 2006
[14]

(46) Altman, D

https://doi.org/10.1186/s41512-020-00082-3. (46) Altman, D. G.; Andersen, P . K. Bootstrap Investigation of the Stability of a Cox Regression Model. Stat. Med. 1989, 8 (7), 771–783. https://doi.org/10.1002/sim.4780080702. (47) Thoemmes, F .; Mohan, K. Graphical Representation of Missing Data Problems. Struct. Equ. Model. Multidiscip. J. 2015, 22 (4), 631–...

work page doi:10.1186/s41512-020-00082-3 1989
[15]

Why” behind Including “Y

(54) D’Agostino McGowan, L.; Lotspeich, S. C.; Hepler, S. A. The “Why” behind Including “Y” in Your Imputation Model. Stat. Methods Med. Res. 2024, 33 (6), 996–1020. https://doi.org/10.1177/09622802241244608. (55) Sadatsafavi, M.; Lee, T. Y .; Wynants, L.; Vickers, A. J.; Gustafson, P . Value-of-Information Analysis for External Validation of Risk Predict...

work page doi:10.1177/09622802241244608 2024
[16]

(57) Friedman, J.; Hastie, T.; Tibshirani, R

https://doi.org/10.21105/joss.01686. (57) Friedman, J.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33 (1). https://doi.org/10.18637/jss.v033.i01. (58) Ensor, J. Pmsampsize: Sample Size for Development of a Prediction Model, 2019, 1.1.3. https://doi.org/10.32614/CRAN.package.p...

work page doi:10.21105/joss.01686 2010
[17]

(63) Pate, A.; Martin, G

https://doi.org/10.1186/s41512-021-00096-5. (63) Pate, A.; Martin, G. P .; Riley, R. D. Agreement between Heuristic Shrinkage Factor and Optimal Shrinkage Factors in Logistic Regression for Risk Prediction: A Simulation Study across Dicerent Sample Sizes and Settings. BMC Diagn. Progn. Res. In Press. 23 Supplementary Material: Incorporating Missing Data C...

work page doi:10.1186/s41512-021-00096-5
[18]

P .; Riley, R

10 Supplementary References (1) Martin, G. P .; Riley, R. D.; Collins, G. S.; Sperrin, M. Developing Clinical Prediction Models When Adhering to Minimum Sample Size Recommendations: The Importance of Quantifying Bootstrap Variability in Tuning Parameters and Predictive Performance. Stat Methods Med Res 2021, 30 (12), 2545–2561. https://doi.org/10.1177/096...

work page doi:10.1177/09622802211046388 2021