Influence of continuous predictor modelling methods on prediction stability in clinical prediction model development: an empirical comparison using real clinical data
Pith reviewed 2026-06-27 21:17 UTC · model grok-4.3
The pith
The choice of modeling method for continuous predictors affects how stable clinical predictions remain across repeated samples, with linear terms reaching stability at smaller sizes than quadratic, fractional polynomial or boosting approach
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continuous predictor modelling methods appeared to influence prediction stability. LIN achieved stable predictions from the base sample size onwards, whereas QUA, MFP, and XGB required larger samples. Although XGB showed high discrimination, calibration concerns persisted.
What carries the argument
Bootstrap-based stability criterion requiring at least 90 percent of individual predictions to have MAPE of 5 percent or less, applied across six modeling methods and five sample-size scenarios.
If this is right
- LIN yields stable predictions even in the smallest tested samples where other methods fail.
- QUA reaches the stability criterion once the sample reaches approximately 1,748 patients.
- MFP and XGB require samples of at least 3,496 patients before most individual predictions meet the stability threshold.
- All six methods produce stable predictions once the sample reaches 8,739 patients.
- XGB maintains the highest discrimination throughout but shows persistent miscalibration.
Where Pith is reading between the lines
- Developers working with limited data may favor linear terms to obtain reliable individual predictions without waiting for larger samples.
- The observed stability differences imply that sample-size calculations for new clinical models should account for the intended modeling approach.
- The calibration problems seen with XGB suggest that discrimination alone is insufficient for choosing a method when individual predictions matter.
Load-bearing premise
The 90 percent MAPE threshold of 5 percent or less within the bootstrap framework supplies a clinically meaningful and unbiased way to compare stability across modeling methods.
What would settle it
A replication on similar clinical data in which quadratic, MFP or XGB methods reach the 90 percent MAPE stability threshold at the same small sample size as linear terms would falsify the claim that modeling choices differentially affect stability.
read the original abstract
Background and objective: Prediction stability is increasingly recognised as important for reliable clinical prediction model development, but the effect of continuous predictor modelling choices is unclear. This study examined how approaches to modelling continuous predictors influence prediction stability. Methods: We used a real clinical dataset of 19,418 emergency department patients to create five sample size scenarios ranging from 437 to 8,739 patients. Six methods were compared: dichotomisation at the median (DIC), tertile categorisation (TER), linear terms (LIN), quadratic terms (QUA), multivariable fractional polynomials (MFP), and extreme gradient boosting (XGB). Prediction stability was evaluated using a bootstrap-based framework. Optimism-corrected AUC and calibration were estimated through internal validation. A method was considered stable when at least 90% of individual predictions had a mean absolute prediction error (MAPE) <=5%. Results: Stability increased with sample size and varied by method. At n = 437, no method met the stability criterion; LIN was the most stable, followed by DIC. At n = 874, DIC and LIN achieved stable predictions with similar calibration, although DIC had lower AUC. At n = 1,748, QUA achieved stability, whereas MFP and XGB did not. At n = 3,496 and n = 8,739, all methods achieved stability. LIN, QUA, MFP, and XGB generally had higher AUCs than DIC and TER, while XGB showed the highest AUC but persistent miscalibration. Conclusion: Continuous predictor modelling methods appeared to influence prediction stability. LIN achieved stable predictions from the base sample size onwards, whereas QUA, MFP, and XGB required larger samples. Although XGB showed high discrimination, calibration concerns persisted. These findings suggest that, in smaller datasets, simpler approaches, particularly LIN, may provide more stable predictions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical comparison of six continuous-predictor modelling approaches (dichotomisation, tertiles, linear terms, quadratic terms, MFP, XGB) on prediction stability using a single 19,418-patient ED cohort subsampled into five sizes (437–8,739). Stability is defined via bootstrap as the point at which ≥90 % of individual predictions satisfy MAPE ≤5 %; secondary outcomes are optimism-corrected AUC and calibration. The central claim is that linear terms reach the stability threshold at the smallest n, while quadratic, MFP and XGB require larger samples, and that XGB exhibits the highest AUC but persistent miscalibration.
Significance. If the ordering of methods is robust to the stability definition, the work supplies practical, data-driven guidance for choosing modelling strategies in smaller clinical datasets where stability matters. Strengths include the use of real clinical data, a range of sample sizes, and bootstrap-based internal validation; these elements make the comparison more informative than purely simulated studies.
major comments (2)
- [Methods] Methods (stability definition paragraph): the threshold “at least 90 % of individual predictions with MAPE ≤5 %” is introduced without sensitivity checks on either the 90 % or 5 % cut-off, without comparison to alternative metrics (Brier score, absolute probability error), and without clinical justification. Because all five sample-size scenarios are subsamples of the identical 19,418-patient cohort, any change in the rule directly alters which methods cross the stability threshold at which n, rendering the reported ordering (LIN earliest, then QUA/MFP/XGB) dependent on an unvalidated operational choice.
- [Results] Results (sample-size scenario comparisons): the bootstrap procedure used to compute MAPE for each individual prediction is not described in sufficient detail (e.g., whether the same bootstrap samples are used for model fitting and for stability assessment, how out-of-bag predictions are handled, and whether the 90 % rule is applied per bootstrap replicate or aggregated). This detail is load-bearing for the claim that LIN is stable from n = 874 onward while the other flexible methods are not.
minor comments (2)
- [Abstract] The abstract states that “at n = 874, DIC and LIN achieved stable predictions” but does not report the exact percentages or confidence intervals around the 90 % criterion; adding these numbers would improve transparency.
- [Tables/Figures] Table or figure legends should explicitly state the number of bootstrap replicates and the exact definition of MAPE (mean absolute prediction error on the probability scale) to allow readers to reproduce the stability calculations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important aspects of our stability definition and bootstrap implementation. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods (stability definition paragraph): the threshold “at least 90 % of individual predictions with MAPE ≤5 %” is introduced without sensitivity checks on either the 90 % or 5 % cut-off, without comparison to alternative metrics (Brier score, absolute probability error), and without clinical justification. Because all five sample-size scenarios are subsamples of the identical 19,418-patient cohort, any change in the rule directly alters which methods cross the stability threshold at which n, rendering the reported ordering (LIN earliest, then QUA/MFP/XGB) dependent on an unvalidated operational choice.
Authors: We agree that the specific 90% and 5% thresholds represent an operational choice that could influence the reported ordering, and that additional justification and robustness checks are warranted. The thresholds were selected to reflect a stringent requirement for the majority of predictions to exhibit low error, consistent with aims for reliable clinical use, but we acknowledge the absence of sensitivity analyses in the original submission. In the revised manuscript we will add a dedicated sensitivity analysis subsection in Methods and Results. This will include: (i) re-running the stability assessment across a grid of thresholds (proportion: 80%, 85%, 90%, 95%; MAPE: 3%, 5%, 7%, 10%); (ii) alternative stability definitions based on Brier score and mean absolute probability error; and (iii) explicit reporting of whether the relative ordering of methods (LIN earliest, followed by QUA/MFP/XGB) remains consistent. We will also add a short clinical rationale paragraph referencing literature on acceptable prediction error in emergency-department risk models. These additions will directly address the dependence on the chosen rule. revision: yes
-
Referee: [Results] Results (sample-size scenario comparisons): the bootstrap procedure used to compute MAPE for each individual prediction is not described in sufficient detail (e.g., whether the same bootstrap samples are used for model fitting and for stability assessment, how out-of-bag predictions are handled, and whether the 90 % rule is applied per bootstrap replicate or aggregated). This detail is load-bearing for the claim that LIN is stable from n = 874 onward while the other flexible methods are not.
Authors: We apologise for the insufficient detail on the bootstrap workflow. The procedure separated model fitting from stability evaluation: for each subsample size, 1,000 bootstrap replicates were drawn to fit each modelling method; an independent set of 1,000 bootstrap replicates was then used to generate predictions on the original subsample, employing out-of-bag observations for the stability metric. The 90% rule was applied after aggregation across all stability-assessment replicates (i.e., the proportion of individuals whose average MAPE across replicates met the threshold). We will expand the Methods section with a new subsection titled “Bootstrap-based stability assessment” that includes a step-by-step description, a schematic diagram, and explicit statements on the separation of fitting and evaluation samples, handling of out-of-bag predictions, and aggregation of the 90% criterion. This clarification will be cross-referenced in the Results when presenting the n = 874 findings. revision: yes
Circularity Check
No circularity: purely empirical comparison with operational metric applied to data
full rationale
The paper performs an empirical bootstrap-based comparison of modeling methods on subsamples from one fixed clinical cohort. Stability is defined once in Methods as an operational threshold (≥90% of predictions with MAPE≤5%) and then measured directly; the ordering of methods by sample size at which the threshold is crossed is an observed outcome, not a quantity that reduces to the definition by construction. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear. The study is self-contained against external benchmarks and receives the default non-finding.
Axiom & Free-Parameter Ledger
free parameters (2)
- stability criterion (90% of predictions with MAPE <=5%)
- sample size scenarios (437, 874, 1748, 3496, 8739)
axioms (2)
- domain assumption The bootstrap resampling framework accurately estimates prediction stability for individual patients.
- domain assumption The real clinical dataset of 19,418 emergency patients is representative for testing modeling choices in clinical prediction.
Reference graph
Works this paper leans on
-
[1]
Prognosis Research in Health Care: Concepts, Methods, and Impact
Riley RD, Van Der Windt D, Croft P, Moons KGM, editors. Prognosis Research in Health Care: Concepts, Methods, and Impact. 1st edition. Oxford University Press; 2019. https://doi.org/10.1093/med/9780198796619.001.0001
-
[2]
Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research
Steyerberg EW, Moons KGM, van der Windt DA, Hayden JA, Perel P, Schroter S, et al. Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research. PLoS Med. 2013;10:e1001381. https://doi.org/10.1371/journal.pmed.1001381
-
[3]
Collins GS, Omar O, Shanyinde M, Yu L-M. A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods. J Clin Epidemiol. 2013;66:268–77. https://doi.org/10.1016/j.jclinepi.2012.06.020
-
[4]
Helmrich IRAR, Mikolić A, Kent DM, Lingsma HF, Wynants L, Steyerberg EW, et al. Does poor methodological quality of prediction modeling studies translate to poor model performance? An illustration in traumatic brain injury. Diagn Progn Res. 2022;6:8. https://doi.org/10.1186/s41512-022-00122-0
-
[5]
Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models. J Clin Epidemiol. 2023;158:99–110. https://doi.org/10.1016/j.jclinepi.2023.03.024
-
[6]
Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating
Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer Science & Business Media; 2008
2008
-
[7]
Clinical prediction models and the multiverse of madness
Riley RD, Pate A, Dhiman P, Archer L, Martin GP, Collins GS. Clinical prediction models and the multiverse of madness. BMC Med. 2023;21:502. https://doi.org/10.1186/s12916-023-03212- y
-
[8]
Pate A, Emsley R, Sperrin M, Martin GP, Van Staa T. Impact of sample size on the stability of risk scores from clinical prediction models: a case study in cardiovascular disease. Diagn Progn Res. 2020;4:14. https://doi.org/10.1186/s41512-020-00082-3
-
[9]
Stability of clinical prediction models developed using statistical or machine learning methods
Riley RD, Collins GS. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J Biom Z. 2023;65:e2200302. https://doi.org/10.1002/bimj.202200302
-
[10]
Bootstrap investigation of the stability of a Cox regression model
Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med. 1989;8:771–83. https://doi.org/10.1002/sim.4780080702
-
[11]
A bootstrap resampling procedure for model building: Application to the cox regression model
Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model building: Application to the cox regression model. Stat Med. 1992;11:2093–109. https://doi.org/10.1002/sim.4780111607
-
[12]
On stability issues in deriving multivariable regression models
Sauerbrei W, Buchholz A, Boulesteix A-L, Binder H. On stability issues in deriving multivariable regression models. Biom J Biom Z. 2015;57:531–55. https://doi.org/10.1002/bimj.201300222
-
[13]
Royston P, Sauerbrei W. Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation. Stat Med. 2003;22:639–59. https://doi.org/10.1002/sim.1310
-
[14]
Sauerbrei W, Boulesteix A-L, Binder H. Stability Investigations of Multivariable Regression Models Derived from Low- and High-Dimensional Data. J Biopharm Stat. 2011;21:1206–31. https://doi.org/10.1080/10543406.2011.629890
-
[15]
Leeuwenberg AM, Van Smeden M, Langendijk JA, Van Der Schaaf A, Mauer ME, Moons KGM, et al. Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods. Diagn Progn Res. 2022;6:1. https://doi.org/10.1186/s41512- 021-00115-5
-
[16]
To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets
Šinkovec H, Heinze G, Blagus R, Geroldinger A. To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets. BMC Med Res Methodol. 2021;21:199. https://doi.org/10.1186/s12874-021-01374-y
-
[17]
Ma J, Dhiman P, Qi C, Bullock G, van Smeden M, Riley RD, et al. Poor handling of continuous predictors in clinical prediction models using logistic regression: a systematic review. J Clin Epidemiol. 2023;161:140–51. https://doi.org/10.1016/j.jclinepi.2023.07.017
-
[18]
Dichotomizing continuous predictors in multiple regression: a bad idea
Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25:127–41. https://doi.org/10.1002/sim.2331
-
[19]
Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables
Royston P, Sauerbrei W. Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables. John Wiley & Sons; 2008
2008
-
[20]
Predicting Hospital Admission of Patients at Triage in the Emergency Department at Lampang Hospital
Seesuwan N, Lokeskrawee T, Lawanaskol S, Patumanond J. Predicting Hospital Admission of Patients at Triage in the Emergency Department at Lampang Hospital. Biomed Sci Clin Med. 2025;64:23–32
2025
-
[21]
Riley RD, Snell KI, Ensor J, Burke DL, Jr FEH, Moons KG, et al. Minimum sample size for developing a multivariable prediction model: P ART II - binary and time-to-event outcomes. Stat Med. 2019;38:1276–96. https://doi.org/10.1002/sim.7992
-
[22]
Snell KIE, Archer L, Ensor J, Bonnett LJ, Debray TPA, Phillips B, et al. External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb. J Clin Epidemiol. 2021;135:79–89. https://doi.org/10.1016/j.jclinepi.2021.02.011
-
[23]
Legha A, Ensor J, Whittle R, Archer L, Van Calster B, Christodoulou E, et al. Sequential sample size calculations and learning curves safeguard the robust development of a clinical prediction model for individuals. J Clin Epidemiol. 2026;191:112117. https://doi.org/10.1016/j.jclinepi.2025.112117
-
[24]
Ragland DR. Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint. Epidemiol Camb Mass. 1992;3:434–40. https://doi.org/10.1097/00001648-199209000-00009
-
[25]
Hu Y , Zhang X, Slavin V , Belsti Y , Tiruneh SA, Callander E, et al. Beyond Comparing Machine Learning and Logistic Regression in Clinical Prediction Modelling: Shifting from Model Debate to Data Quality. J Med Internet Res. 2025;27:e77721. https://doi.org/10.2196/77721
-
[26]
Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for pr edicting dichotomous endpoints. BMC Med Res Methodol. 2014;14:137. https://doi.org/10.1186/1471-2288-14-137
-
[27]
Austin PC, Lee DS, Wang B. The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration. Diagn Progn Res. 2024;8:15. https://doi.org/10.1186/s41512-024-00179-z
-
[28]
Austin PC, Harrell FE, Steyerberg EW. Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting. Stat Methods Med Res. 2021;30:1465–83. https://doi.org/10.1177/09622802211002867
-
[29]
Sample size for binary logistic prediction models: Beyond events per variable criteria
van Smeden M, Moons KG, de Groot JA, Collins GS, Altman DG, Eijkemans MJ, et al. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Methods Med Res. 2019;28:2455–74. https://doi.org/10.1177/0962280218784726
-
[30]
No rationale for 1 variable per 10 events criterion for binary logistic regression analysis
van Smeden M, de Groot JAH, Moons KGM, Collins GS, Altman DG, Eijkemans MJC, et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol. 2016;16:163. https://doi.org/10.1186/s12874-016-0267-3. Acknowledgements This study was partially supported by Chiang Mai University and the Faculty of Medici...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.