Influence of continuous predictor modelling methods on prediction stability in clinical prediction model development: an empirical comparison using real clinical data

Natthanaphop Isaradech; Noppadon Seesuwan; Noraworn Jirattikanwong; Pakpoom Wongyikul; Phichayut Phinyo; Suppachai Lawanaskol; Wachiranun Sirikul; Wuttipat Kiratipaisarl

arxiv: 2606.07052 · v1 · pith:I4TKZRVVnew · submitted 2026-06-05 · 📊 stat.ME

Influence of continuous predictor modelling methods on prediction stability in clinical prediction model development: an empirical comparison using real clinical data

Phichayut Phinyo , Pakpoom Wongyikul , Noraworn Jirattikanwong , Natthanaphop Isaradech , Wuttipat Kiratipaisarl , Suppachai Lawanaskol , Noppadon Seesuwan , Wachiranun Sirikul This is my paper

Pith reviewed 2026-06-27 21:17 UTC · model grok-4.3

classification 📊 stat.ME

keywords prediction stabilitycontinuous predictorsclinical prediction modelslinear termsmultivariable fractional polynomialsextreme gradient boostingsample sizebootstrap validation

0 comments

The pith

The choice of modeling method for continuous predictors affects how stable clinical predictions remain across repeated samples, with linear terms reaching stability at smaller sizes than quadratic, fractional polynomial or boosting approach

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study tested six approaches to handling continuous predictors on real data from 19,418 emergency department patients, creating five sample-size scenarios from 437 to 8,739 cases. Stability was measured by a bootstrap procedure that required at least 90 percent of individual predictions to show mean absolute prediction error of 5 percent or less. Linear terms met this criterion from the smallest sample onward, while quadratic terms stabilized at around 1,748 patients and multivariable fractional polynomials plus extreme gradient boosting needed at least 3,496. Dichotomization and tertiles sometimes stabilized early but produced lower discrimination. Extreme gradient boosting delivered the highest AUC values yet continued to show miscalibration even in the largest samples.

Core claim

Continuous predictor modelling methods appeared to influence prediction stability. LIN achieved stable predictions from the base sample size onwards, whereas QUA, MFP, and XGB required larger samples. Although XGB showed high discrimination, calibration concerns persisted.

What carries the argument

Bootstrap-based stability criterion requiring at least 90 percent of individual predictions to have MAPE of 5 percent or less, applied across six modeling methods and five sample-size scenarios.

If this is right

LIN yields stable predictions even in the smallest tested samples where other methods fail.
QUA reaches the stability criterion once the sample reaches approximately 1,748 patients.
MFP and XGB require samples of at least 3,496 patients before most individual predictions meet the stability threshold.
All six methods produce stable predictions once the sample reaches 8,739 patients.
XGB maintains the highest discrimination throughout but shows persistent miscalibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers working with limited data may favor linear terms to obtain reliable individual predictions without waiting for larger samples.
The observed stability differences imply that sample-size calculations for new clinical models should account for the intended modeling approach.
The calibration problems seen with XGB suggest that discrimination alone is insufficient for choosing a method when individual predictions matter.

Load-bearing premise

The 90 percent MAPE threshold of 5 percent or less within the bootstrap framework supplies a clinically meaningful and unbiased way to compare stability across modeling methods.

What would settle it

A replication on similar clinical data in which quadratic, MFP or XGB methods reach the 90 percent MAPE stability threshold at the same small sample size as linear terms would falsify the claim that modeling choices differentially affect stability.

read the original abstract

Background and objective: Prediction stability is increasingly recognised as important for reliable clinical prediction model development, but the effect of continuous predictor modelling choices is unclear. This study examined how approaches to modelling continuous predictors influence prediction stability. Methods: We used a real clinical dataset of 19,418 emergency department patients to create five sample size scenarios ranging from 437 to 8,739 patients. Six methods were compared: dichotomisation at the median (DIC), tertile categorisation (TER), linear terms (LIN), quadratic terms (QUA), multivariable fractional polynomials (MFP), and extreme gradient boosting (XGB). Prediction stability was evaluated using a bootstrap-based framework. Optimism-corrected AUC and calibration were estimated through internal validation. A method was considered stable when at least 90% of individual predictions had a mean absolute prediction error (MAPE) <=5%. Results: Stability increased with sample size and varied by method. At n = 437, no method met the stability criterion; LIN was the most stable, followed by DIC. At n = 874, DIC and LIN achieved stable predictions with similar calibration, although DIC had lower AUC. At n = 1,748, QUA achieved stability, whereas MFP and XGB did not. At n = 3,496 and n = 8,739, all methods achieved stability. LIN, QUA, MFP, and XGB generally had higher AUCs than DIC and TER, while XGB showed the highest AUC but persistent miscalibration. Conclusion: Continuous predictor modelling methods appeared to influence prediction stability. LIN achieved stable predictions from the base sample size onwards, whereas QUA, MFP, and XGB required larger samples. Although XGB showed high discrimination, calibration concerns persisted. These findings suggest that, in smaller datasets, simpler approaches, particularly LIN, may provide more stable predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Linear terms for continuous predictors gave stable predictions at smaller sample sizes than QUA, MFP or XGB in this bootstrap analysis of ED data.

read the letter

Linear terms for continuous predictors gave stable predictions at smaller sample sizes than QUA, MFP or XGB in this bootstrap analysis of ED data.

The paper compares six standard ways of handling continuous predictors on real emergency department data from 19,418 patients. They create five sample-size scenarios by subsampling (437 to 8,739 cases), run bootstrap to measure stability, and also track optimism-corrected AUC plus calibration. The concrete result is that linear terms met their stability rule first, while XGB needed larger n despite the best discrimination and still showed calibration problems. That pattern is useful to see on actual clinical numbers rather than simulations.

The design is straightforward and the question matters for people who build models on moderate-sized datasets. Reporting both stability and the usual performance metrics is the right move.

The soft spot is the stability rule itself. Defining stability as at least 90% of predictions with MAPE at or below 5% is presented without any sensitivity checks or comparison to other cut-offs or metrics. Because every smaller sample comes from the same original cohort, changing the 90% or 5% numbers would likely move the point at which each method crosses the threshold. The abstract gives no clinical justification for those exact values either.

This is the sort of targeted empirical check that clinical prediction modelers can use when choosing how to code predictors. It does not claim broad theory, just shows behavior on one dataset with clear methods.

I would bring it to a reading group to talk about the stability definition. It deserves peer review because the practical question is real and the evidence is direct, even if the threshold choice needs more defense in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical comparison of six continuous-predictor modelling approaches (dichotomisation, tertiles, linear terms, quadratic terms, MFP, XGB) on prediction stability using a single 19,418-patient ED cohort subsampled into five sizes (437–8,739). Stability is defined via bootstrap as the point at which ≥90 % of individual predictions satisfy MAPE ≤5 %; secondary outcomes are optimism-corrected AUC and calibration. The central claim is that linear terms reach the stability threshold at the smallest n, while quadratic, MFP and XGB require larger samples, and that XGB exhibits the highest AUC but persistent miscalibration.

Significance. If the ordering of methods is robust to the stability definition, the work supplies practical, data-driven guidance for choosing modelling strategies in smaller clinical datasets where stability matters. Strengths include the use of real clinical data, a range of sample sizes, and bootstrap-based internal validation; these elements make the comparison more informative than purely simulated studies.

major comments (2)

[Methods] Methods (stability definition paragraph): the threshold “at least 90 % of individual predictions with MAPE ≤5 %” is introduced without sensitivity checks on either the 90 % or 5 % cut-off, without comparison to alternative metrics (Brier score, absolute probability error), and without clinical justification. Because all five sample-size scenarios are subsamples of the identical 19,418-patient cohort, any change in the rule directly alters which methods cross the stability threshold at which n, rendering the reported ordering (LIN earliest, then QUA/MFP/XGB) dependent on an unvalidated operational choice.
[Results] Results (sample-size scenario comparisons): the bootstrap procedure used to compute MAPE for each individual prediction is not described in sufficient detail (e.g., whether the same bootstrap samples are used for model fitting and for stability assessment, how out-of-bag predictions are handled, and whether the 90 % rule is applied per bootstrap replicate or aggregated). This detail is load-bearing for the claim that LIN is stable from n = 874 onward while the other flexible methods are not.

minor comments (2)

[Abstract] The abstract states that “at n = 874, DIC and LIN achieved stable predictions” but does not report the exact percentages or confidence intervals around the 90 % criterion; adding these numbers would improve transparency.
[Tables/Figures] Table or figure legends should explicitly state the number of bootstrap replicates and the exact definition of MAPE (mean absolute prediction error on the probability scale) to allow readers to reproduce the stability calculations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of our stability definition and bootstrap implementation. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods] Methods (stability definition paragraph): the threshold “at least 90 % of individual predictions with MAPE ≤5 %” is introduced without sensitivity checks on either the 90 % or 5 % cut-off, without comparison to alternative metrics (Brier score, absolute probability error), and without clinical justification. Because all five sample-size scenarios are subsamples of the identical 19,418-patient cohort, any change in the rule directly alters which methods cross the stability threshold at which n, rendering the reported ordering (LIN earliest, then QUA/MFP/XGB) dependent on an unvalidated operational choice.

Authors: We agree that the specific 90% and 5% thresholds represent an operational choice that could influence the reported ordering, and that additional justification and robustness checks are warranted. The thresholds were selected to reflect a stringent requirement for the majority of predictions to exhibit low error, consistent with aims for reliable clinical use, but we acknowledge the absence of sensitivity analyses in the original submission. In the revised manuscript we will add a dedicated sensitivity analysis subsection in Methods and Results. This will include: (i) re-running the stability assessment across a grid of thresholds (proportion: 80%, 85%, 90%, 95%; MAPE: 3%, 5%, 7%, 10%); (ii) alternative stability definitions based on Brier score and mean absolute probability error; and (iii) explicit reporting of whether the relative ordering of methods (LIN earliest, followed by QUA/MFP/XGB) remains consistent. We will also add a short clinical rationale paragraph referencing literature on acceptable prediction error in emergency-department risk models. These additions will directly address the dependence on the chosen rule. revision: yes
Referee: [Results] Results (sample-size scenario comparisons): the bootstrap procedure used to compute MAPE for each individual prediction is not described in sufficient detail (e.g., whether the same bootstrap samples are used for model fitting and for stability assessment, how out-of-bag predictions are handled, and whether the 90 % rule is applied per bootstrap replicate or aggregated). This detail is load-bearing for the claim that LIN is stable from n = 874 onward while the other flexible methods are not.

Authors: We apologise for the insufficient detail on the bootstrap workflow. The procedure separated model fitting from stability evaluation: for each subsample size, 1,000 bootstrap replicates were drawn to fit each modelling method; an independent set of 1,000 bootstrap replicates was then used to generate predictions on the original subsample, employing out-of-bag observations for the stability metric. The 90% rule was applied after aggregation across all stability-assessment replicates (i.e., the proportion of individuals whose average MAPE across replicates met the threshold). We will expand the Methods section with a new subsection titled “Bootstrap-based stability assessment” that includes a step-by-step description, a schematic diagram, and explicit statements on the separation of fitting and evaluation samples, handling of out-of-bag predictions, and aggregation of the 90% criterion. This clarification will be cross-referenced in the Results when presenting the n = 874 findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with operational metric applied to data

full rationale

The paper performs an empirical bootstrap-based comparison of modeling methods on subsamples from one fixed clinical cohort. Stability is defined once in Methods as an operational threshold (≥90% of predictions with MAPE≤5%) and then measured directly; the ordering of methods by sample size at which the threshold is crossed is an observed outcome, not a quantity that reduces to the definition by construction. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear. The study is self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The study is empirical and relies on domain assumptions about data and validation methods rather than new postulates; free parameters are limited to the stability definition and chosen sample sizes.

free parameters (2)

stability criterion (90% of predictions with MAPE <=5%)
Arbitrary threshold defining when a method is considered stable; directly determines which methods qualify at each sample size.
sample size scenarios (437, 874, 1748, 3496, 8739)
Specific values extracted from the full dataset to create comparison scenarios.

axioms (2)

domain assumption The bootstrap resampling framework accurately estimates prediction stability for individual patients.
Central to the evaluation method described in the abstract.
domain assumption The real clinical dataset of 19,418 emergency patients is representative for testing modeling choices in clinical prediction.
Used as the source for all sample size scenarios.

pith-pipeline@v0.9.1-grok · 5938 in / 1512 out tokens · 28729 ms · 2026-06-27T21:17:21.882332+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 27 canonical work pages

[1]

Prognosis Research in Health Care: Concepts, Methods, and Impact

Riley RD, Van Der Windt D, Croft P, Moons KGM, editors. Prognosis Research in Health Care: Concepts, Methods, and Impact. 1st edition. Oxford University Press; 2019. https://doi.org/10.1093/med/9780198796619.001.0001

work page doi:10.1093/med/9780198796619.001.0001 2019
[2]

Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research

Steyerberg EW, Moons KGM, van der Windt DA, Hayden JA, Perel P, Schroter S, et al. Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research. PLoS Med. 2013;10:e1001381. https://doi.org/10.1371/journal.pmed.1001381

work page doi:10.1371/journal.pmed.1001381 2013
[3]

A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods

Collins GS, Omar O, Shanyinde M, Yu L-M. A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods. J Clin Epidemiol. 2013;66:268–77. https://doi.org/10.1016/j.jclinepi.2012.06.020

work page doi:10.1016/j.jclinepi.2012.06.020 2013
[4]

Does poor methodological quality of prediction modeling studies translate to poor model performance? An illustration in traumatic brain injury

Helmrich IRAR, Mikolić A, Kent DM, Lingsma HF, Wynants L, Steyerberg EW, et al. Does poor methodological quality of prediction modeling studies translate to poor model performance? An illustration in traumatic brain injury. Diagn Progn Res. 2022;6:8. https://doi.org/10.1186/s41512-022-00122-0

work page doi:10.1186/s41512-022-00122-0 2022
[5]

Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models

Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models. J Clin Epidemiol. 2023;158:99–110. https://doi.org/10.1016/j.jclinepi.2023.03.024

work page doi:10.1016/j.jclinepi.2023.03.024 2023
[6]

Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating

Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer Science & Business Media; 2008

2008
[7]

Clinical prediction models and the multiverse of madness

Riley RD, Pate A, Dhiman P, Archer L, Martin GP, Collins GS. Clinical prediction models and the multiverse of madness. BMC Med. 2023;21:502. https://doi.org/10.1186/s12916-023-03212- y

work page doi:10.1186/s12916-023-03212- 2023
[8]

Impact of sample size on the stability of risk scores from clinical prediction models: a case study in cardiovascular disease

Pate A, Emsley R, Sperrin M, Martin GP, Van Staa T. Impact of sample size on the stability of risk scores from clinical prediction models: a case study in cardiovascular disease. Diagn Progn Res. 2020;4:14. https://doi.org/10.1186/s41512-020-00082-3

work page doi:10.1186/s41512-020-00082-3 2020
[9]

Stability of clinical prediction models developed using statistical or machine learning methods

Riley RD, Collins GS. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J Biom Z. 2023;65:e2200302. https://doi.org/10.1002/bimj.202200302

work page doi:10.1002/bimj.202200302 2023
[10]

Bootstrap investigation of the stability of a Cox regression model

Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med. 1989;8:771–83. https://doi.org/10.1002/sim.4780080702

work page doi:10.1002/sim.4780080702 1989
[11]

A bootstrap resampling procedure for model building: Application to the cox regression model

Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model building: Application to the cox regression model. Stat Med. 1992;11:2093–109. https://doi.org/10.1002/sim.4780111607

work page doi:10.1002/sim.4780111607 1992
[12]

On stability issues in deriving multivariable regression models

Sauerbrei W, Buchholz A, Boulesteix A-L, Binder H. On stability issues in deriving multivariable regression models. Biom J Biom Z. 2015;57:531–55. https://doi.org/10.1002/bimj.201300222

work page doi:10.1002/bimj.201300222 2015
[13]

Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation

Royston P, Sauerbrei W. Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation. Stat Med. 2003;22:639–59. https://doi.org/10.1002/sim.1310

work page doi:10.1002/sim.1310 2003
[14]

Stability Investigations of Multivariable Regression Models Derived from Low- and High-Dimensional Data

Sauerbrei W, Boulesteix A-L, Binder H. Stability Investigations of Multivariable Regression Models Derived from Low- and High-Dimensional Data. J Biopharm Stat. 2011;21:1206–31. https://doi.org/10.1080/10543406.2011.629890

work page doi:10.1080/10543406.2011.629890 2011
[15]

Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods

Leeuwenberg AM, Van Smeden M, Langendijk JA, Van Der Schaaf A, Mauer ME, Moons KGM, et al. Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods. Diagn Progn Res. 2022;6:1. https://doi.org/10.1186/s41512- 021-00115-5

work page doi:10.1186/s41512- 2022
[16]

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets

Šinkovec H, Heinze G, Blagus R, Geroldinger A. To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets. BMC Med Res Methodol. 2021;21:199. https://doi.org/10.1186/s12874-021-01374-y

work page doi:10.1186/s12874-021-01374-y 2021
[17]

Poor handling of continuous predictors in clinical prediction models using logistic regression: a systematic review

Ma J, Dhiman P, Qi C, Bullock G, van Smeden M, Riley RD, et al. Poor handling of continuous predictors in clinical prediction models using logistic regression: a systematic review. J Clin Epidemiol. 2023;161:140–51. https://doi.org/10.1016/j.jclinepi.2023.07.017

work page doi:10.1016/j.jclinepi.2023.07.017 2023
[18]

Dichotomizing continuous predictors in multiple regression: a bad idea

Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25:127–41. https://doi.org/10.1002/sim.2331

work page doi:10.1002/sim.2331 2006
[19]

Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables

Royston P, Sauerbrei W. Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables. John Wiley & Sons; 2008

2008
[20]

Predicting Hospital Admission of Patients at Triage in the Emergency Department at Lampang Hospital

Seesuwan N, Lokeskrawee T, Lawanaskol S, Patumanond J. Predicting Hospital Admission of Patients at Triage in the Emergency Department at Lampang Hospital. Biomed Sci Clin Med. 2025;64:23–32

2025
[21]

Minimum sample size for developing a multivariable prediction model: P ART II - binary and time-to-event outcomes

Riley RD, Snell KI, Ensor J, Burke DL, Jr FEH, Moons KG, et al. Minimum sample size for developing a multivariable prediction model: P ART II - binary and time-to-event outcomes. Stat Med. 2019;38:1276–96. https://doi.org/10.1002/sim.7992

work page doi:10.1002/sim.7992 2019
[22]

External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb

Snell KIE, Archer L, Ensor J, Bonnett LJ, Debray TPA, Phillips B, et al. External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb. J Clin Epidemiol. 2021;135:79–89. https://doi.org/10.1016/j.jclinepi.2021.02.011

work page doi:10.1016/j.jclinepi.2021.02.011 2021
[23]

Sequential sample size calculations and learning curves safeguard the robust development of a clinical prediction model for individuals

Legha A, Ensor J, Whittle R, Archer L, Van Calster B, Christodoulou E, et al. Sequential sample size calculations and learning curves safeguard the robust development of a clinical prediction model for individuals. J Clin Epidemiol. 2026;191:112117. https://doi.org/10.1016/j.jclinepi.2025.112117

work page doi:10.1016/j.jclinepi.2025.112117 2026
[24]

Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint

Ragland DR. Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint. Epidemiol Camb Mass. 1992;3:434–40. https://doi.org/10.1097/00001648-199209000-00009

work page doi:10.1097/00001648-199209000-00009 1992
[25]

Beyond Comparing Machine Learning and Logistic Regression in Clinical Prediction Modelling: Shifting from Model Debate to Data Quality

Hu Y , Zhang X, Slavin V , Belsti Y , Tiruneh SA, Callander E, et al. Beyond Comparing Machine Learning and Logistic Regression in Clinical Prediction Modelling: Shifting from Model Debate to Data Quality. J Med Internet Res. 2025;27:e77721. https://doi.org/10.2196/77721

work page doi:10.2196/77721 2025
[26]

Modern modelling techniques are data hungry: a simulation study for pr edicting dichotomous endpoints

Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for pr edicting dichotomous endpoints. BMC Med Res Methodol. 2014;14:137. https://doi.org/10.1186/1471-2288-14-137

work page doi:10.1186/1471-2288-14-137 2014
[27]

The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration

Austin PC, Lee DS, Wang B. The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration. Diagn Progn Res. 2024;8:15. https://doi.org/10.1186/s41512-024-00179-z

work page doi:10.1186/s41512-024-00179-z 2024
[28]

large N, small p

Austin PC, Harrell FE, Steyerberg EW. Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting. Stat Methods Med Res. 2021;30:1465–83. https://doi.org/10.1177/09622802211002867

work page doi:10.1177/09622802211002867 2021
[29]

Sample size for binary logistic prediction models: Beyond events per variable criteria

van Smeden M, Moons KG, de Groot JA, Collins GS, Altman DG, Eijkemans MJ, et al. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Methods Med Res. 2019;28:2455–74. https://doi.org/10.1177/0962280218784726

work page doi:10.1177/0962280218784726 2019
[30]

No rationale for 1 variable per 10 events criterion for binary logistic regression analysis

van Smeden M, de Groot JAH, Moons KGM, Collins GS, Altman DG, Eijkemans MJC, et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol. 2016;16:163. https://doi.org/10.1186/s12874-016-0267-3. Acknowledgements This study was partially supported by Chiang Mai University and the Faculty of Medici...

work page doi:10.1186/s12874-016-0267-3 2016

[1] [1]

Prognosis Research in Health Care: Concepts, Methods, and Impact

Riley RD, Van Der Windt D, Croft P, Moons KGM, editors. Prognosis Research in Health Care: Concepts, Methods, and Impact. 1st edition. Oxford University Press; 2019. https://doi.org/10.1093/med/9780198796619.001.0001

work page doi:10.1093/med/9780198796619.001.0001 2019

[2] [2]

Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research

Steyerberg EW, Moons KGM, van der Windt DA, Hayden JA, Perel P, Schroter S, et al. Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research. PLoS Med. 2013;10:e1001381. https://doi.org/10.1371/journal.pmed.1001381

work page doi:10.1371/journal.pmed.1001381 2013

[3] [3]

A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods

Collins GS, Omar O, Shanyinde M, Yu L-M. A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods. J Clin Epidemiol. 2013;66:268–77. https://doi.org/10.1016/j.jclinepi.2012.06.020

work page doi:10.1016/j.jclinepi.2012.06.020 2013

[4] [4]

Does poor methodological quality of prediction modeling studies translate to poor model performance? An illustration in traumatic brain injury

Helmrich IRAR, Mikolić A, Kent DM, Lingsma HF, Wynants L, Steyerberg EW, et al. Does poor methodological quality of prediction modeling studies translate to poor model performance? An illustration in traumatic brain injury. Diagn Progn Res. 2022;6:8. https://doi.org/10.1186/s41512-022-00122-0

work page doi:10.1186/s41512-022-00122-0 2022

[5] [5]

Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models

Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models. J Clin Epidemiol. 2023;158:99–110. https://doi.org/10.1016/j.jclinepi.2023.03.024

work page doi:10.1016/j.jclinepi.2023.03.024 2023

[6] [6]

Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating

Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer Science & Business Media; 2008

2008

[7] [7]

Clinical prediction models and the multiverse of madness

Riley RD, Pate A, Dhiman P, Archer L, Martin GP, Collins GS. Clinical prediction models and the multiverse of madness. BMC Med. 2023;21:502. https://doi.org/10.1186/s12916-023-03212- y

work page doi:10.1186/s12916-023-03212- 2023

[8] [8]

Impact of sample size on the stability of risk scores from clinical prediction models: a case study in cardiovascular disease

Pate A, Emsley R, Sperrin M, Martin GP, Van Staa T. Impact of sample size on the stability of risk scores from clinical prediction models: a case study in cardiovascular disease. Diagn Progn Res. 2020;4:14. https://doi.org/10.1186/s41512-020-00082-3

work page doi:10.1186/s41512-020-00082-3 2020

[9] [9]

Stability of clinical prediction models developed using statistical or machine learning methods

Riley RD, Collins GS. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J Biom Z. 2023;65:e2200302. https://doi.org/10.1002/bimj.202200302

work page doi:10.1002/bimj.202200302 2023

[10] [10]

Bootstrap investigation of the stability of a Cox regression model

Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med. 1989;8:771–83. https://doi.org/10.1002/sim.4780080702

work page doi:10.1002/sim.4780080702 1989

[11] [11]

A bootstrap resampling procedure for model building: Application to the cox regression model

Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model building: Application to the cox regression model. Stat Med. 1992;11:2093–109. https://doi.org/10.1002/sim.4780111607

work page doi:10.1002/sim.4780111607 1992

[12] [12]

On stability issues in deriving multivariable regression models

Sauerbrei W, Buchholz A, Boulesteix A-L, Binder H. On stability issues in deriving multivariable regression models. Biom J Biom Z. 2015;57:531–55. https://doi.org/10.1002/bimj.201300222

work page doi:10.1002/bimj.201300222 2015

[13] [13]

Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation

Royston P, Sauerbrei W. Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation. Stat Med. 2003;22:639–59. https://doi.org/10.1002/sim.1310

work page doi:10.1002/sim.1310 2003

[14] [14]

Stability Investigations of Multivariable Regression Models Derived from Low- and High-Dimensional Data

Sauerbrei W, Boulesteix A-L, Binder H. Stability Investigations of Multivariable Regression Models Derived from Low- and High-Dimensional Data. J Biopharm Stat. 2011;21:1206–31. https://doi.org/10.1080/10543406.2011.629890

work page doi:10.1080/10543406.2011.629890 2011

[15] [15]

Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods

Leeuwenberg AM, Van Smeden M, Langendijk JA, Van Der Schaaf A, Mauer ME, Moons KGM, et al. Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods. Diagn Progn Res. 2022;6:1. https://doi.org/10.1186/s41512- 021-00115-5

work page doi:10.1186/s41512- 2022

[16] [16]

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets

Šinkovec H, Heinze G, Blagus R, Geroldinger A. To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets. BMC Med Res Methodol. 2021;21:199. https://doi.org/10.1186/s12874-021-01374-y

work page doi:10.1186/s12874-021-01374-y 2021

[17] [17]

Poor handling of continuous predictors in clinical prediction models using logistic regression: a systematic review

Ma J, Dhiman P, Qi C, Bullock G, van Smeden M, Riley RD, et al. Poor handling of continuous predictors in clinical prediction models using logistic regression: a systematic review. J Clin Epidemiol. 2023;161:140–51. https://doi.org/10.1016/j.jclinepi.2023.07.017

work page doi:10.1016/j.jclinepi.2023.07.017 2023

[18] [18]

Dichotomizing continuous predictors in multiple regression: a bad idea

Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25:127–41. https://doi.org/10.1002/sim.2331

work page doi:10.1002/sim.2331 2006

[19] [19]

Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables

Royston P, Sauerbrei W. Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables. John Wiley & Sons; 2008

2008

[20] [20]

Predicting Hospital Admission of Patients at Triage in the Emergency Department at Lampang Hospital

Seesuwan N, Lokeskrawee T, Lawanaskol S, Patumanond J. Predicting Hospital Admission of Patients at Triage in the Emergency Department at Lampang Hospital. Biomed Sci Clin Med. 2025;64:23–32

2025

[21] [21]

Minimum sample size for developing a multivariable prediction model: P ART II - binary and time-to-event outcomes

Riley RD, Snell KI, Ensor J, Burke DL, Jr FEH, Moons KG, et al. Minimum sample size for developing a multivariable prediction model: P ART II - binary and time-to-event outcomes. Stat Med. 2019;38:1276–96. https://doi.org/10.1002/sim.7992

work page doi:10.1002/sim.7992 2019

[22] [22]

External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb

Snell KIE, Archer L, Ensor J, Bonnett LJ, Debray TPA, Phillips B, et al. External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb. J Clin Epidemiol. 2021;135:79–89. https://doi.org/10.1016/j.jclinepi.2021.02.011

work page doi:10.1016/j.jclinepi.2021.02.011 2021

[23] [23]

Sequential sample size calculations and learning curves safeguard the robust development of a clinical prediction model for individuals

Legha A, Ensor J, Whittle R, Archer L, Van Calster B, Christodoulou E, et al. Sequential sample size calculations and learning curves safeguard the robust development of a clinical prediction model for individuals. J Clin Epidemiol. 2026;191:112117. https://doi.org/10.1016/j.jclinepi.2025.112117

work page doi:10.1016/j.jclinepi.2025.112117 2026

[24] [24]

Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint

Ragland DR. Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint. Epidemiol Camb Mass. 1992;3:434–40. https://doi.org/10.1097/00001648-199209000-00009

work page doi:10.1097/00001648-199209000-00009 1992

[25] [25]

Beyond Comparing Machine Learning and Logistic Regression in Clinical Prediction Modelling: Shifting from Model Debate to Data Quality

Hu Y , Zhang X, Slavin V , Belsti Y , Tiruneh SA, Callander E, et al. Beyond Comparing Machine Learning and Logistic Regression in Clinical Prediction Modelling: Shifting from Model Debate to Data Quality. J Med Internet Res. 2025;27:e77721. https://doi.org/10.2196/77721

work page doi:10.2196/77721 2025

[26] [26]

Modern modelling techniques are data hungry: a simulation study for pr edicting dichotomous endpoints

Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for pr edicting dichotomous endpoints. BMC Med Res Methodol. 2014;14:137. https://doi.org/10.1186/1471-2288-14-137

work page doi:10.1186/1471-2288-14-137 2014

[27] [27]

The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration

Austin PC, Lee DS, Wang B. The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration. Diagn Progn Res. 2024;8:15. https://doi.org/10.1186/s41512-024-00179-z

work page doi:10.1186/s41512-024-00179-z 2024

[28] [28]

large N, small p

Austin PC, Harrell FE, Steyerberg EW. Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting. Stat Methods Med Res. 2021;30:1465–83. https://doi.org/10.1177/09622802211002867

work page doi:10.1177/09622802211002867 2021

[29] [29]

Sample size for binary logistic prediction models: Beyond events per variable criteria

van Smeden M, Moons KG, de Groot JA, Collins GS, Altman DG, Eijkemans MJ, et al. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Methods Med Res. 2019;28:2455–74. https://doi.org/10.1177/0962280218784726

work page doi:10.1177/0962280218784726 2019

[30] [30]

No rationale for 1 variable per 10 events criterion for binary logistic regression analysis

van Smeden M, de Groot JAH, Moons KGM, Collins GS, Altman DG, Eijkemans MJC, et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol. 2016;16:163. https://doi.org/10.1186/s12874-016-0267-3. Acknowledgements This study was partially supported by Chiang Mai University and the Faculty of Medici...

work page doi:10.1186/s12874-016-0267-3 2016