pith. sign in

arxiv: 2606.07052 · v1 · pith:I4TKZRVVnew · submitted 2026-06-05 · 📊 stat.ME

Influence of continuous predictor modelling methods on prediction stability in clinical prediction model development: an empirical comparison using real clinical data

Pith reviewed 2026-06-27 21:17 UTC · model grok-4.3

classification 📊 stat.ME
keywords prediction stabilitycontinuous predictorsclinical prediction modelslinear termsmultivariable fractional polynomialsextreme gradient boostingsample sizebootstrap validation
0
0 comments X

The pith

The choice of modeling method for continuous predictors affects how stable clinical predictions remain across repeated samples, with linear terms reaching stability at smaller sizes than quadratic, fractional polynomial or boosting approach

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study tested six approaches to handling continuous predictors on real data from 19,418 emergency department patients, creating five sample-size scenarios from 437 to 8,739 cases. Stability was measured by a bootstrap procedure that required at least 90 percent of individual predictions to show mean absolute prediction error of 5 percent or less. Linear terms met this criterion from the smallest sample onward, while quadratic terms stabilized at around 1,748 patients and multivariable fractional polynomials plus extreme gradient boosting needed at least 3,496. Dichotomization and tertiles sometimes stabilized early but produced lower discrimination. Extreme gradient boosting delivered the highest AUC values yet continued to show miscalibration even in the largest samples.

Core claim

Continuous predictor modelling methods appeared to influence prediction stability. LIN achieved stable predictions from the base sample size onwards, whereas QUA, MFP, and XGB required larger samples. Although XGB showed high discrimination, calibration concerns persisted.

What carries the argument

Bootstrap-based stability criterion requiring at least 90 percent of individual predictions to have MAPE of 5 percent or less, applied across six modeling methods and five sample-size scenarios.

If this is right

  • LIN yields stable predictions even in the smallest tested samples where other methods fail.
  • QUA reaches the stability criterion once the sample reaches approximately 1,748 patients.
  • MFP and XGB require samples of at least 3,496 patients before most individual predictions meet the stability threshold.
  • All six methods produce stable predictions once the sample reaches 8,739 patients.
  • XGB maintains the highest discrimination throughout but shows persistent miscalibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers working with limited data may favor linear terms to obtain reliable individual predictions without waiting for larger samples.
  • The observed stability differences imply that sample-size calculations for new clinical models should account for the intended modeling approach.
  • The calibration problems seen with XGB suggest that discrimination alone is insufficient for choosing a method when individual predictions matter.

Load-bearing premise

The 90 percent MAPE threshold of 5 percent or less within the bootstrap framework supplies a clinically meaningful and unbiased way to compare stability across modeling methods.

What would settle it

A replication on similar clinical data in which quadratic, MFP or XGB methods reach the 90 percent MAPE stability threshold at the same small sample size as linear terms would falsify the claim that modeling choices differentially affect stability.

read the original abstract

Background and objective: Prediction stability is increasingly recognised as important for reliable clinical prediction model development, but the effect of continuous predictor modelling choices is unclear. This study examined how approaches to modelling continuous predictors influence prediction stability. Methods: We used a real clinical dataset of 19,418 emergency department patients to create five sample size scenarios ranging from 437 to 8,739 patients. Six methods were compared: dichotomisation at the median (DIC), tertile categorisation (TER), linear terms (LIN), quadratic terms (QUA), multivariable fractional polynomials (MFP), and extreme gradient boosting (XGB). Prediction stability was evaluated using a bootstrap-based framework. Optimism-corrected AUC and calibration were estimated through internal validation. A method was considered stable when at least 90% of individual predictions had a mean absolute prediction error (MAPE) <=5%. Results: Stability increased with sample size and varied by method. At n = 437, no method met the stability criterion; LIN was the most stable, followed by DIC. At n = 874, DIC and LIN achieved stable predictions with similar calibration, although DIC had lower AUC. At n = 1,748, QUA achieved stability, whereas MFP and XGB did not. At n = 3,496 and n = 8,739, all methods achieved stability. LIN, QUA, MFP, and XGB generally had higher AUCs than DIC and TER, while XGB showed the highest AUC but persistent miscalibration. Conclusion: Continuous predictor modelling methods appeared to influence prediction stability. LIN achieved stable predictions from the base sample size onwards, whereas QUA, MFP, and XGB required larger samples. Although XGB showed high discrimination, calibration concerns persisted. These findings suggest that, in smaller datasets, simpler approaches, particularly LIN, may provide more stable predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical comparison of six continuous-predictor modelling approaches (dichotomisation, tertiles, linear terms, quadratic terms, MFP, XGB) on prediction stability using a single 19,418-patient ED cohort subsampled into five sizes (437–8,739). Stability is defined via bootstrap as the point at which ≥90 % of individual predictions satisfy MAPE ≤5 %; secondary outcomes are optimism-corrected AUC and calibration. The central claim is that linear terms reach the stability threshold at the smallest n, while quadratic, MFP and XGB require larger samples, and that XGB exhibits the highest AUC but persistent miscalibration.

Significance. If the ordering of methods is robust to the stability definition, the work supplies practical, data-driven guidance for choosing modelling strategies in smaller clinical datasets where stability matters. Strengths include the use of real clinical data, a range of sample sizes, and bootstrap-based internal validation; these elements make the comparison more informative than purely simulated studies.

major comments (2)
  1. [Methods] Methods (stability definition paragraph): the threshold “at least 90 % of individual predictions with MAPE ≤5 %” is introduced without sensitivity checks on either the 90 % or 5 % cut-off, without comparison to alternative metrics (Brier score, absolute probability error), and without clinical justification. Because all five sample-size scenarios are subsamples of the identical 19,418-patient cohort, any change in the rule directly alters which methods cross the stability threshold at which n, rendering the reported ordering (LIN earliest, then QUA/MFP/XGB) dependent on an unvalidated operational choice.
  2. [Results] Results (sample-size scenario comparisons): the bootstrap procedure used to compute MAPE for each individual prediction is not described in sufficient detail (e.g., whether the same bootstrap samples are used for model fitting and for stability assessment, how out-of-bag predictions are handled, and whether the 90 % rule is applied per bootstrap replicate or aggregated). This detail is load-bearing for the claim that LIN is stable from n = 874 onward while the other flexible methods are not.
minor comments (2)
  1. [Abstract] The abstract states that “at n = 874, DIC and LIN achieved stable predictions” but does not report the exact percentages or confidence intervals around the 90 % criterion; adding these numbers would improve transparency.
  2. [Tables/Figures] Table or figure legends should explicitly state the number of bootstrap replicates and the exact definition of MAPE (mean absolute prediction error on the probability scale) to allow readers to reproduce the stability calculations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of our stability definition and bootstrap implementation. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods (stability definition paragraph): the threshold “at least 90 % of individual predictions with MAPE ≤5 %” is introduced without sensitivity checks on either the 90 % or 5 % cut-off, without comparison to alternative metrics (Brier score, absolute probability error), and without clinical justification. Because all five sample-size scenarios are subsamples of the identical 19,418-patient cohort, any change in the rule directly alters which methods cross the stability threshold at which n, rendering the reported ordering (LIN earliest, then QUA/MFP/XGB) dependent on an unvalidated operational choice.

    Authors: We agree that the specific 90% and 5% thresholds represent an operational choice that could influence the reported ordering, and that additional justification and robustness checks are warranted. The thresholds were selected to reflect a stringent requirement for the majority of predictions to exhibit low error, consistent with aims for reliable clinical use, but we acknowledge the absence of sensitivity analyses in the original submission. In the revised manuscript we will add a dedicated sensitivity analysis subsection in Methods and Results. This will include: (i) re-running the stability assessment across a grid of thresholds (proportion: 80%, 85%, 90%, 95%; MAPE: 3%, 5%, 7%, 10%); (ii) alternative stability definitions based on Brier score and mean absolute probability error; and (iii) explicit reporting of whether the relative ordering of methods (LIN earliest, followed by QUA/MFP/XGB) remains consistent. We will also add a short clinical rationale paragraph referencing literature on acceptable prediction error in emergency-department risk models. These additions will directly address the dependence on the chosen rule. revision: yes

  2. Referee: [Results] Results (sample-size scenario comparisons): the bootstrap procedure used to compute MAPE for each individual prediction is not described in sufficient detail (e.g., whether the same bootstrap samples are used for model fitting and for stability assessment, how out-of-bag predictions are handled, and whether the 90 % rule is applied per bootstrap replicate or aggregated). This detail is load-bearing for the claim that LIN is stable from n = 874 onward while the other flexible methods are not.

    Authors: We apologise for the insufficient detail on the bootstrap workflow. The procedure separated model fitting from stability evaluation: for each subsample size, 1,000 bootstrap replicates were drawn to fit each modelling method; an independent set of 1,000 bootstrap replicates was then used to generate predictions on the original subsample, employing out-of-bag observations for the stability metric. The 90% rule was applied after aggregation across all stability-assessment replicates (i.e., the proportion of individuals whose average MAPE across replicates met the threshold). We will expand the Methods section with a new subsection titled “Bootstrap-based stability assessment” that includes a step-by-step description, a schematic diagram, and explicit statements on the separation of fitting and evaluation samples, handling of out-of-bag predictions, and aggregation of the 90% criterion. This clarification will be cross-referenced in the Results when presenting the n = 874 findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with operational metric applied to data

full rationale

The paper performs an empirical bootstrap-based comparison of modeling methods on subsamples from one fixed clinical cohort. Stability is defined once in Methods as an operational threshold (≥90% of predictions with MAPE≤5%) and then measured directly; the ordering of methods by sample size at which the threshold is crossed is an observed outcome, not a quantity that reduces to the definition by construction. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear. The study is self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The study is empirical and relies on domain assumptions about data and validation methods rather than new postulates; free parameters are limited to the stability definition and chosen sample sizes.

free parameters (2)
  • stability criterion (90% of predictions with MAPE <=5%)
    Arbitrary threshold defining when a method is considered stable; directly determines which methods qualify at each sample size.
  • sample size scenarios (437, 874, 1748, 3496, 8739)
    Specific values extracted from the full dataset to create comparison scenarios.
axioms (2)
  • domain assumption The bootstrap resampling framework accurately estimates prediction stability for individual patients.
    Central to the evaluation method described in the abstract.
  • domain assumption The real clinical dataset of 19,418 emergency patients is representative for testing modeling choices in clinical prediction.
    Used as the source for all sample size scenarios.

pith-pipeline@v0.9.1-grok · 5938 in / 1512 out tokens · 28729 ms · 2026-06-27T21:17:21.882332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 27 canonical work pages

  1. [1]

    Prognosis Research in Health Care: Concepts, Methods, and Impact

    Riley RD, Van Der Windt D, Croft P, Moons KGM, editors. Prognosis Research in Health Care: Concepts, Methods, and Impact. 1st edition. Oxford University Press; 2019. https://doi.org/10.1093/med/9780198796619.001.0001

  2. [2]

    Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research

    Steyerberg EW, Moons KGM, van der Windt DA, Hayden JA, Perel P, Schroter S, et al. Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research. PLoS Med. 2013;10:e1001381. https://doi.org/10.1371/journal.pmed.1001381

  3. [3]

    A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods

    Collins GS, Omar O, Shanyinde M, Yu L-M. A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods. J Clin Epidemiol. 2013;66:268–77. https://doi.org/10.1016/j.jclinepi.2012.06.020

  4. [4]

    Does poor methodological quality of prediction modeling studies translate to poor model performance? An illustration in traumatic brain injury

    Helmrich IRAR, Mikolić A, Kent DM, Lingsma HF, Wynants L, Steyerberg EW, et al. Does poor methodological quality of prediction modeling studies translate to poor model performance? An illustration in traumatic brain injury. Diagn Progn Res. 2022;6:8. https://doi.org/10.1186/s41512-022-00122-0

  5. [5]

    Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models

    Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models. J Clin Epidemiol. 2023;158:99–110. https://doi.org/10.1016/j.jclinepi.2023.03.024

  6. [6]

    Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating

    Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer Science & Business Media; 2008

  7. [7]

    Clinical prediction models and the multiverse of madness

    Riley RD, Pate A, Dhiman P, Archer L, Martin GP, Collins GS. Clinical prediction models and the multiverse of madness. BMC Med. 2023;21:502. https://doi.org/10.1186/s12916-023-03212- y

  8. [8]

    Impact of sample size on the stability of risk scores from clinical prediction models: a case study in cardiovascular disease

    Pate A, Emsley R, Sperrin M, Martin GP, Van Staa T. Impact of sample size on the stability of risk scores from clinical prediction models: a case study in cardiovascular disease. Diagn Progn Res. 2020;4:14. https://doi.org/10.1186/s41512-020-00082-3

  9. [9]

    Stability of clinical prediction models developed using statistical or machine learning methods

    Riley RD, Collins GS. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J Biom Z. 2023;65:e2200302. https://doi.org/10.1002/bimj.202200302

  10. [10]

    Bootstrap investigation of the stability of a Cox regression model

    Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med. 1989;8:771–83. https://doi.org/10.1002/sim.4780080702

  11. [11]

    A bootstrap resampling procedure for model building: Application to the cox regression model

    Sauerbrei W, Schumacher M. A bootstrap resampling procedure for model building: Application to the cox regression model. Stat Med. 1992;11:2093–109. https://doi.org/10.1002/sim.4780111607

  12. [12]

    On stability issues in deriving multivariable regression models

    Sauerbrei W, Buchholz A, Boulesteix A-L, Binder H. On stability issues in deriving multivariable regression models. Biom J Biom Z. 2015;57:531–55. https://doi.org/10.1002/bimj.201300222

  13. [13]

    Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation

    Royston P, Sauerbrei W. Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation. Stat Med. 2003;22:639–59. https://doi.org/10.1002/sim.1310

  14. [14]

    Stability Investigations of Multivariable Regression Models Derived from Low- and High-Dimensional Data

    Sauerbrei W, Boulesteix A-L, Binder H. Stability Investigations of Multivariable Regression Models Derived from Low- and High-Dimensional Data. J Biopharm Stat. 2011;21:1206–31. https://doi.org/10.1080/10543406.2011.629890

  15. [15]

    Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods

    Leeuwenberg AM, Van Smeden M, Langendijk JA, Van Der Schaaf A, Mauer ME, Moons KGM, et al. Performance of binary prediction models in high-correlation low-dimensional settings: a comparison of methods. Diagn Progn Res. 2022;6:1. https://doi.org/10.1186/s41512- 021-00115-5

  16. [16]

    To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets

    Šinkovec H, Heinze G, Blagus R, Geroldinger A. To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets. BMC Med Res Methodol. 2021;21:199. https://doi.org/10.1186/s12874-021-01374-y

  17. [17]

    Poor handling of continuous predictors in clinical prediction models using logistic regression: a systematic review

    Ma J, Dhiman P, Qi C, Bullock G, van Smeden M, Riley RD, et al. Poor handling of continuous predictors in clinical prediction models using logistic regression: a systematic review. J Clin Epidemiol. 2023;161:140–51. https://doi.org/10.1016/j.jclinepi.2023.07.017

  18. [18]

    Dichotomizing continuous predictors in multiple regression: a bad idea

    Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med. 2006;25:127–41. https://doi.org/10.1002/sim.2331

  19. [19]

    Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables

    Royston P, Sauerbrei W. Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables. John Wiley & Sons; 2008

  20. [20]

    Predicting Hospital Admission of Patients at Triage in the Emergency Department at Lampang Hospital

    Seesuwan N, Lokeskrawee T, Lawanaskol S, Patumanond J. Predicting Hospital Admission of Patients at Triage in the Emergency Department at Lampang Hospital. Biomed Sci Clin Med. 2025;64:23–32

  21. [21]

    Minimum sample size for developing a multivariable prediction model: P ART II - binary and time-to-event outcomes

    Riley RD, Snell KI, Ensor J, Burke DL, Jr FEH, Moons KG, et al. Minimum sample size for developing a multivariable prediction model: P ART II - binary and time-to-event outcomes. Stat Med. 2019;38:1276–96. https://doi.org/10.1002/sim.7992

  22. [22]

    External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb

    Snell KIE, Archer L, Ensor J, Bonnett LJ, Debray TPA, Phillips B, et al. External validation of clinical prediction models: simulation-based sample size calculations were more reliable than rules-of-thumb. J Clin Epidemiol. 2021;135:79–89. https://doi.org/10.1016/j.jclinepi.2021.02.011

  23. [23]

    Sequential sample size calculations and learning curves safeguard the robust development of a clinical prediction model for individuals

    Legha A, Ensor J, Whittle R, Archer L, Van Calster B, Christodoulou E, et al. Sequential sample size calculations and learning curves safeguard the robust development of a clinical prediction model for individuals. J Clin Epidemiol. 2026;191:112117. https://doi.org/10.1016/j.jclinepi.2025.112117

  24. [24]

    Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint

    Ragland DR. Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint. Epidemiol Camb Mass. 1992;3:434–40. https://doi.org/10.1097/00001648-199209000-00009

  25. [25]

    Beyond Comparing Machine Learning and Logistic Regression in Clinical Prediction Modelling: Shifting from Model Debate to Data Quality

    Hu Y , Zhang X, Slavin V , Belsti Y , Tiruneh SA, Callander E, et al. Beyond Comparing Machine Learning and Logistic Regression in Clinical Prediction Modelling: Shifting from Model Debate to Data Quality. J Med Internet Res. 2025;27:e77721. https://doi.org/10.2196/77721

  26. [26]

    Modern modelling techniques are data hungry: a simulation study for pr edicting dichotomous endpoints

    Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a simulation study for pr edicting dichotomous endpoints. BMC Med Res Methodol. 2014;14:137. https://doi.org/10.1186/1471-2288-14-137

  27. [27]

    The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration

    Austin PC, Lee DS, Wang B. The relative data hungriness of unpenalized and penalized logistic regression and ensemble-based machine learning methods: the case of calibration. Diagn Progn Res. 2024;8:15. https://doi.org/10.1186/s41512-024-00179-z

  28. [28]

    large N, small p

    Austin PC, Harrell FE, Steyerberg EW. Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting. Stat Methods Med Res. 2021;30:1465–83. https://doi.org/10.1177/09622802211002867

  29. [29]

    Sample size for binary logistic prediction models: Beyond events per variable criteria

    van Smeden M, Moons KG, de Groot JA, Collins GS, Altman DG, Eijkemans MJ, et al. Sample size for binary logistic prediction models: Beyond events per variable criteria. Stat Methods Med Res. 2019;28:2455–74. https://doi.org/10.1177/0962280218784726

  30. [30]

    No rationale for 1 variable per 10 events criterion for binary logistic regression analysis

    van Smeden M, de Groot JAH, Moons KGM, Collins GS, Altman DG, Eijkemans MJC, et al. No rationale for 1 variable per 10 events criterion for binary logistic regression analysis. BMC Med Res Methodol. 2016;16:163. https://doi.org/10.1186/s12874-016-0267-3. Acknowledgements This study was partially supported by Chiang Mai University and the Faculty of Medici...