arxiv: 2605.12895 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.CY· stat.AP

Recognition: no theorem link

RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems

Rohith Reddy Bellibatlu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CYstat.AP

keywords clinical AIpre-deployment evaluationdecision support systemsinput stabilitythreshold sensitivityequity diagnosticsreliability checkssafety framework

0 comments

The pith

Clinical AI models that pass standard accuracy tests can fail on input stability and threshold sensitivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the RISED framework as a pre-deployment evaluation tool for clinical AI decision-support systems. It organizes checks into five dimensions—Reliability, Inclusivity, Sensitivity, Equity, and Deployability—each with explicit sub-criteria, fixed pass/fail thresholds, and bootstrap confidence intervals corrected for multiple comparisons. The central demonstration shows that models achieving high discrimination on aggregate metrics can still fail encoding stability and threshold-shift tests while equity comparisons stay inconclusive. This pattern appears across synthetic data and three real clinical cohorts spanning decades, with different dimensions failing in each case. The framework reframes equity evaluation as a diagnostic that flags the need for outcome-independent measures before any fairness verdict becomes binding.

Core claim

A classifier satisfying conventional high-discrimination benchmarks can simultaneously fail input-encoding stability and threshold-shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks that aggregate evaluation alone cannot detect. Validation occurs on a synthetic cohort and three real-world cohorts from 1980s cardiology data to a 2024 national health survey, where failing dimensions vary by cohort.

What carries the argument

The RISED five-dimension framework, operationalized with formal sub-criteria, pre-specified pass/fail thresholds, bias-corrected accelerated bootstrap 95% confidence intervals, and Holm-Bonferroni family-wise error correction.

Load-bearing premise

The five chosen dimensions and their sub-criteria with fixed thresholds capture the main pre-deployment risks for clinical AI across different datasets and use cases.

What would settle it

A prospective silent-trial study in which a model passes all RISED checks but then shows input-encoding instability or threshold sensitivity failures during actual clinical use would falsify the framework's predictive value.

Figures

Figures reproduced from arXiv: 2605.12895 by Rohith Reddy Bellibatlu.

**Figure 2.** Figure 2: Inclusivity dimension: subgroup AUC-ROC across race, sex, age group, and insurance subgroups. [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity dimension: threshold flip rate sweep from [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Equity dimension: group-level need–prediction gaps using the binary outcome label as the need [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Deployability dimension: global SHAP feature importance (rank order). Top five features: age, [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: RISED Framework scorecard with CI-based decisions across all five dimensions for the XGBoost [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

read the original abstract

Aggregate accuracy metrics dominate the evaluation of clinical AI decision-support systems but do not detect deployment-phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. We propose the RISED Framework: a five-dimension pre-deployment evaluation covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through formal sub-criteria, pre-specified pass/fail thresholds, and bias-corrected accelerated (BCa) bootstrap 95% confidence intervals combined under a Holm-Bonferroni family-wise error correction. A central demonstration is that a classifier satisfying conventional high-discrimination benchmarks can simultaneously fail input-encoding stability and threshold-shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks that aggregate evaluation alone cannot detect. We validate this differential pass/fail pattern on a synthetic cohort and three publicly available real-world cohorts spanning 35 years of clinical data vintage, from a 1980s cardiology dataset to a 2024 nationally representative health survey, where failing dimensions differ across cohorts, providing preliminary evidence of construct validity. The Equity dimension is reframed as a proxy-dependence diagnostic rather than a stand-alone gate: any need-based fairness verdict computed against a utilization-derived proxy carries a construct-validity problem the framework surfaces explicitly, triggering a procurement requirement for an outcome-independent need measure before the gate is binding. RISED is released as an open-source Python package that supplies the quantitative verdicts existing clinical AI reporting standards require, providing a principled gateway between in-silico model validation and silent-trial clinical evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RISED catches risks beyond accuracy with a five-dimension framework, but its thresholds need better real-world grounding.

read the letter

The key takeaway is that this framework catches deployment risks like input instability and threshold sensitivity that accuracy alone misses, and it backs that up with stats on several datasets. What is new is the specific five-dimension setup with sub-criteria, the statistical integration using BCa intervals and Holm-Bonferroni correction, and treating equity as a diagnostic for proxy problems rather than a direct fairness gate. The open-source release is a plus, and applying it to cohorts from the 1980s to 2024 shows the failures aren't uniform. It does well in providing concrete examples where a high-discrimination model fails other checks, which illustrates the point without circularity since the dimensions are defined separately. The main soft spot is the pre-specified thresholds for things like stability and sensitivity. They lack derivation from actual clinical deployment results, so the reported failures might be sensitive to those exact numbers. If the full paper doesn't show external calibration, that weakens how strongly we can say aggregate metrics miss key risks. For a reader working on clinical AI evaluation or regulation, this gives a structured starting point and code to try. It is worth a serious referee because the problem is important and the approach is more rigorous than typical reporting, though the thresholds will need discussion in review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the RISED framework for pre-deployment safety evaluation of clinical AI decision-support systems. It defines five dimensions—Reliability, Inclusivity, Sensitivity, Equity, and Deployability—each operationalized via formal sub-criteria, pre-specified pass/fail thresholds, BCa bootstrap 95% confidence intervals, and Holm-Bonferroni correction. The central demonstration shows that a classifier meeting conventional high-discrimination benchmarks can fail input-encoding stability and threshold-shift sensitivity checks while subgroup AUC parity remains inconclusive; this differential pattern is validated on one synthetic cohort and three real-world cohorts spanning 1980s cardiology data to a 2024 national survey. Equity is reframed as a proxy-dependence diagnostic that triggers a procurement requirement for outcome-independent need measures. An open-source Python package implementing the quantitative verdicts is released.

Significance. If the dimensions and thresholds prove robust to external validation, the framework supplies a structured, multi-dimensional gateway between in-silico validation and silent-trial evaluation that aggregate accuracy metrics alone cannot provide. Strengths include the explicit construct-validity treatment of the Equity dimension, the multi-decade cohort validation demonstrating that failing dimensions vary across datasets, and the open-source package that directly supplies the reporting elements required by existing clinical AI standards.

major comments (2)

[Abstract and Methods] Abstract and Methods: the pre-specified pass/fail thresholds for input-encoding stability (Reliability) and threshold-shift sensitivity (Sensitivity) are stated to be fixed in advance yet lack derivation from observed clinical deployment failures or prospective silent-trial outcomes; because the central claim that aggregate metrics miss deployment risks rests on the reported differential pass/fail pattern, this absence of external anchoring makes the pattern potentially sensitive to modest threshold shifts.
[Validation section] Validation section: the manuscript reports that failing dimensions differ across the synthetic and three real cohorts but does not present sensitivity analyses showing how the pass/fail verdicts change when the pre-specified thresholds are varied within plausible ranges; such analyses are required to establish that the observed differential pattern is not an artifact of the particular cutoff choices.

minor comments (2)

The open-source package release is a clear strength; the manuscript would benefit from a short code snippet or installation command in the main text or supplementary material to illustrate immediate usability.
Table or figure captions describing the cohort characteristics should explicitly list the number of samples, outcome prevalence, and feature dimensionality for each of the four validation cohorts to allow readers to assess generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below and have revised the manuscript to incorporate additional justification and sensitivity analyses for the pre-specified thresholds.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: the pre-specified pass/fail thresholds for input-encoding stability (Reliability) and threshold-shift sensitivity (Sensitivity) are stated to be fixed in advance yet lack derivation from observed clinical deployment failures or prospective silent-trial outcomes; because the central claim that aggregate metrics miss deployment risks rests on the reported differential pass/fail pattern, this absence of external anchoring makes the pattern potentially sensitive to modest threshold shifts.

Authors: We agree that stronger external anchoring would strengthen the framework. The thresholds were derived from a review of published clinical AI deployment studies documenting input drift and threshold instability as common failure modes, combined with conservative clinical judgment to flag deviations likely to affect safety. We have expanded the Methods section with explicit citations to these sources and the rationale for each value. To address sensitivity concerns, we have added new analyses (Figure S3, Table S4) varying thresholds by +/-10%, +/-20%, and +/-30%; the differential pass/fail pattern across cohorts remains stable, supporting the central claim. revision: yes
Referee: Validation section: the manuscript reports that failing dimensions differ across the synthetic and three real cohorts but does not present sensitivity analyses showing how the pass/fail verdicts change when the pre-specified thresholds are varied within plausible ranges; such analyses are required to establish that the observed differential pattern is not an artifact of the particular cutoff choices.

Authors: We thank the referee for this observation. We have now conducted and reported the requested sensitivity analyses in the revised Validation section. Re-evaluating all cohorts at +/-15% and +/-25% threshold variations shows that while a few borderline verdicts shift, the overall pattern of differing failing dimensions across the four cohorts is preserved, with no dataset reversing its overall safety profile. These results are presented in the main text and supplementary tables. revision: yes

Circularity Check

0 steps flagged

RISED framework derivation is self-contained with no circular reductions

full rationale

The paper introduces the RISED framework by defining five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through formal sub-criteria, pre-specified pass/fail thresholds, and BCa bootstrap CIs with Holm-Bonferroni correction. These are applied to independent synthetic and real-world cohorts without any equations that reduce verdicts to fitted parameters from the same data, self-citations that bear the central load, or ansatzes smuggled via prior work. The differential failure demonstration follows directly from the externally stated criteria rather than construction from evaluation inputs, satisfying the self-contained benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that the five dimensions are the right ones to operationalize and that pre-specified thresholds plus BCa bootstrap with Holm-Bonferroni correction produce reliable verdicts; no explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption The five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) together cover the main pre-deployment risks for clinical AI.
Invoked in the proposal of the framework as the basis for evaluation.
domain assumption Pre-specified pass/fail thresholds combined with BCa bootstrap 95% CIs and Holm-Bonferroni correction yield valid verdicts.
Used to operationalize each dimension.

pith-pipeline@v0.9.0 · 5584 in / 1446 out tokens · 43398 ms · 2026-05-14T20:09:19.495359+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

105 extracted references · 3 canonical work pages

[1]

and Torkamani, Ali and Dias, Raquel and Gianfrancesco, Milena and Arnaout, Rima and Kohane, Isaac S

Norgeot, Beau and Quer, Giorgio and Beaulieu-Jones, Brett K. and Torkamani, Ali and Dias, Raquel and Gianfrancesco, Milena and Arnaout, Rima and Kohane, Isaac S. and Saria, Suchi and Topol, Eric and Obermeyer, Ziad and Yu, Bin and Butte, Atul J. , title =. Nature Medicine , year =
[2]

and Porras, Antonio R

Lekadir, Karim and Frangi, Alejandro F. and Porras, Antonio R. and Glocker, Ben and Cintas, Celia and Langlotz, Curtis P. and Weicken, Eva and Asselbergs, Folkert W. and Prior, Fred and Collins, Gary S. and others , title =. BMJ , year =
[3]

and Calvert, Melanie J

Cruz Rivera, Samantha and Liu, Xiaoxuan and Chan, An-Wen and Denniston, Alastair K. and Calvert, Melanie J. and. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the. Nature Medicine , year =
[4]

and Denniston, Alastair K

Liu, Xiaoxuan and Cruz Rivera, Samantha and Moher, David and Calvert, Melanie J. and Denniston, Alastair K. and. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the. Nature Medicine , year =
[5]

Journal of the American Statistical Association , year =

Efron, Bradley , title =. Journal of the American Statistical Association , year =
[6]

International Conference on Learning Representations (ICLR) , year =

Madry, Aleksander and Makelov, Aleksandar and Schmidt, Ludwig and Tsipras, Dimitris and Vladu, Adrian , title =. International Conference on Learning Representations (ICLR) , year =
[7]

2025 , publisher =

Bellibatlu, Rohith Reddy , title =. 2025 , publisher =. doi:10.57967/hf/8734 , url =

work page doi:10.57967/hf/8734 2025
[8]

Science , year =

Obermeyer, Ziad and Powers, Brian and Vogeli, Christine and Mullainathan, Sendhil , title =. Science , year =
[9]

Bellamy, Rachel K. E. and Dey, Kuntal and Hind, Michael and Hoffman, Samuel C. and Houde, Stephanie and Kannan, Kalapriya and Lohia, Pranay and Martino, Jacquelyn and Mehta, Sameep and Mojsilovic, Aleksandra and Nagar, Seema and Ramamurthy, Karthikeyan Natesan and Richards, John and Saha, Diptikalyan and Sattigeri, Prasanna and Singh, Moninder and Varshne...
[10]

Fairlearn: A toolkit for assessing and improving fairness in

Bird, Sarah and Dud. Fairlearn: A toolkit for assessing and improving fairness in. 2020 , number =

2020
[11]

Fairlearn: Assessing and improving fairness of

Bird, Sarah and Dud. Fairlearn: Assessing and improving fairness of. Journal of Machine Learning Research , year =
[12]

Journal of the American Medical Informatics Association , year =

Walonoski, Jason and Kramer, Mark and Nichols, Joseph and Quina, Andre and Moesel, Chris and Hall, Dylan and Duffett, Carlton and Dube, Kudakwashe and Gallagher, Thomas and McLachlan, Scott , title =. Journal of the American Medical Informatics Association , year =
[13]

and Pompei, Peter and Ales, Kathy L

Charlson, Mary E. and Pompei, Peter and Ales, Kathy L. and MacKenzie, C. Ronald , title =. Journal of Chronic Diseases , year =
[14]

and Fushimi, Kiyohide and Graham, Patrick and Hider, Phil and Januel, Jean-Marie and Sundararajan, Vijaya , title =

Quan, Hude and Li, Bing and Couris, Chantal M. and Fushimi, Kiyohide and Graham, Patrick and Hider, Phil and Januel, Jean-Marie and Sundararajan, Vijaya , title =. American Journal of Epidemiology , year =
[15]

and Moons, Karel G

Collins, Gary S. and Moons, Karel G. M. and Dhiman, Paula and Riley, Richard D. and Beam, Andrew L. and Van Calster, Ben and Ghassemi, Marzyeh and Liu, Xiaoxuan and Reitsma, Johannes B. and van Smeden, Maarten and others , title =. BMJ , year =
[16]

and Reitsma, Johannes B

Collins, Gary S. and Reitsma, Johannes B. and Altman, Douglas G. and Moons, Karel G. M. , title =. BMJ , year =
[17]

2021 , month =

Artificial Intelligence/. 2021 , month =

2021
[18]

2024 , month =

Health Data, Technology, and Interoperability: Certification Program Updates, Algorithm Transparency, and Information Sharing (. 2024 , month =

2024
[19]

2024 , month =

Regulation (. 2024 , month =

2024
[20]

2023 , url =

Report to Congress on. 2023 , url =

2023
[21]

2025 , month =

Behavioral Risk Factor Surveillance System: 2024 Annual Survey Data , howpublished =. 2025 , month =

2024
[22]

2025 , note =

National Health Interview Survey, 2024 Public-Use Data File (. 2025 , note =

2024
[23]

2026 , url =

Bellibatlu, Rohith Reddy , title =. 2026 , url =

2026
[24]

2026 , doi =

Bellibatlu, Rohith Reddy , title =. 2026 , doi =

2026
[25]

and Lee, Su-In , title =

Lundberg, Scott M. and Lee, Su-In , title =. Advances in Neural Information Processing Systems , year =
[26]

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year =

Chen, Tianqi and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year =
[27]

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year =

Ribeiro, Marco Tulio and Singh, Sameer and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year =
[28]

Advances in Neural Information Processing Systems , year =

Hardt, Moritz and Price, Eric and Srebro, Nathan , title =. Advances in Neural Information Processing Systems , year =
[29]

, title =

Guo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q. , title =. Proceedings of the 34th International Conference on Machine Learning , year =
[30]

Nature Machine Intelligence , year =

Rudin, Cynthia , title =. Nature Machine Intelligence , year =
[31]

Proceedings of the 4th Workshop on Fairness, Accountability, and Transparency in Machine Learning , year =

Chouldechova, Alexandra , title =. Proceedings of the 4th Workshop on Fairness, Accountability, and Transparency in Machine Learning , year =
[32]

, title =

Pleiss, Geoff and Raghavan, Manish and Wu, Felix and Kleinberg, Jon and Weinberger, Kilian Q. , title =. Advances in Neural Information Processing Systems , year =
[33]

Proceedings of the 3rd Innovations in Theoretical Computer Science Conference , year =

Dwork, Cynthia and Hardt, Moritz and Pitassi, Toniann and Reingold, Omer and Zemel, Richard , title =. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference , year =
[34]

Communications of the ACM , year =

Gebru, Timnit and Morgenstern, Jamie and Vecchione, Briana and Vaughan, Jennifer Wortman and Wallach, Hanna and Daume, Hal and Crawford, Kate , title =. Communications of the ACM , year =
[35]

Proceedings of the Conference on Fairness, Accountability, and Transparency , year =

Mitchell, Margaret and Wu, Simone and Zaldivar, Andrew and Barnes, Parker and Vasserman, Lucy and Hutchinson, Ben and Spitzer, Elena and Raji, Inioluwa Deborah and Gebru, Timnit , title =. Proceedings of the Conference on Fairness, Accountability, and Transparency , year =
[36]

2017 , url =

Doshi-Velez, Finale and Kim, Been , title =. 2017 , url =

2017
[37]

, title =

Rajpurkar, Pranav and Chen, Emma and Banerjee, Oishi and Topol, Eric J. , title =. Nature Medicine , year =
[38]

, title =

Topol, Eric J. , title =. Nature Medicine , year =
[39]

and Shah, Nigam H

Char, Danton S. and Shah, Nigam H. and Magnus, David , title =. New England Journal of Medicine , year =
[40]

and Karthikesalingam, Alan and Suleyman, Mustafa and Corrado, Greg and King, Dominic , title =

Kelly, Christopher J. and Karthikesalingam, Alan and Suleyman, Mustafa and Corrado, Greg and King, Dominic , title =. BMC Medicine , year =
[41]

and Elish, Madeleine Clare and Gao, Michael and Futoma, Joseph and Ratliff, William and Nichols, Marshall and Bedoya, Armando and Balu, Suresh and O'Brien, Cara , title =

Sendak, Mark P. and Elish, Madeleine Clare and Gao, Michael and Futoma, Joseph and Ratliff, William and Nichols, Marshall and Bedoya, Armando and Balu, Suresh and O'Brien, Cara , title =. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency , year =

2020
[42]

and Chen, Irene Y

Ghassemi, Marzyeh and Naumann, Tristan and Schulam, Peter and Beam, Andrew L. and Chen, Irene Y. and Ranganath, Rajesh , title =. The Lancet Digital Health , year =
[43]

Biostatistics , year =

Subbaswamy, Adarsh and Saria, Suchi , title =. Biostatistics , year =
[44]

Shortliffe, Edward H. and Sep. Clinical decision support in the era of artificial intelligence , journal =. 2018 , volume =

2018
[45]

and Pincock, David and Baumgart, Daniel C

Sutton, Reed T. and Pincock, David and Baumgart, Daniel C. and Sadowski, Daniel C. and Fedorak, Richard N. and Kroeker, Karen I. , title =. npj Digital Medicine , year =
[46]

Journal of the American Medical Informatics Association , year =

Morley, Jessica and Cowls, Josh and Taddeo, Mariarosaria and Floridi, Luciano , title =. Journal of the American Medical Informatics Association , year =
[47]

and Eisenstein, Leo G

Vyas, Darshali A. and Eisenstein, Leo G. and Jones, David S. , title =. New England Journal of Medicine , year =
[48]

Nature , year =

Zou, James and Schiebinger, Londa , title =. Nature , year =
[49]

and Moukheiber, Lama and Resche-Rigon, Matthieu and Samayamuthu, Malarkodi Jebathurai and others , title =

Celi, Leo Anthony and Cellini, John and Charpignon, Marie-Laure and Dee, Edward Christopher and Dernoncourt, Franck and Eber, Ruben and Mitchell, William G. and Moukheiber, Lama and Resche-Rigon, Matthieu and Samayamuthu, Malarkodi Jebathurai and others , title =. PLOS Digital Health , year =
[50]

and Zatarah, Rami and Waldrip, Shannon and Ke, Justin Wei-Cyuan and Moukheiber, Mira and Khanna, Ashika K

Nazer, Laila H. and Zatarah, Rami and Waldrip, Shannon and Ke, Justin Wei-Cyuan and Moukheiber, Mira and Khanna, Ashika K. and Hicklen, Rachel S. and Moukheiber, Lama and Moukheiber, Dana and Ma, Hezhe and Mathur, Piyush , title =. PLOS Digital Health , year =
[51]

and Wei, Wei and

Ross, Matthew K. and Wei, Wei and. Sources of racial bias in clinical note text leading to disparate performance of a machine learning model , journal =. 2021 , volume =

2021
[52]

npj Digital Medicine , year =

Cirillo, Davide and Catuara-Solarz, Silvina and Morey, Czuee and Guney, Emre and Subirats, Laia and Mellino, Simona and Gigante, Annalisa and Valencia, Alfonso and Rementeria, Maria Jose and Chadha, Abhishek Singh and Mavridis, Nikolaos , title =. npj Digital Medicine , year =
[53]

and Janizek, Joseph D

DeGrave, Alex J. and Janizek, Joseph D. and Lee, Su-In , title =. Nature Machine Intelligence , year =
[54]

and Scheidegger, Carlos and Venkatasubramanian, Suresh and Choudhary, Sonam and Hamilton, Evan P

Friedler, Sorelle A. and Scheidegger, Carlos and Venkatasubramanian, Suresh and Choudhary, Sonam and Hamilton, Evan P. and Roth, Derek , title =. Proceedings of the Conference on Fairness, Accountability, and Transparency , year =
[55]

and Vickers, Andrew J

Steyerberg, Ewout W. and Vickers, Andrew J. and Cook, Nancy R. and Gerds, Thomas and Gonen, Mithat and Obuchowski, Nancy and Pencina, Michael J. and Kattan, Michael W. , title =. Epidemiology , year =
[56]

and van Smeden, Maarten and Wynants, Laure and Steyerberg, Ewout W

Van Calster, Ben and McLernon, David J. and van Smeden, Maarten and Wynants, Laure and Steyerberg, Ewout W. , title =. BMC Medicine , year =
[57]

and D'Agostino, Ralph B

Pencina, Michael J. and D'Agostino, Ralph B. and D'Agostino, Ralph B. and Vasan, Ramachandran S. , title =. Statistics in Medicine , year =
[58]

and Steyerberg, Ewout W

Austin, Peter C. and Steyerberg, Ewout W. , title =. Statistics in Medicine , year =
[59]

Proceedings of the 22nd International Conference on Machine Learning , year =

Niculescu-Mizil, Alexandru and Caruana, Rich , title =. Proceedings of the 22nd International Conference on Machine Learning , year =
[60]

and Subbaswamy, Adarsh and Singh, Karandeep and Bowers, John and Kupke, Annabel and Zittrain, Jonathan and Kohane, Isaac S

Finlayson, Samuel G. and Subbaswamy, Adarsh and Singh, Karandeep and Bowers, John and Kupke, Annabel and Zittrain, Jonathan and Kohane, Isaac S. and Saria, Suchi , title =. New England Journal of Medicine , year =
[61]

Wong, Andrew and Otles, Erkin and Donnelly, John P. and Krumm, Andrew and McCullough, Jeffrey and DeTroyer-Cooley, Olivia and Pestrue, Justin and Phillips, Marie and Konye, Judy and Penoza, Carleen and Ghous, Muhammad and Singh, Karandeep , title =. JAMA Internal Medicine , year =
[62]

Concrete problems in

Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete problems in. 2016 , url =

2016
[63]

and Pierson, Emma and Rose, Sherri and Joshi, Shalmali and Ferryman, Kadija and Ghassemi, Marzyeh , title =

Chen, Irene Y. and Pierson, Emma and Rose, Sherri and Joshi, Shalmali and Ferryman, Kadija and Ghassemi, Marzyeh , title =. Annual Review of Biomedical Data Science , year =
[64]

and Wagner, Siegfried K

Liu, Xiaoxuan and Faes, Livia and Kale, Aditya U. and Wagner, Siegfried K. and Fu, Dun Jack and Bruynseels, Alice and Mahendiran, Thushika and Moraes, Gabriella and Shamdas, Mohith and Kern, Christoph and others , title =. The Lancet Digital Health , year =
[65]

npj Digital Medicine , year =

Sendak, Mark and Gao, Michael and Brajer, Nathan and Balu, Suresh , title =. npj Digital Medicine , year =
[66]

, title =

Zhang, Angela and Xing, Lei and Zou, James and Wu, Joseph C. , title =. Nature Biomedical Engineering , year =
[67]

and Schuemie, Martijn J

Reps, Jenna M. and Schuemie, Martijn J. and Suchard, Marc A. and Ryan, Patrick B. and Rijnbeek, Peter R. , title =. Journal of the American Medical Informatics Association , year =
[68]

, title =

Iezzoni, Lisa I. , title =. 2003 , edition =

2003
[69]

2019 , url =

Proposed Regulatory Framework for Modifications to Artificial Intelligence/. 2019 , url =

2019
[70]

Good Machine Learning Practice for Medical Device Development: Guiding Principles , year =
[71]

Nicholson and Cohen, I

Price, W. Nicholson and Cohen, I. Glenn , title =. Nature Medicine , year =
[72]

Journal of the American Medical Informatics Association , year =

Reddy, Sandeep and Allan, Sonia and Coghlan, Simon and Cooper, Paul , title =. Journal of the American Medical Informatics Association , year =
[73]

2023 , url =

Barocas, Solon and Hardt, Moritz and Narayanan, Arvind , title =. 2023 , url =

2023
[74]

, title =

Steyerberg, Ewout W. , title =. 2019 , edition =

2019
[75]

and Riley, Richard D

Wynants, Laure and Van Calster, Ben and Collins, Gary S. and Riley, Richard D. and Heinze, Georg and Schuit, Ewoud and Bonten, Marc M. J. and others , title =. BMJ , year =
[76]

Moons, Karel G. M. and Altman, Douglas G. and Reitsma, Johannes B. and Ioannidis, John P. A. and Macaskill, Petra and Steyerberg, Ewout W. and Vickers, Andrew J. and Ransohoff, David F. and Collins, Gary S. , title =. Annals of Internal Medicine , year =
[77]

ACM Computing Surveys , year =

Mehrabi, Ninareh and Morstatter, Fred and Saxena, Nripsuta and Lerman, Kristina and Galstyan, Aram , title =. ACM Computing Surveys , year =
[78]

and Wang, Judy J

Chen, Richard J. and Wang, Judy J. and Williamson, Drew F. K. and Chen, Tiffany Y. and Lipkova, Jana and Singh, Rahul and Shaban, Muhammad and Mahmood, Faisal , title =. Nature Biomedical Engineering , year =
[79]

BMJ Health & Care Informatics , year =

Reddy, Sandeep , title =. BMJ Health & Care Informatics , year =
[80]

Wynants, Laure and Bouwmeester, Walter and Moons, Karel G. M. and Moerbeek, Mirjam and Timmerman, Dirk and Van Huffel, Sabine and Van Calster, Ben and Vergouwe, Yvonne , title =. Journal of Clinical Epidemiology , year =. doi:10.1016/j.jclinepi.2015.02.002 , note =

work page doi:10.1016/j.jclinepi.2015.02.002 2015

Showing first 80 references.