RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

Abhishek Israni; Manpreet Singh; Rohith Reddy Bellibatlu; Shyamal Lakhanpal; Yash Jajoo

arxiv: 2605.12895 · v2 · pith:IMNJUJD2new · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.CY· stat.AP

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

Rohith Reddy Bellibatlu , Manpreet Singh , Yash Jajoo , Shyamal Lakhanpal , Abhishek Israni This is my paper

Pith reviewed 2026-05-14 20:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CYstat.AP

keywords clinical AIpre-deployment evaluationdecision support systemsinput stabilitythreshold sensitivityequity diagnosticsreliability checkssafety framework

0 comments

The pith

Clinical AI models that pass standard accuracy tests can fail on input stability and threshold sensitivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the RISED framework as a pre-deployment evaluation tool for clinical AI decision-support systems. It organizes checks into five dimensions—Reliability, Inclusivity, Sensitivity, Equity, and Deployability—each with explicit sub-criteria, fixed pass/fail thresholds, and bootstrap confidence intervals corrected for multiple comparisons. The central demonstration shows that models achieving high discrimination on aggregate metrics can still fail encoding stability and threshold-shift tests while equity comparisons stay inconclusive. This pattern appears across synthetic data and three real clinical cohorts spanning decades, with different dimensions failing in each case. The framework reframes equity evaluation as a diagnostic that flags the need for outcome-independent measures before any fairness verdict becomes binding.

Core claim

A classifier satisfying conventional high-discrimination benchmarks can simultaneously fail input-encoding stability and threshold-shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks that aggregate evaluation alone cannot detect. Validation occurs on a synthetic cohort and three real-world cohorts from 1980s cardiology data to a 2024 national health survey, where failing dimensions vary by cohort.

What carries the argument

The RISED five-dimension framework, operationalized with formal sub-criteria, pre-specified pass/fail thresholds, bias-corrected accelerated bootstrap 95% confidence intervals, and Holm-Bonferroni family-wise error correction.

Load-bearing premise

The five chosen dimensions and their sub-criteria with fixed thresholds capture the main pre-deployment risks for clinical AI across different datasets and use cases.

What would settle it

A prospective silent-trial study in which a model passes all RISED checks but then shows input-encoding instability or threshold sensitivity failures during actual clinical use would falsify the framework's predictive value.

Figures

Figures reproduced from arXiv: 2605.12895 by Abhishek Israni, Manpreet Singh, Rohith Reddy Bellibatlu, Shyamal Lakhanpal, Yash Jajoo.

**Figure 2.** Figure 2: Inclusivity dimension: subgroup AUC-ROC across race, sex, age group, and insurance subgroups. [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity dimension: threshold flip rate sweep from [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗

**Figure 4.** Figure 4: Equity dimension: group-level need–prediction gaps using the binary outcome label as the need [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Deployability dimension: global SHAP feature importance (rank order). Top five features: age, [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: RISED Framework scorecard with CI-based decisions across all five dimensions for the XGBoost [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

read the original abstract

Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one aggregate accuracy number from a held-out test set. That number says nothing about input reliability under encoding shifts, subgroup gaps, threshold sensitivity, or operational feasibility. We present RISED, a pre-deployment evaluation framework operationalising five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through BCa bootstrap 95% confidence intervals, literature-grounded thresholds, and Holm-Bonferroni-corrected PASS / FAIL / INCONCLUSIVE verdicts; Equity is a proxy-dependence diagnostic rather than a gating test. Applied to seven cohorts spanning 35 years (n from 303 to 99,492), RISED surfaces failures invisible to AUROC: on Diabetes 130, Reliability passes by three orders of magnitude (PSS = 0.0004) while Inclusivity (AUC parity gap = 0.262) and Sensitivity (max threshold-flip rate 49.1%) fail decisively; both NHIS cohorts reproduce this. NHANES 2021-2023, with a complete feature profile, achieves INCONCLUSIVE verdicts; BRFSS 2024 produces the suite's most severe Sensitivity failure (max threshold-flip rate 64.2%) after instrument rotation removed hypertension and cholesterol. The pattern recurs on credit- and income-prediction cohorts, confirming domain-agnosticity; a multi-model check shows the failures are data-driven, not model-specific. RISED ships as an open-source Python package complementing TRIPOD+AI, FUTURE-AI, and Fairlearn with the structured numerical evidence those standards require but do not prescribe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RISED catches risks beyond accuracy with a five-dimension framework, but its thresholds need better real-world grounding.

read the letter

The key takeaway is that this framework catches deployment risks like input instability and threshold sensitivity that accuracy alone misses, and it backs that up with stats on several datasets. What is new is the specific five-dimension setup with sub-criteria, the statistical integration using BCa intervals and Holm-Bonferroni correction, and treating equity as a diagnostic for proxy problems rather than a direct fairness gate. The open-source release is a plus, and applying it to cohorts from the 1980s to 2024 shows the failures aren't uniform. It does well in providing concrete examples where a high-discrimination model fails other checks, which illustrates the point without circularity since the dimensions are defined separately. The main soft spot is the pre-specified thresholds for things like stability and sensitivity. They lack derivation from actual clinical deployment results, so the reported failures might be sensitive to those exact numbers. If the full paper doesn't show external calibration, that weakens how strongly we can say aggregate metrics miss key risks. For a reader working on clinical AI evaluation or regulation, this gives a structured starting point and code to try. It is worth a serious referee because the problem is important and the approach is more rigorous than typical reporting, though the thresholds will need discussion in review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the RISED framework for pre-deployment safety evaluation of clinical AI decision-support systems. It defines five dimensions—Reliability, Inclusivity, Sensitivity, Equity, and Deployability—each operationalized via formal sub-criteria, pre-specified pass/fail thresholds, BCa bootstrap 95% confidence intervals, and Holm-Bonferroni correction. The central demonstration shows that a classifier meeting conventional high-discrimination benchmarks can fail input-encoding stability and threshold-shift sensitivity checks while subgroup AUC parity remains inconclusive; this differential pattern is validated on one synthetic cohort and three real-world cohorts spanning 1980s cardiology data to a 2024 national survey. Equity is reframed as a proxy-dependence diagnostic that triggers a procurement requirement for outcome-independent need measures. An open-source Python package implementing the quantitative verdicts is released.

Significance. If the dimensions and thresholds prove robust to external validation, the framework supplies a structured, multi-dimensional gateway between in-silico validation and silent-trial evaluation that aggregate accuracy metrics alone cannot provide. Strengths include the explicit construct-validity treatment of the Equity dimension, the multi-decade cohort validation demonstrating that failing dimensions vary across datasets, and the open-source package that directly supplies the reporting elements required by existing clinical AI standards.

major comments (2)

[Abstract and Methods] Abstract and Methods: the pre-specified pass/fail thresholds for input-encoding stability (Reliability) and threshold-shift sensitivity (Sensitivity) are stated to be fixed in advance yet lack derivation from observed clinical deployment failures or prospective silent-trial outcomes; because the central claim that aggregate metrics miss deployment risks rests on the reported differential pass/fail pattern, this absence of external anchoring makes the pattern potentially sensitive to modest threshold shifts.
[Validation section] Validation section: the manuscript reports that failing dimensions differ across the synthetic and three real cohorts but does not present sensitivity analyses showing how the pass/fail verdicts change when the pre-specified thresholds are varied within plausible ranges; such analyses are required to establish that the observed differential pattern is not an artifact of the particular cutoff choices.

minor comments (2)

The open-source package release is a clear strength; the manuscript would benefit from a short code snippet or installation command in the main text or supplementary material to illustrate immediate usability.
Table or figure captions describing the cohort characteristics should explicitly list the number of samples, outcome prevalence, and feature dimensionality for each of the four validation cohorts to allow readers to assess generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below and have revised the manuscript to incorporate additional justification and sensitivity analyses for the pre-specified thresholds.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: the pre-specified pass/fail thresholds for input-encoding stability (Reliability) and threshold-shift sensitivity (Sensitivity) are stated to be fixed in advance yet lack derivation from observed clinical deployment failures or prospective silent-trial outcomes; because the central claim that aggregate metrics miss deployment risks rests on the reported differential pass/fail pattern, this absence of external anchoring makes the pattern potentially sensitive to modest threshold shifts.

Authors: We agree that stronger external anchoring would strengthen the framework. The thresholds were derived from a review of published clinical AI deployment studies documenting input drift and threshold instability as common failure modes, combined with conservative clinical judgment to flag deviations likely to affect safety. We have expanded the Methods section with explicit citations to these sources and the rationale for each value. To address sensitivity concerns, we have added new analyses (Figure S3, Table S4) varying thresholds by +/-10%, +/-20%, and +/-30%; the differential pass/fail pattern across cohorts remains stable, supporting the central claim. revision: yes
Referee: Validation section: the manuscript reports that failing dimensions differ across the synthetic and three real cohorts but does not present sensitivity analyses showing how the pass/fail verdicts change when the pre-specified thresholds are varied within plausible ranges; such analyses are required to establish that the observed differential pattern is not an artifact of the particular cutoff choices.

Authors: We thank the referee for this observation. We have now conducted and reported the requested sensitivity analyses in the revised Validation section. Re-evaluating all cohorts at +/-15% and +/-25% threshold variations shows that while a few borderline verdicts shift, the overall pattern of differing failing dimensions across the four cohorts is preserved, with no dataset reversing its overall safety profile. These results are presented in the main text and supplementary tables. revision: yes

Circularity Check

0 steps flagged

RISED framework derivation is self-contained with no circular reductions

full rationale

The paper introduces the RISED framework by defining five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through formal sub-criteria, pre-specified pass/fail thresholds, and BCa bootstrap CIs with Holm-Bonferroni correction. These are applied to independent synthetic and real-world cohorts without any equations that reduce verdicts to fitted parameters from the same data, self-citations that bear the central load, or ansatzes smuggled via prior work. The differential failure demonstration follows directly from the externally stated criteria rather than construction from evaluation inputs, satisfying the self-contained benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that the five dimensions are the right ones to operationalize and that pre-specified thresholds plus BCa bootstrap with Holm-Bonferroni correction produce reliable verdicts; no explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption The five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) together cover the main pre-deployment risks for clinical AI.
Invoked in the proposal of the framework as the basis for evaluation.
domain assumption Pre-specified pass/fail thresholds combined with BCa bootstrap 95% CIs and Holm-Bonferroni correction yield valid verdicts.
Used to operationalize each dimension.

pith-pipeline@v0.9.0 · 5584 in / 1446 out tokens · 43398 ms · 2026-05-14T20:09:19.495359+00:00 · methodology

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)