arxiv: 2604.03478 · v1 · submitted 2026-04-03 · 💻 cs.LG

Recognition: no theorem link

Investigating Data Interventions for Subgroup Fairness: An ICU Case Study

Erin Tan , Judy Hanwen Shen , Irene Y. Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords subgroup fairnessdata additiondistribution shiftsICU modelspost-hoc calibrationalgorithmic biaselectronic health recordshealthcare machine learning

0 comments

The pith

Adding data from different ICU sources can both improve and degrade subgroup fairness in clinical models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how combining electronic health record data from multiple hospitals affects fairness for patient subgroups in machine learning models for ICU outcomes. It shows that adding data often introduces distribution shifts that offset gains from larger sample sizes, making results volatile. Intuitive choices about which data to include frequently fail to deliver consistent fairness benefits. A mix of adding data and applying post-hoc calibration on the trained model produces better subgroup performance than either method alone. This matters for healthcare because pooled datasets are common, yet they can amplify biases if interventions are not evaluated together.

Core claim

Across the eICU Collaborative Research Database and the MIMIC-IV dataset, data addition can both help and hurt model fairness and performance, and many intuitive strategies for data selection are unreliable. The combination of model-based post-hoc calibration and data-centric addition strategies is important to improve subgroup performance. The work questions the traditional dogma of using more data to overcome fairness challenges in clinical models.

What carries the argument

The interaction of data addition strategies from multiple sources with post-hoc model calibration to address subgroup performance gaps caused by distribution shifts.

If this is right

Combining records from different hospitals does not reliably improve fairness or performance.
Common intuitive rules for selecting which data to add often produce unreliable results.
Post-hoc calibration works better when paired with data interventions than when used in isolation.
The assumption that more data automatically reduces fairness problems does not hold in these clinical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners in other high-stakes areas may see similar volatility when merging datasets from separate institutions.
Developers should test multiple data combinations explicitly rather than defaulting to larger pools.
Better metrics that tie directly to downstream clinical decisions could reduce reliance on current subgroup definitions.

Load-bearing premise

The chosen subgroup definitions and fairness metrics on these two ICU datasets reflect the relevant real-world harms from biased predictions.

What would settle it

A consistent improvement in fairness metrics for all subgroups when adding data from a new hospital source, without any post-hoc calibration and across repeated trials, would falsify the volatility finding.

Figures

Figures reproduced from arXiv: 2604.03478 by Erin Tan, Irene Y. Chen, Judy Hanwen Shen.

**Figure 1.** Figure 1: Change in Overall and subgroup-level accuracy after WHOLE-SOURCE data addition (Logistic Regression). The change in Overall performance (a) does not reflect equally upon changes in performance across subgroups. For example, we observe that adding data from any source to Target Hospital 458 improves overall accuracy. While this change is reflected across the White (b) and Black (c) subgroups, the Other (d) … view at source ↗

**Figure 2.** Figure 2: Change in subgroup ratio vs. Change in subgroup test accuracy after WHOLE-SOURCE data addition on the eICU Dataset. 4.3 DATA ADDITION EXPERIMENTS To test our data addition heuristics, we performed three types of experiments, detailed in this section. • BASELINE: For each source and model class, we train on 1000 samples and test on 400 samples. Change in performance is measured with respect to the BASELINE.… view at source ↗

**Figure 3.** Figure 3: Change in subgroup accuracy as a function of samples added in SUBGROUP-LEVEL data addition (see Section 4 for details). Across nearly all combinations of subgroups and Test Hospitals, we find that adding more samples does not necessarily lead to larger performance gains. These visualizations lead to the conclusion that naive subgroup balancing is an uninformative data selection heuristic. available from so… view at source ↗

**Figure 4.** Figure 4: Subgroup similarity score (using features and labels) vs. Change in subgroup test accuracy for the White (left) and Black (right) subgroups (eICU). The scores are computed using only the patients from the target subgroup in each source. These performance changes result from SUBGROUP-LEVEL data addition. We do not observe statistically significant correlations in any Test Hospital for either subgroup. simil… view at source ↗

**Figure 5.** Figure 5: Mean consistency and subgroup performance: (a) Subgroup Mean Discrepancy (Eq. 6) vs. Subgroup Test Accuracy (Eq. 1). (b) Change in Subgroup Mean Discrepancy vs. Change in Subgroup Test Accuracy. Strong negative correlations are observed across all subgroups in (a), and in all minority subgroups in (b) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Difference in best-case subgroup performance (without calibration) and worst-case subgroup performance (with calibration) after WHOLE-SOURCE data addition. The overwhelming majority of subgroups show positive differences, supporting the idea that making informed choices for data addition is more important than calibrating post-hoc. DISC(f, D) ≤ 1 |D| X (x,y∈D) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Change in Overall and subgroup-level accuracy after WHOLE-SOURCE data addition. All results from abridged [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Change in overall test accuracy vs change in subgroup test accuracy. The Pareto Frontier [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Change in subgroup ratio vs. Change in subgroup test AUC on the eICU dataset. Same [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Number of subgroup samples added vs. Change in subgroup test AUC on the eICU [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Subgroup similarity score vs. Change in subgroup test AUC on the eICU dataset. Same [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Difference in best-case subgroup AUC (without calibration) and worst-case subgroup [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Difference in best-case subgroup AUC (with calibration) and worst-case subgroup AUC [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Change in performance after WHOLE-SOURCE data addition using the Light Gradient Boosting Machine (LGBM) classifier on the eICU Dataset [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Change in Subgroup Rate vs. Change in Subgroup Accuracy after W [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Number of subgroup samples added vs. Change in Subgroup Accuracy after S [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Subgroup Similarity Score between the test subgroup and added subgroup vs. Change in [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Difference in best-case subgroup performance (without calibration) and worst-case sub [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Difference in best-case subgroup performance (with calibration) and worst-case subgroup [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: Change in Overall and Subgroup Accuracy after W [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

**Figure 21.** Figure 21: Change in Subgroup Rate vs. Change in Subgroup Accuracy after W [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗

**Figure 22.** Figure 22: Number of subgroup samples added vs. Change in Subgroup Accuracy after S [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗

**Figure 23.** Figure 23: Subgroup Similarity Score between the test subgroup and added subgroup vs. Change in [PITH_FULL_IMAGE:figures/full_fig_p025_23.png] view at source ↗

**Figure 24.** Figure 24: Change in Overall and Subgroup AUC by ethnicity group after W [PITH_FULL_IMAGE:figures/full_fig_p026_24.png] view at source ↗

read the original abstract

In high-stakes settings where machine learning models are used to automate decision-making about individuals, the presence of algorithmic bias can exacerbate systemic harm to certain subgroups of people. These biases often stem from the underlying training data. In practice, interventions to "fix the data" depend on the actual additional data sources available -- where many are less than ideal. In these cases, the effects of data scaling on subgroup performance become volatile, as the improvements from increased sample size are counteracted by the introduction of distribution shifts in the training set. In this paper, we investigate the limitations of combining data sources to improve subgroup performance within the context of healthcare. Clinical models are commonly trained on datasets comprised of patient electronic health record (EHR) data from different hospitals or admission departments. Across two such datasets, the eICU Collaborative Research Database and the MIMIC-IV dataset, we find that data addition can both help and hurt model fairness and performance, and many intuitive strategies for data selection are unreliable. We compare model-based post-hoc calibration and data-centric addition strategies to find that the combination of both is important to improve subgroup performance. Our work questions the traditional dogma of "better data" for overcoming fairness challenges by comparing and combining data- and model-based approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adding data from different ICU sources can swing subgroup fairness either way, and the paper shows calibration plus data tweaks works better than either alone, but the mechanism needs checking for confounding.

read the letter

The main takeaway is that pulling in extra EHR data from other hospitals can make fairness metrics for subgroups better or worse, and the reliable fix is to combine data addition with post-hoc calibration rather than relying on data alone. The authors test this on eICU and MIMIC-IV by trying several intuitive ways to select or scale the added data and then measuring effects on performance and fairness gaps. They find that sample-size gains often get offset by shifts between sources, which undercuts the usual advice to just collect more data. This is a useful empirical check on how data interventions actually behave in a real clinical setting with public datasets. The work is grounded in the practical constraints hospitals face when merging records, and it directly compares data-centric and model-based approaches instead of treating them as separate silos. That comparison is the clearest contribution. The soft spots are the lack of numbers in the abstract and the open question on controls. The volatility they report could partly come from incidental changes in subgroup prevalence or label balance when data is added, rather than from distribution shifts alone. If the full paper does not hold total sample size or subgroup proportions fixed across conditions, the causal story needs tightening. The statistical tests and subgroup definitions also need explicit reporting to let readers judge effect sizes. This paper is for people who build or audit clinical prediction models and have to work with multi-source data. A reader who already knows the fairness literature but wants a concrete case study on data scaling will get value from the experiments. I would send it to peer review. The setup is relevant and the datasets are appropriate, even if the methods section needs more detail on the controls and the results need error bars and exact definitions.

Referee Report

3 major / 2 minor

Summary. The paper investigates data addition interventions for improving subgroup fairness in ICU clinical prediction models trained on the eICU Collaborative Research Database and MIMIC-IV. It reports that combining data sources can both help and hurt fairness and performance metrics because sample-size gains are counteracted by distribution shifts, that many intuitive data-selection heuristics are unreliable, and that hybrid use of post-hoc model calibration together with data-centric addition is required to achieve reliable subgroup improvements. The work challenges the assumption that more or 'better' data alone resolves fairness issues in high-stakes healthcare settings.

Significance. If the empirical patterns hold after appropriate controls, the result would be significant for fair ML in healthcare: it supplies concrete evidence that data scaling is volatile in real multi-source EHR settings and that purely data-centric or purely model-centric fixes are each insufficient. The comparison of calibration versus addition strategies offers a practical takeaway for practitioners who must work with imperfect additional data sources.

major comments (3)

[§4 and §5] §4 (Experiments) and §5 (Results): the central attribution of volatility to distribution shifts is not isolated from changes in subgroup prevalence or label balance. No subsampling to fixed total N, reweighting to preserve subgroup proportions, or explicit reporting of pre/post-addition subgroup sizes and positive rates is described; without these controls the observed help/hurt patterns could be driven by incidental shifts in empirical risk rather than the claimed mechanism.
[Abstract and §3] Abstract and §3 (Methods): subgroup definitions, fairness metrics (e.g., which disparity measure is primary), and statistical tests are not specified with sufficient detail to allow verification of the directional claims. The abstract states findings without error bars, confidence intervals, or p-values; the full results section must supply these to support the conclusion that intuitive strategies are 'unreliable'.
[§5.3] §5.3 (Comparison of calibration and addition): the claim that 'the combination of both is important' requires an ablation that holds total training size fixed while varying only the source composition or calibration step. If the hybrid improvement disappears under such a control, the recommendation for combined interventions would need qualification.

minor comments (2)

[Figures] Figure captions and axis labels should explicitly state the fairness metric plotted and the subgroup definitions used; current presentation leaves some visual comparisons ambiguous.
[Discussion] The paper should add a short limitations paragraph discussing generalizability beyond the two ICU datasets and the specific clinical tasks examined.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation and strengthen the empirical claims. We address each major point below and commit to revisions that improve the isolation of mechanisms, statistical rigor, and ablation controls.

read point-by-point responses

Referee: [§4 and §5] §4 (Experiments) and §5 (Results): the central attribution of volatility to distribution shifts is not isolated from changes in subgroup prevalence or label balance. No subsampling to fixed total N, reweighting to preserve subgroup proportions, or explicit reporting of pre/post-addition subgroup sizes and positive rates is described; without these controls the observed help/hurt patterns could be driven by incidental shifts in empirical risk rather than the claimed mechanism.

Authors: We agree that the current experiments do not fully isolate distribution shift from changes in subgroup prevalence and label balance. In the revised manuscript we will add (i) subsampling experiments that hold total training N fixed while varying source composition, (ii) explicit tables reporting pre- and post-addition subgroup sizes and positive rates for every condition, and (iii) reweighting analyses that preserve original subgroup proportions. These controls will allow readers to assess whether the observed volatility is attributable to distribution shift as claimed. revision: yes
Referee: [Abstract and §3] Abstract and §3 (Methods): subgroup definitions, fairness metrics (e.g., which disparity measure is primary), and statistical tests are not specified with sufficient detail to allow verification of the directional claims. The abstract states findings without error bars, confidence intervals, or p-values; the full results section must supply these to support the conclusion that intuitive strategies are 'unreliable'.

Authors: We acknowledge the lack of detail. In revision we will expand §3 to (a) list the exact subgroup definitions used (age, sex, ethnicity, admission type), (b) designate the primary fairness metric (equalized-odds difference) and secondary metrics, and (c) describe the statistical procedures (bootstrap confidence intervals and paired permutation tests with p-values). The abstract will be updated to reference the presence of error bars and statistical support; all figures and tables in §5 will include 95% CIs and p-values for the key comparisons that underpin the “unreliable” claim. revision: yes
Referee: [§5.3] §5.3 (Comparison of calibration and addition): the claim that 'the combination of both is important' requires an ablation that holds total training size fixed while varying only the source composition or calibration step. If the hybrid improvement disappears under such a control, the recommendation for combined interventions would need qualification.

Authors: We accept the need for this control. We will add an ablation in the revised §5.3 that holds total training size constant (by subsampling the pooled dataset to match the size of the single-source baselines) and reports performance with and without post-hoc calibration. The results of this ablation will be presented alongside the original experiments; if the hybrid benefit is not robust, we will qualify the recommendation in both the results and discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on public datasets

full rationale

The paper performs an empirical study comparing data-addition strategies and post-hoc calibration on eICU and MIMIC-IV datasets. It reports observed effects on fairness and performance metrics without any derivation chain, first-principles predictions, fitted parameters renamed as predictions, or self-citation load-bearing steps. All claims rest on experimental outcomes from standard training and evaluation procedures, with no equations or definitions that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study that relies on standard ML fairness evaluation practices and public datasets without introducing new parameters or entities.

axioms (1)

domain assumption Subgroup fairness metrics (e.g., equalized odds or demographic parity) are appropriate proxies for real-world harm in ICU settings.
Invoked when interpreting performance differences across subgroups as actionable bias.

pith-pipeline@v0.9.0 · 5521 in / 1136 out tokens · 39465 ms · 2026-05-13T19:30:45.895586+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

URLhttps://arxiv.org/abs/1907.02893. Anmol Arora, Joseph Alderman, Joanne Palmer, Shaswath Ganapathi, Elinor Laws, Melissa Mc- Cradden, Lauren Oakden-Rayner, Stephen Pfohl, Marzyeh Ghassemi, Francis Mckay, Darren Treanor, Negar Rostamzadeh, Bilal Mateen, Jacqui Gath, Adewole Adebajo, Stephanie Kuku, Rubeta Matin, Katherine Heller, Elizabeth Sapey, and Xia...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1038/s41591-023-02608-w 1907
[2]

doi: 10.1109/ICDMW.2009.83. N. V . Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over- sampling technique.Journal of Artificial Intelligence Research, 16:321–357, June 2002. ISSN 1076-9757. doi: 10.1613/jair.953. URLhttp://dx.doi.org/10.1613/jair.953. Irene Y . Chen, Fredrik D. Johansson, and David A. Sontag. Why is my ...

work page doi:10.1109/icdmw.2009.83 2009
[3]

Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq

doi: 10.1503/cmaj.202066. Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic de- cision making and the cost of fairness. InProceedings of the 23rd ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining, KDD ’17, pp. 797–806, New York, NY , USA, 2017. Association for Computing Machinery. ISBN 97814503...

work page doi:10.1503/cmaj.202066 2017
[4]

MIMIC-IV , a freely accessible electronic health record dataset,

URLhttps://proceedings.mlr.press/v177/idrissi22a.html. Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-wei H. Lehman, Leo A. Celi, and Roger G. Mark. Mimic-iv, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, Jan 2023a. IS...

work page doi:10.1038/s41597-022-01899-x 2052
[5]

doi: https://doi.org/10.1016/j.eswa.2021.115667

ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2021.115667. URLhttps: //www.sciencedirect.com/science/article/pii/S0957417421010575. Tom J Pollard, Alistair E W Johnson, Jesse D Raffa, Leo A Celi, Roger G Mark, and Omar Badawi. The eICU Collaborative Research Database, a freely available multi-center database for critical care research.Scientific dat...

work page doi:10.1016/j.eswa.2021.115667 2021
[6]

recommend using metrics which consider the tradeoff between sensitivity and specificity, such as AUC

When sample sizes from minority groups are especially small, Zhioua et al. recommend using metrics which consider the tradeoff between sensitivity and specificity, such as AUC. D.1 RESULTS ANDANALYSIS Figure 24 shows the change in AUC after data addition compared to the base results. Observing the plots, we see a similar phenomenon as Figure 1 where the p...

work page 2026