arxiv: 2605.06059 · v1 · submitted 2026-05-07 · 📊 stat.AP · cs.LG

Recognition: unknown

Correcting heterogeneous diagnostic bias when developing clinical prediction models using causal hidden Markov models

Brian McMillan, Jose Benitez-Aurioles, Matthew Sperrin, Ricardo Silva

Pith reviewed 2026-05-08 03:42 UTC · model grok-4.3

classification 📊 stat.AP cs.LG

keywords clinical prediction modelsdiagnostic biashidden Markov modelscausal inferenceelectronic health recordschronic kidney diseasemodel calibrationheterogeneous testing

0 comments

The pith

A causal hidden Markov model estimates the diagnosis probability an individual would have if tested at the same rate as a reference group, removing the label error that arises from uneven diagnostic testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Routine clinical data often contain label errors because some people receive confirmatory tests more frequently than others due to known risk factors or demographics. This paper shows how to recover the diagnosis probability that would have been observed under a uniform reference testing rate by combining causal inference with a hidden Markov model of disease progression. The model treats test results as emissions from an unobserved latent disease stage that advances over time. When applied to simulated data and to chronic kidney disease records, the corrected models produce observed-to-expected ratios near one for previously under-diagnosed groups, whereas standard models remain miscalibrated.

Core claim

By defining the target estimand as the counterfactual diagnosis probability under a reference testing regime and embedding that estimand inside a hidden Markov model whose emissions are confirmatory test results, the method removes systematic prediction bias caused by heterogeneous diagnostic rates and restores calibration-in-the-large for under-tested subgroups.

What carries the argument

A hidden Markov model of longitudinal disease stages in which confirmatory test results are treated as emissions from a latent progressive state, combined with causal adjustment that targets the diagnosis probability under a uniform reference testing rate.

If this is right

Prediction models developed with the correction show observed-to-expected ratios near 1.0 for under-diagnosed subgroups in both simulation and electronic health record data.
The method identifies observable drivers of testing frequency, such as diabetes strongly increasing the rate of urine albumin-creatinine ratio tests.
Even when the hidden Markov assumptions are violated, the corrected model remains better calibrated than an uncorrected standard model.
The approach directly targets calibration-in-the-large rather than relying on post-hoc recalibration of biased labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structure could be applied to other conditions with delayed or selective diagnosis, such as certain cancers or cardiovascular events.
Validation protocols for electronic health record models may need to report counterfactual calibration metrics alongside standard ones.
Integrating this correction with existing fairness adjustments for protected attributes could address multiple sources of label bias simultaneously.

Load-bearing premise

The true disease process can be represented as a sequence of hidden stages whose emissions match the observed tests, and the causal model accurately recovers what the diagnosis rate would have been if testing frequency had matched the reference group.

What would settle it

A controlled dataset containing both the true underlying disease stages and the actual test results under deliberately varied testing rates, in which the method fails to bring the observed-to-expected ratio of the resulting prediction model to one for the under-tested group.

read the original abstract

In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce label error. This causes systematic model error for specific groups and biases performance metrics during validation. This paper proposes a method to correct for such bias in prediction models due to differential diagnostic delay. We use a causal inference framework to define our target estimand: an individual's diagnosis probability in a counterfactual scenario where their diagnosis rate matches that of a reference group. We model the longitudinal process as a hidden Markov model, in which confirmatory test results are emissions from a latent progressive disease stage. We validate our approach in simulated data and apply it to a case study of chronic kidney disease prediction using electronic health records. In simulations, our method reduces prediction bias and improves calibration-in-the-large, correcting the Observed:Expected ratio in the underdiagnosed group from 1.34 (standard deviation: 0.09) in a model developed without any correction for underdiagnosis bias to 1.02 (0.09). Violations of assumptions in the simulation affected the estimation of model parameters, but the proposed approach nonetheless remained better calibrated than the standard model. In the clinical case study, we identify diabetes as the main driver of observability, with an odds ratio of 10.36 (95% confidence interval, 9.80 - 11.02) in 6-month urine albumin-creatinine ratio testing rate. Using our approach to predict the counterfactual diagnostic rate in patients without diabetes, we improved the Observed:Expected ratio of a developed clinical prediction model from 1.55 (1.51 - 1.59) to 1.01 (0.98 - 1.04).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The causal HMM method corrects calibration bias from uneven testing rates in EHR models with clear simulation gains, but rests on an untestable conditional independence assumption that the case study does not directly verify.

read the letter

The paper gives a method to fix bias from uneven diagnostic testing in electronic health records by defining a counterfactual diagnosis probability under a reference testing rate and estimating it with a hidden Markov model. Simulations show the observed-to-expected ratio in the underdiagnosed group moves from 1.34 to 1.02, and the CKD case study achieves a similar shift to 1.01 after adjusting for diabetes-driven testing differences. That is the core practical result worth knowing up front.

Referee Report

3 major / 2 minor

Summary. The paper claims that heterogeneous diagnostic bias due to differential testing rates (e.g., by diabetes status) can be corrected in clinical prediction models by defining a counterfactual diagnosis probability under a reference-group testing rate and estimating it via a causal hidden Markov model in which confirmatory tests are emissions from a latent progressive disease stage. Simulations demonstrate correction of the Observed:Expected ratio in the underdiagnosed group from 1.34 (SD 0.09) to 1.02 (0.09), with the method remaining better calibrated than the uncorrected model even under some assumption violations; the CKD EHR case study identifies diabetes as the primary observability driver (OR 10.36) and improves the ratio from 1.55 (1.51-1.59) to 1.01 (0.98-1.04).

Significance. If the conditional independence and HMM representation assumptions hold, the approach addresses a practically important source of label error and miscalibration in EHR-based models, with clear simulation evidence of improved calibration-in-the-large and a real-world case study showing actionable gains. The use of simulations against known ground truth and the explicit causal estimand are strengths that support potential adoption in fairness-aware clinical modeling.

major comments (3)

[§3 (Causal framework and HMM specification)] §3 (Causal framework and HMM specification): the target counterfactual diagnosis probability is identified only under the assumption that testing decisions are independent of the latent disease stage conditional on observed covariates (e.g., diabetes); this is load-bearing for the central claim, yet the manuscript provides no direct test or quantitative sensitivity analysis for residual dependence (such as unmeasured frailty), and the case study reports no check that the assumption holds after conditioning on diabetes and other variables.
[Simulation results and §4 (validation)] Simulation results and §4 (validation): while the reported O:E corrections are encouraging, the abstract and results give limited detail on how specific violations of the HMM transition or emission probabilities propagate through to the final prediction bias and counterfactual rates; an expanded table or figure quantifying this propagation (beyond the statement that the method 'remained better calibrated') would be needed to support the robustness conclusion.
[Case study application] Case study application: the counterfactual rates are obtained by plugging fitted HMM parameters (estimated from the same EHR data) into the causal formula, creating dependence on model-specific quantities rather than external benchmarks; the simulation validation supplies an independent ground-truth check, but the CKD analysis lacks an analogous diagnostic for whether the estimated counterfactuals are reliable when the true latent process is unknown.

minor comments (2)

[Abstract] The abstract could briefly state the key identifying assumption and note that simulations examined some violations, to give readers an immediate sense of the scope of the claims.
[Results] Table and figure captions in the results section would benefit from explicit mention of the reference group used for the counterfactual and the exact definition of the Observed:Expected ratio to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the paper's significance. We address each major comment point by point below.

read point-by-point responses

Referee: [§3 (Causal framework and HMM specification)] §3 (Causal framework and HMM specification): the target counterfactual diagnosis probability is identified only under the assumption that testing decisions are independent of the latent disease stage conditional on observed covariates (e.g., diabetes); this is load-bearing for the central claim, yet the manuscript provides no direct test or quantitative sensitivity analysis for residual dependence (such as unmeasured frailty), and the case study reports no check that the assumption holds after conditioning on diabetes and other variables.

Authors: We agree that the conditional independence assumption (testing independent of latent stage given covariates) is central to identification. The manuscript justifies this based on domain knowledge for CKD, where diabetes is the dominant observed driver, but we did not include quantitative sensitivity analysis for residual unmeasured dependence. We will add a dedicated sensitivity analysis subsection that introduces simulated unmeasured frailty and reports its effects on counterfactual estimates and calibration. revision: yes
Referee: [Simulation results and §4 (validation)] Simulation results and §4 (validation): while the reported O:E corrections are encouraging, the abstract and results give limited detail on how specific violations of the HMM transition or emission probabilities propagate through to the final prediction bias and counterfactual rates; an expanded table or figure quantifying this propagation (beyond the statement that the method 'remained better calibrated') would be needed to support the robustness conclusion.

Authors: We appreciate the request for more granular robustness reporting. While our simulations included assumption violations, we agree that explicit propagation details would strengthen the claims. We will expand the simulation results with a new table quantifying bias in counterfactual rates, O:E ratios, and calibration metrics under controlled mild-to-severe violations of transition probabilities (e.g., altered progression rates) and emission probabilities (e.g., test sensitivity/specificity misspecification). revision: yes
Referee: [Case study application] Case study application: the counterfactual rates are obtained by plugging fitted HMM parameters (estimated from the same EHR data) into the causal formula, creating dependence on model-specific quantities rather than external benchmarks; the simulation validation supplies an independent ground-truth check, but the CKD analysis lacks an analogous diagnostic for whether the estimated counterfactuals are reliable when the true latent process is unknown.

Authors: We acknowledge that, absent ground truth for the latent process in real EHR data, a direct analogue to the simulation validation is not feasible. The simulations provide the primary evidence of method performance. In revision we will add indirect diagnostics to the case study, including goodness-of-fit checks on observed testing patterns, comparison of estimated emission probabilities to external CKD literature, and expanded discussion of this inherent limitation of observational applications. revision: partial

Circularity Check

0 steps flagged

No circularity: causal estimand estimated via fitted HMM with external simulation validation

full rationale

The paper defines the target estimand as an individual's counterfactual diagnosis probability under a reference testing rate using a causal framework, then represents the process via an HMM whose parameters are estimated from observed data to compute that quantity. This is standard model-based causal estimation rather than a reduction by construction. Simulations supply independent ground-truth checks showing calibration improvement (O:E from 1.34 to 1.02), and the CKD application reports empirical improvement (1.55 to 1.01) without the result being tautological to the inputs. No quoted step equates a claimed prediction to a fitted parameter or self-citation chain; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

3 free parameters · 3 axioms · 1 invented entities

The central claim rests on a latent-variable model whose parameters are estimated from data and on domain assumptions about disease progression and testing mechanisms.

free parameters (3)

HMM transition probabilities between latent disease stages
Estimated from longitudinal EHR sequences to capture progression dynamics.
Emission probabilities linking latent states to observed test results
Fitted to model how confirmatory tests reveal the hidden disease stage.
Odds ratio of diabetes on 6-month urine albumin-creatinine ratio testing rate
Estimated from the case-study data as 10.36.

axioms (3)

domain assumption Disease progression follows a first-order Markov process with unobserved latent stages
Invoked to justify the hidden Markov model structure for longitudinal data.
domain assumption Confirmatory test results are emissions generated by the current latent disease state
Core modeling assumption allowing inference of hidden states from observed tests.
ad hoc to paper The causal model identifies the counterfactual diagnosis probability under a reference-group testing rate
Defines the target estimand that the HMM is then used to estimate.

invented entities (1)

Latent progressive disease stage no independent evidence
purpose: Represents the unobserved true disease progression that drives both testing frequency and eventual diagnosis
Postulated to enable the emission model within the HMM; no independent falsifiable evidence supplied beyond model fit.

pith-pipeline@v0.9.0 · 5642 in / 1698 out tokens · 88118 ms · 2026-05-08T03:42:18.447142+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references

[1]

A logistic regression ‘naï ve’ model, using covariates and the observability attributes to predict observed 5-year incidence 𝑑𝑛,10
[2]

A logistic regression ‘blind’ model, excluding ethnicity, sex, and Townsend score when predicting 𝑑𝑛,10
[3]

Model performance was first evaluated on the training data

A logistic regression ‘imputed’ model using covariates and the observability attributes to predict 𝑑𝑛,10 𝑐𝑓 . Model performance was first evaluated on the training data. To correct for optimism, we applied a bootstrap pipeline. For each iteration:
[4]

A sample of equal size to the original dataset was drawn with replacement
[5]

To improve computational efficiency, the parameters are initialised to those of the fit in the overall data

The HMM was re-fitted in the bootstrapped sample. To improve computational efficiency, the parameters are initialised to those of the fit in the overall data
[6]

Counterfactual outcomes are re-imputed, and the three logistic regression models re-trained in the bootstrap sample
[7]

Performance metrics are calculated in the bootstrapped sample (optimistic) and the original data (realistic) and recorded. This process was repeated 100 times, and the average difference between bootstrap and original performance was used to correct the apparent performance measures obtained from the full dataset. Supplementary 3: Additional results of th...