Recognition: unknown
Correcting heterogeneous diagnostic bias when developing clinical prediction models using causal hidden Markov models
Pith reviewed 2026-05-08 03:42 UTC · model grok-4.3
The pith
A causal hidden Markov model estimates the diagnosis probability an individual would have if tested at the same rate as a reference group, removing the label error that arises from uneven diagnostic testing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By defining the target estimand as the counterfactual diagnosis probability under a reference testing regime and embedding that estimand inside a hidden Markov model whose emissions are confirmatory test results, the method removes systematic prediction bias caused by heterogeneous diagnostic rates and restores calibration-in-the-large for under-tested subgroups.
What carries the argument
A hidden Markov model of longitudinal disease stages in which confirmatory test results are treated as emissions from a latent progressive state, combined with causal adjustment that targets the diagnosis probability under a uniform reference testing rate.
If this is right
- Prediction models developed with the correction show observed-to-expected ratios near 1.0 for under-diagnosed subgroups in both simulation and electronic health record data.
- The method identifies observable drivers of testing frequency, such as diabetes strongly increasing the rate of urine albumin-creatinine ratio tests.
- Even when the hidden Markov assumptions are violated, the corrected model remains better calibrated than an uncorrected standard model.
- The approach directly targets calibration-in-the-large rather than relying on post-hoc recalibration of biased labels.
Where Pith is reading between the lines
- The same structure could be applied to other conditions with delayed or selective diagnosis, such as certain cancers or cardiovascular events.
- Validation protocols for electronic health record models may need to report counterfactual calibration metrics alongside standard ones.
- Integrating this correction with existing fairness adjustments for protected attributes could address multiple sources of label bias simultaneously.
Load-bearing premise
The true disease process can be represented as a sequence of hidden stages whose emissions match the observed tests, and the causal model accurately recovers what the diagnosis rate would have been if testing frequency had matched the reference group.
What would settle it
A controlled dataset containing both the true underlying disease stages and the actual test results under deliberately varied testing rates, in which the method fails to bring the observed-to-expected ratio of the resulting prediction model to one for the under-tested group.
read the original abstract
In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce label error. This causes systematic model error for specific groups and biases performance metrics during validation. This paper proposes a method to correct for such bias in prediction models due to differential diagnostic delay. We use a causal inference framework to define our target estimand: an individual's diagnosis probability in a counterfactual scenario where their diagnosis rate matches that of a reference group. We model the longitudinal process as a hidden Markov model, in which confirmatory test results are emissions from a latent progressive disease stage. We validate our approach in simulated data and apply it to a case study of chronic kidney disease prediction using electronic health records. In simulations, our method reduces prediction bias and improves calibration-in-the-large, correcting the Observed:Expected ratio in the underdiagnosed group from 1.34 (standard deviation: 0.09) in a model developed without any correction for underdiagnosis bias to 1.02 (0.09). Violations of assumptions in the simulation affected the estimation of model parameters, but the proposed approach nonetheless remained better calibrated than the standard model. In the clinical case study, we identify diabetes as the main driver of observability, with an odds ratio of 10.36 (95% confidence interval, 9.80 - 11.02) in 6-month urine albumin-creatinine ratio testing rate. Using our approach to predict the counterfactual diagnostic rate in patients without diabetes, we improved the Observed:Expected ratio of a developed clinical prediction model from 1.55 (1.51 - 1.59) to 1.01 (0.98 - 1.04).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that heterogeneous diagnostic bias due to differential testing rates (e.g., by diabetes status) can be corrected in clinical prediction models by defining a counterfactual diagnosis probability under a reference-group testing rate and estimating it via a causal hidden Markov model in which confirmatory tests are emissions from a latent progressive disease stage. Simulations demonstrate correction of the Observed:Expected ratio in the underdiagnosed group from 1.34 (SD 0.09) to 1.02 (0.09), with the method remaining better calibrated than the uncorrected model even under some assumption violations; the CKD EHR case study identifies diabetes as the primary observability driver (OR 10.36) and improves the ratio from 1.55 (1.51-1.59) to 1.01 (0.98-1.04).
Significance. If the conditional independence and HMM representation assumptions hold, the approach addresses a practically important source of label error and miscalibration in EHR-based models, with clear simulation evidence of improved calibration-in-the-large and a real-world case study showing actionable gains. The use of simulations against known ground truth and the explicit causal estimand are strengths that support potential adoption in fairness-aware clinical modeling.
major comments (3)
- [§3 (Causal framework and HMM specification)] §3 (Causal framework and HMM specification): the target counterfactual diagnosis probability is identified only under the assumption that testing decisions are independent of the latent disease stage conditional on observed covariates (e.g., diabetes); this is load-bearing for the central claim, yet the manuscript provides no direct test or quantitative sensitivity analysis for residual dependence (such as unmeasured frailty), and the case study reports no check that the assumption holds after conditioning on diabetes and other variables.
- [Simulation results and §4 (validation)] Simulation results and §4 (validation): while the reported O:E corrections are encouraging, the abstract and results give limited detail on how specific violations of the HMM transition or emission probabilities propagate through to the final prediction bias and counterfactual rates; an expanded table or figure quantifying this propagation (beyond the statement that the method 'remained better calibrated') would be needed to support the robustness conclusion.
- [Case study application] Case study application: the counterfactual rates are obtained by plugging fitted HMM parameters (estimated from the same EHR data) into the causal formula, creating dependence on model-specific quantities rather than external benchmarks; the simulation validation supplies an independent ground-truth check, but the CKD analysis lacks an analogous diagnostic for whether the estimated counterfactuals are reliable when the true latent process is unknown.
minor comments (2)
- [Abstract] The abstract could briefly state the key identifying assumption and note that simulations examined some violations, to give readers an immediate sense of the scope of the claims.
- [Results] Table and figure captions in the results section would benefit from explicit mention of the reference group used for the counterfactual and the exact definition of the Observed:Expected ratio to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of the paper's significance. We address each major comment point by point below.
read point-by-point responses
-
Referee: [§3 (Causal framework and HMM specification)] §3 (Causal framework and HMM specification): the target counterfactual diagnosis probability is identified only under the assumption that testing decisions are independent of the latent disease stage conditional on observed covariates (e.g., diabetes); this is load-bearing for the central claim, yet the manuscript provides no direct test or quantitative sensitivity analysis for residual dependence (such as unmeasured frailty), and the case study reports no check that the assumption holds after conditioning on diabetes and other variables.
Authors: We agree that the conditional independence assumption (testing independent of latent stage given covariates) is central to identification. The manuscript justifies this based on domain knowledge for CKD, where diabetes is the dominant observed driver, but we did not include quantitative sensitivity analysis for residual unmeasured dependence. We will add a dedicated sensitivity analysis subsection that introduces simulated unmeasured frailty and reports its effects on counterfactual estimates and calibration. revision: yes
-
Referee: [Simulation results and §4 (validation)] Simulation results and §4 (validation): while the reported O:E corrections are encouraging, the abstract and results give limited detail on how specific violations of the HMM transition or emission probabilities propagate through to the final prediction bias and counterfactual rates; an expanded table or figure quantifying this propagation (beyond the statement that the method 'remained better calibrated') would be needed to support the robustness conclusion.
Authors: We appreciate the request for more granular robustness reporting. While our simulations included assumption violations, we agree that explicit propagation details would strengthen the claims. We will expand the simulation results with a new table quantifying bias in counterfactual rates, O:E ratios, and calibration metrics under controlled mild-to-severe violations of transition probabilities (e.g., altered progression rates) and emission probabilities (e.g., test sensitivity/specificity misspecification). revision: yes
-
Referee: [Case study application] Case study application: the counterfactual rates are obtained by plugging fitted HMM parameters (estimated from the same EHR data) into the causal formula, creating dependence on model-specific quantities rather than external benchmarks; the simulation validation supplies an independent ground-truth check, but the CKD analysis lacks an analogous diagnostic for whether the estimated counterfactuals are reliable when the true latent process is unknown.
Authors: We acknowledge that, absent ground truth for the latent process in real EHR data, a direct analogue to the simulation validation is not feasible. The simulations provide the primary evidence of method performance. In revision we will add indirect diagnostics to the case study, including goodness-of-fit checks on observed testing patterns, comparison of estimated emission probabilities to external CKD literature, and expanded discussion of this inherent limitation of observational applications. revision: partial
Circularity Check
No circularity: causal estimand estimated via fitted HMM with external simulation validation
full rationale
The paper defines the target estimand as an individual's counterfactual diagnosis probability under a reference testing rate using a causal framework, then represents the process via an HMM whose parameters are estimated from observed data to compute that quantity. This is standard model-based causal estimation rather than a reduction by construction. Simulations supply independent ground-truth checks showing calibration improvement (O:E from 1.34 to 1.02), and the CKD application reports empirical improvement (1.55 to 1.01) without the result being tautological to the inputs. No quoted step equates a claimed prediction to a fitted parameter or self-citation chain; the derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (3)
- HMM transition probabilities between latent disease stages
- Emission probabilities linking latent states to observed test results
- Odds ratio of diabetes on 6-month urine albumin-creatinine ratio testing rate
axioms (3)
- domain assumption Disease progression follows a first-order Markov process with unobserved latent stages
- domain assumption Confirmatory test results are emissions generated by the current latent disease state
- ad hoc to paper The causal model identifies the counterfactual diagnosis probability under a reference-group testing rate
invented entities (1)
-
Latent progressive disease stage
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A logistic regression ‘naï ve’ model, using covariates and the observability attributes to predict observed 5-year incidence 𝑑𝑛,10
-
[2]
A logistic regression ‘blind’ model, excluding ethnicity, sex, and Townsend score when predicting 𝑑𝑛,10
-
[3]
Model performance was first evaluated on the training data
A logistic regression ‘imputed’ model using covariates and the observability attributes to predict 𝑑𝑛,10 𝑐𝑓 . Model performance was first evaluated on the training data. To correct for optimism, we applied a bootstrap pipeline. For each iteration:
-
[4]
A sample of equal size to the original dataset was drawn with replacement
-
[5]
To improve computational efficiency, the parameters are initialised to those of the fit in the overall data
The HMM was re-fitted in the bootstrapped sample. To improve computational efficiency, the parameters are initialised to those of the fit in the overall data
-
[6]
Counterfactual outcomes are re-imputed, and the three logistic regression models re-trained in the bootstrap sample
-
[7]
Performance metrics are calculated in the bootstrapped sample (optimistic) and the original data (realistic) and recorded. This process was repeated 100 times, and the average difference between bootstrap and original performance was used to correct the apparent performance measures obtained from the full dataset. Supplementary 3: Additional results of th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.