Recognition: no theorem link
Semi-supervised Method for Risk Prediction with Doubly Censored EHR Data
Pith reviewed 2026-05-11 02:18 UTC · model grok-4.3
The pith
The semi-supervised estimator for risk prediction under double censoring is consistent and more efficient than supervised methods that use only labeled data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop a novel SSL framework for risk prediction that combines a small set of gold-standard labels with large-scale surrogate information under double censoring. We establish the theoretical validity of the proposed estimator. Through extensive simulation studies, we show that our method substantially improves estimation efficiency relative to existing supervised estimators based on the labeled data alone. We demonstrate its practical value by applying it to study risk factors for type 2 diabetes using EHR data.
What carries the argument
The semi-supervised estimator that fuses gold-standard event times for a small subset with surrogate event times for the full cohort while correctly handling both left and right censoring.
If this is right
- The estimator remains asymptotically valid when only a small fraction of records have gold-standard labels.
- Efficiency gains grow with the volume of unlabeled surrogate data while bias stays controlled.
- The framework applies directly to other time-to-event risk prediction tasks in EHR settings that exhibit double censoring.
- It produces more stable estimates of risk factor associations for type 2 diabetes than label-only methods.
Where Pith is reading between the lines
- Health systems could reduce the number of manual chart reviews needed for accurate risk modeling if surrogates are routinely available.
- The same integration strategy may transfer to other data sources that pair expensive verified outcomes with cheap proxies, such as insurance claims.
- Further gains are possible by replacing the current base model with flexible machine-learning predictors while retaining the double-censoring correction.
Load-bearing premise
The surrogate outcomes can be combined with the gold-standard labels under double censoring without introducing bias that the model does not account for.
What would settle it
A simulation in which surrogate outcomes are generated independently of the true event times, after which the semi-supervised estimator exhibits bias or higher variance than the supervised estimator that ignores the surrogates.
read the original abstract
The rapid expansion of large-scale electronic health record (EHR) data offers unique opportunities to improve the accuracy and efficiency of clinical risk estimation. Yet, because clinical events may occur outside the recording health system, clinical event outcomes are frequently subject to double censoring (both left and right). Besides, gold-standard event times can often only be ascertained through labor-intensive manual chart reviews, yielding labels for only a small subset of patients. Reliance on this limited labeled set alone is limited in efficiency, whereas widely available surrogate outcomes such as the time to first diagnostic code or first disease mention are error-prone and can yield biased estimates if used directly. Semi-supervised learning (SSL) methods provide a principled way to integrate labeled and unlabeled data, and prior work has demonstrated their advantages in settings with binary or right-censored outcomes. However, existing approaches do not accommodate double censoring for risk prediction, which poses additional methodological challenges. To address this gap, we develop a novel SSL framework for risk prediction that combines a small set of gold-standard labels with large-scale surrogate information under double censoring. We establish the theoretical validity of the proposed estimator. Through extensive simulation studies, we show that our method substantially improves estimation efficiency relative to existing supervised estimators (based on the labeled data). Finally, we demonstrate its practical value by applying it to study risk factors for type 2 diabetes (T2D) using EHR data from a health system in the US.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a semi-supervised learning (SSL) framework for risk prediction under double censoring in EHR data. It integrates a small set of gold-standard labels (from chart review) with large-scale surrogate outcomes while explicitly handling left and right censoring. The authors claim to establish theoretical validity of the proposed estimator, demonstrate substantial efficiency gains over supervised estimators in extensive simulations, and apply the method to identify risk factors for type 2 diabetes using real EHR data from a US health system.
Significance. If the theoretical validity and efficiency claims hold under the stated assumptions, this addresses a practical gap in semi-supervised survival analysis for doubly censored data, which is common in EHR settings. The approach could improve statistical efficiency for clinical risk models without requiring extensive additional labeling, building on prior SSL work for binary or right-censored outcomes. Strengths include the focus on a real-world data challenge and simulation-based evidence of gains relative to labeled-data-only estimators.
major comments (2)
- [§3.2] §3.2, the estimating equation for the semi-supervised estimator: the derivation of asymptotic normality appears to rely on the surrogate outcomes being conditionally independent of the censoring mechanism given the covariates; this assumption is not explicitly verified in the simulation design (Table 1) and could be violated in EHR settings where diagnostic codes correlate with visit frequency.
- [§4.1] §4.1, simulation results for n=500 labeled and N=5000 unlabeled: the reported efficiency gain (e.g., 40-60% reduction in MSE for the coefficient of the primary risk factor) is shown only under correct specification of the surrogate model; no results are provided for misspecified surrogates, which undermines the claim of robustness for real EHR applications.
minor comments (3)
- [§2] The notation for the double-censoring indicators (L_i, R_i) is introduced in §2 but used inconsistently with the likelihood contribution in Eq. (3); clarify the definition of the observed data for unlabeled subjects.
- [Figure 2] Figure 2 (simulation boxplots) lacks axis labels for the y-scale in the efficiency ratio panel and does not indicate the number of Monte Carlo replicates (stated as 500 in text but not in caption).
- [§5] The application section (§5) reports hazard ratios but does not provide confidence intervals or p-values adjusted for the semi-supervised variance estimator; add these for interpretability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and limitations of our work. We address each major comment below.
read point-by-point responses
-
Referee: [§3.2] §3.2, the estimating equation for the semi-supervised estimator: the derivation of asymptotic normality appears to rely on the surrogate outcomes being conditionally independent of the censoring mechanism given the covariates; this assumption is not explicitly verified in the simulation design (Table 1) and could be violated in EHR settings where diagnostic codes correlate with visit frequency.
Authors: The asymptotic normality derivation in §3.2 is obtained under Assumption 3, which requires conditional independence of the surrogate outcomes and the censoring mechanism given the covariates. This assumption is satisfied by construction in the data-generating process of our simulations (Section 4.1). We agree that an explicit statement and verification would improve clarity, and that potential violations are plausible in EHR data due to correlations between diagnostic codes and visit patterns. In the revision we will add an explicit verification paragraph in §4.1 confirming that the assumption holds in the reported simulations and include a small sensitivity study that introduces mild dependence between surrogates and censoring to illustrate robustness. revision: partial
-
Referee: [§4.1] §4.1, simulation results for n=500 labeled and N=5000 unlabeled: the reported efficiency gain (e.g., 40-60% reduction in MSE for the coefficient of the primary risk factor) is shown only under correct specification of the surrogate model; no results are provided for misspecified surrogates, which undermines the claim of robustness for real EHR applications.
Authors: The simulations in §4.1 were designed under correct surrogate-model specification to isolate the efficiency gains attainable when the working model is well-specified. We acknowledge that results under misspecification would better support claims of practical utility in EHR settings. We will therefore add a new set of simulation scenarios in the revised §4.1 that deliberately misspecify the surrogate model (e.g., omitted covariates or incorrect link function) and report the resulting efficiency gains relative to the supervised estimator. revision: yes
Circularity Check
No significant circularity; derivation self-contained from first principles
full rationale
The paper proposes a novel SSL estimator for risk prediction under double censoring by integrating a small labeled set with surrogate outcomes. The abstract states the framework is developed to address the gap in existing methods, with theoretical validity established and efficiency gains shown via simulations. No load-bearing steps reduce by construction to fitted parameters, self-citations, or renamed inputs; the central claim rests on independent modeling assumptions and external validation rather than tautological definitions or prior self-referential results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Weak Convergence and Empirical Processes , author=. 1996 , publisher=
work page 1996
-
[2]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =
Semiparametric Regression for the Mean and Rate Functions of Recurrent Events , author =. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =. 2000 , publisher=
work page 2000
-
[3]
Adjusted regression estimation for time-to-event data with differential measurement error , author=. Biometrika , volume=. 2013 , publisher=
work page 2013
-
[4]
Additive hazard regression with auxiliary covariates , author=. Biometrika , volume=. 2007 , publisher=
work page 2007
-
[5]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Cox regression in cohort studies with validation sampling , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2002 , publisher=
work page 2002
-
[6]
Journal of the American Medical Informatics Association , volume=
An augmented estimation procedure for EHR-based association studies accounting for differential misclassification , author=. Journal of the American Medical Informatics Association , volume=. 2020 , publisher=
work page 2020
-
[7]
Journal of Multivariate Analysis , volume=
Semiparametric linear transformation model with differential measurement error and validation sampling , author=. Journal of Multivariate Analysis , volume=. 2015 , publisher=
work page 2015
-
[8]
Semisupervised Calibration of Risk with Noisy Event Times (SCORNET) using electronic health record data , author=. Biostatistics , volume=. 2023 , publisher=
work page 2023
-
[9]
Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=
Semi-supervised approaches to efficient evaluation of model prediction performance , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2018 , publisher=
work page 2018
-
[10]
Semi-supervised inference: General theory and estimation of means , author=. Ann. Statist. , year=
-
[11]
Efficient and adaptive linear regression in semi-supervised settings , author=. Ann. Statist. , year=
-
[12]
Journal of biomedical informatics , volume=
Semi-supervised learning of the electronic health record for phenotype stratification , author=. Journal of biomedical informatics , volume=. 2016 , publisher=
work page 2016
-
[13]
High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP) , author=. Nature protocols , volume=. 2019 , publisher=
work page 2019
-
[14]
Semiparametric regression analysis for doubly censored data , author=. Biometrika , volume=. 2004 , publisher=
work page 2004
-
[15]
Computational Statistics & Data Analysis , volume=
An EM algorithm for the proportional hazards model with doubly censored data , author=. Computational Statistics & Data Analysis , volume=. 2013 , publisher=
work page 2013
-
[16]
Journal of Multivariate Analysis , volume=
Asymptotic properties of the maximum likelihood estimator for the proportional hazards model with doubly censored data , author=. Journal of Multivariate Analysis , volume=. 2010 , publisher=
work page 2010
-
[17]
The Annals of Statistics , volume=
Regression M-estimators with doubly censored data , author=. The Annals of Statistics , volume=. 1997 , publisher=
work page 1997
-
[18]
The Annals of Statistics , volume=
Linear regression with doubly censored data , author=. The Annals of Statistics , volume=. 1996 , publisher=
work page 1996
-
[19]
The Annals of Statistics , pages=
Asymptotic properties of self-consistent estimators based on doubly censored data , author=. The Annals of Statistics , pages=. 1993 , publisher=
work page 1993
-
[20]
The Annals of Statistics , pages=
Weak convergence of a self-consistent estimator of the survival function with doubly censored data , author=. The Annals of Statistics , pages=. 1990 , publisher=
work page 1990
-
[21]
The Annals of Statistics , pages=
A large sample study of generalized maximum likelihood estimators from incomplete data via self-consistency , author=. The Annals of Statistics , pages=. 1985 , publisher=
work page 1985
-
[22]
Journal of the American statistical association , volume=
Nonparametric estimation of a survivorship function with doubly censored data , author=. Journal of the American statistical association , volume=. 1974 , publisher=
work page 1974
-
[23]
The Annals of Statistics , volume=
Regression analysis under link violation , author=. The Annals of Statistics , volume=. 1989 , publisher=
work page 1989
-
[24]
Semiparametric analysis of transformation models with censored data , author=. Biometrika , volume=. 2002 , publisher=
work page 2002
-
[25]
Scandinavian Journal of Statistics , volume=
A class of semiparametric transformation models for doubly censored failure time data , author=. Scandinavian Journal of Statistics , volume=. 2018 , publisher=
work page 2018
-
[26]
Maximum likelihood estimation for semiparametric transformation models with interval-censored data , author=. Biometrika , volume=. 2016 , publisher=
work page 2016
-
[27]
Journal of the American statistical association , volume =
Turnbull, Bruce W , title =. Journal of the American statistical association , volume =. 1974 , type =
work page 1974
- [28]
-
[29]
Journal of the Royal Statistical Society: Series B (Methodological) , volume=
The empirical distribution function with arbitrarily grouped, censored and truncated data , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1976 , publisher=
work page 1976
-
[30]
The Annals of Statistics , pages =
Tsai, Wei-Yann and Crowley, John , title =. The Annals of Statistics , pages =. 1985 , type =
work page 1985
-
[31]
Journal of the American statistical Association , volume =
Efron, Bradley , title =. Journal of the American statistical Association , volume =. 1986 , type =
work page 1986
-
[32]
Analysis of doubly-censored survival data, with application to AIDS , author=. Biometrics , pages=. 1989 , publisher=
work page 1989
-
[33]
Econometrica: Journal of the Econometric Society , pages=
Simulation and the asymptotics of optimization estimators , author=. Econometrica: Journal of the Econometric Society , pages=. 1989 , publisher=
work page 1989
-
[34]
Empirical processes: theory and applications , year=
Pollard, David , title =. Empirical processes: theory and applications , year=
-
[35]
The Annals of Statistics , volume =
Gu, MG and Zhang, C-H , title =. The Annals of Statistics , volume =. 1993 , type =
work page 1993
- [36]
-
[37]
Handbook of econometrics , volume=
Large sample estimation and hypothesis testing , author=. Handbook of econometrics , volume=. 1994 , publisher=
work page 1994
-
[38]
The Annals of Statistics , volume=
Towards a general asymptotic theory for Cox model with staggered entry , author=. The Annals of Statistics , volume=. 1997 , publisher=
work page 1997
-
[39]
Journal of the American Statistical Association , volume =
Van Der Laan, Mark J and Robins, James M , title =. Journal of the American Statistical Association , volume =. 1998 , type =
work page 1998
- [40]
- [41]
-
[42]
Lifetime Data Analysis , volume=
Non-parametric hypothesis testing and confidence intervals with doubly censored data , author=. Lifetime Data Analysis , volume=. 2003 , publisher=
work page 2003
-
[43]
Cai, T and Cheng, S , title =. Biometrika , volume =. 2004 , type =
work page 2004
-
[44]
Package ‘dblcens’ , author=
-
[45]
Sun, Liuquan and Kim, Yang‐jin and Sun, Jianguo , title =. Biometrics , volume =. 2004 , type =
work page 2004
-
[46]
Advances in Neural Information Processing Systems , volume=
Statistical analysis of semi-supervised regression , author=. Advances in Neural Information Processing Systems , volume=
-
[47]
Model evaluation based on the sampling distribution of estimated absolute prediction error , author=. Biometrika , volume=. 2007 , publisher=
work page 2007
-
[48]
Journal of statistical computation and simulation , volume =
Zhang, Wei and Zhang, Ying and Chaloner, Kathryn and Stapleton, Jack T , title =. Journal of statistical computation and simulation , volume =. 2009 , type =
work page 2009
- [49]
-
[50]
Arthritis care & research , volume=
Electronic medical records for discovery research in rheumatoid arthritis , author=. Arthritis care & research , volume=. 2010 , publisher=
work page 2010
-
[51]
Statistics & probability letters , volume =
Messaci, Fatiha and Nemouchi, Nahima , title =. Statistics & probability letters , volume =. 2011 , type =
work page 2011
-
[52]
Lifetime data analysis , volume=
Simultaneous marginal survival estimators when doubly censored data is present , author=. Lifetime data analysis , volume=. 2011 , publisher=
work page 2011
-
[53]
Extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning , author=. PLoS One , volume=. 2012 , publisher=
work page 2012
-
[54]
Journal of biomedical informatics , volume =
Garla, Vijay and Taylor, Caroline and Brandt, Cynthia , title =. Journal of biomedical informatics , volume =. 2013 , type =
work page 2013
-
[55]
Journal of the American Medical Informatics Association , volume =
Hripcsak, George and Albers, David J , title =. Journal of the American Medical Informatics Association , volume =. 2013 , type =
work page 2013
-
[56]
AMIA annual symposium proceedings , volume=
Semi-supervised learning for phenotyping tasks , author=. AMIA annual symposium proceedings , volume=. 2015 , organization=
work page 2015
-
[57]
The Journal of the American Board of Family Medicine , volume=
Inaccuracy of ICD-9 codes for chronic kidney disease: a study from two practice-based research networks (PBRNs) , author=. The Journal of the American Board of Family Medicine , volume=. 2015 , publisher=
work page 2015
-
[58]
Automatic de-identification of medical records with a multilevel hybrid semi-supervised learning approach , author=. 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) , pages=. 2016 , organization=
work page 2016
-
[59]
Miotto, Riccardo and Li, Li and Kidd, Brian A and Dudley, Joel T , title =. Scientific reports , volume =. 2016 , type =
work page 2016
-
[60]
2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) , pages=
An NLP-based cognitive system for disease status identification in electronic health records , author=. 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) , pages=. 2017 , organization=
work page 2017
-
[61]
2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) , publisher =
Alemzadeh, Homa and Devarakonda, Murthy , title =. 2017 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI) , publisher =
work page 2017
-
[62]
The Annals of Statistics , volume =
Chakrabortty, Abhishek and Cai, Tianxi , title =. The Annals of Statistics , volume =. 2018 , type =
work page 2018
-
[63]
Chai, Hua and Li, Zi-na and Meng, De-yu and Xia, Liang-yong and Liang, Yong , title =. Scientific reports , volume =. 2017 , type =
work page 2017
-
[64]
Journal of biomedical informatics , volume =
Perez, Alicia and Weegar, Rebecka and Casillas, Arantza and Gojenola, Koldo and Oronoz, Maite and Dalianis, Hercules , title =. Journal of biomedical informatics , volume =. 2017 , type =
work page 2017
-
[65]
AMIA Annual Symposium Proceedings , volume=
Phenotyping through semi-supervised tensor factorization (PSST) , author=. AMIA Annual Symposium Proceedings , volume=. 2018 , organization=
work page 2018
-
[66]
Health Information Management Journal , volume =
Hodgkins, Adam Jose and Bonney, Andrew and Mullan, Judy and Mayne, Darren John and Barnett, Stephen , title =. Health Information Management Journal , volume =. 2018 , type =
work page 2018
-
[67]
Expert Systems with Applications , volume =
Nezhad, Milad Zafar and Sadati, Najibesadat and Yang, Kai and Zhu, Dongxiao , title =. Expert Systems with Applications , volume =. 2019 , type =
work page 2019
-
[68]
The Annals of Statistics , volume =
Zhang, Anru and Brown, Lawrence D and Cai, T Tony , title =. The Annals of Statistics , volume =. 2019 , type =
work page 2019
-
[69]
Zhang, Yichi and Cai, Tianrun and Yu, Sheng and Cho, Kelly and Hong, Chuan and Sun, Jiehuan and Huang, Jie and Ho, Yuk-Lam and Ananthakrishnan, Ashwin N and Xia, Zongqi , title =. Nature protocols , volume =. 2019 , type =
work page 2019
-
[70]
Journal of the Royal Statistical Society Series B , volume=
Efficient evaluation of prediction rules in semi-supervised settings under stratified sampling , author=. Journal of the Royal Statistical Society Series B , volume=. 2022 , publisher=
work page 2022
-
[71]
Lifetime Data Analysis , volume =
Li, Shuwei and Sun, Jianguo and Tian, Tian and Cui, Xia , title =. Lifetime Data Analysis , volume =. 2020 , type =
work page 2020
-
[72]
Ahuja, Yuri and Liang, Liang and Huang, Selena and Cai, Tianxi , title =. bioRxiv , year =
-
[73]
Cheng, David and Ananthakrishnan, Ashwin N and Cai, Tianxi , title =. Biometrics , volume =. 2021 , type =
work page 2021
-
[74]
BMC medical informatics and decision making , volume =
Wang, Ni and Huang, Yanqun and Liu, Honglei and Zhang, Zhiqiang and Wei, Lan and Fei, Xiaolu and Chen, Hui , title =. BMC medical informatics and decision making , volume =. 2021 , type =
work page 2021
-
[75]
BMC medical informatics and decision making , volume =
Sánchez-de-Madariaga, Ricardo and Martinez-Romo, Juan and Escribano, José Miguel Cantero and Araujo, Lourdes , title =. BMC medical informatics and decision making , volume =. 2022 , type =
work page 2022
-
[76]
Pattern Recognition Letters , volume=
Pattern classification and clustering: A review of partially supervised learning approaches , author=. Pattern Recognition Letters , volume=. 2014 , publisher=
work page 2014
-
[77]
2018 IEEE International Conference on Big Knowledge (ICBK) , pages=
Don't do imputation: dealing with informative missing values in EHR data analysis , author=. 2018 IEEE International Conference on Big Knowledge (ICBK) , pages=. 2018 , organization=
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.