Proper Scoring Rules for Right-Censored Survival Data

Glenn Van Wallendael; Jef Jonkers; Luc Duchateau; Sofie Van Hoecke

arxiv: 2606.06393 · v1 · pith:I7C6DC6Xnew · submitted 2026-06-04 · 💻 cs.LG

Proper Scoring Rules for Right-Censored Survival Data

Jef Jonkers , Glenn Van Wallendael , Luc Duchateau , Sofie Van Hoecke This is my paper

Pith reviewed 2026-06-28 02:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords proper scoring rulesright-censored datasurvival analysisconditional independent censoringCRPSBrier scoreIPCWengression

0 comments

The pith

Mapping the predictive distribution through the censoring mechanism produces proper scoring rules for right-censored survival data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a general construction for proper scoring rules when survival times are only partially observed due to right censoring. It first transforms the forecast distribution according to the censoring process to induce a distribution over the observed data, then applies the original proper score to that induced law. This yields localized scores for fixed censoring times and marginalized scores for random or partially observed censoring. The marginalized score is proper under conditional independent censoring and strictly proper where the distribution remains identifiable. The same construction recovers the right-censored likelihood and IPCW criteria while defining censored versions of the CRPS, pinball loss, Brier score, and energy score, and supports a sample-based objective called censored engression.

Core claim

The central claim is that proper scoring rules for right-censored survival outcomes are obtained by composing the predictive distribution with the censoring mechanism to obtain the induced observed-data law and then applying the base proper score to that law. The resulting marginalized score is proper under conditional independent censoring and strictly proper on the identifiable region. This recovers the right-censored likelihood and IPCW-type criteria within one framework and extends to right-censored CRPS, pinball loss, Brier score, and energy score. It also produces censored engression as a sample-based learning objective for multivariate right-censored survival modeling.

What carries the argument

The mapping of the predictive distribution through the censoring mechanism to induce an observed-data law, on which the base proper score is then evaluated.

If this is right

The marginalized score is proper under conditional independent censoring.
It is strictly proper on the identifiable region.
The construction recovers right-censored likelihood and IPCW criteria.
Right-censored versions of CRPS, Brier score, pinball loss, and energy score are obtained.
Censored engression improves training over naive use of censored outcomes and the scores correctly rank the oracle forecast across regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mapping principle could be tested on interval-censored or other partially observed data types.
In medical survival modeling the scores would allow direct comparison of probabilistic forecasts without ranking reversals from plug-in weights.
The identifiable-region strict propriety implies that evaluation can focus on observable parts of the distribution without losing theoretical guarantees.

Load-bearing premise

Censoring time is independent of event time given the covariates.

What would settle it

A simulation in which censoring depends on the event time and the marginalized score assigns a higher value to a misspecified forecast than to the true distribution.

read the original abstract

Proper scoring rules provide a rigorous theoretical basis for the training and evaluation of probabilistic forecasts. However, in the presence of right censoring, the event time is only partially observed, rendering conventional scoring rules inapplicable in their standard form. We propose a framework for proper scoring of right-censored survival outcomes based on a simple idea: first, map the predictive distribution through the censoring mechanism, then apply the underlying proper score on the induced observed-data law. This yields localized scores for fixed censoring times and marginalized scores when the censoring time is random or only partially observed. The resulting construction recovers familiar right-censored likelihood and IPCW-type criteria within a coherent framework, while also yielding right-censored versions of the CRPS, pinball loss, Brier score, and energy score. We show that the marginalized score is proper under conditional independent censoring and strictly proper on the identifiable region. The same principle also leads to censored engression, a sample-based learning objective for multivariate right-censored survival modeling. In experiments, our scores correctly rank the oracle forecast across several censoring regimes, whereas forecast-dependent plug-in weighted scores can exhibit ranking reversals. Censored engression likewise substantially improves over naive training on censored outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core move—pushing the predictive law through the censoring mechanism before scoring—looks like a direct and useful way to get proper scores for right-censored data.

read the letter

The main new piece is the mapping construction itself. You take a predictive distribution over event times, apply the censoring mechanism to induce a law on the observed (min(T,C), delta) pair, then score that induced distribution with any proper score. This produces both fixed-censoring localized scores and marginalized scores when censoring is random. It recovers the usual right-censored likelihood and IPCW criteria as instances, and supplies right-censored versions of CRPS, pinball, Brier, and energy score. The paper states that the marginalized score is proper under conditional independent censoring and strictly proper on the identifiable region.

What works is the unification and the downstream claim about ranking. The experiments are said to show that these scores rank the oracle correctly across censoring regimes while some plug-in weighted scores reverse rankings. Censored engression is presented as a sample-based training objective that improves over naive handling of censored outcomes. The construction stays within standard assumptions and does not appear to add circularity or self-referential fitting.

The soft spot is evidentiary. The abstract asserts propriety and correct oracle ranking but gives no derivation steps or quantitative results, so the practical size of the improvement and the tightness of the strict-propriety claim on the identifiable region are not visible here. If the full proofs are clean and the experiments include reasonable baselines and multiple censoring rates, that gap closes; otherwise the empirical support stays thin.

This is for people who train or evaluate probabilistic survival models and want scoring rules that respect censoring without ad-hoc weighting. A reader already working on proper scores or censored likelihoods will see the value in the recovered criteria and the new variants. It is coherent enough on its own terms to merit referee time rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper proposes a framework for proper scoring rules on right-censored survival data: map the predictive distribution through the censoring mechanism to obtain the induced observed-data law, then apply a standard proper score to that law. This produces localized scores (fixed censoring time) and marginalized scores (random or partially observed censoring). The construction recovers right-censored likelihood and IPCW criteria, extends to right-censored versions of CRPS, pinball loss, Brier score and energy score, and yields a sample-based objective called censored engression. The manuscript asserts that the marginalized score is proper under conditional independent censoring and strictly proper on the identifiable region; experiments are said to show that the new scores correctly rank the oracle while forecast-dependent plug-in weighted scores can reverse rankings.

Significance. If the propriety result holds, the work supplies a coherent, assumption-explicit unification of scoring rules for censored data that recovers familiar methods while generating new ones. The explicit conditioning on conditional independent censoring (a standard assumption) and the experimental check on oracle ranking are strengths; the latter directly addresses a practical failure mode of existing plug-in scores.

minor comments (3)

[Abstract] Abstract: the claim that 'experiments show correct oracle ranking' is stated without any numerical results, tables, or description of the censoring regimes and metrics used; adding a short quantitative summary or reference to a results table would make the empirical support verifiable from the abstract.
[Abstract] The manuscript states that the marginalized score is proper and strictly proper on the identifiable region, but the abstract supplies no theorem number, section reference, or derivation outline; readers must locate the proof without guidance.
[Abstract] The description of censored engression is introduced as 'a sample-based learning objective' but the abstract gives no explicit loss expression or algorithmic detail; a one-line definition or equation reference would clarify its relation to the scoring framework.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of the manuscript, for highlighting its strengths, and for recommending minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The derivation applies the standard definition of proper scoring rules to an induced observed-data distribution obtained by mapping the predictive law through the censoring mechanism. Propriety of the marginalized score is shown under the explicit, standard assumption of conditional independent censoring rather than being asserted unconditionally or derived from fitted parameters. The construction recovers known right-censored likelihood and IPCW criteria as special cases but does not reduce any central claim to a self-definition, fitted-input prediction, or self-citation chain. No load-bearing step equates a result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central propriety claim rests on the domain assumption of conditional independent censoring; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Censoring is conditionally independent of the event time given covariates
Explicitly required for the marginalized score to be proper.

pith-pipeline@v0.9.1-grok · 5753 in / 1265 out tokens · 13885 ms · 2026-06-28T02:40:25.813298+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 7 canonical work pages

[1]

Journal of the American Statistical Association 102, 359–378

Tilmann Gneiting and Adrian E Raftery. Strictly Proper Scoring Rules, Prediction, and Es- timation.Journal of the American Statistical Association, 102(477):359–378, March 2007. ISSN 0162-1459. doi: 10.1198/016214506000001437. URL https://doi.org/10.1198/ 016214506000001437

work page doi:10.1198/016214506000001437 2007
[2]

Ramon F. A. de Punder, Cees G. H. Diks, Roger J. A. Laeven, and Dick J. C. van Dijk. Localizing Strictly Proper Scoring Rules.Journal of the American Statistical Association, 0(0):1–13, January 2026. ISSN 0162-1459. doi: 10.1080/01621459.2025.2576189. URL https://doi.org/10.1080/01621459.2025.2576189

work page doi:10.1080/01621459.2025.2576189 2026
[3]

Proper Scoring Rules for Survival Analysis

Hiroki Yanagisawa. Proper Scoring Rules for Survival Analysis. InProceedings of the 40th International Conference on Machine Learning, pages 39165–39182. PMLR, July 2023. URL https://proceedings.mlr.press/v202/yanagisawa23a.html

2023
[4]

Engression: extrapolation through the lens of dis- tributional regression.Journal of the Royal Statistical Society Series B: Statistical Method- ology, 87(3):653–677, July 2025

Xinwei Shen and Nicolai Meinshausen. Engression: extrapolation through the lens of dis- tributional regression.Journal of the Royal Statistical Society Series B: Statistical Method- ology, 87(3):653–677, July 2025. ISSN 1369-7412. doi: 10.1093/jrsssb/qkae108. URL https://doi.org/10.1093/jrsssb/qkae108

work page doi:10.1093/jrsssb/qkae108 2025
[5]

As- sessment and comparison of prognostic classification schemes for survival data

Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. As- sessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine, 18(17-18):2529–2545, 1999. ISSN 1097-0258. doi: 10.1002/(SICI)1097-0258(19990915/30)18:17/18<2529::AID-SIM274>3.0.CO;2-5. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SIC...

work page doi:10.1002/(sici)1097-0258(19990915/30)18:17/18 1999
[6]

Gerds and Martin Schumacher

Thomas A. Gerds and Martin Schumacher. Consistent Estimation of the Expected Brier Score in General Survival Models with Right-Censored Event Times.Biometrical Journal, 48(6):1029–1040, 2006. ISSN 1521-4036. doi: 10.1002/bimj.200610301. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/bimj.200610301

work page doi:10.1002/bimj.200610301 2006
[7]

The Brier Score under Administrative Censoring: Prob- lems and a Solution.Journal of Machine Learning Research, 24(2):1–26, 2023

Håvard Kvamme and Ørnulf Borgan. The Brier Score under Administrative Censoring: Prob- lems and a Solution.Journal of Machine Learning Research, 24(2):1–26, 2023. ISSN 1533-7928. URLhttp://jmlr.org/papers/v24/19-1030.html

2023
[8]

Survival regression with proper scoring rules and monotonic neural networks

David Rindt, Robert Hu, David Steinsaltz, and Dino Sejdinovic. Survival regression with proper scoring rules and monotonic neural networks. InProceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 1190–1205. PMLR, May 2022. URL https://proceedings.mlr.press/v151/rindt22a.html

2022
[9]

Shah, and Andrew Y

Anand Avati, Tony Duan, Sharon Zhou, Kenneth Jung, Nigam H. Shah, and Andrew Y . Ng. Countdown Regression: Sharp and Calibrated Survival Predictions. InProceedings of The 35th 10 Uncertainty in Artificial Intelligence Conference, pages 145–155. PMLR, August 2020. URL https://proceedings.mlr.press/v115/avati20a.html

2020
[10]

Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-wei H. Lehman, Leo A. Celi, and Roger G. Mark. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10(1):1, January 2023. ISSN 2052-4463. doi: 10.1038/s41597-022-01899-x. URLh...

work page doi:10.1038/s41597-022-01899-x 2023
[11]

Z C y (1−F(s|x)) 2 ds C≥y, X=x # . By Fubini’s theorem, E

Section 2: AKI Definition.Kidney International Supplements, 2(1):19–36, March 2012. ISSN 2157-1716, 2157-1724. doi: 10.1038/kisup.2011.32. URL https://www.kisupplements. org/article/S2157-1716(15)31031-5/fulltext. A Theory and proofs A.1 Proof of Proposition 1 We use S♭ c(F;ψ ♭ c(t)) and S♭ c(F;Y,∆) interchangeably for the abstract and right-censoring enc...

work page doi:10.1038/kisup.2011.32 2012
[12]

either AKI

stage-2 AKI according to the creatinine criterion and according to the urine-output criterion, see Table 12. The shared censoring time C is the remaining time to ICU discharge. For endpoint j∈ {1,2}, Yj = min(Tj, C),∆ j =1{T j ≤C}, where Tj is the latent time to the corresponding AKI definition. Times are transformed during training as t7→ log(1 +t) log(1...

arXiv 2008

[1] [1]

Journal of the American Statistical Association 102, 359–378

Tilmann Gneiting and Adrian E Raftery. Strictly Proper Scoring Rules, Prediction, and Es- timation.Journal of the American Statistical Association, 102(477):359–378, March 2007. ISSN 0162-1459. doi: 10.1198/016214506000001437. URL https://doi.org/10.1198/ 016214506000001437

work page doi:10.1198/016214506000001437 2007

[2] [2]

Ramon F. A. de Punder, Cees G. H. Diks, Roger J. A. Laeven, and Dick J. C. van Dijk. Localizing Strictly Proper Scoring Rules.Journal of the American Statistical Association, 0(0):1–13, January 2026. ISSN 0162-1459. doi: 10.1080/01621459.2025.2576189. URL https://doi.org/10.1080/01621459.2025.2576189

work page doi:10.1080/01621459.2025.2576189 2026

[3] [3]

Proper Scoring Rules for Survival Analysis

Hiroki Yanagisawa. Proper Scoring Rules for Survival Analysis. InProceedings of the 40th International Conference on Machine Learning, pages 39165–39182. PMLR, July 2023. URL https://proceedings.mlr.press/v202/yanagisawa23a.html

2023

[4] [4]

Engression: extrapolation through the lens of dis- tributional regression.Journal of the Royal Statistical Society Series B: Statistical Method- ology, 87(3):653–677, July 2025

Xinwei Shen and Nicolai Meinshausen. Engression: extrapolation through the lens of dis- tributional regression.Journal of the Royal Statistical Society Series B: Statistical Method- ology, 87(3):653–677, July 2025. ISSN 1369-7412. doi: 10.1093/jrsssb/qkae108. URL https://doi.org/10.1093/jrsssb/qkae108

work page doi:10.1093/jrsssb/qkae108 2025

[5] [5]

As- sessment and comparison of prognostic classification schemes for survival data

Erika Graf, Claudia Schmoor, Willi Sauerbrei, and Martin Schumacher. As- sessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine, 18(17-18):2529–2545, 1999. ISSN 1097-0258. doi: 10.1002/(SICI)1097-0258(19990915/30)18:17/18<2529::AID-SIM274>3.0.CO;2-5. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SIC...

work page doi:10.1002/(sici)1097-0258(19990915/30)18:17/18 1999

[6] [6]

Gerds and Martin Schumacher

Thomas A. Gerds and Martin Schumacher. Consistent Estimation of the Expected Brier Score in General Survival Models with Right-Censored Event Times.Biometrical Journal, 48(6):1029–1040, 2006. ISSN 1521-4036. doi: 10.1002/bimj.200610301. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/bimj.200610301

work page doi:10.1002/bimj.200610301 2006

[7] [7]

The Brier Score under Administrative Censoring: Prob- lems and a Solution.Journal of Machine Learning Research, 24(2):1–26, 2023

Håvard Kvamme and Ørnulf Borgan. The Brier Score under Administrative Censoring: Prob- lems and a Solution.Journal of Machine Learning Research, 24(2):1–26, 2023. ISSN 1533-7928. URLhttp://jmlr.org/papers/v24/19-1030.html

2023

[8] [8]

Survival regression with proper scoring rules and monotonic neural networks

David Rindt, Robert Hu, David Steinsaltz, and Dino Sejdinovic. Survival regression with proper scoring rules and monotonic neural networks. InProceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 1190–1205. PMLR, May 2022. URL https://proceedings.mlr.press/v151/rindt22a.html

2022

[9] [9]

Shah, and Andrew Y

Anand Avati, Tony Duan, Sharon Zhou, Kenneth Jung, Nigam H. Shah, and Andrew Y . Ng. Countdown Regression: Sharp and Calibrated Survival Predictions. InProceedings of The 35th 10 Uncertainty in Artificial Intelligence Conference, pages 145–155. PMLR, August 2020. URL https://proceedings.mlr.press/v115/avati20a.html

2020

[10] [10]

Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-wei H. Lehman, Leo A. Celi, and Roger G. Mark. MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10(1):1, January 2023. ISSN 2052-4463. doi: 10.1038/s41597-022-01899-x. URLh...

work page doi:10.1038/s41597-022-01899-x 2023

[11] [11]

Z C y (1−F(s|x)) 2 ds C≥y, X=x # . By Fubini’s theorem, E

Section 2: AKI Definition.Kidney International Supplements, 2(1):19–36, March 2012. ISSN 2157-1716, 2157-1724. doi: 10.1038/kisup.2011.32. URL https://www.kisupplements. org/article/S2157-1716(15)31031-5/fulltext. A Theory and proofs A.1 Proof of Proposition 1 We use S♭ c(F;ψ ♭ c(t)) and S♭ c(F;Y,∆) interchangeably for the abstract and right-censoring enc...

work page doi:10.1038/kisup.2011.32 2012

[12] [12]

either AKI

stage-2 AKI according to the creatinine criterion and according to the urine-output criterion, see Table 12. The shared censoring time C is the remaining time to ICU discharge. For endpoint j∈ {1,2}, Yj = min(Tj, C),∆ j =1{T j ≤C}, where Tj is the latent time to the corresponding AKI definition. Times are transformed during training as t7→ log(1 +t) log(1...

arXiv 2008