pith. sign in

arxiv: 2606.01239 · v1 · pith:BKTBXRQAnew · submitted 2026-05-31 · 📊 stat.ME · stat.ML

Functional Clustering of Survival Data via Smoothed Log-Hazard Trajectories: A Risk-Dynamics Perspective

Pith reviewed 2026-06-28 16:26 UTC · model grok-4.3

classification 📊 stat.ME stat.ML
keywords functional clusteringsurvival analysishazard functionfunctional principal component analysisB-spline smoothinglog-hazard trajectoriesrisk dynamicsclustering
0
0 comments X

The pith

Clustering survival data on smoothed log-hazard trajectories identifies groups by distinct temporal risk patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shifts analysis of survival data clustering away from cumulative survival probabilities toward the hazard function that tracks instantaneous risk at each time point. It smooths individual log-hazard trajectories with B-splines, extracts functional principal component scores that retain 95 percent of variance, and performs clustering directly on those unstandardized scores. Simulation studies examine performance under crossing hazards, cohort imbalance, heterogeneous profiles, and outlier contamination, while applications to breast cancer and biliary cirrhosis data illustrate the resulting groups. The log-hazard focus produces clusters that reflect relative changes in risk over time rather than overall survival levels.

Core claim

By representing survival data as smoothed log-hazard trajectories and clustering the FPCA scores obtained from them, the method supplies an interpretable grouping based on how risk evolves temporally for each subject, with explicit comparisons showing competitive cohesion and robustness relative to cumulative-survival benchmarks.

What carries the argument

B-spline smoothed log-hazard trajectories reduced by functional principal component analysis, with clustering performed on the resulting FPCA scores after 95 percent variance selection.

If this is right

  • The framework distinguishes patient groups whose hazard functions cross or overlap in simulations.
  • It supplies built-in diagnostics for robustness against outliers and cohort imbalance.
  • Clusters on real clinical data reflect interpretable differences in when risk rises or falls.
  • Performance remains competitive in internal cohesion metrics against cumulative-survival alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The clusters could support timing-specific interventions, such as intensified monitoring during periods when a group's hazard trajectory peaks.
  • Extending the smoothing step to include time-varying covariates might allow direct incorporation of treatment effects into the functional objects.
  • Direct comparison of hazard-based versus survival-based partitions on the same data could quantify how much additional stratification the temporal dynamics provide.

Load-bearing premise

B-spline smoothing of the log-hazard trajectories captures the relevant temporal risk patterns well enough that the derived FPCA scores remain meaningful for clustering even with right-censoring present.

What would settle it

A simulation study in which true hazard trajectories contain sharp local features that the chosen B-spline basis fails to recover, producing clusters with markedly lower recovery accuracy than the cumulative-survival benchmark methods.

Figures

Figures reproduced from arXiv: 2606.01239 by Anna De Magistris, Elvira Romano, Fabrizio Maturo.

Figure 1
Figure 1. Figure 1: On the left, a visualization of the individual-level clustering process in [5]; on the right, an example of curve-level clustering outcome in [6]. Curve-level clustering is particularly valuable in contexts where aggregated survival out￾comes provide clearer insights. In multicenter studies, for instance, it helps compare survival across hospitals following a common treatment protocol, identifying patterns… view at source ↗
Figure 2
Figure 2. Figure 2: Clustering recovery over increasing log-hazard overlap. The figure reports the mean [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simulation sensitivity analysis under alternative FPCA truncation thresholds. The [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulated smoothed log-hazard tra￾jectories clustered in the FPCA score space. The three colors identify the clusters recovered by the proposed functional pipeline [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Estimated log-hazard centroids for the three simulated functional clusters. The shaded [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Retained standard FPCA eigenfunc￾tions extracted from the smoothed log-hazard trajectories in the simulation study. The leg￾end reports the proportion of functional vari￾ance associated with each component [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: GBCS clustering comparison between the proposed functional log-hazard represen [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: GBCS log-hazard centroids with within-cluster standard deviation bands. The [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Post-hoc FPCA interpretation for the GBCS application. Panel (a) reports the [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: PBC clustering comparison between the proposed functional log-hazard represen [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: PBC log-hazard centroids with within-cluster standard deviation bands. The shaded [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Post-hoc FPCA interpretation for the PBC application. Panel (a) reports the [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Selected number of clusters across FPCA truncation thresholds in the GBCS and [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Concordance between alternative FPCA-threshold partitions and the 95% reference [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Diagnostic comparison between raw and smoothed log-hazard trajectories for the [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
read the original abstract

This paper investigates clustering in survival data by shifting the analytical focus from cumulative survival probabilities to instantaneous risk, as characterized by the hazard function. We model smoothed log-hazard trajectories as functional objects that capture the temporal evolution of risk and propose a clustering framework based on Functional Principal Component Analysis applied to B-spline smoothed log-hazard trajectories. The number of retained functional principal components is selected before clustering using a 95% cumulative explained-variance rule, and clustering is then performed on the unstandardized FPCA scores. The proposed methodology is evaluated through simulation studies covering progressively complex scenarios, including overlapping and crossing hazard functions, cohort imbalance, heterogeneous risk profiles, and outlier contamination. The framework is further illustrated on two real-world clinical datasets, the German Breast Cancer Study and the Primary Biliary Cirrhosis dataset. Results show that the proposed log-hazard-based functional clustering framework provides an interpretable representation of relative temporal risk dynamics, with competitive internal cohesion and explicit robustness diagnostics when compared with cumulative-survival-based benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes clustering survival data by modeling B-spline smoothed log-hazard trajectories as functional objects, applying FPCA with a 95% cumulative explained-variance threshold to select components, and performing clustering on the resulting unstandardized scores. The approach is evaluated in simulation studies that include overlapping and crossing hazards, cohort imbalance, heterogeneous profiles, and outlier contamination, and is illustrated on the German Breast Cancer Study and Primary Biliary Cirrhosis datasets, with claims of interpretable relative temporal risk dynamics and competitive internal cohesion versus cumulative-survival benchmarks.

Significance. If supported by quantitative evidence, the shift to log-hazard trajectories could provide a useful complementary perspective in survival analysis for identifying subgroups defined by dynamic risk patterns rather than cumulative survival. The simulation design covering progressively complex censoring and hazard scenarios, together with real-data applications and benchmark comparisons, would strengthen the contribution if the reported cohesion metrics and robustness diagnostics are clearly quantified.

major comments (2)
  1. [Abstract / Results] Abstract and Results sections: the claims of 'competitive internal cohesion' and 'explicit robustness diagnostics' are stated without any accompanying numerical values (e.g., silhouette widths, adjusted Rand indices, or mean squared errors), tables, or figures showing performance metrics, standard errors, or direct comparisons; this prevents evaluation of the central competitiveness assertion.
  2. [Methodology] Methodology (B-spline + FPCA pipeline): the paper relies on the assumption that B-spline smoothing of log-hazards followed by unstandardized FPCA scores remains stable and informative under right-censoring and crossing hazards, yet no sensitivity analysis or diagnostic for the impact of censoring on the retained components is described; this is load-bearing for the claim that the scores suffice for meaningful clustering.
minor comments (2)
  1. [Methodology] The 95% cumulative explained-variance rule for PC retention is presented without discussion of its sensitivity or comparison to alternative criteria such as cross-validation or scree plots.
  2. [Methodology] Notation for the functional objects and the precise definition of the log-hazard estimator should be clarified with an explicit equation early in the methods to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results and strengthen the methodological justification. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results sections: the claims of 'competitive internal cohesion' and 'explicit robustness diagnostics' are stated without any accompanying numerical values (e.g., silhouette widths, adjusted Rand indices, or mean squared errors), tables, or figures showing performance metrics, standard errors, or direct comparisons; this prevents evaluation of the central competitiveness assertion.

    Authors: We agree that the abstract and the summary of results do not include explicit numerical values or tables to support the claims of competitive internal cohesion and robustness. Although the simulation and real-data sections contain quantitative comparisons, these are not presented in summarized tabular form. In the revised manuscript we will add a table reporting mean silhouette widths (with standard errors), adjusted Rand indices where applicable, and direct benchmark comparisons against the cumulative-survival approach. revision: yes

  2. Referee: [Methodology] Methodology (B-spline + FPCA pipeline): the paper relies on the assumption that B-spline smoothing of log-hazards followed by unstandardized FPCA scores remains stable and informative under right-censoring and crossing hazards, yet no sensitivity analysis or diagnostic for the impact of censoring on the retained components is described; this is load-bearing for the claim that the scores suffice for meaningful clustering.

    Authors: The simulation design already includes right-censoring at multiple rates and crossing-hazard scenarios, and the clustering performance remains stable across these conditions. However, we acknowledge that a dedicated sensitivity analysis examining how censoring affects the number of retained FPCA components (under the 95% variance threshold) is not provided. We will add such diagnostics, including plots of cumulative explained variance as a function of censoring proportion, to the revised methodology section. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a methodological proposal for functional clustering of survival data via B-spline smoothed log-hazard trajectories followed by FPCA, with the number of components chosen by a 95% variance rule and clustering on unstandardized scores. Evaluation occurs on independent simulation scenarios (overlapping hazards, censoring, imbalance) and external clinical datasets (German Breast Cancer, PBC). No equations, fitted parameters, or self-citations are shown to reduce the claimed performance or interpretability to quantities defined by the authors' own inputs; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard survival-analysis assumptions and a conventional variance threshold; no new entities are postulated and no free parameters are fitted beyond the common 95% rule.

free parameters (1)
  • 95% cumulative explained-variance threshold
    Used to choose the number of retained FPCA components; a standard heuristic applied by the authors.
axioms (1)
  • domain assumption Hazard functions can be estimated from right-censored survival data
    Required to construct the log-hazard trajectories that form the input to the functional analysis.

pith-pipeline@v0.9.1-grok · 5712 in / 1301 out tokens · 44469 ms · 2026-06-28T16:26:39.338745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 9 canonical work pages

  1. [1]

    Chandra Mouli, Leonardo Teixeira, Jennifer Neville, and Bruno Ribeiro

    S. Chandra Mouli, Leonardo Teixeira, Jennifer Neville, and Bruno Ribeiro. Deep lifetime clustering.arXiv preprint arXiv:1910.00547, 2019. URLhttps://arxiv.org/abs/1910. 00547

  2. [2]

    Massi, Thomas Weikert, Alexander Sauter, Verena Gotta, Timothy Müller, Flavio Vasella, Marian C

    Laura Manduchi, Ričards Marcinkevičs, Michela C. Massi, Thomas Weikert, Alexander Sauter, Verena Gotta, Timothy Müller, Flavio Vasella, Marian C. Neidert, Marc Pfister, Bram Stieltjes, and Julia E. Vogt. A deep variational approach to clustering survival data. arXiv preprint arXiv:2106.05763, 2021. URLhttps://arxiv.org/abs/2106.05763

  3. [3]

    Clustering survival data using a mixture of non-parametric experts.arXiv preprint arXiv:2405.15934, 2024

    Gabriel Buginga and Edmundo de Souza e Silva. Clustering survival data using a mixture of non-parametric experts.arXiv preprint arXiv:2405.15934, 2024. URLhttps://arxiv. org/abs/2405.15934

  4. [4]

    Survival cluster analysis

    Paidamoyo Chapfuwa, Chunyuan Li, Nikhil Mehta, Lawrence Carin, and Ricardo Henao. Survival cluster analysis. InProceedings of the ACM Conference on Health, Inference, and Learning, pages 60–68. ACM, 2020. doi: 10.1145/3368555.3384457

  5. [5]

    Survival hierarchical agglomerative clus- tering: Asemi-supervisedclusteringmethodincorporatingsurvivaldata

    Alexander Lacki and Antonio Martinez-Millana. Survival hierarchical agglomerative clus- tering: Asemi-supervisedclusteringmethodincorporatingsurvivaldata. InJoseM.Juarez, Mar Marcos, Gregor Stiglic, and Allan Tucker, editors,Artificial Intelligence in Medicine, pages 3–12, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-34344-5

  6. [6]

    Villanueva, Marta Sestelo, Luis Meira-Machado, and Javier Roca-Pardiñas

    Nora M. Villanueva, Marta Sestelo, Luis Meira-Machado, and Javier Roca-Pardiñas. clustcurv: An r package for determining groups in multiple curves.The R Journal, 13(1):164–183, 2021. doi: 10.32614/RJ-2021-032. URLhttps://doi.org/10.32614/ RJ-2021-032

  7. [7]

    What can we learn from the functional clustering of mortality data? an application to the human mortality database.Demographic Research, 45:803–834, 2021

    Anna-Erica Léger and Stefano Mazzuco. What can we learn from the functional clustering of mortality data? an application to the human mortality database.Demographic Research, 45:803–834, 2021. doi: 10.4054/DemRes.2021.45.28. URLhttps://doi.org/10.4054/ DemRes.2021.45.28

  8. [8]

    Villanueva, Marta Sestelo, and Luis Meira-Machado

    Nora M. Villanueva, Marta Sestelo, and Luis Meira-Machado. Efficient and scalable clustering of survival curves.arXiv preprint arXiv:2512.16481, 2025. URLhttps: //arxiv.org/abs/2512.16481

  9. [9]

    Ramsay and Bernard W

    James O. Ramsay and Bernard W. Silverman.Functional Data Analysis. Springer, New York, NY, 2nd edition, 2005. ISBN 9780387400808. doi: 10.1007/b98888

  10. [10]

    Jan Gertheiss, David Rügamer, Bernard X. W. Liew, and Sonja Greven. Functional data analysis: Anintroductionandrecentdevelopments.arXiv preprint arXiv:2312.05523, 2023. URLhttps://arxiv.org/abs/2312.05523. 29

  11. [11]

    FabrizioMaturoandRosannaVerde. Poolingrandomforestandfunctionaldataanalysisfor biomedical signals supervised classification: Theory and application to electrocardiogram data.Statistics in Medicine, 41(12):2247–2275, 2022. doi: 10.1002/sim.9353. URLhttps: //doi.org/10.1002/sim.9353

  12. [12]

    Random survival forest for censored functional data.Statistics in Medicine, 44(5):e10344, 2025

    Giuseppe Loffredo, Elvira Romano, and Fabrizio Maturo. Random survival forest for censored functional data.Statistics in Medicine, 44(5):e10344, 2025. doi: 10.1002/sim. 10344. URLhttps://doi.org/10.1002/sim.10344

  13. [13]

    Über lineare methoden in der wahrscheinlichkeitsrechnung.Annales Academiae Scientiarum Fennicae

    Kari Karhunen. Über lineare methoden in der wahrscheinlichkeitsrechnung.Annales Academiae Scientiarum Fennicae. Series A. I. Mathematica-Physica, 37:1–79, 1947

  14. [14]

    Functional data clustering: A survey.Advances in Data Analysis and Classification, 8(3):231–255, 2014

    Julien Jacques and Cristian Preda. Functional data clustering: A survey.Advances in Data Analysis and Classification, 8(3):231–255, 2014. doi: 10.1007/s11634-013-0158-y. URLhttps://doi.org/10.1007/s11634-013-0158-y

  15. [15]

    Journal of Computational and Applied Mathematics , author =

    Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987. doi: 10.1016/0377-0427(87)90125-7

  16. [16]

    Therneau and Patricia M

    Terry M. Therneau and Patricia M. Grambsch.Modeling Survival Data: Extending the Cox Model. Springer, New York, NY, 2000. doi: 10.1007/978-1-4757-3294-8. 30