arxiv: 2605.03895 · v2 · submitted 2026-05-05 · 💻 cs.LG · cs.SE

Recognition: no theorem link

From Data Lifting to Continuous Risk Estimation: A Process-Aware Pipeline for Predictive Monitoring of Clinical Pathways

Pasquale Ardimento , Mario Luca Bernardi , Marta Cimitile , Samuele Latorre

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:19 UTC · model grok-4.3

classification 💻 cs.LG cs.SE

keywords clinical pathwayspredictive monitoringprocess miningdata liftingprefix analysisCOVID-19ICU admissionlogistic regression

0 comments

The pith

A process-aware pipeline using data lifting and prefix representations supports continuous risk estimation for clinical pathways, with accuracy increasing as patient data accumulates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a pipeline that transforms raw clinical data into prefix-based event sequences for predicting outcomes in ongoing pathways. This matters because it allows risk estimates to be updated continuously rather than only after the fact, which could support timely decisions in patient care. On COVID-19 data, it shows performance gains from early (AUC 0.642) to late stages (0.942), with logistic regression leading at overall AUC 0.906.

Core claim

The central discovery is that a pipeline from data lifting through temporal reconstruction and prefix construction enables predictive models to perform continuous risk estimation on clinical pathways. Using ICU admission in COVID-19 as the target, the models demonstrate increasing accuracy with pathway progression, highlighting that predictive signals strengthen over time in evolving trajectories.

What carries the argument

The prefix-based representation derived from lifted event logs, which encodes the sequence of events up to the current point in a patient's clinical pathway for use in predictive models.

Load-bearing premise

The case-level split and data lifting process produce unbiased prefix representations that accurately reflect real evolving trajectories without leakage or systematic missing data in the COVID-19 dataset.

What would settle it

A failure to observe increasing AUC as prefixes lengthen in additional patient cohorts, or evidence of performance drop due to leakage in the split, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.03895 by Mario Luca Bernardi, Marta Cimitile, Pasquale Ardimento, Samuele Latorre.

**Figure 1.** Figure 1: Proposed pipeline for predictive monitoring of clinical pathways. view at source ↗

**Figure 1.** Figure 1: Proposed pipeline for predictive monitoring of clinical pathways. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

This paper presents a reproducible and process-aware pipeline for predictive monitoring of clinical pathways. The approach integrates data lifting, temporal reconstruction, event log construction, prefix-based representations, and predictive modeling to support continuous reasoning on partially observed patient trajectories, overcoming the limitations of traditional retrospective process mining. The framework is evaluated on COVID-19 clinical pathways using ICU admission as the prediction target, considering 4,479 patient cases and 46,804 prefixes. Predictive models are trained and evaluated using a case-level split, with 896 patients in the test set. Logistic Regression achieves the best performance (AUC 0.906, F1-score 0.835). A detailed prefix-based analysis shows that predictive performance improves progressively as new clinical events become available, with AUC increasing from 0.642 at early stages to 0.942 at later stages of the pathway. The results highlight two key findings: predictive signals emerge progressively along clinical pathways, and process-aware representations enable effective early risk estimation from evolving patient trajectories. Overall, the findings suggest that predictive monitoring in healthcare is best conceived as a continuous, dynamically aware process, in which risk estimates are progressively refined as the patient journey evolves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a workable pipeline for prefix-based risk prediction on clinical pathways with clear progressive AUC trends on COVID data, but the no-leakage claim for early prefixes needs explicit verification.

read the letter

The main point is that this work chains data lifting, prefix encoding, and standard classifiers into a pipeline that estimates ICU admission risk continuously as patient events arrive. On 4479 COVID cases they report logistic regression reaching 0.906 AUC overall, with performance climbing from 0.642 on short prefixes to 0.942 on longer ones, using a proper case-level split and 46k prefixes for evaluation. That progressive signal is the useful empirical takeaway for anyone thinking about online monitoring rather than one-shot retrospective analysis. The case split and scale give the numbers some credibility, and the abstract is straightforward about the setup and the two headline findings on emerging signals and process-aware representations. What they actually do well is show how established process-mining steps can be sequenced for this domain without overclaiming new core methods. The soft spot is the feature construction step. If data lifting or any derived attributes pull statistics or information from the full case instead of recomputing strictly from the current prefix, then early-stage predictions contain future information and the reported AUC climb becomes harder to trust. The abstract does not detail how features are built at each prefix length, so the central no-leakage assumption stays unverified until the methods section is checked. Minor issues include the lack of ablation on the lifting choices and whether the best model is simply the simplest one that fits this particular log. This is aimed at applied researchers in predictive process mining who work with healthcare event data and want concrete examples of continuous monitoring. It is not a foundational methods paper, but the empirical trends are solid enough to justify referee time. I would send it for review with a request to confirm prefix-local feature computation and add a short leakage check.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a reproducible and process-aware pipeline for predictive monitoring of clinical pathways, integrating data lifting, temporal reconstruction, event log construction, prefix-based representations, and predictive modeling. Evaluated on a COVID-19 dataset with 4,479 patient cases, 46,804 prefixes, and ICU admission as the target, using case-level splitting with 896 patients in the test set, it reports that Logistic Regression achieves the best performance (AUC 0.906, F1-score 0.835). The central empirical finding is that predictive performance improves progressively as new events become available, with AUC increasing from 0.642 at early stages to 0.942 at later stages of the pathway.

Significance. If the no-leakage assumption in prefix construction holds, the work provides concrete evidence that process-aware representations support continuous risk estimation, with performance scaling as trajectories evolve. Strengths include the large dataset size, explicit case-level splitting to prevent cross-patient leakage, and the progressive prefix analysis that directly tests the dynamic nature of the claims. This offers a practical, reproducible framework with potential impact on real-time clinical decision support systems.

major comments (1)

[Pipeline Description (Data Lifting and Prefix Construction)] The description of data lifting, temporal reconstruction, and prefix-based representations does not explicitly state that all features, aggregates, and embeddings are recomputed exclusively from the strict prefix available at each time point (i.e., without using future events or case-level global statistics). This verification is load-bearing for the headline result that AUC rises from 0.642 early to 0.942 late, because any non-local computation would introduce leakage and invalidate the progressive-performance claim.

minor comments (2)

[Abstract] The abstract introduces 'data lifting' without a one-sentence definition or pointer to its role in ensuring prefix locality, which reduces immediate accessibility for readers outside process mining.
[Evaluation] The evaluation section would benefit from a brief statement on the exact train/test split ratio and any stratification criteria beyond case-level independence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps strengthen the clarity of our pipeline description. We address the major comment point by point below.

read point-by-point responses

Referee: The description of data lifting, temporal reconstruction, and prefix-based representations does not explicitly state that all features, aggregates, and embeddings are recomputed exclusively from the strict prefix available at each time point (i.e., without using future events or case-level global statistics). This verification is load-bearing for the headline result that AUC rises from 0.642 early to 0.942 late, because any non-local computation would introduce leakage and invalidate the progressive-performance claim.

Authors: We agree that an explicit statement on this point is essential for validating the no-leakage assumption underlying the progressive AUC results. In the implemented pipeline, all features, aggregates, and embeddings are strictly recomputed from the events present in each prefix only, with no access to future events or case-level global statistics; this is enforced by constructing independent prefix logs during temporal reconstruction and by limiting all computations (e.g., frequency counts, duration aggregates, and embeddings) to the prefix snapshot at each step. The case-level split further ensures no cross-patient information leakage. To address the referee's concern, we will revise the manuscript by adding a dedicated paragraph (and accompanying pseudocode) in the Methods section that explicitly describes this prefix-only recomputation process and its enforcement. This clarification will directly reinforce the validity of the reported performance progression without changing any empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity: standard held-out evaluation on independent cases

full rationale

The paper reports empirical results from a standard ML pipeline (data lifting to prefixes, case-level split into 896 test patients, Logistic Regression training) with AUC/F1 computed on held-out prefixes. No equations, derivations, or self-citations reduce any reported prediction or performance number to a quantity fitted on the same data used for evaluation. The progressive AUC claim (0.642 early to 0.942 late) is obtained by direct evaluation on temporally ordered prefixes from the test set, not by construction or renaming of inputs. The derivation chain is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard machine learning evaluation assumptions and the validity of the described data lifting and prefix construction steps; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5521 in / 996 out tokens · 55466 ms · 2026-05-13T07:19:49.859661+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Moore and J

R. Moore and J. Lopes. Paper templates. TEMPLATE'06, 1st International Conference on Template Production. 1999

work page 1999
[2]

J. Smith. The Book. 1998

work page 1998
[3]

Process mining: Data science in action , pages=

Data science in action , author=. Process mining: Data science in action , pages=. 2016 , publisher=

work page 2016
[4]

Process Mining in Healthcare: A Literature Review , journal =

Eric Rojas and Jorge Munoz-Gama and Marcos Sep. Process Mining in Healthcare: A Literature Review , journal =

work page
[5]

Journal of Biomedical Informatics , volume=

Process mining for healthcare: Characteristics and challenges , author=. Journal of Biomedical Informatics , volume=. 2022 , publisher=

work page 2022
[6]

Automated Discovery of Process Models from Event Logs: Review and Benchmark , year=

Augusto, Adriano and Conforti, Raffaele and Dumas, Marlon and Rosa, Marcello La and Maggi, Fabrizio Maria and Marrella, Andrea and Mecella, Massimo and Soo, Allar , journal=. Automated Discovery of Process Models from Event Logs: Review and Benchmark , year=

work page
[7]

Granular Computing , volume=

Event abstraction in process mining: literature review and taxonomy , author=. Granular Computing , volume=. 2021 , publisher=

work page 2021
[8]

Wil M. P. van der Aalst and M. H. Schonenberg and Minseok Song , title =. Information Systems , volume =

work page
[9]

International conference on advanced information systems engineering , pages=

Predictive business process monitoring with LSTM neural networks , author=. International conference on advanced information systems engineering , pages=. 2017 , organization=

work page 2017
[10]

Predicting process behaviour using deep learning , journal =

Joerg Evermann and Jana-Rebecca Rehse and Peter Fettke , keywords =. Predicting process behaviour using deep learning , journal =. 2017 , note =. doi:https://doi.org/10.1016/j.dss.2017.04.003 , url =

work page doi:10.1016/j.dss.2017.04.003 2017
[11]

2020 2nd International Conference on Process Mining (ICPM) , pages=

Explainable predictive process monitoring , author=. 2020 2nd International Conference on Process Mining (ICPM) , pages=. 2020 , organization=

work page 2020
[12]

International conference on business process management , pages=

Explainability in predictive process monitoring: When understanding helps improving , author=. International conference on business process management , pages=. 2020 , organization=

work page 2020
[13]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

work page 2023
[14]

2025 , publisher=

Global strategy on digital health 2020-2027 , author=. 2025 , publisher=

work page 2020
[15]

A. Ritor. COVID Data for Shared Learning (CDSL): A Comprehensive, Multimodal COVID-19 Dataset from HM Hospitales , howpublished =. 2024 , note =

work page 2024
[16]

WorldCIST 2026 , series =

Pasquale Ardimento and Mario Luca Bernardi and Marta Cimitile and Simone Latorre , title =. WorldCIST 2026 , series =. 2026 , note =

work page 2026