pith. machine review for the scientific record. sign in

arxiv: 2605.14227 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:53 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords disease trajectory predictionelectronic health recordstransformer modelnext-event predictionfoundation modelAUCclinical forecastingprospective validation
0
0 comments X

The pith

A transformer model trained on 57 million real-world EHR entries predicts the next disease event with a median AUC of 0.871 across 896 categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops DT-Transformer to forecast patient disease trajectories from structured electronic health records at health-system scale. It trains the model on 57.1 million entries from 1.7 million patients across 11 hospitals and shows the predictions remain accurate in both held-out and prospective settings. Sympathetic readers would care because reliable next-event forecasting from routine clinical data could support earlier interventions and better resource allocation than models limited to smaller or curated cohorts. The work argues that large multi-hospital datasets better reflect real-world variability than single-site or research-only collections.

Core claim

DT-Transformer, trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham spanning 11 hospitals and outpatient clinics, achieves a median age- and sex-stratified AUC of 0.871 for next-event prediction across 896 disease categories, with every category exceeding AUC 0.5, in both held-out and prospective validation.

What carries the argument

DT-Transformer, a transformer architecture that processes sequences of structured EHR entries to output the probability of the next disease event.

If this is right

  • Health systems can build effective clinical forecasting tools directly from their own large-scale routine data rather than relying on curated research cohorts.
  • The model maintains discrimination for all 896 tested disease categories in prospective validation on unseen patients.
  • Next-event prediction at this scale supports earlier intervention and resource planning across a broad range of conditions.
  • Health-system-scale training provides a practical route to foundation models for real-world clinical forecasting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar transformer models could be retrained or fine-tuned on data from other multi-hospital systems to check whether the AUC levels transfer.
  • Incorporating additional data modalities such as free-text notes or lab trends might further improve prediction for categories that currently sit near the lower end of the AUC range.
  • If performance holds across systems, the same architecture could be applied to related longitudinal tasks such as medication response or complication forecasting.

Load-bearing premise

Structured EHR entries from one health system capture enough of the full complexity and variability of real-world patient trajectories for the model to generalize.

What would settle it

Applying the same model to structured EHR data from an independent health system and finding any disease category with AUC at or below 0.5 would show the claimed performance does not hold outside the training system.

Figures

Figures reproduced from arXiv: 2605.14227 by Andrew R Weckstein, Jie Yang, Kueiyu Joshua Lin, Yunying Zhu.

Figure 1
Figure 1. Figure 1: Overview of (A) data, (B) input sequence format [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average age- and sex-stratified AUC values for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Median age- and sex-stratified AUC values across [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Accurate disease trajectory prediction is critical for early intervention, resource allocation, and improving long-term outcomes. While electronic health records (EHRs) provide a rich longitudinal view of patient health in clinical environments, models trained on curated research cohorts may not reflect routine deployment settings, and those trained on single-hospital datasets capture only fragments of each patient's trajectory. This highlights the importance of leveraging large, multi-hospital health systems for training and validation to better reflect real-world clinical complexity. In this work, we develop DT-Transformer, a foundation model trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham (MGB), spanning 11 hospitals and a broad network of outpatient clinics. DT-Transformer achieves strong discrimination in both held-out and prospective validation settings. Next-event prediction achieves a median age- and sex-stratified AUC of 0.871 across 896 disease categories, with all categories exceeding AUC 0.5. These results support health system-scale training as a path toward foundation models suited to real-world clinical forecasting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DT-Transformer, a transformer-based foundation model trained on 57.1M structured EHR entries from 1.7M patients across Mass General Brigham's 11 hospitals and outpatient network. It reports next-event prediction performance with a median age- and sex-stratified AUC of 0.871 across 896 disease categories (all >0.5) on both held-out and prospective internal validation splits, arguing that health-system-scale training advances real-world clinical forecasting.

Significance. If the reported discrimination holds under external scrutiny, the work would illustrate the feasibility of training large-scale EHR models on multi-hospital data and could inform deployment of trajectory predictors in routine care. The internal scale (57.1M entries) is a strength, but the foundation-model framing depends on evidence of transferability beyond MGB-specific patterns.

major comments (3)
  1. [Abstract] Abstract and Results: the headline claim that the model constitutes a path toward foundation models rests on internal MGB-only held-out and prospective splits; no external cohort, multi-center test set, or cross-system evaluation is described, which directly undermines the generalization argument for real-world deployment.
  2. [Results] Results: no baseline models (e.g., logistic regression using demographics plus prior codes) or ablation studies are reported, so it is impossible to determine whether the transformer architecture contributes incremental value over simpler approaches on the same 896-category task.
  3. [Methods] Methods: the abstract supplies no architecture details, training procedure, loss function, handling of class imbalance or missing data, or hyperparameter search; these omissions make the AUC numbers impossible to reproduce or stress-test for the central performance claim.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'strong discrimination' is used without quantifying the exact number of validation patients or the prospective time window, which would clarify the evaluation rigor.
  2. [Results] The manuscript should include error bars or confidence intervals on the per-category AUCs and report the distribution of AUCs rather than only the median.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Results: the headline claim that the model constitutes a path toward foundation models rests on internal MGB-only held-out and prospective splits; no external cohort, multi-center test set, or cross-system evaluation is described, which directly undermines the generalization argument for real-world deployment.

    Authors: We agree that external validation on independent systems would strengthen generalization claims. The study demonstrates feasibility using large-scale multi-hospital internal data with held-out and prospective splits. In revision we will tone down foundation-model language in the abstract, add an explicit limitations paragraph noting the absence of external cohorts, and frame results as an internal health-system-scale demonstration rather than broad deployment-ready evidence. revision: partial

  2. Referee: [Results] Results: no baseline models (e.g., logistic regression using demographics plus prior codes) or ablation studies are reported, so it is impossible to determine whether the transformer architecture contributes incremental value over simpler approaches on the same 896-category task.

    Authors: We accept this criticism. The revised manuscript will add baseline comparisons (logistic regression on demographics plus prior codes, and a simple GRU) plus ablation studies on attention layers and embedding strategies, all evaluated on the identical 896-category task and splits. These results will be inserted into the Results section with appropriate statistical tests. revision: yes

  3. Referee: [Methods] Methods: the abstract supplies no architecture details, training procedure, loss function, handling of class imbalance or missing data, or hyperparameter search; these omissions make the AUC numbers impossible to reproduce or stress-test for the central performance claim.

    Authors: The full Methods section already contains these specifications (12-layer transformer, 768-dim embeddings, weighted cross-entropy loss, forward-fill plus missingness indicators, and Bayesian hyperparameter optimization). To address the abstract-level gap we will insert a concise methods summary into the abstract and add a reproducibility checklist. No new experiments are required. revision: yes

standing simulated objections not resolved
  • Absence of any external validation cohort from an independent health system, which cannot be supplied from the current dataset.

Circularity Check

0 steps flagged

No significant circularity; empirical AUC from independent held-out and prospective splits

full rationale

The paper trains DT-Transformer on 57.1M MGB EHR entries and reports next-event prediction performance via median age/sex-stratified AUC of 0.871 on held-out and prospective validation sets drawn from the same source but kept separate from training. This is standard non-circular ML evaluation with no equations, self-definitional reductions, fitted-input-as-prediction, or load-bearing self-citations that collapse the reported metric to its inputs by construction. No uniqueness theorems, ansatzes, or renamings of known results are invoked in the provided text to support the core claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that structured EHR entries accurately reflect disease trajectories and that standard transformer training on this data yields generalizable next-event predictions; no free parameters or invented entities are explicitly introduced beyond typical deep-learning hyperparameters.

free parameters (1)
  • transformer hyperparameters
    Standard model size, learning rate, and regularization choices required for training but not enumerated in the abstract.
axioms (1)
  • domain assumption Structured EHR entries from one health system capture representative patient trajectories
    Invoked to justify training and validation on MGB data as reflective of real-world complexity.

pith-pipeline@v0.9.0 · 5496 in / 1210 out tokens · 42795 ms · 2026-05-15T01:53:01.885754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    and Matheny, Michael E

    Desai, Rishi J. and Matheny, Michael E. and Johnson, Kristin and Wen, Tianxi and Yu, Yan and Rogers, John R. and Suo, Yuedong and Wang, Shirley V. and Schneeweiss, Sebastian , title =. npj Digital Medicine , volume =. 2021 , doi =

  8. [8]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  9. [9]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  10. [10]

    Learning the natural history of human disease with generative transformers , volume =

    Shmatko, Artem and Jung, Alexander Wolfgang and Gaurav, Kumar and Brunak, Søren and Mortensen, Laust Hvas and Birney, Ewan and Fitzgerald, Tom and Gerstung, Moritz , month = nov, year =. Learning the natural history of human disease with generative transformers , volume =. Nature , publisher =. doi:10.1038/s41586-025-09529-3 , abstract =

  11. [11]

    and Was, Jaroslaw and Li, Quanzheng and Bates, David W

    Renc, Pawel and Jia, Yugang and Samir, Anthony E. and Was, Jaroslaw and Li, Quanzheng and Bates, David W. and Sitek, Arkadiusz , month = sep, year =. Zero shot health trajectory prediction using transformer , volume =. npj Digital Medicine , publisher =. doi:10.1038/s41746-024-01235-0 , abstract =

  12. [12]

    IEEE journal of biomedical and health informatics , author =

    Bidirectional. IEEE journal of biomedical and health informatics , author =. 2021 , pages =. doi:10.1109/JBHI.2021.3063721 , abstract =

  13. [13]

    and Zheng, Chunlei and Haue, Amalie D

    Placido, Davide and Yuan, Bo and Hjaltelin, Jessica X. and Zheng, Chunlei and Haue, Amalie D. and Chmura, Piotr J. and Yuan, Chen and Kim, Jihye and Umeton, Renato and Antell, Gregory and Chowdhury, Alexander and Franz, Alexandra and Brais, Lauren and Andrews, Elizabeth and Marks, Debora S. and Regev, Aviv and Ayandeh, Siamack and Brophy, Mary T. and Do, ...

  14. [14]

    American Journal of Epidemiology , author =

    Comparison of. American Journal of Epidemiology , author =. 2017 , pages =. doi:10.1093/aje/kwx246 , abstract =

  15. [15]

    Rasmy, Laila and Xiang, Yang and Xie, Ziqian and Tao, Cui and Zhi, Degui , month = may, year =. Med-. npj Digital Medicine , publisher =. doi:10.1038/s41746-021-00455-y , abstract =

  16. [16]

    2023 , pages =

    Journal of biomedical informatics , author =. 2023 , pages =. doi:10.1016/j.jbi.2023.104442 , abstract =

  17. [17]

    Li, Yikuan , year =. Hi-

  18. [18]

    Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study -

    Kraljevic, Zeljko , year =. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study -

  19. [19]

    doi:10.48550/arXiv.2301.03150 , abstract =

    Steinberg, Ethan and Fries, Jason and Xu, Yizhe and Shah, Nigam , month = dec, year =. doi:10.48550/arXiv.2301.03150 , abstract =

  20. [20]

    A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

    Zhang, Andrew and Ding, Tong and Wagner, Sophia J. and Tian, Caiwei and Lu, Ming Y. and Pettit, Rowland and Lewis, Joshua E. and Misrahi, Alexandre and Mo, Dandan and Le, Long Phi and Mahmood, Faisal , month = apr, year =. A multimodal and temporal foundation model for virtual patient representations at healthcare system scale , url =. doi:10.48550/arXiv....

  21. [21]

    Better & faster large language models via multi-token prediction.arXiv preprint arXiv:2404.19737, 2024

    Gloeckle, Fabian and Idrissi, Badr Youbi and Rozière, Baptiste and Lopez-Paz, David and Synnaeve, Gabriel , month = apr, year =. Better &. doi:10.48550/arXiv.2404.19737 , abstract =

  22. [22]

    IEEE Journal of Biomedical and Health Informatics , author =

    Hi-. IEEE Journal of Biomedical and Health Informatics , author =. 2023 , keywords =. doi:10.1109/JBHI.2022.3224727 , abstract =

  23. [23]

    Attention Is All You Need

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser, Lukasz and Polosukhin, Illia , month = aug, year =. Attention. doi:10.48550/arXiv.1706.03762 , abstract =

  24. [24]

    Johnson, Alistair E. W. and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J. and Hao, Sicheng and Moody, Benjamin and Gow, Brian and Lehman, Li-wei H. and Celi, Leo A. and Mark, Roger G. , month = jan, year =. Scientific Data , publisher =. doi:10.1038/s41597-022-01899-x , abstract =

  25. [25]

    Nature Communications , publisher =

    Yang, Zhichao and Mitra, Avijit and Liu, Weisong and Berlowitz, Dan and Yu, Hong , month = nov, year =. Nature Communications , publisher =. doi:10.1038/s41467-023-43715-z , abstract =

  26. [26]

    , month = oct, year =

    Makarov, Nikita and Bordukova, Maria and Quengdaeng, Papichaya and Garger, Daniel and Rodriguez-Esteban, Raul and Schmich, Fabian and Menden, Michael P. , month = oct, year =. Large language models forecast patient health trajectories enabling digital twins , volume =. npj Digital Medicine , publisher =. doi:10.1038/s41746-025-02004-3 , abstract =

  27. [27]

    International

    Steinberg, Ethan and Fries, Jason Alan and Xu, Yizhe and Shah, Nigam H , year =. International

  28. [28]

    Journal of the American Medical Informatics Association : JAMIA , author =

    Large language models leverage external knowledge to extend clinical insight beyond language boundaries , volume =. Journal of the American Medical Informatics Association : JAMIA , author =. 2024 , pages =. doi:10.1093/jamia/ocae079 , abstract =

  29. [29]

    and Alsentzer, Emily and Celi, Leo Anthony and Rodman, Adam and Schneeweiss, Sebastian and Chen, Jonathan H

    Wu, Jiageng and Gu, Bowen and Zhou, Ren and Xie, Kevin and Snyder, Doug and Jiang, Yixing and Carducci, Valentina and Wyss, Richard and Desai, Rishi J. and Alsentzer, Emily and Celi, Leo Anthony and Rodman, Adam and Schneeweiss, Sebastian and Chen, Jonathan H. and Romero-Brufau, Santiago and Lin, Kueiyu Joshua and Yang, Jie , month = may, year =. doi:10.4...

  30. [30]

    Journal of Biomedical Informatics , author =

    Mining for equitable health:. Journal of Biomedical Informatics , author =. 2023 , keywords =. doi:10.1016/j.jbi.2022.104269 , abstract =

  31. [31]

    PLOS ONE , author =

    The impact of electronic health record discontinuity on prediction modeling , volume =. PLOS ONE , author =. 2023 , pages =. doi:10.1371/journal.pone.0287985 , abstract =

  32. [32]

    Multimodal data matters: language model pre-training over structured and unstructured electronic health records , shorttitle =

    Liu, Sicen and Wang, Xiaolong and Hou, Yongshuai and Li, Ge and Wang, Hui and Xu, Hui and Xiang, Yang and Tang, Buzhou , month = oct, year =. Multimodal data matters: language model pre-training over structured and unstructured electronic health records , shorttitle =. doi:10.48550/arXiv.2201.10113 , abstract =

  33. [33]

    Scalable medication extraction and discontinuation identification from electronic health records using large language models , journal =

    Shao, Chong and Snyder, Douglas and Li, Chiran and Gu, Bowen and Ngan, Kerry and Yang, Chun-Ting and Wu, Jiageng and Wyss, Richard and Lin, Kueiyu Joshua and Yang, Jie , year =. Scalable medication extraction and discontinuation identification from electronic health records using large language models , journal =

  34. [34]

    BMC Medical Research Methodology , author =

    Scalable information extraction from free text electronic health records using large language models , volume =. BMC Medical Research Methodology , author =. 2025 , pages =