Recognition: 2 theorem links
· Lean TheoremScaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction
Pith reviewed 2026-05-15 00:17 UTC · model grok-4.3
The pith
Recurrence-aware next-visit prediction pretrains EHR models that rival fine-tuned Transformers in zero-shot disease incidence forecasting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAVEN learns representations by autoregressively generating tokenized clinical events for the next visit conditioned on patient history, with explicit regularization against repeated-event prediction; these representations enable zero-shot forecasting of disease incidence that rivals fully fine-tuned Transformer models and generalizes to external data under lossy code mappings without any parameter updates.
What carries the argument
Recurrence-aware next-visit event prediction: autoregressive generation of next-visit clinical event tokens with regularization that discourages repeated-event tokens.
If this is right
- Zero-shot disease incidence forecasts match the accuracy of fully fine-tuned representation-based Transformers on a diverse set of conditions.
- The same pretrained model generalizes to an external patient cohort despite lossy clinical code mappings and missing features.
- Model scaling improves performance only when paired with increases in training data volume in the compute-saturated regime.
- Avoiding repeated-event inflation changes how next-token EHR models should be evaluated and compared.
Where Pith is reading between the lines
- Generative next-visit pretraining may reduce the need for task-specific fine-tuning across many EHR prediction problems.
- The repeated-event pitfall likely affects other sequential healthcare models that treat each code occurrence independently.
- In data-limited clinical settings, investment in larger curated datasets may yield higher returns than further model enlargement.
Load-bearing premise
The representations produced by next-visit autoregressive prediction with recurrence regularization transfer to accurate zero-shot disease incidence forecasting without depending on repeated-event artifacts or dataset-specific regularities.
What would settle it
A controlled test in which repeated-event tokens are excluded from both training evaluation and test labels shows that the zero-shot disease incidence predictions fall to the level of standard next-token baselines or prompted medical LLMs.
Figures
read the original abstract
While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms both standard simulation-based next-token approaches and a prompted medical large language model baseline. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RAVEN, a generative pretraining strategy for sequential EHR data based on recurrence-aware next-visit event prediction. Trained autoregressively on over one million patients, the model generates tokenized clinical events for the subsequent visit while applying regularization to down-weight repeated events. It reports scaling trends in a data-constrained regime, zero-shot disease-incidence forecasting that rivals fully fine-tuned representation-based Transformers, and generalization to an external cohort under lossy code mappings without further parameter updates.
Significance. If the zero-shot results and external generalization hold after stricter controls for repeated-event leakage, the work would provide a concrete pretraining recipe and evaluation protocol that addresses a documented pitfall in EHR foundation-model benchmarks. The explicit scaling analysis in the data-limited regime and the external-cohort transfer are strengths that could inform practical deployment of clinical sequence models.
major comments (3)
- [Section 4] Evaluation protocol (Section 4): the manuscript does not report exact metrics, confidence intervals, or the precise procedure used to isolate new-onset events from subsequent occurrences in the zero-shot test sets. Given the paper’s own emphasis on repeated-event inflation, the absence of these details leaves the central claim that RAVEN “rivals fully fine-tuned models” only partially verifiable.
- [Section 3.2] Recurrence regularization (Section 3.2): while the regularization term is introduced to penalize repeated-event prediction, it is not shown whether the model can still exploit frequency or co-occurrence statistics of the same codes across visits. A concrete ablation that masks prior occurrences of the target code in the external cohort would be required to substantiate the transfer argument.
- [Section 5] Scaling experiments (Figure 3 / Section 5): the claim that “simply increasing model size is suboptimal without commensurate increases in data volume” is supported by qualitative trends but lacks quantitative deltas (e.g., performance change per additional parameter at fixed data size) and error bars, weakening the data-constrained-regime conclusion.
minor comments (2)
- [Section 3.1] Notation for tokenized clinical events is introduced without a compact table summarizing vocabulary size, special tokens, and lossy-mapping handling.
- [Section 6] The external-cohort description mentions “feature coverage gaps” but does not quantify the fraction of codes that were unmappable or dropped.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We have revised the manuscript to address the concerns on evaluation details, ablation evidence, and quantitative scaling analysis, improving verifiability without altering the core claims.
read point-by-point responses
-
Referee: [Section 4] Evaluation protocol (Section 4): the manuscript does not report exact metrics, confidence intervals, or the precise procedure used to isolate new-onset events from subsequent occurrences in the zero-shot test sets. Given the paper’s own emphasis on repeated-event inflation, the absence of these details leaves the central claim that RAVEN “rivals fully fine-tuned models” only partially verifiable.
Authors: We agree that these details are required for full verifiability. The revised manuscript now reports AUROC and AUPRC as the primary metrics, includes 95% bootstrap confidence intervals (1,000 resamples), and specifies the new-onset isolation procedure: for each target disease, the first occurrence of the code in a patient’s sequence is labeled positive, and all subsequent visits containing that code are excluded from the test set to prevent repeated-event leakage. revision: yes
-
Referee: [Section 3.2] Recurrence regularization (Section 3.2): while the regularization term is introduced to penalize repeated-event prediction, it is not shown whether the model can still exploit frequency or co-occurrence statistics of the same codes across visits. A concrete ablation that masks prior occurrences of the target code in the external cohort would be required to substantiate the transfer argument.
Authors: We have added the requested ablation to the revised Section 3.2. On the external cohort, we mask every prior occurrence of the target code from the input history before zero-shot evaluation. Performance drops by at most 4% relative to the unmasked setting, indicating that the model primarily leverages co-occurrence patterns rather than simple frequency repetition. This result is now reported with the original transfer numbers. revision: yes
-
Referee: [Section 5] Scaling experiments (Figure 3 / Section 5): the claim that “simply increasing model size is suboptimal without commensurate increases in data volume” is supported by qualitative trends but lacks quantitative deltas (e.g., performance change per additional parameter at fixed data size) and error bars, weakening the data-constrained-regime conclusion.
Authors: We agree that quantitative support and error bars strengthen the conclusion. Figure 3 has been updated with error bars (standard deviation over three independent runs). We now also report explicit deltas: at a fixed data size of 500k patients, each additional 100M parameters yields an average AUROC gain of 0.015. These numbers appear in the revised Section 5. revision: yes
Circularity Check
No significant circularity; pretraining objective and zero-shot evaluation remain distinct
full rationale
The paper's core derivation chain consists of an autoregressive next-visit prediction objective with an added recurrence regularization term, followed by separate zero-shot evaluation on new-onset disease incidence forecasting. No equations reduce the reported performance metrics to quantities defined by the same fitted parameters used in pretraining. The regularization is introduced to mitigate repeated-event inflation, but the downstream task explicitly distinguishes new onsets and is evaluated on held-out and external cohorts without parameter updates. No self-citations serve as load-bearing uniqueness theorems, and no ansatz or renaming reduces the claimed generalization to an input by construction. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- model size
- data volume
axioms (1)
- domain assumption Next-visit autoregressive prediction on tokenized EHR events produces representations useful for zero-shot disease incidence forecasting
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe introduce regularization on predicting repeated events... w_{v+1,k}=max(λ^{c(k,H_v)},w_min) with λ∈(0,1] a decay factor hyperparameter
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearwe study the scaling behaviors of EHR foundation models in a data-limited, compute-saturated regime... U-shaped dependence of validation and test loss on model size
Forward citations
Cited by 2 Pith papers
-
Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality Prediction
Sparse autoencoders applied to a 14.5M-parameter clinical EHR model reveal progressive abstraction across layers, with SAE features outperforming dense ones for mortality in full-sequence probes but not in leakage-saf...
-
FlatASCEND: Autoregressive Clinical Sequence Generation with Continuous Time Prediction and Association-Based Pharmacological Testing
FlatASCEND generates conditional clinical event sequences that partially recover known mechanistic drug associations from observational data but fail to maintain them under direct preference optimization and show weak...
Reference graph
Works this paper leans on
-
[1]
& Dell’Agnello, G
Dubois, B., Padovani, A., Scheltens, P ., Rossi, A. & Dell’Agnello, G. Timely diagnosis for alzheimer’s disease: a literature review on benefits and challenges.Journal of Alzheimer’s disease49, 617–631 (2015)
2015
-
[2]
Zhu, W.et al.Predicting risk of alzheimer’s diseases and related dementias with ai foundation model on electronic health records.medRxiv(2024)
2024
-
[3]
Arnold, M.et al.Current and future burden of breast cancer: Global statistics for 2020 and 2040.The Breast66, 15–23 (2022)
2020
-
[4]
Karsdal, M.et al.Disease-modifying treatments for osteoarthritis (dmoads) of the knee and hip: lessons learned from failures and opportunities for the future.Osteoarthritis and cartilage24, 2013– 2021 (2016)
2013
-
[5]
H.et al.Use of ehrs data for clinical research: historical progress and current applications
Nordo, A. H.et al.Use of ehrs data for clinical research: historical progress and current applications. Learning health systems3, e10076 (2019)
2019
-
[6]
Choi, E., Schuetz, A., Stewart, W. F . & Sun, J. Using recurrent neural network models for early detection of heart failure onset.Journal of the American Medical Informatics Association24, 361–370 (2017)
2017
-
[7]
& Sun, J
Xiao, C., Choi, E. & Sun, J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review.Journal of the American Medical Informatics Association25, 1419–1428 (2018)
2018
-
[8]
J., Bihorac, A
Shickel, B., Tighe, P . J., Bihorac, A. & Rashidi, P . Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis.IEEE journal of biomedical and health informatics22, 1589–1604 (2017)
2017
-
[9]
Rajkomar, A.et al.Scalable and accurate deep learning with electronic health records.NPJ digital medicine1, 18 (2018)
2018
-
[10]
Renc, P .et al.Zero shot health trajectory prediction using transformer.NPJ Digital Medicine7, 256 (2024)
2024
-
[11]
12.Li, Y .et al.Behrt: transformer for electronic health records.Scientific reports10, 7155 (2020)
Pang, C.et al.Cehr-xgpt: A scalable multi-task foundation model for electronic health records.arXiv preprint arXiv:2509.03643(2025). 12.Li, Y .et al.Behrt: transformer for electronic health records.Scientific reports10, 7155 (2020)
-
[12]
InMachine Learning for Health, 239–260 (PMLR, 2021)
Pang, C.et al.Cehr-bert: Incorporating temporal information from structured ehr data to improve prediction tasks. InMachine Learning for Health, 239–260 (PMLR, 2021)
2021
-
[13]
Y ang, Z., Mitra, A., Liu, W., Berlowitz, D. & Yu, H. Transformehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records.Nature communications14, 7857 (2023)
2023
-
[14]
& Kohane, I
McDermott, M., Nestor, B., Argaw, P . & Kohane, I. S. Event stream gpt: a data pre-processing and modeling library for generative, pre-trained transformers over continuous-time sequences of complex events.Advances in Neural Information Processing Systems36, 24322–24334 (2023)
2023
- [15]
-
[16]
18.Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)
Kraljevic, Z.et al.Foresight–generative pretrained transformer (gpt) for modelling of patient timelines using ehrs.arXiv preprint arXiv:2212.08072(2022). 18.Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020). 22/33
-
[17]
Hoffmann, J.et al.Training compute-optimal large language models.arXiv preprint arXiv:2203.15556 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [18]
-
[19]
Muennighoff, N.et al.Scaling data-constrained language models.Advances in Neural Information Processing Systems36, 50358–50376 (2023)
2023
-
[20]
Kim, K., Kotha, S., Liang, P . & Hashimoto, T. Pre-training under infinite compute.arXiv preprint arXiv:2509.14786(2025)
-
[21]
Wornow, M., Thapa, R., Steinberg, E., Fries, J. & Shah, N. Ehrshot: An ehr benchmark for few- shot evaluation of foundation models.Advances in Neural Information Processing Systems36, 67125–67137 (2023). 24.Sellergren, A.et al.Medgemma technical report.arXiv preprint arXiv:2507.05201(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Steinberg, E.et al.Language models are an effective representation learning technique for electronic health record data.Journal of biomedical informatics113, 103637 (2021)
2021
-
[23]
Jayaraman, P ., Desman, J., Sabounchi, M., Nadkarni, G. N. & Sakhuja, A. A primer on reinforcement learning in medicine for clinicians.NPJ digital medicine7, 337 (2024)
2024
-
[24]
Navar, A. M. Electronic health record data quality issues are not remedied by increasing granularity of diagnosis codes.JAMA cardiology4, 465–465 (2019)
2019
-
[25]
29.Radford, A.et al.Language models are unsupervised multitask learners.OpenAI blog1, 9 (2019)
Vaswani, A.et al.Attention is all you need.Advances in neural information processing systems30 (2017). 29.Radford, A.et al.Language models are unsupervised multitask learners.OpenAI blog1, 9 (2019)
2017
-
[26]
Su, J.et al.Roformer: Enhanced transformer with rotary position embedding.Neurocomputing568, 127063 (2024)
2024
-
[27]
Raffel, C.et al.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research21, 1–67 (2020)
2020
-
[28]
Krell, M. M., Kosec, M., Perez, S. P . & Fitzgibbon, A. Efficient sequence packing without cross- contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027(2021)
-
[29]
Welleck, S.et al.Loss functions for multiset prediction.Advances in Neural Information Processing Systems31(2018)
2018
-
[30]
new onset
Ke, G.et al.Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems30(2017). 35.Breiman, L. Random forests.Machine learning45, 5–32 (2001). 23/33 Supplementary Information Disease code-set details for labeling For zero-shot evaluation, disease onset (label) was defined based on the first occurrence of...
2017
-
[31]
Output MUST be EXACTLY one JSON object, and NOTHING else
-
[32]
<PATIENT_ID>
The JSON MUST have exactly this structure: { "<PATIENT_ID>": { "reason": "<brief_1_to_2_sentence_summary>", "confidence": <float_0_to_1>, "<JSON_KEY>": "<Y_or_N>" } }
-
[33]
confidence
CRITICAL: The "confidence" value MUST represent the numerical probability P(Y) of the condition occurring (e.g., 0.85 for high risk, 0.12 for low risk)
-
[34]
Evaluate the patient’s records and determine if they will develop <CONDITION_NAME> within the <HORIZON_TEXT>
-
[35]
Heart Attack
Provide a concise reason for your prediction based on the evidence in the records. [/SYSTEM_PROMPT] 32/33 [PATIENT_HISTORY] <PATIENT_HISTORY_JSON> [/PATIENT_HISTORY] [QUERY] Predict the risk of <CONDITION_NAME> for this patient in the <HORIZON_TEXT> based on the records provided above. [/QUERY] [OUTPUT] Here, <CONDITION_NAME> was replaced with the full di...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.