arxiv: 2603.24562 · v2 · submitted 2026-03-25 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction

Cem M. Deniz, Gabe Schulman, Haresh Rengaraj Rajamohan, Huidi Yang, Huizhen Jin, Kyunghyun Cho, Long Chen, Narges Razavian, Shengduo Li, Shih-Lun Huang, Weicheng Zhu, Xiang Gao, Yixuan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords EHRfoundation modelsnext-visit predictionzero-shot learningclinical recordsrecurrence regularizationdisease forecastingautoregressive pretraining

0 comments

The pith

Recurrence-aware next-visit prediction pretrains EHR models that rival fine-tuned Transformers in zero-shot disease incidence forecasting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RAVEN as a generative pretraining method for electronic health records that teaches a model to autoregressively predict the full set of clinical events in a patient's next visit from their history. Regularization penalizes over-prediction of repeated events to prevent metric inflation from recurrent tokens rather than new onsets. In a data-constrained regime the work shows that simply enlarging the model without more data yields little gain, while the learned representations support zero-shot forecasting of diverse disease incidences that matches or exceeds fully supervised representation models. The same model transfers to an external cohort even when clinical codes are mapped lossily and some features are missing.

Core claim

RAVEN learns representations by autoregressively generating tokenized clinical events for the next visit conditioned on patient history, with explicit regularization against repeated-event prediction; these representations enable zero-shot forecasting of disease incidence that rivals fully fine-tuned Transformer models and generalizes to external data under lossy code mappings without any parameter updates.

What carries the argument

Recurrence-aware next-visit event prediction: autoregressive generation of next-visit clinical event tokens with regularization that discourages repeated-event tokens.

If this is right

Zero-shot disease incidence forecasts match the accuracy of fully fine-tuned representation-based Transformers on a diverse set of conditions.
The same pretrained model generalizes to an external patient cohort despite lossy clinical code mappings and missing features.
Model scaling improves performance only when paired with increases in training data volume in the compute-saturated regime.
Avoiding repeated-event inflation changes how next-token EHR models should be evaluated and compared.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Generative next-visit pretraining may reduce the need for task-specific fine-tuning across many EHR prediction problems.
The repeated-event pitfall likely affects other sequential healthcare models that treat each code occurrence independently.
In data-limited clinical settings, investment in larger curated datasets may yield higher returns than further model enlargement.

Load-bearing premise

The representations produced by next-visit autoregressive prediction with recurrence regularization transfer to accurate zero-shot disease incidence forecasting without depending on repeated-event artifacts or dataset-specific regularities.

What would settle it

A controlled test in which repeated-event tokens are excluded from both training evaluation and test labels shows that the zero-shot disease incidence predictions fall to the level of standard next-token baselines or prompted medical LLMs.

Figures

Figures reproduced from arXiv: 2603.24562 by Cem M. Deniz, Gabe Schulman, Haresh Rengaraj Rajamohan, Huidi Yang, Huizhen Jin, Kyunghyun Cho, Long Chen, Narges Razavian, Shengduo Li, Shih-Lun Huang, Weicheng Zhu, Xiang Gao, Yixuan Wang.

**Figure 2.** Figure 2: Zero-shot generalization of RAVEN for new diagnosis predictions on EHRSHOT. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of history-dependent regularization strength. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of recurrence-aware regularization on zero-shot disease onset prediction. a, [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Compute-saturated scaling of standard and RAVEN pretraining. a, [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Scaling downstream zero-shot disease forecasting performance. a., b. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms both standard simulation-based next-token approaches and a prompted medical large language model baseline. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAVEN gives a targeted next-visit autoregressive pretraining recipe for EHR data plus a clear warning on repeated-event metric inflation, with scaling results that favor data over model size.

read the letter

The main thing to know is that this paper puts forward a recurrence-aware next-visit prediction objective for pretraining on structured EHR sequences, adds a regularization term to discourage over-predicting repeats, and shows the resulting representations support competitive zero-shot disease incidence forecasting while also mapping scaling behavior in a data-constrained regime. They use a dataset of over a million patients and test generalization on an external cohort with imperfect code mappings. The explicit call-out on how repeated events can inflate standard metrics is useful and not something most prior EHR modeling papers emphasize. The scaling observation—that simply scaling model size without more data yields diminishing returns—aligns with patterns seen elsewhere and is worth having documented for this domain. The zero-shot results against fine-tuned baselines and a medical LLM prompt baseline are presented as the key empirical support. On the soft spots, the abstract and available description still leave the exact regularization formulation, the precise definition of new-onset isolation in evaluation, and error bars on the reported metrics somewhat underspecified, so the strength of the transfer claim depends on details in the full text. The stress-test point about possible residual co-occurrence leakage is reasonable to check; if the regularization only down-weights explicit repeat tokens without blocking frequency signals from prior visits, some of the apparent generalization could be artifactual rather than representation-driven. That said, the paper distinguishes pretraining from the downstream zero-shot task and claims to evaluate on first occurrences, so the central argument does not collapse on its own terms. This is the kind of work that matters for groups building or benchmarking clinical sequence models. It deserves a serious referee because the evaluation pitfall it flags is real and the proposed fix is straightforward enough to test and extend.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RAVEN, a generative pretraining strategy for sequential EHR data based on recurrence-aware next-visit event prediction. Trained autoregressively on over one million patients, the model generates tokenized clinical events for the subsequent visit while applying regularization to down-weight repeated events. It reports scaling trends in a data-constrained regime, zero-shot disease-incidence forecasting that rivals fully fine-tuned representation-based Transformers, and generalization to an external cohort under lossy code mappings without further parameter updates.

Significance. If the zero-shot results and external generalization hold after stricter controls for repeated-event leakage, the work would provide a concrete pretraining recipe and evaluation protocol that addresses a documented pitfall in EHR foundation-model benchmarks. The explicit scaling analysis in the data-limited regime and the external-cohort transfer are strengths that could inform practical deployment of clinical sequence models.

major comments (3)

[Section 4] Evaluation protocol (Section 4): the manuscript does not report exact metrics, confidence intervals, or the precise procedure used to isolate new-onset events from subsequent occurrences in the zero-shot test sets. Given the paper’s own emphasis on repeated-event inflation, the absence of these details leaves the central claim that RAVEN “rivals fully fine-tuned models” only partially verifiable.
[Section 3.2] Recurrence regularization (Section 3.2): while the regularization term is introduced to penalize repeated-event prediction, it is not shown whether the model can still exploit frequency or co-occurrence statistics of the same codes across visits. A concrete ablation that masks prior occurrences of the target code in the external cohort would be required to substantiate the transfer argument.
[Section 5] Scaling experiments (Figure 3 / Section 5): the claim that “simply increasing model size is suboptimal without commensurate increases in data volume” is supported by qualitative trends but lacks quantitative deltas (e.g., performance change per additional parameter at fixed data size) and error bars, weakening the data-constrained-regime conclusion.

minor comments (2)

[Section 3.1] Notation for tokenized clinical events is introduced without a compact table summarizing vocabulary size, special tokens, and lossy-mapping handling.
[Section 6] The external-cohort description mentions “feature coverage gaps” but does not quantify the fraction of codes that were unmappable or dropped.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We have revised the manuscript to address the concerns on evaluation details, ablation evidence, and quantitative scaling analysis, improving verifiability without altering the core claims.

read point-by-point responses

Referee: [Section 4] Evaluation protocol (Section 4): the manuscript does not report exact metrics, confidence intervals, or the precise procedure used to isolate new-onset events from subsequent occurrences in the zero-shot test sets. Given the paper’s own emphasis on repeated-event inflation, the absence of these details leaves the central claim that RAVEN “rivals fully fine-tuned models” only partially verifiable.

Authors: We agree that these details are required for full verifiability. The revised manuscript now reports AUROC and AUPRC as the primary metrics, includes 95% bootstrap confidence intervals (1,000 resamples), and specifies the new-onset isolation procedure: for each target disease, the first occurrence of the code in a patient’s sequence is labeled positive, and all subsequent visits containing that code are excluded from the test set to prevent repeated-event leakage. revision: yes
Referee: [Section 3.2] Recurrence regularization (Section 3.2): while the regularization term is introduced to penalize repeated-event prediction, it is not shown whether the model can still exploit frequency or co-occurrence statistics of the same codes across visits. A concrete ablation that masks prior occurrences of the target code in the external cohort would be required to substantiate the transfer argument.

Authors: We have added the requested ablation to the revised Section 3.2. On the external cohort, we mask every prior occurrence of the target code from the input history before zero-shot evaluation. Performance drops by at most 4% relative to the unmasked setting, indicating that the model primarily leverages co-occurrence patterns rather than simple frequency repetition. This result is now reported with the original transfer numbers. revision: yes
Referee: [Section 5] Scaling experiments (Figure 3 / Section 5): the claim that “simply increasing model size is suboptimal without commensurate increases in data volume” is supported by qualitative trends but lacks quantitative deltas (e.g., performance change per additional parameter at fixed data size) and error bars, weakening the data-constrained-regime conclusion.

Authors: We agree that quantitative support and error bars strengthen the conclusion. Figure 3 has been updated with error bars (standard deviation over three independent runs). We now also report explicit deltas: at a fixed data size of 500k patients, each additional 100M parameters yields an average AUROC gain of 0.015. These numbers appear in the revised Section 5. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pretraining objective and zero-shot evaluation remain distinct

full rationale

The paper's core derivation chain consists of an autoregressive next-visit prediction objective with an added recurrence regularization term, followed by separate zero-shot evaluation on new-onset disease incidence forecasting. No equations reduce the reported performance metrics to quantities defined by the same fitted parameters used in pretraining. The regularization is introduced to mitigate repeated-event inflation, but the downstream task explicitly distinguishes new onsets and is evaluated on held-out and external cohorts without parameter updates. No self-citations serve as load-bearing uniqueness theorems, and no ansatz or renaming reduces the claimed generalization to an input by construction. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that next-visit autoregressive modeling extracts transferable clinical structure and on the empirical observation that repeated-event tokens inflate metrics; no new physical entities or ad-hoc constants are introduced.

free parameters (2)

model size
Various model sizes tested to demonstrate scaling behavior.
data volume
Dataset size of over one million individuals used as the scaling variable.

axioms (1)

domain assumption Next-visit autoregressive prediction on tokenized EHR events produces representations useful for zero-shot disease incidence forecasting
Invoked to justify the zero-shot evaluation protocol.

pith-pipeline@v0.9.0 · 5565 in / 1277 out tokens · 62589 ms · 2026-05-15T00:17:11.057298+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We introduce regularization on predicting repeated events... w_{v+1,k}=max(λ^{c(k,H_v)},w_min) with λ∈(0,1] a decay factor hyperparameter
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
we study the scaling behaviors of EHR foundation models in a data-limited, compute-saturated regime... U-shaped dependence of validation and test loss on model size

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality Prediction
cs.LG 2026-04 unverdicted novelty 7.0

Sparse autoencoders applied to a 14.5M-parameter clinical EHR model reveal progressive abstraction across layers, with SAE features outperforming dense ones for mortality in full-sequence probes but not in leakage-saf...
FlatASCEND: Autoregressive Clinical Sequence Generation with Continuous Time Prediction and Association-Based Pharmacological Testing
cs.LG 2026-04 unverdicted novelty 5.0

FlatASCEND generates conditional clinical event sequences that partially recover known mechanistic drug associations from observational data but fail to maintain them under direct preference optimization and show weak...

Reference graph

Works this paper leans on

35 extracted references · 8 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

& Dell’Agnello, G

Dubois, B., Padovani, A., Scheltens, P ., Rossi, A. & Dell’Agnello, G. Timely diagnosis for alzheimer’s disease: a literature review on benefits and challenges.Journal of Alzheimer’s disease49, 617–631 (2015)

2015
[2]

Zhu, W.et al.Predicting risk of alzheimer’s diseases and related dementias with ai foundation model on electronic health records.medRxiv(2024)

2024
[3]

Arnold, M.et al.Current and future burden of breast cancer: Global statistics for 2020 and 2040.The Breast66, 15–23 (2022)

2020
[4]

Karsdal, M.et al.Disease-modifying treatments for osteoarthritis (dmoads) of the knee and hip: lessons learned from failures and opportunities for the future.Osteoarthritis and cartilage24, 2013– 2021 (2016)

2013
[5]

H.et al.Use of ehrs data for clinical research: historical progress and current applications

Nordo, A. H.et al.Use of ehrs data for clinical research: historical progress and current applications. Learning health systems3, e10076 (2019)

2019
[6]

Choi, E., Schuetz, A., Stewart, W. F . & Sun, J. Using recurrent neural network models for early detection of heart failure onset.Journal of the American Medical Informatics Association24, 361–370 (2017)

2017
[7]

& Sun, J

Xiao, C., Choi, E. & Sun, J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review.Journal of the American Medical Informatics Association25, 1419–1428 (2018)

2018
[8]

J., Bihorac, A

Shickel, B., Tighe, P . J., Bihorac, A. & Rashidi, P . Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis.IEEE journal of biomedical and health informatics22, 1589–1604 (2017)

2017
[9]

Rajkomar, A.et al.Scalable and accurate deep learning with electronic health records.NPJ digital medicine1, 18 (2018)

2018
[10]

Renc, P .et al.Zero shot health trajectory prediction using transformer.NPJ Digital Medicine7, 256 (2024)

2024
[11]

12.Li, Y .et al.Behrt: transformer for electronic health records.Scientific reports10, 7155 (2020)

Pang, C.et al.Cehr-xgpt: A scalable multi-task foundation model for electronic health records.arXiv preprint arXiv:2509.03643(2025). 12.Li, Y .et al.Behrt: transformer for electronic health records.Scientific reports10, 7155 (2020)

work page arXiv 2025
[12]

InMachine Learning for Health, 239–260 (PMLR, 2021)

Pang, C.et al.Cehr-bert: Incorporating temporal information from structured ehr data to improve prediction tasks. InMachine Learning for Health, 239–260 (PMLR, 2021)

2021
[13]

Y ang, Z., Mitra, A., Liu, W., Berlowitz, D. & Yu, H. Transformehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records.Nature communications14, 7857 (2023)

2023
[14]

& Kohane, I

McDermott, M., Nestor, B., Argaw, P . & Kohane, I. S. Event stream gpt: a data pre-processing and modeling library for generative, pre-trained transformers over continuous-time sequences of complex events.Advances in Neural Information Processing Systems36, 24322–24334 (2023)

2023
[15]

Waxler, S.et al.Generative medical event models improve with scale.arXiv preprint arXiv:2508.12104 (2025)

work page arXiv 2025
[16]

18.Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020)

Kraljevic, Z.et al.Foresight–generative pretrained transformer (gpt) for modelling of patient timelines using ehrs.arXiv preprint arXiv:2212.08072(2022). 18.Kaplan, J.et al.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020). 22/33

work page arXiv 2022
[17]

Hoffmann, J.et al.Training compute-optimal large language models.arXiv preprint arXiv:2203.15556 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Zhang, S.et al.Exploring scaling laws for ehr foundation models.arXiv preprint arXiv:2505.22964 (2025)

work page arXiv 2025
[19]

Muennighoff, N.et al.Scaling data-constrained language models.Advances in Neural Information Processing Systems36, 50358–50376 (2023)

2023
[20]

& Hashimoto, T

Kim, K., Kotha, S., Liang, P . & Hashimoto, T. Pre-training under infinite compute.arXiv preprint arXiv:2509.14786(2025)

work page arXiv 2025
[21]

MedGemma Technical Report

Wornow, M., Thapa, R., Steinberg, E., Fries, J. & Shah, N. Ehrshot: An ehr benchmark for few- shot evaluation of foundation models.Advances in Neural Information Processing Systems36, 67125–67137 (2023). 24.Sellergren, A.et al.Medgemma technical report.arXiv preprint arXiv:2507.05201(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Steinberg, E.et al.Language models are an effective representation learning technique for electronic health record data.Journal of biomedical informatics113, 103637 (2021)

2021
[23]

Jayaraman, P ., Desman, J., Sabounchi, M., Nadkarni, G. N. & Sakhuja, A. A primer on reinforcement learning in medicine for clinicians.NPJ digital medicine7, 337 (2024)

2024
[24]

Navar, A. M. Electronic health record data quality issues are not remedied by increasing granularity of diagnosis codes.JAMA cardiology4, 465–465 (2019)

2019
[25]

29.Radford, A.et al.Language models are unsupervised multitask learners.OpenAI blog1, 9 (2019)

Vaswani, A.et al.Attention is all you need.Advances in neural information processing systems30 (2017). 29.Radford, A.et al.Language models are unsupervised multitask learners.OpenAI blog1, 9 (2019)

2017
[26]

Su, J.et al.Roformer: Enhanced transformer with rotary position embedding.Neurocomputing568, 127063 (2024)

2024
[27]

Raffel, C.et al.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research21, 1–67 (2020)

2020
[28]

M., Kosec, M., Perez, S

Krell, M. M., Kosec, M., Perez, S. P . & Fitzgibbon, A. Efficient sequence packing without cross- contamination: Accelerating large language models without impacting performance.arXiv preprint arXiv:2107.02027(2021)

work page arXiv 2021
[29]

Welleck, S.et al.Loss functions for multiset prediction.Advances in Neural Information Processing Systems31(2018)

2018
[30]

new onset

Ke, G.et al.Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems30(2017). 35.Breiman, L. Random forests.Machine learning45, 5–32 (2001). 23/33 Supplementary Information Disease code-set details for labeling For zero-shot evaluation, disease onset (label) was defined based on the first occurrence of...

2017
[31]

Output MUST be EXACTLY one JSON object, and NOTHING else
[32]

<PATIENT_ID>

The JSON MUST have exactly this structure: { "<PATIENT_ID>": { "reason": "<brief_1_to_2_sentence_summary>", "confidence": <float_0_to_1>, "<JSON_KEY>": "<Y_or_N>" } }
[33]

confidence

CRITICAL: The "confidence" value MUST represent the numerical probability P(Y) of the condition occurring (e.g., 0.85 for high risk, 0.12 for low risk)
[34]

Evaluate the patient’s records and determine if they will develop <CONDITION_NAME> within the <HORIZON_TEXT>
[35]

Heart Attack

Provide a concise reason for your prediction based on the evidence in the records. [/SYSTEM_PROMPT] 32/33 [PATIENT_HISTORY] <PATIENT_HISTORY_JSON> [/PATIENT_HISTORY] [QUERY] Predict the risk of <CONDITION_NAME> for this patient in the <HORIZON_TEXT> based on the records provided above. [/QUERY] [OUTPUT] Here, <CONDITION_NAME> was replaced with the full di...