Large Language Models are Powerful Electronic Health Record Encoders
Pith reviewed 2026-05-23 01:26 UTC · model grok-4.3
The pith
General-purpose LLMs match specialized EHR models by converting medical codes to plain text for embeddings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting EHR data into plain text by replacing medical codes with natural-language descriptions, general-purpose Large Language Models produce high-dimensional embeddings for downstream prediction tasks without access to private medical training data. LLM-based embeddings perform on par with a specialized EHR foundation model, CLMBR-T-Base, across 15 clinical tasks from the EHRSHOT benchmark. In an external validation using the UK Biobank, an LLM-based model shows statistically significant improvements for some tasks, which we attribute to higher vocabulary coverage and slightly better generalization.
What carries the argument
Conversion of EHR medical codes to natural-language text descriptions, enabling general LLMs to extract embeddings without domain-specific pretraining or private data.
If this is right
- LLM embeddings achieve performance parity with specialized EHR models on 15 clinical prediction tasks.
- External validation reveals advantages for some tasks due to wider vocabulary coverage.
- The approach removes the requirement for private medical data in developing predictive models.
- A trade-off exists between computational efficiency of specialized models and the data independence of LLM embeddings.
Where Pith is reading between the lines
- Hospitals lacking large private datasets could adopt this method to build prediction tools more readily.
- The technique may extend to other code-heavy domains like insurance claims or lab results for similar encoding benefits.
- Combining these embeddings with additional data types such as notes or images could further boost task performance.
Load-bearing premise
Converting EHR data into plain text by replacing medical codes with natural-language descriptions preserves sufficient clinical signal for general-purpose LLMs to produce embeddings that support accurate downstream prediction without any domain-specific pretraining or private medical data.
What would settle it
A new set of clinical tasks where LLM embeddings after text conversion underperform CLMBR-T-Base by a large margin on average would falsify the on-par performance claim.
read the original abstract
Electronic Health Records (EHRs) offer considerable potential for clinical prediction, but their complexity and heterogeneity challenge traditional machine learning. Domain-specific EHR foundation models trained on unlabeled EHR data have shown improved predictive accuracy and generalization. However, their development is constrained by limited data access and site-specific vocabularies. We convert EHR data into plain text by replacing medical codes with natural-language descriptions, enabling general-purpose Large Language Models (LLMs) to produce high-dimensional embeddings for downstream prediction tasks without access to private medical training data. LLM-based embeddings perform on par with a specialized EHR foundation model, CLMBR-T-Base, across 15 clinical tasks from the EHRSHOT benchmark. In an external validation using the UK Biobank, an LLM-based model shows statistically significant improvements for some tasks, which we attribute to higher vocabulary coverage and slightly better generalization. Overall, we reveal a trade-off between the computational efficiency of specialized EHR models and the portability and data independence of LLM-based embeddings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that converting EHR data into plain text by replacing medical codes with natural-language descriptions enables general-purpose LLMs to produce embeddings that achieve performance parity with the specialized EHR foundation model CLMBR-T-Base across 15 clinical tasks from the EHRSHOT benchmark. It further claims statistically significant improvements for some tasks in external validation on the UK Biobank, attributed to higher vocabulary coverage and better generalization, while noting a trade-off between computational efficiency of domain-specific models and the portability/data independence of LLM-based embeddings.
Significance. If the empirical results hold after full verification, the work would be significant for demonstrating that general-purpose LLMs can serve as effective EHR encoders without domain-specific pretraining or access to private medical data, potentially increasing accessibility and reducing barriers posed by limited data access and site-specific vocabularies. The direct comparison to an established baseline (CLMBR-T-Base) and the external validation add potential value, though the absence of the full manuscript prevents assessment of robustness or reproducibility.
major comments (2)
- [Abstract] Abstract: The central claims of performance parity with CLMBR-T-Base on 15 EHRSHOT tasks and statistically significant improvements on UK Biobank are stated without any methods details, error bars, statistical tests, data-handling descriptions, or quantitative results, making it impossible to verify whether the data support the claims.
- [Abstract] Abstract: The key assumption that replacing medical codes with natural-language descriptions preserves sufficient clinical signal for accurate downstream prediction is presented without any supporting evidence, ablation, or discussion, and this assumption is load-bearing for the portability claim.
Simulated Author's Rebuttal
We thank the referee for their comments. We address each major comment below. The full manuscript (arXiv:2502.17403) contains the detailed methods, results, and statistical analyses referenced in the abstract.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of performance parity with CLMBR-T-Base on 15 EHRSHOT tasks and statistically significant improvements on UK Biobank are stated without any methods details, error bars, statistical tests, data-handling descriptions, or quantitative results, making it impossible to verify whether the data support the claims.
Authors: Abstracts are intentionally concise and omit detailed methods, error bars, statistical tests, and quantitative results due to length constraints. The full manuscript provides all of these: the specific statistical tests used to establish significance on UK Biobank tasks, error bars from repeated evaluations, data-handling procedures for the 15 EHRSHOT tasks, and the exact performance numbers supporting parity with CLMBR-T-Base. The abstract serves only as a high-level summary of those findings. revision: no
-
Referee: [Abstract] Abstract: The key assumption that replacing medical codes with natural-language descriptions preserves sufficient clinical signal for accurate downstream prediction is presented without any supporting evidence, ablation, or discussion, and this assumption is load-bearing for the portability claim.
Authors: The reported performance parity with CLMBR-T-Base across all 15 EHRSHOT tasks constitutes direct empirical evidence that the conversion preserves clinical signal. The statistically significant gains on UK Biobank tasks, attributed to superior vocabulary coverage, provide additional support for the portability claim. The full manuscript contains the relevant ablations and discussion; we can expand the introduction to explicitly link these results to the assumption if the referee recommends it. revision: partial
Circularity Check
No significant circularity
full rationale
The provided abstract contains no equations, parameters, or derivation chain of any kind. The central claims are framed as direct empirical comparisons (LLM embeddings vs. CLMBR-T-Base on EHRSHOT tasks, plus UK Biobank external validation) with no fitted inputs renamed as predictions, no self-citations invoked as uniqueness theorems, and no ansatz or renaming steps. The argument is therefore self-contained against external benchmarks and receives the default non-finding.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
-
Fusion or Confusion? Multimodal Complexity Is Not All You Need
Complex multimodal architectures do not reliably outperform unimodal baselines or a simple multimodal baseline under standardized evaluation.
-
Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models
Fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 and other clinical outcome predictions, while certain temporal encodings like event order match or exceed time tokens with shorter sequences.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.