Recognition: unknown
Domain-Adapted Fine-Tuning of ECG Foundation Models for Multi-Label Structural Heart Disease Screening
Pith reviewed 2026-05-08 08:17 UTC · model grok-4.3
The pith
Adapting open ECG foundation models to echocardiography data through self-supervised pre-adaptation and selective fine-tuning provides the strongest performance for screening multiple structural heart diseases from ECG alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In-domain self-supervised adaptation of an ECG foundation model on the target waveforms, followed by selective supervised fine-tuning, produces the highest overall performance for multi-label detection of six echo-confirmed structural heart diseases, outperforming engineered features, scratch training, and other adaptation strategies including LoRA and mixtures.
What carries the argument
The central mechanism is self-supervised adaptation of the pretrained ECG foundation model on target-domain waveforms combined with selective supervised fine-tuning, which enables effective domain transfer while controlling computational cost.
If this is right
- Adapted models reach a peak macro-AUROC of 0.8509 and macro-AUPRC of 0.4297 for the six conditions.
- A parameter-efficient fine-tuning variant preserves nearly identical AUROC at 0.8501 while achieving the highest fixed-threshold macro-F1 of 0.3691.
- Late fusion of the model outputs with clinical covariates does not enhance threshold-independent discrimination metrics.
- Evaluated alternatives such as LoRA adaptation, different backbone choices, and mixture-of-foundations approaches fail to exceed the performance of the best single adapted backbone.
Where Pith is reading between the lines
- If the performance holds, ECG-based screening could serve as a low-cost first step to reduce the volume of echocardiograms needed in clinical workflows.
- The strategy of combining self-supervised domain adaptation with selective updates may apply to other medical signal tasks where labeled data is scarce.
- Testing the approach on data from additional hospitals would reveal whether site-specific adaptation remains necessary for consistent results.
Load-bearing premise
The performance differences observed on the benchmark dataset will generalize to new patient populations, clinical sites, and recording equipment.
What would settle it
A validation study on ECG recordings from a different hospital or scanner fleet where the adapted model's macro-AUROC falls below 0.80 would indicate the method does not transfer as claimed.
Figures
read the original abstract
Transthoracic echocardiography is the reference standard for confirming structural heart disease (SHD), but first-line screening is limited by cost, workflow burden, and specialist availability. We evaluated whether open pretrained electrocardiogram (ECG) foundation models can support echo-confirmed multi-label SHD detection using the public EchoNext Mini-Model benchmark. Six echocardiography-derived abnormalities were targeted: reduced left ventricular ejection fraction, increased left ventricular wall thickness, aortic stenosis, mitral regurgitation, tricuspid regurgitation, and right ventricular systolic dysfunction. Under a common pipeline, we compared engineered ECG features with gradient boosting, end-to-end waveform learning from scratch, and transfer from open ECG foundation models. We then applied in-domain self-supervised adaptation of an ECG foundation model (ECG-FM) on EchoNext waveforms followed by selective supervised fine-tuning, and evaluated trade-offs between discrimination and adaptation cost. Adapted ECG-FM models achieved the best overall performance: peak macro-AUROC 0.8509 and macro-AUPRC 0.4297, while a parameter-efficient operating point preserved AUROC (0.8501) and attained the highest fixed-threshold macro-F1 0.3691. Late fusion with covariates did not improve threshold-independent discrimination, and evaluated LoRA, alternative backbones, and mixture-of-foundations strategies did not surpass the best adapted single-backbone models. These results indicate that for ECG-based case finding and echocardiography triage, combining target-domain self-supervised adaptation with selective supervised updating of a pretrained ECG backbone is the most effective transfer strategy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates transfer strategies for multi-label structural heart disease (SHD) detection from ECGs on the public EchoNext Mini-Model benchmark. It compares engineered features with gradient boosting, end-to-end training from scratch, and transfer from open ECG foundation models (ECG-FMs). The central finding is that in-domain self-supervised adaptation of an ECG-FM followed by selective supervised fine-tuning yields the strongest results: peak macro-AUROC of 0.8509 and macro-AUPRC of 0.4297, with a parameter-efficient operating point preserving AUROC at 0.8501 while achieving the highest fixed-threshold macro-F1 of 0.3691. Late fusion with covariates and alternative strategies (LoRA, other backbones, mixtures) did not improve performance.
Significance. If the reported ordering generalizes, the work demonstrates a practical route to improve ECG-based SHD case-finding and echocardiography triage using existing open foundation models and modest adaptation compute. Strengths include the systematic comparison of transfer approaches under a common pipeline, explicit reporting of both threshold-independent (AUROC/AUPRC) and fixed-threshold (F1) metrics, and evaluation of parameter-efficient fine-tuning trade-offs. These elements make the empirical benchmark results a useful reference point for ECG foundation-model research.
major comments (2)
- [Results] Results section (performance tables and text reporting macro-AUROC 0.8509 / macro-F1 0.3691): All headline comparisons rest on a single public benchmark (EchoNext Mini-Model) with no external validation cohort, multi-center split, or scanner-vendor hold-out described. Because echo-derived labels are known to shift with acquisition protocol, operator, and population demographics, the observed superiority of domain-adapted ECG-FM models over baselines could reverse under distribution shift; this single-dataset limitation directly weakens the claim that the adaptation strategy is the most effective transfer approach for clinical use.
- [Methods] Methods (data splits and evaluation protocol): No details are provided on patient-level versus waveform-level splitting, handling of multiple ECGs per patient, or statistical testing (e.g., confidence intervals or paired tests) for the reported AUROC/AUPRC/F1 differences. Without these, it is impossible to determine whether the small numerical gaps (e.g., 0.8509 vs. 0.8501 AUROC) are robust or merely within noise.
minor comments (2)
- [Abstract] Abstract and Methods: The phrase 'EchoNext Mini-Model benchmark' should include a brief description of dataset size, label prevalence, and train/validation/test splits on first use to allow readers to assess the scale of the evaluation.
- [Results] Results: The statement that 'late fusion with covariates did not improve threshold-independent discrimination' would be clearer if the exact covariates and fusion method were named in the main text rather than deferred to supplementary material.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our evaluation of domain-adapted ECG foundation models for multi-label SHD detection. The comments highlight important aspects of generalizability and methodological transparency. We address each major comment below and have revised the manuscript to incorporate clarifications and expanded discussion where appropriate.
read point-by-point responses
-
Referee: [Results] Results section (performance tables and text reporting macro-AUROC 0.8509 / macro-F1 0.3691): All headline comparisons rest on a single public benchmark (EchoNext Mini-Model) with no external validation cohort, multi-center split, or scanner-vendor hold-out described. Because echo-derived labels are known to shift with acquisition protocol, operator, and population demographics, the observed superiority of domain-adapted ECG-FM models over baselines could reverse under distribution shift; this single-dataset limitation directly weakens the claim that the adaptation strategy is the most effective transfer approach for clinical use.
Authors: We agree that reliance on a single public benchmark constitutes a genuine limitation, as echo label distributions can shift across sites, operators, and demographics, potentially altering the relative performance of adaptation strategies. In the revised manuscript we have added an explicit paragraph in the Discussion section acknowledging this constraint, clarifying that the reported ordering (including the 0.8509 macro-AUROC) is benchmark-specific rather than a universal clinical claim, and outlining the need for future multi-center validation. The systematic head-to-head comparison on the standardized EchoNext Mini-Model benchmark nonetheless remains a useful reference point for the community, even if external cohorts would be required to confirm broader applicability. revision: yes
-
Referee: [Methods] Methods (data splits and evaluation protocol): No details are provided on patient-level versus waveform-level splitting, handling of multiple ECGs per patient, or statistical testing (e.g., confidence intervals or paired tests) for the reported AUROC/AUPRC/F1 differences. Without these, it is impossible to determine whether the small numerical gaps (e.g., 0.8509 vs. 0.8501 AUROC) are robust or merely within noise.
Authors: We acknowledge that the original Methods section was insufficiently explicit on these points. In the revised version we have expanded the Data Splits and Evaluation Protocol subsections to state that all partitions were performed at the patient level (no patient overlap across train/validation/test), with a single representative ECG selected per patient to avoid intra-patient leakage. We now also report bootstrap-derived 95% confidence intervals for all headline metrics and include paired DeLong tests for AUROC comparisons between the leading models and baselines. These additions show that the primary performance gains over non-adapted baselines are statistically distinguishable, while the small gap between the full and parameter-efficient adapted variants is within the reported confidence intervals, consistent with the operating-point trade-off we emphasize. revision: yes
Circularity Check
No circularity: purely empirical benchmark results with no derivation chain
full rationale
The paper reports standard ML training and evaluation outcomes (AUROC, AUPRC, F1) on the public EchoNext benchmark for six echo-derived SHD labels. No equations, closed-form derivations, or parameter-fitting steps are described that would reduce the headline performance numbers to quantities defined by the same paper's inputs. All comparisons (engineered features, from-scratch training, foundation-model transfer, LoRA, late fusion) follow conventional supervised learning pipelines whose metrics are independently computable from held-out test data. Self-citations, if present, are not load-bearing for any claimed result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The EchoNext Mini-Model benchmark and its echo-derived labels constitute a valid and representative testbed for multi-label SHD screening
Forward citations
Cited by 1 Pith paper
-
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.
Reference graph
Works this paper leans on
-
[1]
Artificial intelligence-enhanced electrocardiography in cardiovascular disease management
K. C. Siontis, P. A. Noseworthy, Z. I. Attia, and P. A. Friedman. “Artificial intelligence-enhanced electrocardiography in cardiovascular disease management”. In:Nature Reviews Cardiology18.7 (2021), pp. 465–478.doi: 10.1038/s41569- 020- 00503- 2 .url: https://www.nature.com/ articles/s41569-020-00503-2
-
[2]
A. E. Ulloa-Cerna et al. “rECHOmmend: An ECG-Based Machine Learning Approach for Identi- fying Patients at High Risk of Undiagnosed Structural Heart Disease Detectable by Echocardio- graphy”. In:Circulation146.1 (2022), pp. 36–47.doi:10.1161/CIRCULATIONAHA.121.057869. url:https://www.ahajournals.org/doi/10.1161/CIRCULATIONAHA.121.057869
-
[3]
URL https://www.nature.com/articles/ s41586-025-09227-0
T. J. Poterucha, L. Jing, R. P. Ricart, et al. “Detecting structural heart disease from electrocardio- grams using AI”. In:Nature644.8075 (2025), pp. 221–230.doi:10.1038/s41586-025-09227-0. url:https://www.nature.com/articles/s41586-025-09227-0. 12
-
[4]
The TRIPOD-LLM reporting guideline for studies using large language models
A. Y. Hannun, P. Rajpurkar, M. Haghpanahi, G. H. Tison, C. Bourn, M. P. Turakhia, and A. Y. Ng. “Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network”. In:Nature Medicine25.1 (2019), pp. 65–69.doi:10.1038/s41591- 018-0268-3.url:https://www.nature.com/articles/s41591-018-0268-3
-
[5]
A. H. Ribeiro, M. H. Ribeiro, G. M. M. Paixão, et al. “Automatic diagnosis of the 12-lead ECG using a deep neural network”. In:Nature Communications11 (2020), p. 1760.doi:10.1038/s41467- 020-15432-4.url:https://www.nature.com/articles/s41467-020-15432-4
-
[6]
PTB-XL, a large publicly available electrocardiography dataset
P. Wagner, N. Strodthoff, R.-D. Bousseljot, D. Kreiseler, F. I. Lunze, W. Samek, T. Schaeffter, et al. “PTB-XL, a large publicly available electrocardiography dataset”. In:Scientific Data7 (2020), p. 154.doi: 10.1038/s41597-020-0495-6.url: https://www.nature.com/articles/s41597- 020-0495-6
-
[7]
Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XL
N. Strodthoff, P. Wagner, T. Schaeffter, and W. Samek. “Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XL”. In:IEEE Journal of Biomedical and Health Informatics 25.5 (2021), pp. 1519–1528.doi:10.1109/JBHI.2020.3022989 .url: https://pubmed.ncbi. nlm.nih.gov/32903191/
-
[8]
Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram
Z. I. Attia, S. Kapa, F. Lopez-Jimenez, P. M. McKie, et al. “Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram”. In:Nature Medicine25.1 (2019), pp. 70–74.doi: 10 . 1038 / s41591 - 018 - 0240 - 2.url: https : / / www . nature . com / articles/s41591-018-0240-2
2019
-
[9]
Deep learning electrocardiographic analysis for detection of left-sided valvular heart disease
P. Elias et al. “Deep learning electrocardiographic analysis for detection of left-sided valvular heart disease”. In:Journal of the American College of Cardiology80.6 (2022), pp. 613–626.doi: 10.1016/j.jacc.2022.05.029 .url: https://www.sciencedirect.com/science/article/ pii/S0735109722052251
-
[10]
EchoNext: A Dataset for Detecting Echocardiogram-Confirmed Structural Heart Disease from ECGs
P. Elias and J. Finer. “EchoNext: A Dataset for Detecting Echocardiogram-Confirmed Structural Heart Disease from ECGs”. In:PhysioNet(Sept. 2025). Version 1.1.0.doi:10.13026/3ykd-bf14. url:https://doi.org/10.13026/3ykd-bf14
-
[11]
A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals”. In:Circulation101.23 (2000), e215–e220.doi:10.1161/01.CIR.101.23.e215.url: https://pubmed.ncbi.n...
-
[12]
CLOCS: Contrastive Learning of Cardiac Signals Across Space, Time, and Patients
D. Kiyasseh, T. Zhu, and D. A. Clifton. “CLOCS: Contrastive Learning of Cardiac Signals Across Space, Time, and Patients”. In:Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021, pp. 5606–5615.url: https://proceedings.mlr.press/v139/kiyasseh21a.html
2021
-
[13]
ECG-FM: an open electrocardiogram foundation model
K. McKeen, S. Masood, A. Toma, B. Rubin, and B. Wang. “ECG-FM: an open electrocardiogram foundation model”. In:JAMIA Open8.5 (2025), ooaf122.doi: 10.1093/jamiaopen/ooaf122 . url: https://academic.oup.com/jamiaopen/article/doi/10.1093/jamiaopen/ooaf122/ 8287827
-
[14]
J. Li, A. D. Aguirre, V. Moura Junior, J. Jin, C. Liu, L. Zhong, C. Sun, G. Clifford, M. B. Westover, and S. Hong. “An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains”. In:NEJM AI2.7 (2025), AIoa2401033.doi: 10.1056/AIoa2401033.url:https://pubmed.ncbi.nlm.nih.gov/40771651/
work page doi:10.1056/aioa2401033.url:https://pubmed.ncbi.nlm.nih.gov/40771651/ 2025
-
[15]
HuBERT-ECG: A Self-Supervised Foundation Model for Broad and Scalable Cardiac Applications
E. Coppola, M. Savardi, M. Massussi, M. Adamo, M. Metra, and A. Signoroni. “HuBERT-ECG: A Self-Supervised Foundation Model for Broad and Scalable Cardiac Applications”. In:medRxiv (2024).doi: 10.1101/2024.11.14.24317328 .url: https://www.medrxiv.org/content/10. 1101/2024.11.14.24317328. 13
-
[16]
Niehorster, Raimondas Zemblys, Tanya Beelders, and Kenneth Holmqvist
D. Makowski, T. Pham, Z. J. Lau, J. C. Brammer, F. Lespinasse, H. Pham, C. Schölzel, and S. H. A. Chen. “NeuroKit2: A Python Toolbox for Neurophysiological Signal Processing”. In: Behavior Research Methods53.4 (2021), pp. 1689–1696.doi:10.3758/s13428- 020- 01516- y. url:https://pubmed.ncbi.nlm.nih.gov/33528817/
-
[17]
OPTUNA: a next-generation hyperparameter optimization framework
T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. “Optuna: A Next-generation Hyperparame- ter Optimization Framework”. In:Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2019, pp. 2623–2631.doi:10.1145/3292500.3330701. url:https://dl.acm.org/doi/10.1145/3292500.3330701
-
[18]
XGBoost: A Scalable Tree Boosting System
T. Chen and C. Guestrin. “XGBoost: A Scalable Tree Boosting System”. In:Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, pp. 785–794.doi:10.1145/2939672.2939785.url: https://dl.acm.org/doi/10.1145/ 2939672.2939785
-
[19]
Deep Residual Learning for Image Recognition
K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recognition”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 770–778.doi:10.1109/CVPR.2016.90.url:https://doi.org/10.1109/CVPR.2016.90
work page doi:10.1109/cvpr.2016.90.url:https://doi.org/10.1109/cvpr.2016.90 2016
-
[20]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, H. Zhou, A. Mohamed, and M. Auli. “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”. In:Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 12449–12460.doi:10.48550/arXiv.2006.11477.url: https://arxiv.org/ abs/2006.11477
-
[21]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen.LoRA: Low-Rank Adaptation of Large Language Models. 2021.doi:10.48550/arXiv.2106.09685. arXiv: 2106.09685 [cs.CL].url:https://arxiv.org/abs/2106.09685. 14 Appendix A. Target prevalence in the release cohort and supervised splits Table 9 summarizes the prevalence of the six st...
work page internal anchor Pith review doi:10.48550/arxiv.2106.09685 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.