pith. machine review for the scientific record. sign in

arxiv: 2604.23385 · v1 · submitted 2026-04-25 · 💻 cs.LG

Recognition: unknown

Domain-Adapted Fine-Tuning of ECG Foundation Models for Multi-Label Structural Heart Disease Screening

Dang Nguyen, Duc N. Do, Ethan Philip Lowder, Hoang Le, Hung N. Huynh, Jacques Kpodonu, Khanh T.Q. Le, Khoa D. Pham, Minh H.N. Le, Minh N. Do, Perisa Ashar, Phat K. Huynh, Phat V.H. Nguyen, Phi Pham-Van-Hoang, Quan K. Huynh, Quan Le, Ramez M. Odat

Pith reviewed 2026-05-08 08:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords ECGfoundation modelsstructural heart diseasedomain adaptationmulti-label classificationscreeningechocardiographytransfer learning
0
0 comments X

The pith

Adapting open ECG foundation models to echocardiography data through self-supervised pre-adaptation and selective fine-tuning provides the strongest performance for screening multiple structural heart diseases from ECG alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors investigate whether pretrained electrocardiogram foundation models can be adapted to detect structural heart diseases that are normally confirmed by echocardiography. They target six specific abnormalities and compare several strategies including engineered features with gradient boosting, training models from scratch, and various transfer learning approaches from foundation models. The key finding is that performing self-supervised adaptation on the target waveforms followed by selective supervised fine-tuning yields the best discrimination, with a macro area under the ROC curve reaching 0.8509. This approach also allows for parameter-efficient updates that maintain high performance while improving precision at fixed thresholds. The results highlight a practical way to use ECG for initial case finding to triage patients toward confirmatory imaging.

Core claim

In-domain self-supervised adaptation of an ECG foundation model on the target waveforms, followed by selective supervised fine-tuning, produces the highest overall performance for multi-label detection of six echo-confirmed structural heart diseases, outperforming engineered features, scratch training, and other adaptation strategies including LoRA and mixtures.

What carries the argument

The central mechanism is self-supervised adaptation of the pretrained ECG foundation model on target-domain waveforms combined with selective supervised fine-tuning, which enables effective domain transfer while controlling computational cost.

If this is right

  • Adapted models reach a peak macro-AUROC of 0.8509 and macro-AUPRC of 0.4297 for the six conditions.
  • A parameter-efficient fine-tuning variant preserves nearly identical AUROC at 0.8501 while achieving the highest fixed-threshold macro-F1 of 0.3691.
  • Late fusion of the model outputs with clinical covariates does not enhance threshold-independent discrimination metrics.
  • Evaluated alternatives such as LoRA adaptation, different backbone choices, and mixture-of-foundations approaches fail to exceed the performance of the best single adapted backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the performance holds, ECG-based screening could serve as a low-cost first step to reduce the volume of echocardiograms needed in clinical workflows.
  • The strategy of combining self-supervised domain adaptation with selective updates may apply to other medical signal tasks where labeled data is scarce.
  • Testing the approach on data from additional hospitals would reveal whether site-specific adaptation remains necessary for consistent results.

Load-bearing premise

The performance differences observed on the benchmark dataset will generalize to new patient populations, clinical sites, and recording equipment.

What would settle it

A validation study on ECG recordings from a different hospital or scanner fleet where the adapted model's macro-AUROC falls below 0.80 would indicate the method does not transfer as claimed.

Figures

Figures reproduced from arXiv: 2604.23385 by Dang Nguyen, Duc N. Do, Ethan Philip Lowder, Hoang Le, Hung N. Huynh, Jacques Kpodonu, Khanh T.Q. Le, Khoa D. Pham, Minh H.N. Le, Minh N. Do, Perisa Ashar, Phat K. Huynh, Phat V.H. Nguyen, Phi Pham-Van-Hoang, Quan K. Huynh, Quan Le, Ramez M. Odat.

Figure 1
Figure 1. Figure 1: Overall methodological workflow for multi-label SHD detection from 12-lead ECGs. The framework begins with the EchoNext Mini-Model cohort and released waveform and covariate inputs, proceeds through construction of six moderate-or-greater echocardiography-confirmed endpoints, and compares engineered-feature, from-scratch waveform, and pretrained FM baselines. The primary transfer setting applies continued … view at source ↗
Figure 2
Figure 2. Figure 2: Performance–efficiency operating points for the main benchmark models and ECG-FM adaptation settings. The x-axis shows trainable parameters on a log scale. (A) AUROC vs. trainable parameters. (B) AUPRC vs. trainable parameters. Method Trainable Params AUROC AUPRC Acc F1 Adapted ECG-FM (b=9, waveform-only) 69.81 M 0.8509 0.4261 0.9089 0.3064 Cross-attention fusion 69.81 M 0.8487 0.4167 0.9093 0.3140 Concat … view at source ↗
Figure 3
Figure 3. Figure 3: Engineered-feature selection and importance for Baseline A. (A) Mean validation AUROC after selecting the top-k features from the global ranking. Performance improved rapidly with the first features, plateaued after roughly 60 features, and was highest when using all 166 features (AUROC = 0.8049). (B) Top 10 features by global importance. The highest-ranked features include morphology, multi-lead, signal-s… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative t-SNE visualization of pooled embeddings from the original pretrained ECG foundation-model backbones. In Panels A–C, PTB-XL samples are shown in gray and EchoNext records with at least one modeled endpoint are colored by supervised partition (train/val/test). In Panels D–I, the pretrained ECG-FM latent space is reused to display false positives (gold) and false negatives (red) on the EchoNext t… view at source ↗
read the original abstract

Transthoracic echocardiography is the reference standard for confirming structural heart disease (SHD), but first-line screening is limited by cost, workflow burden, and specialist availability. We evaluated whether open pretrained electrocardiogram (ECG) foundation models can support echo-confirmed multi-label SHD detection using the public EchoNext Mini-Model benchmark. Six echocardiography-derived abnormalities were targeted: reduced left ventricular ejection fraction, increased left ventricular wall thickness, aortic stenosis, mitral regurgitation, tricuspid regurgitation, and right ventricular systolic dysfunction. Under a common pipeline, we compared engineered ECG features with gradient boosting, end-to-end waveform learning from scratch, and transfer from open ECG foundation models. We then applied in-domain self-supervised adaptation of an ECG foundation model (ECG-FM) on EchoNext waveforms followed by selective supervised fine-tuning, and evaluated trade-offs between discrimination and adaptation cost. Adapted ECG-FM models achieved the best overall performance: peak macro-AUROC 0.8509 and macro-AUPRC 0.4297, while a parameter-efficient operating point preserved AUROC (0.8501) and attained the highest fixed-threshold macro-F1 0.3691. Late fusion with covariates did not improve threshold-independent discrimination, and evaluated LoRA, alternative backbones, and mixture-of-foundations strategies did not surpass the best adapted single-backbone models. These results indicate that for ECG-based case finding and echocardiography triage, combining target-domain self-supervised adaptation with selective supervised updating of a pretrained ECG backbone is the most effective transfer strategy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates transfer strategies for multi-label structural heart disease (SHD) detection from ECGs on the public EchoNext Mini-Model benchmark. It compares engineered features with gradient boosting, end-to-end training from scratch, and transfer from open ECG foundation models (ECG-FMs). The central finding is that in-domain self-supervised adaptation of an ECG-FM followed by selective supervised fine-tuning yields the strongest results: peak macro-AUROC of 0.8509 and macro-AUPRC of 0.4297, with a parameter-efficient operating point preserving AUROC at 0.8501 while achieving the highest fixed-threshold macro-F1 of 0.3691. Late fusion with covariates and alternative strategies (LoRA, other backbones, mixtures) did not improve performance.

Significance. If the reported ordering generalizes, the work demonstrates a practical route to improve ECG-based SHD case-finding and echocardiography triage using existing open foundation models and modest adaptation compute. Strengths include the systematic comparison of transfer approaches under a common pipeline, explicit reporting of both threshold-independent (AUROC/AUPRC) and fixed-threshold (F1) metrics, and evaluation of parameter-efficient fine-tuning trade-offs. These elements make the empirical benchmark results a useful reference point for ECG foundation-model research.

major comments (2)
  1. [Results] Results section (performance tables and text reporting macro-AUROC 0.8509 / macro-F1 0.3691): All headline comparisons rest on a single public benchmark (EchoNext Mini-Model) with no external validation cohort, multi-center split, or scanner-vendor hold-out described. Because echo-derived labels are known to shift with acquisition protocol, operator, and population demographics, the observed superiority of domain-adapted ECG-FM models over baselines could reverse under distribution shift; this single-dataset limitation directly weakens the claim that the adaptation strategy is the most effective transfer approach for clinical use.
  2. [Methods] Methods (data splits and evaluation protocol): No details are provided on patient-level versus waveform-level splitting, handling of multiple ECGs per patient, or statistical testing (e.g., confidence intervals or paired tests) for the reported AUROC/AUPRC/F1 differences. Without these, it is impossible to determine whether the small numerical gaps (e.g., 0.8509 vs. 0.8501 AUROC) are robust or merely within noise.
minor comments (2)
  1. [Abstract] Abstract and Methods: The phrase 'EchoNext Mini-Model benchmark' should include a brief description of dataset size, label prevalence, and train/validation/test splits on first use to allow readers to assess the scale of the evaluation.
  2. [Results] Results: The statement that 'late fusion with covariates did not improve threshold-independent discrimination' would be clearer if the exact covariates and fusion method were named in the main text rather than deferred to supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation of domain-adapted ECG foundation models for multi-label SHD detection. The comments highlight important aspects of generalizability and methodological transparency. We address each major comment below and have revised the manuscript to incorporate clarifications and expanded discussion where appropriate.

read point-by-point responses
  1. Referee: [Results] Results section (performance tables and text reporting macro-AUROC 0.8509 / macro-F1 0.3691): All headline comparisons rest on a single public benchmark (EchoNext Mini-Model) with no external validation cohort, multi-center split, or scanner-vendor hold-out described. Because echo-derived labels are known to shift with acquisition protocol, operator, and population demographics, the observed superiority of domain-adapted ECG-FM models over baselines could reverse under distribution shift; this single-dataset limitation directly weakens the claim that the adaptation strategy is the most effective transfer approach for clinical use.

    Authors: We agree that reliance on a single public benchmark constitutes a genuine limitation, as echo label distributions can shift across sites, operators, and demographics, potentially altering the relative performance of adaptation strategies. In the revised manuscript we have added an explicit paragraph in the Discussion section acknowledging this constraint, clarifying that the reported ordering (including the 0.8509 macro-AUROC) is benchmark-specific rather than a universal clinical claim, and outlining the need for future multi-center validation. The systematic head-to-head comparison on the standardized EchoNext Mini-Model benchmark nonetheless remains a useful reference point for the community, even if external cohorts would be required to confirm broader applicability. revision: yes

  2. Referee: [Methods] Methods (data splits and evaluation protocol): No details are provided on patient-level versus waveform-level splitting, handling of multiple ECGs per patient, or statistical testing (e.g., confidence intervals or paired tests) for the reported AUROC/AUPRC/F1 differences. Without these, it is impossible to determine whether the small numerical gaps (e.g., 0.8509 vs. 0.8501 AUROC) are robust or merely within noise.

    Authors: We acknowledge that the original Methods section was insufficiently explicit on these points. In the revised version we have expanded the Data Splits and Evaluation Protocol subsections to state that all partitions were performed at the patient level (no patient overlap across train/validation/test), with a single representative ECG selected per patient to avoid intra-patient leakage. We now also report bootstrap-derived 95% confidence intervals for all headline metrics and include paired DeLong tests for AUROC comparisons between the leading models and baselines. These additions show that the primary performance gains over non-adapted baselines are statistically distinguishable, while the small gap between the full and parameter-efficient adapted variants is within the reported confidence intervals, consistent with the operating-point trade-off we emphasize. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark results with no derivation chain

full rationale

The paper reports standard ML training and evaluation outcomes (AUROC, AUPRC, F1) on the public EchoNext benchmark for six echo-derived SHD labels. No equations, closed-form derivations, or parameter-fitting steps are described that would reduce the headline performance numbers to quantities defined by the same paper's inputs. All comparisons (engineered features, from-scratch training, foundation-model transfer, LoRA, late fusion) follow conventional supervised learning pipelines whose metrics are independently computable from held-out test data. Self-citations, if present, are not load-bearing for any claimed result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the EchoNext benchmark and on standard machine-learning assumptions about transfer learning and self-supervised adaptation. No new physical entities are postulated.

axioms (1)
  • domain assumption The EchoNext Mini-Model benchmark and its echo-derived labels constitute a valid and representative testbed for multi-label SHD screening
    The paper treats this public dataset as the evaluation standard without additional external validation or discussion of selection biases.

pith-pipeline@v0.9.0 · 5663 in / 1433 out tokens · 90901 ms · 2026-05-08T08:17:47.625825+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

    eess.SP 2026-05 unverdicted novelty 7.0

    Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.

Reference graph

Works this paper leans on

21 extracted references · 19 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Artificial intelligence-enhanced electrocardiography in cardiovascular disease management

    K. C. Siontis, P. A. Noseworthy, Z. I. Attia, and P. A. Friedman. “Artificial intelligence-enhanced electrocardiography in cardiovascular disease management”. In:Nature Reviews Cardiology18.7 (2021), pp. 465–478.doi: 10.1038/s41569- 020- 00503- 2 .url: https://www.nature.com/ articles/s41569-020-00503-2

  2. [2]

    rECHOmmend: An ECG-Based Machine Learning Approach for Identi- fying Patients at High Risk of Undiagnosed Structural Heart Disease Detectable by Echocardio- graphy

    A. E. Ulloa-Cerna et al. “rECHOmmend: An ECG-Based Machine Learning Approach for Identi- fying Patients at High Risk of Undiagnosed Structural Heart Disease Detectable by Echocardio- graphy”. In:Circulation146.1 (2022), pp. 36–47.doi:10.1161/CIRCULATIONAHA.121.057869. url:https://www.ahajournals.org/doi/10.1161/CIRCULATIONAHA.121.057869

  3. [3]

    URL https://www.nature.com/articles/ s41586-025-09227-0

    T. J. Poterucha, L. Jing, R. P. Ricart, et al. “Detecting structural heart disease from electrocardio- grams using AI”. In:Nature644.8075 (2025), pp. 221–230.doi:10.1038/s41586-025-09227-0. url:https://www.nature.com/articles/s41586-025-09227-0. 12

  4. [4]

    The TRIPOD-LLM reporting guideline for studies using large language models

    A. Y. Hannun, P. Rajpurkar, M. Haghpanahi, G. H. Tison, C. Bourn, M. P. Turakhia, and A. Y. Ng. “Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network”. In:Nature Medicine25.1 (2019), pp. 65–69.doi:10.1038/s41591- 018-0268-3.url:https://www.nature.com/articles/s41591-018-0268-3

  5. [5]

    Gravitationally induced decoherence vs space-time diffusion: testing the quantum nature of gravity.Nature Commun., 14(1):7910, 2023

    A. H. Ribeiro, M. H. Ribeiro, G. M. M. Paixão, et al. “Automatic diagnosis of the 12-lead ECG using a deep neural network”. In:Nature Communications11 (2020), p. 1760.doi:10.1038/s41467- 020-15432-4.url:https://www.nature.com/articles/s41467-020-15432-4

  6. [6]

    PTB-XL, a large publicly available electrocardiography dataset

    P. Wagner, N. Strodthoff, R.-D. Bousseljot, D. Kreiseler, F. I. Lunze, W. Samek, T. Schaeffter, et al. “PTB-XL, a large publicly available electrocardiography dataset”. In:Scientific Data7 (2020), p. 154.doi: 10.1038/s41597-020-0495-6.url: https://www.nature.com/articles/s41597- 020-0495-6

  7. [7]

    Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XL

    N. Strodthoff, P. Wagner, T. Schaeffter, and W. Samek. “Deep Learning for ECG Analysis: Benchmarks and Insights from PTB-XL”. In:IEEE Journal of Biomedical and Health Informatics 25.5 (2021), pp. 1519–1528.doi:10.1109/JBHI.2020.3022989 .url: https://pubmed.ncbi. nlm.nih.gov/32903191/

  8. [8]

    Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram

    Z. I. Attia, S. Kapa, F. Lopez-Jimenez, P. M. McKie, et al. “Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram”. In:Nature Medicine25.1 (2019), pp. 70–74.doi: 10 . 1038 / s41591 - 018 - 0240 - 2.url: https : / / www . nature . com / articles/s41591-018-0240-2

  9. [9]

    Deep learning electrocardiographic analysis for detection of left-sided valvular heart disease

    P. Elias et al. “Deep learning electrocardiographic analysis for detection of left-sided valvular heart disease”. In:Journal of the American College of Cardiology80.6 (2022), pp. 613–626.doi: 10.1016/j.jacc.2022.05.029 .url: https://www.sciencedirect.com/science/article/ pii/S0735109722052251

  10. [10]

    EchoNext: A Dataset for Detecting Echocardiogram-Confirmed Structural Heart Disease from ECGs

    P. Elias and J. Finer. “EchoNext: A Dataset for Detecting Echocardiogram-Confirmed Structural Heart Disease from ECGs”. In:PhysioNet(Sept. 2025). Version 1.1.0.doi:10.13026/3ykd-bf14. url:https://doi.org/10.13026/3ykd-bf14

  11. [11]

    PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals

    A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals”. In:Circulation101.23 (2000), e215–e220.doi:10.1161/01.CIR.101.23.e215.url: https://pubmed.ncbi.n...

  12. [12]

    CLOCS: Contrastive Learning of Cardiac Signals Across Space, Time, and Patients

    D. Kiyasseh, T. Zhu, and D. A. Clifton. “CLOCS: Contrastive Learning of Cardiac Signals Across Space, Time, and Patients”. In:Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021, pp. 5606–5615.url: https://proceedings.mlr.press/v139/kiyasseh21a.html

  13. [13]

    ECG-FM: an open electrocardiogram foundation model

    K. McKeen, S. Masood, A. Toma, B. Rubin, and B. Wang. “ECG-FM: an open electrocardiogram foundation model”. In:JAMIA Open8.5 (2025), ooaf122.doi: 10.1093/jamiaopen/ooaf122 . url: https://academic.oup.com/jamiaopen/article/doi/10.1093/jamiaopen/ooaf122/ 8287827

  14. [14]

    An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains

    J. Li, A. D. Aguirre, V. Moura Junior, J. Jin, C. Liu, L. Zhong, C. Sun, G. Clifford, M. B. Westover, and S. Hong. “An Electrocardiogram Foundation Model Built on over 10 Million Recordings with External Evaluation across Multiple Domains”. In:NEJM AI2.7 (2025), AIoa2401033.doi: 10.1056/AIoa2401033.url:https://pubmed.ncbi.nlm.nih.gov/40771651/

  15. [15]

    HuBERT-ECG: A Self-Supervised Foundation Model for Broad and Scalable Cardiac Applications

    E. Coppola, M. Savardi, M. Massussi, M. Adamo, M. Metra, and A. Signoroni. “HuBERT-ECG: A Self-Supervised Foundation Model for Broad and Scalable Cardiac Applications”. In:medRxiv (2024).doi: 10.1101/2024.11.14.24317328 .url: https://www.medrxiv.org/content/10. 1101/2024.11.14.24317328. 13

  16. [16]

    Niehorster, Raimondas Zemblys, Tanya Beelders, and Kenneth Holmqvist

    D. Makowski, T. Pham, Z. J. Lau, J. C. Brammer, F. Lespinasse, H. Pham, C. Schölzel, and S. H. A. Chen. “NeuroKit2: A Python Toolbox for Neurophysiological Signal Processing”. In: Behavior Research Methods53.4 (2021), pp. 1689–1696.doi:10.3758/s13428- 020- 01516- y. url:https://pubmed.ncbi.nlm.nih.gov/33528817/

  17. [17]

    OPTUNA: a next-generation hyperparameter optimization framework

    T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. “Optuna: A Next-generation Hyperparame- ter Optimization Framework”. In:Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2019, pp. 2623–2631.doi:10.1145/3292500.3330701. url:https://dl.acm.org/doi/10.1145/3292500.3330701

  18. [18]

    XGBoost: A Scalable Tree Boosting System

    T. Chen and C. Guestrin. “XGBoost: A Scalable Tree Boosting System”. In:Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, pp. 785–794.doi:10.1145/2939672.2939785.url: https://dl.acm.org/doi/10.1145/ 2939672.2939785

  19. [19]

    Deep Residual Learning for Image Recognition

    K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recognition”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 770–778.doi:10.1109/CVPR.2016.90.url:https://doi.org/10.1109/CVPR.2016.90

  20. [20]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, H. Zhou, A. Mohamed, and M. Auli. “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations”. In:Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 12449–12460.doi:10.48550/arXiv.2006.11477.url: https://arxiv.org/ abs/2006.11477

  21. [21]

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen.LoRA: Low-Rank Adaptation of Large Language Models. 2021.doi:10.48550/arXiv.2106.09685. arXiv: 2106.09685 [cs.CL].url:https://arxiv.org/abs/2106.09685. 14 Appendix A. Target prevalence in the release cohort and supervised splits Table 9 summarizes the prevalence of the six st...