arxiv: 2605.12241 · v1 · submitted 2026-05-12 · 📡 eess.SP · cs.AI· cs.LG

Recognition: no theorem link

Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

M A Al-Masud, Nils Strodthoff

Pith reviewed 2026-05-13 03:52 UTC · model grok-4.3

classification 📡 eess.SP cs.AIcs.LG

keywords ECGfoundation modelsself-supervised learningstate space modelspretraining strategiesphysiological signalscontrastive predictive coding

0 comments

The pith

Structured state space models outperform transformers and CNNs for ECG foundation models because of their inductive biases rather than pretraining scale alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically tests five self-supervised pretraining objectives on ECG data, scaling the pretraining set from small sizes up to 11 million samples drawn from public sources. It measures how well the resulting representations transfer to a range of downstream clinical tasks and directly compares three model families: structured state space models, transformers, and CNNs. Pretraining strategy matters, with contrastive predictive coding slightly ahead of other approaches, and larger datasets continue to help most methods. The clearest and most consistent finding is that structured state space models produce better representations than the alternatives across every pretraining objective tested. If this holds, it points to architecture choice, specifically the built-in assumptions about sequential structure, as the dominant factor in effective ECG representation learning.

Core claim

The authors establish that structured state space models deliver superior transferable representations for ECG signals compared with transformers and CNNs when pretrained with the same contrastive or non-contrastive objectives, and that this advantage persists and even strengthens as the pretraining corpus grows to 11 million samples. They conclude that the strong inductive biases of structured state space models, rather than pretraining scale or objective alone, are the primary driver of effective representation learning in this domain.

What carries the argument

Structured state space models, which embed explicit assumptions about the structure of sequential physiological signals to enable efficient modeling of long-range dependencies.

If this is right

Contrastive predictive coding produces the most transferable representations across the tested clinical tasks.
Scaling the pretraining dataset to 11 million samples yields continued gains for most pretraining objectives.
Architecture choice exerts a larger and more consistent effect on downstream performance than the choice of pretraining objective.
The superiority of structured state space models holds across all five pretraining strategies examined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inductive-bias advantage may favor structured state space models for other physiological time series such as EEG or PPG.
Future foundation-model work in this area could test whether even larger state space models continue to improve without requiring proportionally larger datasets.
Resource-constrained settings may achieve strong results by prioritizing state space architectures over simply collecting more unlabeled ECG data.

Load-bearing premise

That performance differences between architectures and pretraining methods are caused primarily by inductive biases rather than unmeasured differences in hyperparameters, preprocessing pipelines, or dataset-specific artifacts.

What would settle it

A controlled re-run in which transformers and CNNs are trained with identical hyperparameters, identical preprocessing, and the same random seeds as the state space models and still underperform on the same downstream tasks.

Figures

Figures reproduced from arXiv: 2605.12241 by M A Al-Masud, Nils Strodthoff.

**Figure 2.** Figure 2: Intra-model layer-wise representational similarity for JEPA [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of performance rankings across seven downstream task categories for the five [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling Analyses for the CPC model investigating the scaling of the pretraining loss with [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise representation similarity within each pretraining objective, measured by CKA [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: EchoNext label efficiency plot tracing downstream performance in dependence of the number of training samples. Label efficiency In [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Intra-model layer-wise representational similarity for data2vec [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Intra-model layer-wise representational similarity for DinoSR [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Intra-model layer-wise representational similarity for JEPA [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Intra-model layer-wise representational similarity for CPC [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Intra-model layer-wise representational similarity for HuBERT++ [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Layer-wise representation similarity within each pretraining objective, measured by CKA [PITH_FULL_IMAGE:figures/full_fig_p047_12.png] view at source ↗

**Figure 13.** Figure 13: Inter-model representational similarity across network depths. CKA heatmaps comparing [PITH_FULL_IMAGE:figures/full_fig_p047_13.png] view at source ↗

**Figure 14.** Figure 14: Validation loss as a function of pretraining dataset size for each model. Dashed lines show [PITH_FULL_IMAGE:figures/full_fig_p048_14.png] view at source ↗

**Figure 15.** Figure 15: Scaling analysis for adult ECG interpretation task category datasets. [PITH_FULL_IMAGE:figures/full_fig_p049_15.png] view at source ↗

**Figure 16.** Figure 16: Correlation between pre-training validation loss and downstream classification error [PITH_FULL_IMAGE:figures/full_fig_p052_16.png] view at source ↗

read the original abstract

Specialized foundation models are beginning to emerge in various medical subdomains, but pretraining methodologies and parametric scaling with the size of the pretraining dataset are rarely assessed systematically and in a like-for-like manner. This work focuses on foundation models for electrocardiography (ECG) data, one of the most widely captured physiological time series world-wide. We present a comprehensive assessment of pretraining methodologies, covering five different contrastive and non-contrastive self-supervised learning objectives for ECG foundation models, and investigate their scaling behavior with pretraining dataset sizes up to 11M input samples, exclusively from publicly available sources. Pretraining strategy has a meaningful and consistent impact on downstream performance, with contrastive predictive coding (slightly ahead of JEPA) yielding the most transferable representations across diverse clinical tasks. Scaling pretraining data continues to yield meaningful improvements up to 11M samples for most objectives. We also compare model architectures across all pretraining methodologies and find evidence for a clear superiority of structured state space models compared to transformers and CNN models. We hypothesize that the strong inductive biases of structured state space models, rather than pretraining scale alone, are the primary driver of effective ECG representation learning, with important implications for future foundation model development in this and potentially other physiological signal domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a controlled benchmark of pretraining objectives and scaling for ECG models but the SSM inductive bias claim needs explicit multi-scale architecture curves to hold up.

read the letter

This paper runs five self-supervised objectives on public ECG data, tracks performance as pretraining data grows to 11M samples, and compares architectures across those objectives. Contrastive predictive coding edges out the others on downstream tasks, scaling keeps helping for most methods, and structured state space models beat transformers and CNNs in the architecture tests. That combination of like-for-like controls and public data is what makes the work new for this area. Earlier ECG foundation model papers usually tested one method or one scale, so the systematic layout here gives actual numbers people can use when choosing pretraining setups. The scaling curves are the clearest addition. The soft spot sits in the main hypothesis. The authors argue that SSM inductive biases drive the gains more than scale alone. The abstract describes the architecture comparison as running across all pretraining methods but does not say the architectures were re-run at multiple data sizes. If the SSM advantage was measured only at the 11M scale, the gap could still be tied to that particular regime rather than a scale-independent property of the state-space structure. The stress-test note flags exactly this, and the abstract leaves it open. This is for labs working on time-series foundation models or clinical ECG tools who need concrete comparisons rather than another single-method claim. Readers who want to know which pretraining choices transfer and how data volume affects them will find usable takeaways. It deserves a serious referee because the scale and public-data design are solid enough to be worth checking, even if the architecture scaling controls need tightening. Send it for review but ask for the per-architecture scaling details.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a systematic empirical study of five self-supervised pretraining objectives (contrastive and non-contrastive) for ECG foundation models, using exclusively public data up to 11M samples. It reports scaling curves showing continued gains for most objectives, identifies contrastive predictive coding (slightly ahead of JEPA) as yielding the most transferable representations on downstream clinical tasks, and compares architectures (structured state space models vs. transformers vs. CNNs) across pretraining methods, finding clear SSM superiority. The central hypothesis is that SSM inductive biases, rather than pretraining scale alone, are the primary driver of effective ECG representation learning.

Significance. If the results hold under controlled conditions, the work supplies a valuable like-for-like benchmark for ECG pretraining strategies and scaling behavior that is currently rare in the domain. The architecture comparison, if shown to be scale-independent, would provide concrete evidence favoring strong inductive biases over pure scale in physiological time-series foundation models, with direct implications for model design choices in related medical signal domains.

major comments (2)

[Architecture comparison results] The architecture comparison (reported as showing 'clear superiority' of SSMs across all pretraining methodologies) appears to have been performed at a single fixed scale rather than with per-architecture scaling curves at multiple data regimes. This design choice leaves the central hypothesis—that inductive biases rather than scale are primary—unsupported by direct evidence, as the performance gap could be an artifact of the particular (largest) scale chosen.
[Downstream evaluation setup] The downstream clinical tasks used to evaluate transferability are not accompanied by an explicit justification or sensitivity analysis showing they are representative of real-world ECG use cases; without this, differences attributed to inductive biases could be confounded by unmeasured factors such as task-specific signal characteristics or preprocessing artifacts.

minor comments (2)

[Abstract] The abstract states that five objectives were studied but does not name them; an explicit list would improve readability.
[Figures] Figure captions and legends should consistently report the number of random seeds or runs used to compute means and error bars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our systematic study of ECG pretraining strategies and scaling. We address each major comment below with clarifications on our experimental choices and proposed revisions to strengthen the manuscript. We believe these changes will better support the central claims while acknowledging limitations in the current design.

read point-by-point responses

Referee: [Architecture comparison results] The architecture comparison (reported as showing 'clear superiority' of SSMs across all pretraining methodologies) appears to have been performed at a single fixed scale rather than with per-architecture scaling curves at multiple data regimes. This design choice leaves the central hypothesis—that inductive biases rather than scale are primary—unsupported by direct evidence, as the performance gap could be an artifact of the particular (largest) scale chosen.

Authors: We agree that performing architecture comparisons at multiple data regimes with dedicated scaling curves would provide stronger direct support for the hypothesis that SSM inductive biases are the primary driver independent of scale. Our current design compared architectures at the largest scale (11M samples) after observing continued gains from scaling the pretraining objectives, with the goal of evaluating representations under conditions where data scale has been maximized. The consistent SSM advantage across all five pretraining methods at this scale suggests the gap is not an artifact of a single objective. We will revise the manuscript to explicitly state the fixed scale used for architecture comparisons, add a dedicated limitations paragraph discussing the absence of per-architecture scaling curves, and include a forward-looking statement on the value of such experiments in future work. No new experiments are feasible within the revision timeline due to computational requirements. revision: partial
Referee: [Downstream evaluation setup] The downstream clinical tasks used to evaluate transferability are not accompanied by an explicit justification or sensitivity analysis showing they are representative of real-world ECG use cases; without this, differences attributed to inductive biases could be confounded by unmeasured factors such as task-specific signal characteristics or preprocessing artifacts.

Authors: We appreciate this observation. The downstream tasks were chosen to span a range of clinically relevant ECG applications drawn from established benchmarks in the literature (e.g., arrhythmia classification, myocardial infarction detection, and rhythm analysis on datasets such as PTB-XL and others). We will add a new subsection in the experimental setup that provides explicit justification for each task, including references to prior ECG foundation model evaluations and clinical guidelines. We will also incorporate a sensitivity analysis (e.g., results under alternative preprocessing pipelines and task subsets) to demonstrate robustness. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivations or self-referential reductions

full rationale

This is an empirical study that evaluates five SSL objectives, scaling curves up to 11M samples, and architecture ablations (SSM vs transformer vs CNN) on held-out clinical downstream tasks using public ECG data. No equations, derivations, or 'predictions' are presented that could reduce to fitted inputs by construction. The hypothesis about SSM inductive biases is framed as an interpretation of experimental results rather than a mathematical claim. No self-citation chains are load-bearing for any central result, and the work does not rename known patterns or smuggle ansatzes. The design is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard machine-learning assumptions about representation transfer and the sufficiency of public ECG corpora for clinical generalization. No free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption Self-supervised learning objectives produce representations that transfer to downstream clinical tasks.
This is the core premise enabling all pretraining comparisons described.

pith-pipeline@v0.9.0 · 5527 in / 1431 out tokens · 131178 ms · 2026-05-13T03:52:39.422200+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 5 internal anchors

[1]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the Opportunities and Risks of Foundation Models.arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Towards a General-Purpose Foundation Model for Computational Pathology.Nature medicine, 30(3):850–862, 2024

Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a General-Purpose Foundation Model for Computational Pathology.Nature medicine, 30(3):850–862, 2024

work page 2024
[3]

A foundation model for generalizable disease detection from retinal images.Nature, 622(7981):156–163, 2023

Yukun Zhou, Mark A Chia, Siegfried K Wagner, Murat S Ayhan, Dominic J Williamson, Robbert R Struyven, Timing Liu, Moucheng Xu, Mateo G Lozano, Peter Woodward-Court, et al. A foundation model for generalizable disease detection from retinal images.Nature, 622(7981):156–163, 2023

work page 2023
[4]

Artificial intelligence- enhanced electrocardiography in cardiovascular disease management.Nature Reviews Cardiology, 18(7): 465–478, 2021

Konstantinos C Siontis, Peter A Noseworthy, Zachi I Attia, and Paul A Friedman. Artificial intelligence- enhanced electrocardiography in cardiovascular disease management.Nature Reviews Cardiology, 18(7): 465–478, 2021

work page 2021
[5]

Deep learning for ECG analysis: Benchmarks and insights from PTB-XL.IEEE journal of biomedical and health informatics, 25(5): 1519–1528, 2020

Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ECG analysis: Benchmarks and insights from PTB-XL.IEEE journal of biomedical and health informatics, 25(5): 1519–1528, 2020

work page 2020
[6]

Screening for cardiovascular disease risk with electrocardiography.JAMA Internal Medicine, 178(9):1163–1164, 2018

R Sacha Bhatia and Paul Dorian. Screening for cardiovascular disease risk with electrocardiography.JAMA Internal Medicine, 178(9):1163–1164, 2018

work page 2018
[7]

Ivan C Rokos, William J French, Amal Mattu, Graham Nichol, Michael E Farkouh, James Reiffel, and Gregg W Stone. Appropriate cardiac cath lab activation: optimizing electrocardiogram interpretation and clinical decision-making for acute ST-elevation myocardial infarction.American heart journal, 160(6): 995–1003, 2010

work page 2010
[8]

An Electrocardiogram Foundation Model Built on over 10 Million Recordings.NEJM AI, 2(7):AIoa2401033, 2025

Jun Li, Aaron D Aguirre, Valdery Moura Junior, Jiarui Jin, Che Liu, Lanhai Zhong, Chenxi Sun, Gari Clifford, M Brandon Westover, and Shenda Hong. An Electrocardiogram Foundation Model Built on over 10 Million Recordings.NEJM AI, 2(7):AIoa2401033, 2025

work page 2025
[9]

Learning General Representation of 12-Lead Electrocardiogram with a Joint-Embedding Predictive Architecture

Sehun Kim. Learning general representation of 12-lead electrocardiogram with a joint-embedding predic- tive architecture.arXiv preprint arXiv:2410.08559, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Guiding Masked Representation Learning to Capture Spatio-Temporal Relationship of Electrocardiogram

Yeongyeon Na, Minje Park, Yunwon Tae, and Sunghoon Joo. Guiding Masked Representation Learning to Capture Spatio-Temporal Relationship of Electrocardiogram. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[11]

Zero-shot ECG classification with multimodal learning and test-time clinical knowledge enhancement

Che Liu, Zhongwei Wan, Cheng Ouyang, Anand Shah, Wenjia Bai, and Rossella Arcucci. Zero-shot ECG classification with multimodal learning and test-time clinical knowledge enhancement. InProceedings of the 41st International Conference on Machine Learning, pages 31949–31963, 2024

work page 2024
[12]

Foundation model of ECG diagnosis: Diagnostics and explanations of any form and rhythm on ECG.Cell Reports Medicine, 5(12), 2024

Yuanyuan Tian, Zhiyuan Li, Yanrui Jin, Mengxiao Wang, Xiaoyang Wei, Liqun Zhao, Yunqing Liu, Jinlei Liu, and Chengliang Liu. Foundation model of ECG diagnosis: Diagnostics and explanations of any form and rhythm on ECG.Cell Reports Medicine, 5(12), 2024

work page 2024
[13]

HuBERT-ECG as a self-supervised foundation model for broad and scalable cardiac applications.medRxiv, pages 2024–11, 2024

Edoardo Coppola, Mattia Savardi, Mauro Massussi, Marianna Adamo, Marco Metra, and Alberto Signoroni. HuBERT-ECG as a self-supervised foundation model for broad and scalable cardiac applications.medRxiv, pages 2024–11, 2024

work page 2024
[14]

Ecg-fm: An open electrocardiogram foundation model.Jamia Open, 8(5):ooaf122, 2025

Kaden McKeen, Sameer Masood, Augustin Toma, Barry Rubin, and Bo Wang. Ecg-fm: An open electrocardiogram foundation model.Jamia Open, 8(5):ooaf122, 2025

work page 2025
[15]

Boosting Masked ECG-Text Auto-Encoders as Discrimi- native Learners

Manh Pham Hung, Aaqib Saeed, and Dong Ma. Boosting Masked ECG-Text Auto-Encoders as Discrimi- native Learners. InProceedings of the 42nd International Conference on Machine Learning, 2025

work page 2025
[16]

BenchECG and xECG: a benchmark and baseline for ECG foundation models.arXiv preprint arXiv:2509.10151, 2025

Riccardo Lunelli, Angus Nicolson, Samuel Martin Pröll, Sebastian Johannes Reinstadler, Axel Bauer, and Clemens Dlaska. BenchECG and xECG: a benchmark and baseline for ECG foundation models.arXiv preprint arXiv:2509.10151, 2025. 10

work page arXiv 2025
[17]

Foundation models for electrocardiogram interpretation: clinical implications.European Heart Journal, page ehaf1119, 2026

Alexis Nolin-Lapalme, Achille Sowa, Jacques Delfrate, Olivier Tastet, Denis Corbin, Merve Kulbay, Derman Ozdemir, Marie-Jeanne Noël, François-Christophe Marois-Blanchet, François Harvey, et al. Foundation models for electrocardiogram interpretation: clinical implications.European Heart Journal, page ehaf1119, 2026

work page 2026
[18]

ECG semantic integrator (ESI): A foundation ECG model pretrained with LLM-enhanced cardiological text.Transactions on Machine Learning Research, 2024

Han Yu, Peikun Guo, and Akane Sano. ECG semantic integrator (ESI): A foundation ECG model pretrained with LLM-enhanced cardiological text.Transactions on Machine Learning Research, 2024. ISSN 2835-8856

work page 2024
[19]

PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, and Yawei Li. PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[20]

Benchmarking ECG FMs: A Reality Check Across Clinical Tasks

M A Al-Masud, Juan Lopez Alcaraz, and Nils Strodthoff. Benchmarking ECG FMs: A Reality Check Across Clinical Tasks. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[21]

OpenECG: Benchmarking ECG Foundation Models with Public 1.2 Million Records.arXiv preprint arXiv:2503.00711, 2025

Zhijiang Wan, Qianhao Yu, Jia Mao, Wenfeng Duan, and Cheng Ding. OpenECG: Benchmarking ECG Foundation Models with Public 1.2 Million Records.arXiv preprint arXiv:2503.00711, 2025. URL https://arxiv.org/abs/2503.00711

work page arXiv 2025
[22]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021

work page 2021
[24]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

work page 2023
[25]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[26]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Re. Efficiently Modeling Long Sequences with Structured State Spaces. InInternational Conference on Learning Representations, 2022

work page 2022
[27]

Temesgen Mehari and Nils Strodthoff. Towards quantitative precision for ECG analysis: Leveraging state space models, self-supervision and patient metadata.IEEE journal of biomedical and health informatics, 27(11):5326–5334, 2023

work page 2023
[28]

Nils Strodthoff, Juan Miguel Lopez Alcaraz, and Wilhelm Haverkamp. Prospects for artificial intelligence- enhanced electrocardiogram as a unified screening tool for cardiac and non-cardiac conditions: an explo- rative study in emergency care.European Heart Journal - Digital Health, 5(4):454–460, 07 2024. ISSN 2634-3916. doi: 10.1093/ehjdh/ztae039. URLhttp...

work page doi:10.1093/ehjdh/ztae039 2024
[29]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 30016–30030, 2022

work page 2022
[30]

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Marianna Nezhurina, Tomer Porian, Giovanni Puccetti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, and Jenia Jitsev. Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[31]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[32]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

Data2vec: A general framework for self-supervised learning in speech, vision and language

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. InInternational conference on machine learning, pages 1298–1312. PMLR, 2022. 11

work page 2022
[34]

Dinosr: Self-distillation and online clustering for self-supervised speech representation learning.Advances in Neural Information Processing Systems, 36:58346–58362, 2023

Alexander H Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, and Jim Glass. Dinosr: Self-distillation and online clustering for self-supervised speech representation learning.Advances in Neural Information Processing Systems, 36:58346–58362, 2023

work page 2023
[35]

Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems, 26, 2013

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in neural information processing systems, 26, 2013

work page 2013
[36]

Harvard- Emory ECG Database (version 5.0)

Zuzana Koscova, Valdery Moura Junior, Matthew Reyna, Shenda Hong, Aditya Gupta, Manohar Ghanta, Reza Sameni, Aaron Aguirre, Qiao Li, Sahar Zafar, Gari Clifford, and M Brandon Westover. Harvard- Emory ECG Database (version 5.0). Brain Data Science Platform, 2026

work page 2026
[37]

The harvard-emory ecg database

Zuzana Koscova, Qiao Li, Chad Robichaux, Valdery Moura Junior, Manohar Ghanta, Aditya Gupta, Jonathan Rosand, Aaron D Aguirre, Erik Reinertsen, Steven Song, et al. The harvard-emory ecg database. Scientific Data, 2026

work page 2026
[38]

Ribeiro, Gabriela M.M

Antônio H. Ribeiro, Gabriela M.M. Paixao, Emilly M. Lima, Manoel Horta Ribeiro, Marcelo M. Pinto Filho, Paulo R. Gomes, Derick M. Oliveira, Wagner Meira Jr, Thömas B Schon, and Antonio Luiz P. Ribeiro. CODE-15%: a large scale annotated dataset of 12-lead ECGs , June 2021

work page 2021
[39]

MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset.PhysioNet, September 2023

Brian Gow, Tom Pollard, Larry A Nathanson, Alistair Johnson, Benjamin Moody, Chrystinne Fernandes, Nathaniel Greenbaum, Jonathan W Waks, Parastou Eslami, Tanner Carbonati, Ashish Chaudhari, Elizabeth Herbst, Dana Moukheiber, Seth Berkowitz, Roger Mark, and Steven Horng. MIMIC-IV-ECG: Diagnostic Electrocardiogram Matched Subset.PhysioNet, September 2023. V...

work page 2023
[40]

V-JEPA: Latent Video Prediction for Visual Representation Learning

Antoine Bardes, Mehdi Mirza, Boris Oreshkin, Michael Auli, Ishan Misra, and Yann LeCun. V-JEPA: Latent Video Prediction for Visual Representation Learning. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[41]

A closer look at AUROC and AUPRC under class imbalance.Advances in Neural Information Processing Systems, 37:44102–44163, 2024

Matthew B McDermott, Haoran Zhang, Lasse H Hansen, Giovanni Angelotti, and Jack Gallifant. A closer look at AUROC and AUPRC under class imbalance.Advances in Neural Information Processing Systems, 37:44102–44163, 2024

work page 2024
[42]

Universal language model fine-tuning for text classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, 2018

work page 2018
[44]

URLhttps://arxiv.org/abs/2604.23385

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Emerging properties in self-supervised vision transformers.Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervè Jègou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers.Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[46]

Cluster and Predict Latents Patches for Improved Masked Image Modeling.Transactions on Machine Learning Research, 2025

Timothée Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Cluster and Predict Latents Patches for Improved Masked Image Modeling.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=Ycmz7qJxUQ

work page 2025
[47]

M. A. Reyna, N. Sadr, A. Gu, E. A. Perez Alday, C. Liu, S. Seyedi, A. Shah, and G. D. Clifford. Will Two Do? Varying Dimensions in Electrocardiography: The PhysioNet/Computing in Cardiology Challenge 2021 (version 1.0.3). https://doi.org/10.13026/34va-7q14, 2022. PhysioNet. RRID:SCR_007345

work page doi:10.13026/34va-7q14 2021
[48]

M. A. Reyna, N. Sadr, E. A. Perez Alday, A. Gu, A. J. Shah, C. Robichaux, A. B. Rad, A. Elola, S. Seyedi, S. Ansari, H. Ghanbari, Q. Li, A. Sharma, and G. D. Clifford. Will Two Do? Varying Dimensions in Electrocardiography: The PhysioNet/Computing in Cardiology Challenge 2021. In2021 Computing in Cardiology (CinC), pages 1–4, Brno, Czech Republic, 2021. d...

work page doi:10.23919/cinc53138.2021.9662687 2021
[49]

A 12-lead electrocardiogram database for arrhythmia research covering more than 10, 000 patients.Scientific Data, 7 (1), February 2020

Jianwei Zheng, Jianming Zhang, Sidy Danioko, Hai Yao, Hangyuan Guo, and Cyril Rakovski. A 12-lead electrocardiogram database for arrhythmia research covering more than 10, 000 patients.Scientific Data, 7 (1), February 2020. ISSN 2052-4463. doi: 10.1038/s41597-020-0386-x. URL http://dx.doi.org/10. 1038/s41597-020-0386-x

work page doi:10.1038/s41597-020-0386-x 2020
[50]

A large-scale multi-label 12-lead electrocardiogram database with standardized diagnostic statements

Hui Liu, Dan Chen, Da Chen, Xiyu Zhang, Huijie Li, Lipan Bian, Minglei Shu, and Yinglong Wang. A large-scale multi-label 12-lead electrocardiogram database with standardized diagnostic statements. Scientific data, 9(1):272, 2022. doi: doi.org/10.1038/s41597-022-01403-5. URL https://doi.org/10. 1038/s41597-022-01403-5. 12

work page doi:10.1038/s41597-022-01403-5 2022
[51]

A large-scale multi-label 12-lead electrocardiogram database with standardized diagnostic statements, 2022

Hui Liu, Yinglong Wang, Da Chen, Xiyu Zhang, Huijie Li, Lipan Bian, Minglei Shu, and Dan Chen. A large-scale multi-label 12-lead electrocardiogram database with standardized diagnostic statements, 2022

work page 2022
[52]

PTB-XL, a large publicly available electrocardiography dataset (version 1.0.3)

Philipp Wagner, Nils Strodthoff, Ralf Bousseljot, Wojciech Samek, and Tobias Schaeffter. PTB-XL, a large publicly available electrocardiography dataset (version 1.0.3). https://doi.org/10.13026/ kfzx-aw45, 2022. PhysioNet. RRID:SCR_007345

work page 2022
[53]

PTB-XL, a large publicly available electrocardiography dataset.Scientific data, 7 (1):1–15, 2020

Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Dieter Kreiseler, Fatima I Lunze, Wojciech Samek, and Tobias Schaeffter. PTB-XL, a large publicly available electrocardiography dataset.Scientific data, 7 (1):1–15, 2020. doi: 10.1038/s41597-020-0495-6

work page doi:10.1038/s41597-020-0495-6 2020
[54]

A pediatric ECG database with disease diagnosis covering 11643 children.Scientific Data, 12(1):867, 2025

Jian Tan, Haoyi Fan, Jiawei Luo, Yanjie Zhou, Ning Wang, Xizheng Wang, Guizhi Liu, Chengyu Liu, and Zongmin Wang. A pediatric ECG database with disease diagnosis covering 11643 children.Scientific Data, 12(1):867, 2025. doi: 10.1038/s41597-025-05225-z

work page doi:10.1038/s41597-025-05225-z 2025
[55]

A pediatric ECG database with disease diagnosis covering 11643 children, 5 2025

Tan Jian, Haoyi Fan, Jiawei Luo, Yanjie Zhou, Ning Wang, Xizheng Wang, Guizhi Liu, Chengyu Liu, and Zongmin Wang. A pediatric ECG database with disease diagnosis covering 11643 children, 5 2025. URL https://doi.org/10.6084/m9.figshare.27078763.v1

work page doi:10.6084/m9.figshare.27078763.v1 2025
[56]

EchoNext: A Dataset for Detecting Echocardiogram-Confirmed Structural Heart Disease from ECGs

Pierre Elias and Joshua Finer. EchoNext: A Dataset for Detecting Echocardiogram-Confirmed Structural Heart Disease from ECGs. PhysioNet, 2025. URLhttps://doi.org/10.13026/r9pp-3y42

work page doi:10.13026/r9pp-3y42 2025
[57]

EchoNext-Mini: A Dataset and Baseline AI Model for Detecting Structural Heart Disease from Electrocardiograms.NEJM AI, 3(5), April 2026

John Weston Hughes, Linyuan Jing, Joshua Finer, Dustin Hartzel, Christopher Kelsey, Aaron Long, Daniel Rocha, Jeffrey Ruhl, Timothy Poterucha, and Pierre Elias. EchoNext-Mini: A Dataset and Baseline AI Model for Detecting Structural Heart Disease from Electrocardiograms.NEJM AI, 3(5), April 2026. ISSN 2836-9386. doi: 10.1056/aidbp2500516. URLhttp://dx.doi...

work page doi:10.1056/aidbp2500516 2026
[58]

We use exponential moving averages of a student network as prediction target (CAPI Figure

work page
[59]

like I-JEPA, HuBERT, CAPI

work page
[60]

To avoiding backpropagation through the clustering, we therefore need another non-SGD cluster update

We use clustering as loss (CAPI Figure 4) like CAPI. To avoiding backpropagation through the clustering, we therefore need another non-SGD cluster update

work page
[61]

We use Sinkhorn-Knopp like DINO and unlike DinoSR, which encourages equiparticipation and prevents collapse

work page
[62]

We use granular, soft prediction targets (unlike DinoSR), which would also allow us to use different temperatures (as in DINO)

work page
[63]

"" 4Get soft targets from EMA path using Sinkhorn-Knopp. 5 6Args: 7x: input images [B, C, T] 8 9Returns: 10targets: [B, K] soft assignment probabilities 11

We use cluster assignmenta using Sinkhorn-Knopp optimal transport (unlike CAPI, which uses a quite ad-hoc procedure). This has the nice side effect that prediction target computa- tion and cluster updates happen consistently S.2 Pseudo-code We provide pseudo-code for the three most crucial components of the algorithm. 1@torch.no_grad() 2def get_ema_target...

work page 1987