Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

China); Shanglong Hu (1); Technology; Tonghao Zhuang (1); Yongsheng Luo (1); Yu Li (1) ((1) Zhuhai College of Science; Zhiqi Zhang (1); Zhuhai

arxiv: 2605.19799 · v1 · pith:E32QJUT6new · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

Tonghao Zhuang (1) , Shanglong Hu (1) , Yongsheng Luo (1) , Zhiqi Zhang (1) , Yu Li (1) ((1) Zhuhai College of Science , Technology , Zhuhai , China) This is my paper

Pith reviewed 2026-05-20 06:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords semi-supervised learningfetal cardiac ultrasoundimage segmentationimage classificationSAM-Med2DDINOv3congenital heart disease

0 comments

The pith

A semi-supervised framework integrates SAM-Med2D boundary refinement and DINOv3 semantic enhancement to improve joint segmentation and classification of fetal cardiac ultrasound images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a semi-supervised approach for fetal cardiac ultrasound that jointly segments heart structures and classifies images or views. It starts from an EchoCare multi-task model and adds SAM-Med2D to sharpen segmentation boundaries while using DINOv3 to raise the quality of pseudo-labels created from unlabeled scans. View-specific hard masking is applied during training, and a two-stage optimization first uses EMA to strengthen segmentation before freezing those parameters and retraining only the classification head. Results on the FETUS 2026 leaderboard show concrete metric gains that indicate practical value for prenatal congenital heart disease screening.

Core claim

The central claim is that a semi-supervised joint segmentation and classification pipeline, built on the EchoCare backbone and augmented with SAM-Med2D for boundary refinement plus DINOv3 for better pseudo-labels, succeeds when equipped with view-specific hard masking and a two-stage optimization that first consolidates segmentation via EMA and then performs classification fine-tuning with segmentation parameters frozen.

What carries the argument

The two-stage optimization strategy of EMA consolidation for segmentation followed by classification fine-tuning with frozen segmentation parameters and a reset classification head, together with view-specific hard masking.

Load-bearing premise

The two-stage process of EMA segmentation consolidation followed by frozen-segmentation classification fine-tuning will recover classification performance without degrading the segmentation gains from SAM-Med2D and DINOv3.

What would settle it

An ablation experiment on the FETUS 2026 data that directly compares the two-stage method against a single-stage joint training baseline and shows whether both the Dice score and F1-score can be maintained simultaneously or if one metric drops when the other is optimized.

Figures

Figures reproduced from arXiv: 2605.19799 by China), Shanglong Hu (1), Technology, Tonghao Zhuang (1), Yongsheng Luo (1), Yu Li (1) ((1) Zhuhai College of Science, Zhiqi Zhang (1), Zhuhai.

**Figure 2.** Figure 2: Comparison between Ground Truth and Method e) 5. CONCLUSION This paper presents a semi-supervised framework for joint fetal cardiac ultrasound image segmentation and classification, built upon the EchoCare multi-task backbone and UniMatch training paradigm. By integrating SAM-Med2D for boundary refinement and DINOv3 for semantic enhancement of pseudo-labels, our method achieves significant improvements o… view at source ↗

read the original abstract

We present a semi-supervised framework for joint segmentation and classification of fetal cardiac ultrasound images. Built upon the EchoCare multi-task backbone, our method integrates SAM-Med2D for boundary refinement and leverages DINOv3 to enhance pseudo-label quality. We introduce view-specific hard masking along with a two-stage optimization strategy: an EMA phase to consolidate segmentation capabilities, followed by a Classification Fine-Tuning phase that freezes segmentation parameters and resets the classification head to recover classification performance without compromising segmentation gains. Evaluated on the FETUS 2026 leaderboard, our method achieves a Dice Similarity Coefficient at 79.99%, Normalized Surface Distance at 61.62%, and F1-score at 41.20%, validating the effectiveness of our approach for prenatal congenital heart disease screening. Source code is publicly available at: https://github.com/2826056177/zcst_fetus2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward application of SAM-Med2D and DINOv3 to fetal cardiac ultrasound on the EchoCare backbone, with a two-stage training schedule whose key assumption lacks any ablation support.

read the letter

The paper applies two existing foundation models to a semi-supervised segmentation-plus-classification task on fetal heart ultrasound. It starts from the EchoCare multi-task network, adds SAM-Med2D for boundary refinement and DINOv3 for better pseudo-labels, then runs an EMA consolidation stage followed by a frozen-segmentation classification fine-tune. The main concrete output is the leaderboard numbers on FETUS 2026: 79.99% Dice, 61.62% NSD, and 41.20% F1, plus a public GitHub repo. That combination is a legitimate domain-specific engineering effort rather than a new method or theoretical result. The public code is the clearest positive here; anyone working on similar prenatal imaging pipelines can at least inspect the implementation. The central weakness is exactly the one the stress-test flagged. The abstract presents the final metrics as the result of the full two-stage procedure, yet gives no before-and-after segmentation scores after the classification fine-tuning step. Without that ablation it is impossible to tell whether the reported Dice and NSD already reflect some degradation or whether they come from an intermediate checkpoint. The text also omits baseline comparisons, statistical tests, error bars, and any description of how the test set was kept separate from hyperparameter choices. These gaps make the performance claims hard to evaluate on their own terms. The work will mainly interest groups already running semi-supervised medical ultrasound experiments who want a ready reference implementation for this specific organ and modality. It is not broad enough or novel enough to change the field, but the applied setting is practically relevant and the code release lowers the barrier to checking the details. I would send it to peer review rather than desk-reject, with the explicit request that the authors add the missing segmentation ablations and basic statistical reporting before acceptance.

Referee Report

2 major / 0 minor

Summary. The paper introduces a semi-supervised framework for joint segmentation and classification of fetal cardiac ultrasound images using the EchoCare backbone enhanced with SAM-Med2D for boundary refinement and DINOv3 for semantic pseudo-label improvement. It employs view-specific hard masking and a two-stage optimization consisting of an EMA phase for segmentation consolidation and a subsequent classification fine-tuning phase with frozen segmentation parameters. On the FETUS 2026 leaderboard, it reports a Dice Similarity Coefficient of 79.99%, Normalized Surface Distance of 61.62%, and F1-score of 41.20%.

Significance. Should the results prove robust upon verification of baselines and ablations, this work has the potential to advance semi-supervised techniques in prenatal ultrasound analysis for congenital heart disease. The public availability of the source code at the provided GitHub repository is a notable strength supporting reproducibility.

major comments (2)

[Abstract] The abstract reports specific performance metrics (DSC at 79.99%, NSD at 61.62%, F1 at 41.20%) but omits baseline comparisons, statistical significance, error bars, data split details, and measurement of pseudo-label quality. Without these, the metrics cannot be fully verified as supporting the effectiveness claim.
[Two-stage optimization] The two-stage optimization strategy is outlined, but no ablation study is presented to confirm that the segmentation gains from SAM-Med2D and DINOv3 are maintained after the classification fine-tuning phase. This absence is load-bearing for the central claim that the full pipeline achieves the reported segmentation metrics without degradation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] The abstract reports specific performance metrics (DSC at 79.99%, NSD at 61.62%, F1 at 41.20%) but omits baseline comparisons, statistical significance, error bars, data split details, and measurement of pseudo-label quality. Without these, the metrics cannot be fully verified as supporting the effectiveness claim.

Authors: We agree that including baseline comparisons and additional details in the abstract would enhance clarity and verifiability. In the revised version, we will expand the abstract to briefly mention comparisons against standard semi-supervised baselines (e.g., Mean Teacher and FixMatch), note that results are reported as mean ± std over multiple runs, and reference the data splits and pseudo-label evaluation sections in the main text. This will better contextualize the reported metrics. revision: yes
Referee: [Two-stage optimization] The two-stage optimization strategy is outlined, but no ablation study is presented to confirm that the segmentation gains from SAM-Med2D and DINOv3 are maintained after the classification fine-tuning phase. This absence is load-bearing for the central claim that the full pipeline achieves the reported segmentation metrics without degradation.

Authors: We recognize that an ablation study is necessary to substantiate that the classification fine-tuning phase does not compromise the segmentation performance achieved in the EMA phase. Although the manuscript specifies that segmentation parameters are frozen during this phase, we will add a dedicated ablation experiment in the revised manuscript. This will compare segmentation metrics (Dice, NSD) immediately after the EMA phase versus after the full two-stage process to confirm no degradation occurs. revision: yes

Circularity Check

0 steps flagged

No significant circularity in method description or results reporting

full rationale

The paper describes an empirical semi-supervised pipeline that applies existing foundation models (SAM-Med2D for boundary refinement, DINOv3 for pseudo-label enhancement) plus a two-stage optimization (EMA consolidation followed by frozen-segmentation classification fine-tuning). No mathematical derivation chain, equations, or first-principles results are present in the abstract or described text. Performance numbers are reported directly against the external FETUS 2026 leaderboard benchmark rather than being derived from internal fits or self-referential definitions. The two-stage strategy is presented as a procedural choice without any reduction of final metrics to the inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. This is a standard application paper whose central claims remain independent of the listed circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the method rests on the unexamined transferability of SAM-Med2D and DINOv3 features to fetal ultrasound and on the unproven stability of the two-stage optimization.

axioms (1)

domain assumption Pre-trained foundation models SAM-Med2D and DINOv3 produce useful features and pseudo-labels when applied to fetal cardiac ultrasound without extensive domain-specific retraining.
The abstract invokes these models for boundary refinement and semantic enhancement but supplies no ablation or domain-adaptation analysis.

pith-pipeline@v0.9.0 · 5731 in / 1521 out tokens · 64001 ms · 2026-05-20T06:31:54.466841+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage optimization strategy: an EMA phase to consolidate segmentation capabilities, followed by a Classification Fine-Tuning phase that freezes segmentation parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

INTRODUCTION Congenital heart disease (CHD) represents the most prevalent structural anomaly in fetuses and remains a leading cause of morbidity and mortality in newborns [2]. Prenatal screening for CHD relies heavily on the expert assessment of ultrasound images from standard cardiac views, including the four- chamber view (4CH), left ventricular outflow...

work page
[2]

Medical Image Segmentation U-Net [9] and its variants have achieved remarkable success in medical imaging

RELATED WORKS 2.1. Medical Image Segmentation U-Net [9] and its variants have achieved remarkable success in medical imaging. Transformer-based models like UNETR

work page
[3]

and Swin-Transformer [13] have further advanced the field through self-attention mechanisms capturing long-range dependencies. The Segment Anything Model (SAM) [5] rep- resents a foundation model breakthrough, with SAM-Med2D specifically optimized for medical images, demonstrating su- perior adaptability to medical imaging characteristics. 2.2. Semi-Super...

work page
[4]

EXPERIMENTAL DESIGN AND RESULT ANALYSIS 4.1. Dataset and Task Description The FETUS 2026 Challenge dataset comprises 5,000 stand- ard-view B-mode fetal cardiac ultrasound images, among which 2,800 are allocated for training (partially sourced from the FOCUS dataset [11]). Collected across multiple centers and devices, the dataset exhibits real-world chara...

work page 2026
[5]

ACKNOWLEDGMENTS This work was supported by the Guangdong Key Disciplines Project under grant number 2024ZDJS137

work page
[6]

UNETR: Transformers for 3D Medical Image Segmentation,

Ali Hatamizadeh, Dong Yang, Holger R. Roth, and Daguang X u, “UNETR: Transformers for 3D Medical Image Segmentation,” 2 022 IEEE/CVF Winter Conference on Applications of Computer Vi sion (WACV), 2021 pp. 1748-1758

work page 2021
[7]

Prenatal di- agnosis of congenital heart disease: A review of current knowledge,

Bravo-Valenzuela, Nathalie Jeanne Magioli, et al., “Prenatal di- agnosis of congenital heart disease: A review of current knowledge,” Indian Heart Journal, vol. 70, no. 1, pp. 150-164, 2018

work page 2018
[8]

Artificial in- telligence-enabled prenatal ultrasound for the detection of fetal car- diac abnormalities: a systematic review and meta-analysis,

Elena D'Alberti, Olga Patey, Carolyn Smith, et al., “Artificial in- telligence-enabled prenatal ultrasound for the detection of fetal car- diac abnormalities: a systematic review and meta-analysis,” eClin- icalMedicine, vol. 84, May 2025

work page 2025
[9]

A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications

Hongyuan Zhang, Yuheng Wu, Mingyang Zhao, et al., “A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications.” arXiv preprint arXiv:2509.11752, 2025

work page arXiv 2025
[10]

Sam-med2d,

Junlong Cheng, Jin Ye, Zhongying Deng, et al., “SAM-Med2D,” arXiv preprint arXiv:2308.16184, 2023

work page arXiv 2023
[11]

FixMat ch: Simplifying Semi-Supervised Learning with Consistency and C onfidence,

Kihyuk Sohn, David Berthelot, Nicholas Carlini, et al., “FixMat ch: Simplifying Semi-Supervised Learning with Consistency and C onfidence,” Advances in Neural Information Processing Systems (N eurIPS), 2020, vol. 33, pp. 596-608

work page 2020
[12]

Revisiting Weak-to-Strong Consistency in Semi-Supervised Se- mantic Segmentation,

Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, Yinghuan Shi, “Revisiting Weak-to-Strong Consistency in Semi-Supervised Se- mantic Segmentation,” 2023 IEEE/CVF Conference on Computer Vision and Patter n Recognition (CVPR), Vancouver, BC, Can- ada, 2023, pp. 7236-7246

work page 2023
[13]

Prenatal screening for congenital heart disease with four‐chamber and outflow‐tract views: a multicenter study,

Ogge G, Gaglioti P, Maccanti S, et al., “Prenatal screening for congenital heart disease with four‐chamber and outflow‐tract views: a multicenter study,” Ultrasound in Obstetrics and Gynecology: The Official Journal of the International Society of Ultrasound in Ob- stetrics and Gynecology, vol. 28, no. 6, pp. 779-784, 2006

work page 2006
[14]

U-net: C onvolutional networks for biomedical image segmentation,

Olaf Ronneberger, Philipp Fischer, and Thomas Brox,“U-net: C onvolutional networks for biomedical image segmentation,” Medic al Image Computing and Computer-Assisted Intervention – MICCA I 2015, Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, Eds., Cham, 2015, pp. 234–241

work page 2015
[15]

DINOv3

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, et al., “DI- NOv3,” arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

FOCU S: Four-chamber Ultrasound Image Dataset for Fetal Cardiac Biom etric Measurement (1.0) [Data set],

Songxiong Wu, Hongyuan Zhang, Tingting Ye, et al., “FOCU S: Four-chamber Ultrasound Image Dataset for Fetal Cardiac Biom etric Measurement (1.0) [Data set],” Zenodo, https://doi.org/10.528 1/zenodo.14597550

work page
[17]

CutMix : Regularization Strategy to Train Strong Classifiers with Localizab le Features,

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, et al., “CutMix : Regularization Strategy to Train Strong Classifiers with Localizab le Features,” 2019 IEEE/CVF International Conference on Comput er Vision (ICCV), pp. 6022-6031, 2019

work page 2019
[18]

Swin Transformer: Hier archical Vision Transformer using Shifted Windows,

Ze Liu, Yutong Lin, Yue Cao, et al., “Swin Transformer: Hier archical Vision Transformer using Shifted Windows,” 2021 IEEE/ CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992-10002

work page 2021

[1] [1]

INTRODUCTION Congenital heart disease (CHD) represents the most prevalent structural anomaly in fetuses and remains a leading cause of morbidity and mortality in newborns [2]. Prenatal screening for CHD relies heavily on the expert assessment of ultrasound images from standard cardiac views, including the four- chamber view (4CH), left ventricular outflow...

work page

[2] [2]

Medical Image Segmentation U-Net [9] and its variants have achieved remarkable success in medical imaging

RELATED WORKS 2.1. Medical Image Segmentation U-Net [9] and its variants have achieved remarkable success in medical imaging. Transformer-based models like UNETR

work page

[3] [3]

and Swin-Transformer [13] have further advanced the field through self-attention mechanisms capturing long-range dependencies. The Segment Anything Model (SAM) [5] rep- resents a foundation model breakthrough, with SAM-Med2D specifically optimized for medical images, demonstrating su- perior adaptability to medical imaging characteristics. 2.2. Semi-Super...

work page

[4] [4]

EXPERIMENTAL DESIGN AND RESULT ANALYSIS 4.1. Dataset and Task Description The FETUS 2026 Challenge dataset comprises 5,000 stand- ard-view B-mode fetal cardiac ultrasound images, among which 2,800 are allocated for training (partially sourced from the FOCUS dataset [11]). Collected across multiple centers and devices, the dataset exhibits real-world chara...

work page 2026

[5] [5]

ACKNOWLEDGMENTS This work was supported by the Guangdong Key Disciplines Project under grant number 2024ZDJS137

work page

[6] [6]

UNETR: Transformers for 3D Medical Image Segmentation,

Ali Hatamizadeh, Dong Yang, Holger R. Roth, and Daguang X u, “UNETR: Transformers for 3D Medical Image Segmentation,” 2 022 IEEE/CVF Winter Conference on Applications of Computer Vi sion (WACV), 2021 pp. 1748-1758

work page 2021

[7] [7]

Prenatal di- agnosis of congenital heart disease: A review of current knowledge,

Bravo-Valenzuela, Nathalie Jeanne Magioli, et al., “Prenatal di- agnosis of congenital heart disease: A review of current knowledge,” Indian Heart Journal, vol. 70, no. 1, pp. 150-164, 2018

work page 2018

[8] [8]

Artificial in- telligence-enabled prenatal ultrasound for the detection of fetal car- diac abnormalities: a systematic review and meta-analysis,

Elena D'Alberti, Olga Patey, Carolyn Smith, et al., “Artificial in- telligence-enabled prenatal ultrasound for the detection of fetal car- diac abnormalities: a systematic review and meta-analysis,” eClin- icalMedicine, vol. 84, May 2025

work page 2025

[9] [9]

A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications

Hongyuan Zhang, Yuheng Wu, Mingyang Zhao, et al., “A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications.” arXiv preprint arXiv:2509.11752, 2025

work page arXiv 2025

[10] [10]

Sam-med2d,

Junlong Cheng, Jin Ye, Zhongying Deng, et al., “SAM-Med2D,” arXiv preprint arXiv:2308.16184, 2023

work page arXiv 2023

[11] [11]

FixMat ch: Simplifying Semi-Supervised Learning with Consistency and C onfidence,

Kihyuk Sohn, David Berthelot, Nicholas Carlini, et al., “FixMat ch: Simplifying Semi-Supervised Learning with Consistency and C onfidence,” Advances in Neural Information Processing Systems (N eurIPS), 2020, vol. 33, pp. 596-608

work page 2020

[12] [12]

Revisiting Weak-to-Strong Consistency in Semi-Supervised Se- mantic Segmentation,

Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, Yinghuan Shi, “Revisiting Weak-to-Strong Consistency in Semi-Supervised Se- mantic Segmentation,” 2023 IEEE/CVF Conference on Computer Vision and Patter n Recognition (CVPR), Vancouver, BC, Can- ada, 2023, pp. 7236-7246

work page 2023

[13] [13]

Prenatal screening for congenital heart disease with four‐chamber and outflow‐tract views: a multicenter study,

Ogge G, Gaglioti P, Maccanti S, et al., “Prenatal screening for congenital heart disease with four‐chamber and outflow‐tract views: a multicenter study,” Ultrasound in Obstetrics and Gynecology: The Official Journal of the International Society of Ultrasound in Ob- stetrics and Gynecology, vol. 28, no. 6, pp. 779-784, 2006

work page 2006

[14] [14]

U-net: C onvolutional networks for biomedical image segmentation,

Olaf Ronneberger, Philipp Fischer, and Thomas Brox,“U-net: C onvolutional networks for biomedical image segmentation,” Medic al Image Computing and Computer-Assisted Intervention – MICCA I 2015, Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, Eds., Cham, 2015, pp. 234–241

work page 2015

[15] [15]

DINOv3

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, et al., “DI- NOv3,” arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

FOCU S: Four-chamber Ultrasound Image Dataset for Fetal Cardiac Biom etric Measurement (1.0) [Data set],

Songxiong Wu, Hongyuan Zhang, Tingting Ye, et al., “FOCU S: Four-chamber Ultrasound Image Dataset for Fetal Cardiac Biom etric Measurement (1.0) [Data set],” Zenodo, https://doi.org/10.528 1/zenodo.14597550

work page

[17] [17]

CutMix : Regularization Strategy to Train Strong Classifiers with Localizab le Features,

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, et al., “CutMix : Regularization Strategy to Train Strong Classifiers with Localizab le Features,” 2019 IEEE/CVF International Conference on Comput er Vision (ICCV), pp. 6022-6031, 2019

work page 2019

[18] [18]

Swin Transformer: Hier archical Vision Transformer using Shifted Windows,

Ze Liu, Yutong Lin, Yue Cao, et al., “Swin Transformer: Hier archical Vision Transformer using Shifted Windows,” 2021 IEEE/ CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992-10002

work page 2021