Focus on What Matters: Two-Stage ROI-Aware Refinement for Anatomy-Preserving Fetal Ultrasound Reconstruction

Ines Abbes; Khalid Alyafei; Mahmood Alzubaidi; Marco Agus; Mowafa Househ; Samir Brahim Belhaouari

arxiv: 2604.23839 · v1 · submitted 2026-04-26 · 💻 cs.CV · cs.AI

Focus on What Matters: Two-Stage ROI-Aware Refinement for Anatomy-Preserving Fetal Ultrasound Reconstruction

Ines Abbes , Mahmood Alzubaidi , Mowafa Househ , Khalid Alyafei , Marco Agus , Samir Brahim Belhaouari This is my paper

Pith reviewed 2026-05-08 06:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords fetal ultrasoundROI-aware reconstructionnuchal translucencydomain shiftconvolutional autoencoderanatomy preservationmulti-hospital evaluationmeasurement accuracy

0 comments

The pith

Focusing refinement on the nuchal translucency region of interest in a two-stage autoencoder improves reconstruction quality for fetal ultrasound images from multiple hospitals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that global reconstruction quality alone does not guarantee fidelity in the small anatomical zones that drive clinical measurements in fetal ultrasound. It introduces a two-stage convolutional autoencoder that first learns an overall 128-dimensional latent representation and then applies targeted intensity and edge constraints only to the nuchal translucency area. Loss weights for the heterogeneous objectives are set automatically from gradient magnitudes rather than hand tuning. A sympathetic reader would care because prenatal screening decisions rest on precise measurements within that narrow region, and data from different hospitals often shifts performance exactly where it matters most.

Core claim

A two-phase convolutional autoencoder first learns a globally faithful 128-D latent code via MS-SSIM, then refines the NT ROI using intensity L1 and normalized Sobel-edge constraints. Loss weights are initialized via gradient-based calibration from per-term gradient magnitudes. Under hospital-wise evaluation with one site held out, this ROI refinement improves both global and measurement-relevant quality while supporting stronger generalization signals in latent probes.

What carries the argument

The two-stage convolutional autoencoder that first builds a global latent code and then applies L1 intensity plus normalized Sobel-edge constraints to a pre-localized nuchal translucency region of interest, with gradient-based loss calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage structure could be adapted to other small-feature tasks such as lesion boundary reconstruction in cross-site MRI.
Gradient-magnitude calibration may reduce the hyperparameter burden when combining reconstruction and task-specific losses in broader medical imaging pipelines.
Direct validation against automated or manual clinical measurements on larger multi-center cohorts would test whether the image-level gains translate to better screening decisions.

Load-bearing premise

The nuchal translucency region can be accurately localized beforehand and adding L1 plus edge constraints to it will preserve anatomy without creating new artifacts or measurement biases elsewhere.

What would settle it

A controlled experiment in which the second-stage refinement produces no drop or an increase in ROI measurement error, or visibly worse anatomy outside the ROI, on held-out hospital data would falsify the benefit of the refinement step.

Figures

Figures reproduced from arXiv: 2604.23839 by Ines Abbes, Khalid Alyafei, Mahmood Alzubaidi, Marco Agus, Mowafa Househ, Samir Brahim Belhaouari.

**Figure 1.** Figure 1: Overview of the Two-Phase Training Strategy. Phase 1 warms up the encoder with global structural similarity. Phase 2 freezes the global structure and refines the ROI using localized intensity (L1) and gradient (Sobel) losses, balanced via gradient calibration. Ω ⊂ {1, . . . , H} × {1, . . . , W}. Standard autoencoders optimize a global similarity metric (e.g., MSE or SSIM (Wang et al., 2003)) over the ful… view at source ↗

**Figure 2.** Figure 2: Representative samples from the three clinical sites showing the NT region (red box). Note the significant domain shift in contrast, speckle texture, and field-of-view across hospitals (Hosp-1, Hosp-2, Hosp-3) view at source ↗

**Figure 3.** Figure 3: The proposed CAE. The encoder maps the input to a 128- D latent vector z via global average pooling and a linear head. This compact z serves as the shared representation for reconstruction and downstream tasks. where (∆x, ∆y) are the padding offsets. This ensures the ROI Ω is spatially consistent relative to the anatomical content. 2.3. ROI-Aware Reconstruction Model We adopt a Convolutional Autoencoder (… view at source ↗

**Figure 4.** Figure 4: Qualitative example on the held-out Hospital-3 test set. Left: original ultrasound image. Right: Phase-2 reconstruction. The NT region (green box) shows improved boundary preservation, consistent with the ROI Edge-MAE reductions reported in view at source ↗

**Figure 5.** Figure 5: Distribution of maximum softmax confidence for linear probe predictions on unseen hospitals. Phase-2 (orange) shifts left, indicating reduced site identifiability. local edge degradation, suggesting a more invariant latent space. Latent space interpolation view at source ↗

**Figure 7.** Figure 7: Latent space generation using the Phase-2 model. Perturbation over the latent representations produces anatomically coherent images, indicating a well-structured and stable latent manifold. Gradient calibration as a transfer- and reproducibilityoriented mechanism. Multi-center ultrasound introduces not only visual domain shift but also loss-scale drift due to differences in gain, dynamic range compressio… view at source ↗

**Figure 8.** Figure 8: Latent space visualizations for the Hospital-1/2 → Hospital-3 split (frozen encoder), with points colored by hospital site. (a) PCA projections for Phase-1 (left) and Phase-2 (right) show strong overlap across hospitals and no major change in global latent geometry after refinement. (b) UMAP projections for Phase-1 (left) and Phase-2 (right) reveal site-dependent manifold structure, with held-out Hospital-… view at source ↗

read the original abstract

Measurement-critical ultrasound tasks often depend on a small anatomical region, making global reconstruction metrics an unreliable proxy for clinical fidelity. We propose an ROI-aware representation learning framework and instantiate it for first-trimester nuchal translucency (NT) screening under multi-hospital domain shift. A two-phase convolutional autoencoder (CAE) first learns a globally faithful 128-D latent code via MS-SSIM, then refines the NT ROI using intensity (L1) and normalized Sobel-edge constraints. To combine these heterogeneous objectives without manual tuning, we initialize loss weights via gradient-based calibration from per-term gradient magnitudes. Under strict hospital-wise evaluation with one hospital held out, ROI refinement improves both global and measurement-relevant quality: on the standard dev split it increases PSNR by +0.27 dB (val) and +0.29 dB (held-out test), reduces ROI MAE by 8.87% (val) and 6.43% (held-out test), and reduces ROI Edge-MAE by 11.10% on source hospitals and 4.90% on the unseen hospital. Beyond reconstruction, frozen-latent probes provide additional evidence of generalization: hospital provenance becomes less confidently predictable on the unseen site (0.556 to 0.541 max-softmax; 0.684 to 0.688 entropy) while OOD detection remains strong across site-held-out protocols (Mahalanobis AUROC up to 0.9956, with modest KNN gains in challenging splits). The same ROI-aware refinement principle is anatomy-agnostic and can be adopted for other fetal biometry targets (e.g., crown-rump length (CRL), nasal bone (NB)) and broader medical imaging settings where small ROIs dominate clinical decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-stage convolutional autoencoder for fetal ultrasound reconstruction focused on nuchal translucency (NT) screening. It first learns a global 128-D latent code via MS-SSIM loss, then refines a pre-provided NT ROI using L1 intensity and normalized Sobel-edge losses whose weights are initialized by gradient-magnitude calibration. Under strict hospital-wise hold-out, ROI refinement is reported to improve PSNR by +0.27 dB, reduce ROI MAE by 6–9 %, and reduce ROI Edge-MAE by 5–11 %, while frozen-latent probes show modest gains in generalization and OOD detection.

Significance. If the improvements prove robust, the two-stage ROI-aware principle offers a practical way to prioritize clinically critical anatomy in reconstruction without manual loss tuning, and the anatomy-agnostic framing could transfer to other fetal biometry tasks. The hospital-wise splits and dual evaluation (reconstruction plus probe metrics) strengthen the generalization assessment.

major comments (2)

[§3] §3 (ROI Refinement Stage): the central claim that ROI refinement “preserves anatomy without creating new artifacts” rests on the untested precondition that the NT ROI is localized accurately beforehand. No ablation on ROI jitter, automatic-detector error, or manual vs. ground-truth localization is reported; even a few-pixel offset would misalign the L1 + Sobel constraints and could erase the modest reported gains or introduce edge artifacts.
[§4] §4 (Experimental Results): the headline improvements (+0.27 dB PSNR, 6–9 % ROI MAE reduction) are small and no statistical significance tests, confidence intervals, or error bars are provided. Without these, it is impossible to determine whether the gains exceed implementation variance or are sensitive to the unstated choices in baselines and training protocol.

minor comments (2)

[§3.2] Clarify the precise definition of the normalized Sobel loss and the gradient-calibration procedure for loss weights; a short equation or pseudocode would remove ambiguity.
[§4] Add explicit comparison to end-to-end alternatives that jointly learn ROI detection rather than presupposing it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing clarifications and indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (ROI Refinement Stage): the central claim that ROI refinement “preserves anatomy without creating new artifacts” rests on the untested precondition that the NT ROI is localized accurately beforehand. No ablation on ROI jitter, automatic-detector error, or manual vs. ground-truth localization is reported; even a few-pixel offset would misalign the L1 + Sobel constraints and could erase the modest reported gains or introduce edge artifacts.

Authors: We acknowledge the validity of this point. The manuscript explicitly describes the NT ROI as pre-provided to the second stage, allowing the refinement to focus exclusively on intensity and edge preservation within the clinically critical region using gradient-calibrated loss weights. This separation is a deliberate design choice to avoid forcing the global latent code to encode fine local details. We agree that robustness to localization inaccuracies is important and was not ablated. In the revision we will expand §3 with a dedicated paragraph discussing the assumption, potential effects of small offsets on the L1/Sobel terms, and the fact that in practice ROIs are supplied by sonographers or standard detectors. We will also note that the modest reported gains are measured under accurate localization and could diminish under misalignment. revision: partial
Referee: [§4] §4 (Experimental Results): the headline improvements (+0.27 dB PSNR, 6–9 % ROI MAE reduction) are small and no statistical significance tests, confidence intervals, or error bars are provided. Without these, it is impossible to determine whether the gains exceed implementation variance or are sensitive to the unstated choices in baselines and training protocol.

Authors: We agree that statistical support is needed to substantiate the modest but consistent gains, particularly since global PSNR changes are small while ROI-specific metrics show larger relative improvement. In the revised manuscript we will report error bars from multiple independent runs with different random seeds, include 95% confidence intervals for the key metrics (PSNR, ROI MAE, ROI Edge-MAE), and add paired statistical tests (e.g., Wilcoxon signed-rank) across the hospital-wise splits to confirm significance. We will also expand the experimental protocol section to fully document baseline implementations, hyperparameter choices, and training details. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical two-stage training and hospital hold-out evaluation are independent of claimed gains

full rationale

The paper describes a two-phase convolutional autoencoder trained first with MS-SSIM for a global latent code, then refined on a pre-provided NT ROI using L1 and normalized Sobel losses whose weights are set by gradient calibration. All reported improvements (+0.27 dB PSNR, ROI MAE reductions, etc.) are measured on strict hospital-wise validation and held-out test splits, including an unseen hospital. No equation, loss term, or result is shown to equal its own inputs by construction, and no self-citation supplies a load-bearing uniqueness theorem or ansatz. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Relies on standard assumptions from representation learning and image processing; no new entities postulated and no free parameters explicitly fitted beyond architecture choices.

axioms (2)

domain assumption Convolutional autoencoders with MS-SSIM can produce globally faithful latent codes for ultrasound images
Invoked for the first phase of the two-stage framework.
domain assumption L1 intensity and normalized Sobel-edge losses on the NT ROI will improve clinical measurement fidelity without harming global reconstruction
Central to the refinement phase and the claim that ROI focus is superior to global-only optimization.

pith-pipeline@v0.9.0 · 5655 in / 1396 out tokens · 48469 ms · 2026-05-08T06:23:50.834720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

C ¸ic ¸ek, O., Abdulkadir, A., Lienkamp, S

doi: 10.1038/s41598-025-91808-0. C ¸ic ¸ek, O., Abdulkadir, A., Lienkamp, S. S., Brox, T., and Ronneberger, O. 3D U-Net: Learning dense vol- umetric segmentation from sparse annotation. InMedi- cal Image Computing and Computer-Assisted Interven- tion (MICCAI), pp. 424–432, 2016. doi: 10.1007/ 978-3-319-46723-8 49. Chen, J., Lu, Y ., Yu, Q., Luo, X., Adeli...

work page doi:10.1038/s41598-025-91808-0 2016
[2]

Nolden, G

ISSN 1361-8415. doi: https://doi.org/10.1016/j. media.2022.102479. Cipolla, R., Gal, Y ., and Kendall, A. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7482–7491, 2018. doi: 10.1109/CVPR.2018.00781. D’Alton, M. E. and Cleary-Goldman, J. First...

work page doi:10.1016/j 2022
[3]

Johnson, J., Alahi, A., and Fei-Fei, L

doi: 10.1109/CIBCB48159.2020.9277638. Johnson, J., Alahi, A., and Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. InEuropean Conference on Computer Vision (ECCV), pp. 694–711,

work page doi:10.1109/cibcb48159.2020.9277638 2020
[4]

In: Leibe, B., Matas, J., Sebe, N., Welling, M

doi: 10.1007/978-3-319-46475-6 43. Karimi, D. and Salcudean, S. E. Reducing the Hausdorff dis- tance in medical image segmentation with convolutional neural networks.IEEE Transactions on Medical Imaging, 39(2):499–513, 2020. doi: 10.1109/TMI.2019.2930068. Kasera, B. et al. Deep-learning computer vision can iden- tify increased nuchal translucency in the f...

work page doi:10.1007/978-3-319-46475-6 2020
[5]

doi: 10.1109/ICCV .2017.324. Liu, L. et al. Intelligent quality assessment of ultrasound images for fetal nuchal translucency measurement during the first trimester of pregnancy based on deep learning models.BMC Pregnancy and Childbirth, 2025. doi: 10.1186/s12884-025-07863-y. Liu, S., Wang, H., Li, Y ., Li, X., Cao, G., and Cao, W. Ahu-multinet: Adaptive ...

work page doi:10.1109/iccv 2017
[6]

Sener, O

doi: 10.1038/s41598-019-52737-x. Sener, O. and Koltun, V . Multi-task learning as multi- objective optimization. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), volume 31, pp. 525–536, 2018. Shi, P. et al. Centerline boundary dice loss for vascular seg- mentation. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2024....

work page doi:10.1038/s41598-019-52737-x 2018
[7]

Letterbox

extends this to unpaired settings and has been used for medical domain translation and adaptation (Sandfort et al., 2019). For ultrasound, CycleGAN-style enhancement has been combined with perceptual objectives to improve visual quality under unpaired data (Athreya et al., 2024). More recently, diffusion probabilistic models have emerged as powerful prior...

work page 2019
[8]

•Input:Frozen latent vectorz∈R 128

Linear Probe (Provenance Classification).We train a linear classifier to predict the source hospital from the latent vectorz. •Input:Frozen latent vectorz∈R 128. •Model:Single linear layernn.Linear(128, num classes). •Classes:Hospital-1 and Hospital-2 (Seen domains). •Training:100 epochs, Adam optimizer, learning rate1×10 −2, weight decay1×10 −4. •Evaluat...

work page
[9]

• Mahalanobis Distance:We fit a multivariate Gaussian distribution ( µ,Σ ) to the training set latents

OOD Detection (Mahalanobis & KNN).We evaluate the ability to detect Out-of-Distribution (OOD) samples (Hospital-3) using the latent statistics of In-Distribution (ID) samples (Hospital-1 & 2). • Mahalanobis Distance:We fit a multivariate Gaussian distribution ( µ,Σ ) to the training set latents. The anomaly score for a test samplezis the Mahalanobis dista...

work page
[10]

ROI Edge Error

Quality Control (QC) Probe.We investigate if the latent space captures information about the reconstruction quality of the critical NT region. • Target:The “ROI Edge Error”, defined as the Mean Absolute Error (MAE) between the normalized Sobel magnitude maps of the original and reconstructed NT regions. 16 Focus on What Matters Table 9.Phase-2 loss compon...

work page arXiv

[1] [1]

C ¸ic ¸ek, O., Abdulkadir, A., Lienkamp, S

doi: 10.1038/s41598-025-91808-0. C ¸ic ¸ek, O., Abdulkadir, A., Lienkamp, S. S., Brox, T., and Ronneberger, O. 3D U-Net: Learning dense vol- umetric segmentation from sparse annotation. InMedi- cal Image Computing and Computer-Assisted Interven- tion (MICCAI), pp. 424–432, 2016. doi: 10.1007/ 978-3-319-46723-8 49. Chen, J., Lu, Y ., Yu, Q., Luo, X., Adeli...

work page doi:10.1038/s41598-025-91808-0 2016

[2] [2]

Nolden, G

ISSN 1361-8415. doi: https://doi.org/10.1016/j. media.2022.102479. Cipolla, R., Gal, Y ., and Kendall, A. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7482–7491, 2018. doi: 10.1109/CVPR.2018.00781. D’Alton, M. E. and Cleary-Goldman, J. First...

work page doi:10.1016/j 2022

[3] [3]

Johnson, J., Alahi, A., and Fei-Fei, L

doi: 10.1109/CIBCB48159.2020.9277638. Johnson, J., Alahi, A., and Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. InEuropean Conference on Computer Vision (ECCV), pp. 694–711,

work page doi:10.1109/cibcb48159.2020.9277638 2020

[4] [4]

In: Leibe, B., Matas, J., Sebe, N., Welling, M

doi: 10.1007/978-3-319-46475-6 43. Karimi, D. and Salcudean, S. E. Reducing the Hausdorff dis- tance in medical image segmentation with convolutional neural networks.IEEE Transactions on Medical Imaging, 39(2):499–513, 2020. doi: 10.1109/TMI.2019.2930068. Kasera, B. et al. Deep-learning computer vision can iden- tify increased nuchal translucency in the f...

work page doi:10.1007/978-3-319-46475-6 2020

[5] [5]

doi: 10.1109/ICCV .2017.324. Liu, L. et al. Intelligent quality assessment of ultrasound images for fetal nuchal translucency measurement during the first trimester of pregnancy based on deep learning models.BMC Pregnancy and Childbirth, 2025. doi: 10.1186/s12884-025-07863-y. Liu, S., Wang, H., Li, Y ., Li, X., Cao, G., and Cao, W. Ahu-multinet: Adaptive ...

work page doi:10.1109/iccv 2017

[6] [6]

Sener, O

doi: 10.1038/s41598-019-52737-x. Sener, O. and Koltun, V . Multi-task learning as multi- objective optimization. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), volume 31, pp. 525–536, 2018. Shi, P. et al. Centerline boundary dice loss for vascular seg- mentation. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2024....

work page doi:10.1038/s41598-019-52737-x 2018

[7] [7]

Letterbox

extends this to unpaired settings and has been used for medical domain translation and adaptation (Sandfort et al., 2019). For ultrasound, CycleGAN-style enhancement has been combined with perceptual objectives to improve visual quality under unpaired data (Athreya et al., 2024). More recently, diffusion probabilistic models have emerged as powerful prior...

work page 2019

[8] [8]

•Input:Frozen latent vectorz∈R 128

Linear Probe (Provenance Classification).We train a linear classifier to predict the source hospital from the latent vectorz. •Input:Frozen latent vectorz∈R 128. •Model:Single linear layernn.Linear(128, num classes). •Classes:Hospital-1 and Hospital-2 (Seen domains). •Training:100 epochs, Adam optimizer, learning rate1×10 −2, weight decay1×10 −4. •Evaluat...

work page

[9] [9]

• Mahalanobis Distance:We fit a multivariate Gaussian distribution ( µ,Σ ) to the training set latents

OOD Detection (Mahalanobis & KNN).We evaluate the ability to detect Out-of-Distribution (OOD) samples (Hospital-3) using the latent statistics of In-Distribution (ID) samples (Hospital-1 & 2). • Mahalanobis Distance:We fit a multivariate Gaussian distribution ( µ,Σ ) to the training set latents. The anomaly score for a test samplezis the Mahalanobis dista...

work page

[10] [10]

ROI Edge Error

Quality Control (QC) Probe.We investigate if the latent space captures information about the reconstruction quality of the critical NT region. • Target:The “ROI Edge Error”, defined as the Mean Absolute Error (MAE) between the normalized Sobel magnitude maps of the original and reconstructed NT regions. 16 Focus on What Matters Table 9.Phase-2 loss compon...

work page arXiv