Recognition: unknown
Defining Robust Ultrasound Quality Metrics via an Ultrasound Foundation Model
Pith reviewed 2026-05-10 01:07 UTC · model grok-4.3
The pith
A TinyUSFM foundation model supplies ultrasound quality metrics that align with clinical task performance and expert preference where PSNR and VGG-LPIPS do not.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training TinyUSFM on ultrasound data and deriving TinyUSFM-uLPIPS from multi-layer token relations plus TinyUSFM-NRQ from clean-manifold modeling with worst-region aggregation, the work produces metrics with four advantages: superior calibration to semantic task damage such as Dice-score drops, stable scoring across anatomical sites and domain shifts, consistency with PSNR in the absence of ground truth, and improved prediction of expert preference from 47.2% to 72.8% accuracy, thereby establishing a modality-aligned standard that links algorithmic output to diagnostic value.
What carries the argument
TinyUSFM, a compact ultrasound foundation model whose learned feature space supplies distances for the full-reference metric TinyUSFM-uLPIPS and manifold deviations for the no-reference metric TinyUSFM-NRQ.
If this is right
- Image reconstructions can be optimized in a closed loop using the new metrics to maximize downstream task performance rather than pixel similarity.
- No-reference quality scoring becomes feasible while remaining consistent with traditional fidelity measures such as PSNR.
- Quality rankings stay comparable and stable when the same metric is applied to images from different anatomical sites or acquisition devices.
- Automated selection or enhancement of ultrasound images can achieve higher agreement with sonographer judgment.
Where Pith is reading between the lines
- The same foundation-model approach could be repeated for other modalities once comparable small foundation models exist, allowing cross-modality quality standards.
- Deploying TinyUSFM-NRQ on real-time scanners could provide immediate feedback on image adequacy before diagnosis.
- Further validation on larger multi-center datasets would test whether the observed gains in expert-preference prediction generalize beyond the reported experiments.
Load-bearing premise
The TinyUSFM model has learned representations whose distances and deviations in feature space correspond to clinically relevant quality differences in ultrasound images across organs and scanners.
What would settle it
A dataset of ultrasound reconstructions with varied organs, scanners, and expert ratings where TinyUSFM-uLPIPS fails to correlate more strongly with Dice-score changes than VGG-LPIPS, or where TinyUSFM-NRQ rankings diverge from both PSNR and sonographer preference.
Figures
read the original abstract
Clinicians lack a principled framework to quantify diagnostic utility in ultrasound reconstructions. Existing standards like PSNR and VGG-LPIPS are inadequate, failing to account for modality-specific physics or the structural nuances of acoustic imaging. We close this gap with a TinyUSFM-based evaluation framework featuring two distinct metrics: TinyUSFM-uLPIPS, a full-reference perceptual distance based on multi-layer token relations, and TinyUSFM-NRQ, a deployable no-reference quality score utilizing clean-manifold modeling and worst-region aggregation to detect localized harmful artifacts. We demonstrate that the presented metrics have four unique advantages: 1) Task-linked quality, where TinyUSFM-uLPIPS achieves superior calibration with semantic task damage, accurately reflecting Dice-score drops in segmentation where VGG-based metrics fail; 2) Cross-organ comparability, maintaining stable scoring scales and consistent severity rankings across diverse anatomical sites and domain-shifted data; 3) PSNR-consistent sensitivity, with TinyUSFM-NRQ providing a reliable quality score without ground-truth images that remains consistent with traditional fidelity benchmarks (i.e. PSNR); and 4) Clinical utility, improving the prediction of expert preference from 47.2$\%$ to 72.8$\%$ accuracy and producing super-resolution reconstructions preferred by sonographers. By integrating these advantages into a unified assessment and optimization loop, this work establishes a modality-aligned standard that finally bridges the gap between algorithmic performance and diagnostic utility. https://github.com/sextant-fable/US-Metrics
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to define robust ultrasound quality metrics using an Ultrasound Foundation Model (TinyUSFM). It introduces TinyUSFM-uLPIPS as a full-reference perceptual distance metric based on multi-layer token relations and TinyUSFM-NRQ as a no-reference quality score using clean-manifold modeling and worst-region aggregation. The metrics are said to offer task-linked quality by better correlating with Dice score drops in segmentation tasks, cross-organ comparability with stable scoring across anatomical sites, PSNR-consistent sensitivity for no-reference use, and clinical utility by improving expert preference prediction accuracy to 72.8%. The work positions these as a modality-aligned standard bridging algorithmic performance and diagnostic utility, with code available on GitHub.
Significance. Should the central assumption hold—that the TinyUSFM feature space faithfully encodes ultrasound-specific physics and clinical quality nuances—the proposed metrics could provide a valuable, unified framework for assessing and optimizing ultrasound reconstructions and super-resolution methods. This would address limitations of generic metrics like PSNR and VGG-LPIPS. The emphasis on clinical validation through expert preferences is a strength, as is the open-source release. However, the overall significance is currently limited by insufficient evidence in the abstract for the claims, and the potential for the metrics to reflect model training artifacts rather than true modality alignment.
major comments (3)
- [Abstract] The claim that TinyUSFM-uLPIPS 'achieves superior calibration with semantic task damage, accurately reflecting Dice-score drops in segmentation where VGG-based metrics fail' is central to the task-linked quality advantage but is presented without supporting equations, experimental details, or quantitative results (e.g., correlation coefficients); this undermines the ability to evaluate if the metric is load-bearing for the paper's conclusions.
- [Abstract] All four advantages presuppose that multi-layer token relations and clean-manifold deviations in TinyUSFM encode acoustic imaging phenomena (speckle, attenuation, reverberation) and clinical quality rather than generic image statistics; no tests for this (e.g., against physics-based degradations or unseen scanners) are described, posing a correctness risk to the cross-organ and modality-alignment claims.
- [Abstract] The clinical utility is quantified as improving expert preference prediction from 47.2% to 72.8% accuracy, but without details on the evaluation protocol, sample size, or statistical significance, it is difficult to assess the robustness of this result which is key to the paper's closing claim.
minor comments (1)
- [Abstract] Consider expanding the acronym TinyUSFM upon first use for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which correctly note that the abstract would be strengthened by including more quantitative details and experimental context for the central claims. We will revise the abstract accordingly to address these points. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Abstract] The claim that TinyUSFM-uLPIPS 'achieves superior calibration with semantic task damage, accurately reflecting Dice-score drops in segmentation where VGG-based metrics fail' is central to the task-linked quality advantage but is presented without supporting equations, experimental details, or quantitative results (e.g., correlation coefficients); this undermines the ability to evaluate if the metric is load-bearing for the paper's conclusions.
Authors: The full manuscript provides the experimental protocol, segmentation task details, and quantitative correlation results (including direct comparisons to VGG-LPIPS) in the results section. We agree the abstract would benefit from including the key correlation coefficients to make this claim more self-contained. We will revise the abstract to incorporate these quantitative results and a brief reference to the segmentation evaluation. revision: yes
-
Referee: [Abstract] All four advantages presuppose that multi-layer token relations and clean-manifold deviations in TinyUSFM encode acoustic imaging phenomena (speckle, attenuation, reverberation) and clinical quality rather than generic image statistics; no tests for this (e.g., against physics-based degradations or unseen scanners) are described, posing a correctness risk to the cross-organ and modality-alignment claims.
Authors: The cross-organ comparability and domain-shift robustness (including unseen scanners) are validated through experiments on multiple anatomical sites and domain-shifted ultrasound datasets, as reported in Sections 4.2 and 4.3. We will revise the abstract to briefly note the use of diverse, domain-shifted data supporting these claims. revision: yes
-
Referee: [Abstract] The clinical utility is quantified as improving expert preference prediction from 47.2% to 72.8% accuracy, but without details on the evaluation protocol, sample size, or statistical significance, it is difficult to assess the robustness of this result which is key to the paper's closing claim.
Authors: The expert preference study protocol, sample size, and statistical analysis are detailed in Section 5. We will revise the abstract to include the sample size and note the statistical significance of the accuracy improvement to 72.8%. revision: yes
Circularity Check
No significant circularity; metrics empirically validated against external clinical and task benchmarks
full rationale
The paper defines TinyUSFM-uLPIPS and TinyUSFM-NRQ using features from the TinyUSFM foundation model, then demonstrates four advantages through direct comparisons to independent external measures: calibration against Dice-score drops in segmentation, stable rankings across organs and domain shifts, consistency with PSNR, and improved expert preference prediction (47.2% to 72.8%). These validations rely on task performance, fidelity benchmarks, and human judgments that are not derived from the model's internal distances or manifold by construction. No load-bearing self-citations, self-definitional reductions, or fitted inputs renamed as predictions appear in the derivation chain; the central claims rest on observable correlations rather than tautological equivalence to the model inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Data in Brief28, 104863 (2020).https://doi.org/10.1016/j.dib.2019 .104863
Al-Dhabyani, W., Gomaa, M., Khaled, H., Fahmy, A.: Dataset of breast ultrasound images. Data in Brief28, 104863 (2020).https://doi.org/10.1016/j.dib.2019 .104863
-
[2]
RadioGraphics37(5), 1408–1423 (2017).https://doi.org/10.1148/rg.2017160 175
Baad, M., Lu, Z.F., Reiser, I., Paushter, D.: Clinical significance of us artifacts. RadioGraphics37(5), 1408–1423 (2017).https://doi.org/10.1148/rg.2017160 175
-
[3]
Ultrasonics24(1), 41–44 (1986).https://doi.org/10.1016/ 0041-624X(86)90072-7
Bamber, J.C., Daft, C.: Adaptive filtering for reduction of speckle in ultrasonic pulse-echo images. Ultrasonics24(1), 41–44 (1986).https://doi.org/10.1016/ 0041-624X(86)90072-7
1986
-
[4]
Journal of Clinical Ultrasound52(6), 753–762 (2024).https://doi.org/10.100 2/jcu.23703
Cai, P., Yang, T., Xie, Q., Liu, P., Li, P.: A lightweight hybrid model for the automatic recognition of uterine fibroid ultrasound images based on deep learning. Journal of Clinical Ultrasound52(6), 753–762 (2024).https://doi.org/10.100 2/jcu.23703
2024
-
[5]
A Neural Algorithm of Artistic Style
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015),https://arxiv.org/abs/1508.06576
work page Pith review arXiv 2015
-
[6]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems (NeurIPS) (2017),https://arxiv.or g/abs/1706.08500
work page Pith review arXiv 2017
-
[7]
Jiao,J.,Zhou,J.,Li,X.,Xia,M.,Huang,Y.,Huang,L.,Wang,N.,Zhang,X.,Zhou, S., Wang, Y., Guo, Y.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis 96, 103202 (2024).https://doi.org/10.1016/j.media.2024.103202
-
[8]
In: Trustworthy Machine Learning for Healthcare (TML4H 2023)
Kwon, J., Jiao, J., Self, A., Noble, J.A., Papageorghiou, A.: A kernel density estimation based quality metric for quality assessment of obstetric ultrasound video. In: Trustworthy Machine Learning for Healthcare (TML4H 2023). Lecture Notes in Computer Science, vol. 13932, pp. 134–146. Springer (2023).https: //doi.org/10.1007/978-3-031-39539-0_12
-
[9]
Leclerc, S., Smistad, E., Pedrosa, J., Östvik, A., Grenier, T., Espinosa, F., et al.: Deep learning for segmentation using an open large-scale dataset in 2d echocardio- graphy. IEEE Transactions on Medical Imaging38(9), 2198–2210 (2019).https: //doi.org/10.1109/TMI.2019.2900516
- [10]
-
[11]
TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models
Ma, C., Jiao, J., Liang, S., Fu, J., Wang, Q., Li, Z., Wang, Y., Guo, Y.: Tinyusfm: Towards compact and efficient ultrasound foundation models. arXiv preprint arXiv:2510.19239 (2025).https://doi.org/10.48550/arXiv.2510.19239, https://arxiv.org/abs/2510.19239
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.19239 2025
-
[12]
Computers in Biology and Medicine135, 104623 (2021)
Marzola, F., van Alfen, N., Doorduin, J., Meiburger, K.M.: Deep learning seg- mentation of transverse musculoskeletal ultrasound images for neuromuscular disease assessment. Computers in Biology and Medicine135, 104623 (2021). https://doi.org/10.1016/j.compbiomed.2021.104623
-
[13]
Meiburger, K.M., Marzola, F., Zahnd, G., Faita, F., Loizou, C.P., Lainé, N., et al.: Carotid ultrasound boundary study (cubs): Technical considerations on an open multi-center analysis of computerized measurement systems for intima- media thickness measurement on common carotid artery longitudinal b-mode ul- 10 Huang et al. trasound scans. Computers in Bi...
-
[14]
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing21(12), 4695–4708 (2012).https://doi.org/10.1109/TIP.2012.2214050
-
[15]
completely blind
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters20(3), 209–212 (2013),https: //live.ece.utexas.edu/research/Quality/niqe_spl.pdf
2013
-
[16]
In: Medical Imaging 2015: Computer- Aided Diagnosis
Pedraza, L., Vargas, C., Narváez, F., Durán, O., Muñoz, E., Romero, E.: An open access thyroid ultrasound image database. In: Medical Imaging 2015: Computer- Aided Diagnosis. Proceedings of SPIE, vol. 9287 (2015).https://doi.org/10.1 117/12.2073532
2015
-
[17]
arXiv preprint arXiv:2307.02462 (2023),https://arxiv.org/abs/2307.02462
Raina, D., Ntentia, D., Chandrashekhara, S.H., Voyles, R., Saha, S.K.: Expert- agnostic ultrasound image quality assessment using deep variational clustering. arXiv preprint arXiv:2307.02462 (2023),https://arxiv.org/abs/2307.02462
-
[18]
Ultrasonic Imaging4(4), 297–310 (1982).https://doi.org/10.1177/0161 73468200400401
Robinson, D.E., Knight, P.C.: Interpolation scan conversion in pulse-echo ultra- sound. Ultrasonic Imaging4(4), 297–310 (1982).https://doi.org/10.1177/0161 73468200400401
-
[19]
In: Proceedings of the 19th International SPIN Workshop on Model Checking of Software (SPIN)
Singla, R., Ringstrom, C., Hu, G., Lessoway, V., Reid, J., Nguan, C., Rohling, R.: The open kidney ultrasound data set. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 (2023).https://doi.org/10.1007/978-3 -031-44521-7_15
-
[20]
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004).https://doi.org/10.1109/TIP.2003.819861
-
[21]
IEEE Transactions on Cybernetics47(5), 1336–1349 (2017).https://doi.org/10.1109/TCYB.2017.26 71898
Wu, L., Cheng, J.Z., Li, S., Lei, B., Wang, T., Ni, D.: Fuiqa: Fetal ultrasound image quality assessment with deep convolutional networks. IEEE Transactions on Cybernetics47(5), 1336–1349 (2017).https://doi.org/10.1109/TCYB.2017.26 71898
-
[22]
Briefings in Bioinformatics24(1), bbac569 (2023).https://doi.org/10.1093/bi b/bbac569
Xu, Y., Zheng, B., Liu, X., Wu, T., Ju, J., Wang, S., Lian, Y., Zhang, H., Liang, T., Sang, Y., Jiang, R., Wang, G., Ren, J., Chen, T.: Improving artificial intelligence pipeline for liver malignancy diagnosis using ultrasound images and video frames. Briefings in Bioinformatics24(1), bbac569 (2023).https://doi.org/10.1093/bi b/bbac569
work page doi:10.1093/bi 2023
-
[23]
Medicine100(4), e24427 (2021)
Zhang, B., Liu, H., Luo, H., Li, K.: Automatic quality assessment for 2d fetal sonographic standard plane based on multitask learning. Medicine100(4), e24427 (2021)
2021
-
[24]
Efros, Eli Shechtman, and Oliver Wang
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effec- tiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 586–595 (2018).https://doi.org/10.1109/CVPR.2018.00068
-
[25]
arXiv preprint arXiv:2207.06799 (2022),https://arxiv.org/abs/2207.06799
Zhao, Q., Lyu, S., Bai, W., Cai, L., Liu, B., Cheng, G., Wu, M., Sang, X., Yang, M., Chen, L.: Mmotu: A multi-modality ovarian tumor ultrasound im- age dataset for unsupervised cross-domain semantic segmentation. arXiv preprint arXiv:2207.06799 (2022),https://arxiv.org/abs/2207.06799
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.