Understanding Task Aggregation for Generalizable Ultrasound Foundation Models
Pith reviewed 2026-05-25 06:37 UTC · model grok-4.3
The pith
Task aggregation in ultrasound models must weigh data scale and task type over clinical groupings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. Task sensitivity varies by task type, with segmentation showing the largest performance drops compared with regression and classification.
What carries the argument
M2DINO, a multi-organ multi-task framework on DINOv3 that inserts task-conditioned Mixture-of-Experts blocks to allocate capacity adaptively across tasks.
If this is right
- Clinically-grouped training improves results only when training data is abundant for each group.
- All-task unified training yields more consistent outcomes across different clinical groups and data regimes.
- Segmentation tasks suffer larger performance drops from suboptimal aggregation than regression or classification tasks.
- Aggregation decisions should jointly factor in data availability and task characteristics instead of clinical taxonomy alone.
Where Pith is reading between the lines
- Data-scarce medical imaging domains may favor unified training over expert clinical groupings.
- The observed interaction between data scale and aggregation strategy could inform adapter design in other imaging modalities.
- Testing whether the same scale-dependent pattern appears in CT or MRI foundation models would clarify generality.
Load-bearing premise
The reported performance differences arise primarily from the choice of task aggregation strategy and its interaction with data scale rather than from unmentioned differences in data preprocessing, hyperparameter tuning, or the Mixture-of-Experts implementation.
What would settle it
Retraining the same models under identical preprocessing and hyperparameter settings while varying only the aggregation strategy and checking whether the performance gaps between grouped and unified training disappear or reverse.
Figures
read the original abstract
Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified models can underperform task-specific baselines. We hypothesize that this degradation arises not from model capacity limitations, but from task aggregation strategies that ignore interactions between task heterogeneity and available training data scale. In this work, we systematically analyze when heterogeneous ultrasound tasks can be jointly learned without performance loss, establishing practical criteria for task aggregation in unified clinical imaging models. We introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. We systematically evaluate 27 ultrasound tasks spanning segmentation, classification, detection, and regression under three paradigms: task-specific, clinically-grouped, and all-task unified training. Our results show that aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. We further observe that task sensitivity varies by task type in our experiments: segmentation shows the largest performance drops compared with regression and classification. These findings provide practical guidance for ultrasound foundation models, emphasizing that aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces M2DINO, a multi-organ multi-task ultrasound foundation model based on DINOv3 augmented with task-conditioned Mixture-of-Experts blocks. It evaluates 27 tasks (segmentation, classification, detection, regression) across three aggregation paradigms—task-specific, clinically-grouped, and all-task unified training—and concludes that aggregation effectiveness depends strongly on training data scale: clinically-grouped training can improve performance in data-rich regimes but induces negative transfer in low-data regimes, while all-task unified training yields more consistent results; segmentation tasks are most sensitive to aggregation.
Significance. If the reported performance differences are shown to arise from the aggregation strategies themselves rather than confounding factors, the work supplies actionable criteria for designing unified ultrasound models by jointly considering data scale and task type. The M2DINO architecture with adaptive MoE capacity allocation represents a concrete technical contribution that could be adopted in future multi-task imaging frameworks.
major comments (2)
- [Abstract / Methods] Abstract and Methods: the central claim that performance differences arise from the choice of task aggregation strategy and its interaction with data scale is not supported by any quantitative results, error bars, statistical tests, or controls in the abstract; the experimental description supplies no explicit statement that data preprocessing pipelines, optimizer schedules, learning-rate searches, or MoE routing/expert allocation were held identical across the three paradigms.
- [Experiments] Experimental evaluation: without matched controls on preprocessing, hyperparameter tuning, and MoE implementation details, the observed interaction between clinically-grouped training and data scale (positive in data-rich, negative transfer in low-data) cannot be attributed to aggregation strategy rather than unequal optimization effort; this directly undermines the strongest claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments below regarding support for claims and experimental controls.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods: the central claim that performance differences arise from the choice of task aggregation strategy and its interaction with data scale is not supported by any quantitative results, error bars, statistical tests, or controls in the abstract; the experimental description supplies no explicit statement that data preprocessing pipelines, optimizer schedules, learning-rate searches, or MoE routing/expert allocation were held identical across the three paradigms.
Authors: The abstract is a concise summary and does not include detailed quantitative results or statistical tests, which is standard practice. The full manuscript reports performance metrics for all 27 tasks under the three paradigms. We agree an explicit statement on controls is missing from the experimental description and will add it to the Methods section, confirming identical preprocessing pipelines, optimizer schedules, learning-rate searches, and MoE routing/expert allocation across paradigms. revision: yes
-
Referee: [Experiments] Experimental evaluation: without matched controls on preprocessing, hyperparameter tuning, and MoE implementation details, the observed interaction between clinically-grouped training and data scale (positive in data-rich, negative transfer in low-data) cannot be attributed to aggregation strategy rather than unequal optimization effort; this directly undermines the strongest claim.
Authors: Matched controls were used throughout: identical data preprocessing, hyperparameter tuning procedures, and MoE implementation details were applied to all three training paradigms to isolate the effect of aggregation strategy. We will add an explicit statement documenting these controls in the revised Methods section. revision: yes
Circularity Check
No circularity: experimental comparisons rest on independent benchmarks, not self-referential definitions or fitted predictions.
full rationale
The paper reports empirical results from training and evaluating M2DINO on 27 ultrasound tasks under three paradigms (task-specific, clinically-grouped, all-task unified). No equations, fitted parameters, or derivations are presented that reduce to their own inputs. Performance differences are attributed to data scale and task aggregation via direct experimental comparison; the abstract and reader's summary confirm absence of self-definitional constructs, self-citation load-bearing for uniqueness theorems, or renaming of known results as novel derivations. Central claims remain falsifiable against external benchmarks and do not collapse by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Awais, M., Naseer, M., Khan, S., Anwer, R.M., Cholakkal, H., Shah, M., Yang, M.H., Khan, F.S.: Foundation models defining a new era in vision: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence47(4), 2245–2264 (2025).https://doi.org/10.1109/TPAMI.2024.3506283
-
[2]
IEEE Transactions on Medical Imaging44(2), 1005–1018 (2025).https://doi.org/10.1109/TMI.2024.3472672
Chen, H., Cai, Y., Wang, C., Chen, L., Zhang, B., Han, H., Guo, Y., Ding, H., Zhang, Q.: Multi-organ foundation model for universal ultrasound image segmen- tation with task prompt and anatomical prior. IEEE Transactions on Medical Imaging44(2), 1005–1018 (2025).https://doi.org/10.1109/TMI.2024.3472672
-
[3]
Dice,L.R.:Measuresoftheamountofecologicassociationbetweenspecies.Ecology 26(3), 297–302 (1945)
work page 1945
-
[4]
In: Explainable Artificial Intelligence
Dorszewski, T., Tětková, L., Jenssen, R., Hansen, L.K., Wickstrøm, K.K.: From colors to classes: Emergence of concepts in vision transformers. In: Explainable Artificial Intelligence. pp. 28–47. Springer Nature Switzerland (2026)
work page 2026
-
[5]
Medical Image Analysis 95, 103187 (2024).https://doi.org/10.1016/j.media.2024.103187
Huang, L., Zhou, J., Jiao, J., Zhou, S., Chang, C., Wang, Y., Guo, Y.: Stan- dardization of ultrasound images across various centers: M2o-diffgan bridging the gaps among unpaired multi-domain ultrasound images. Medical Image Analysis 95, 103187 (2024).https://doi.org/10.1016/j.media.2024.103187
-
[6]
Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing images using the hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelli- gence15(9), 850–863 (1993).https://doi.org/10.1109/34.232073
-
[7]
Iakubovskii, P.: Segmentation models pytorch (2019),https://github.com/ qubvel/segmentation_models.pytorch
work page 2019
-
[8]
Advances in Neural Information Processing Systems36, 69625–69637 (2023)
Jain, Y., Behl, H., Kira, Z., Vineet, V.: Damex: Dataset-aware mixture-of-experts for visual understanding of mixture-of-datasets. Advances in Neural Information Processing Systems36, 69625–69637 (2023)
work page 2023
-
[9]
Medical Image Analysis 96, 103202 (2024).https://doi.org/10.1016/j.media.2024.103202 10 F
Jiao,J.,Zhou,J.,Li,X.,Xia,M.,Huang,Y.,Huang,L.,Wang,N.,Zhang,X.,Zhou, S., Wang, Y., Guo, Y.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis 96, 103202 (2024).https://doi.org/10.1016/j.media.2024.103202 10 F. Wang et al
-
[10]
Kang, Q., Lao, Q., Gao, J., Bao, W., He, Z., Du, C., Lu, Q., Li, K.: Urfm: a gen- eral ultrasound representation foundation model for advancing ultrasound image diagnosis. IScience28(8) (2025)
work page 2025
-
[11]
IEEE Transactions on Medical Imaging44(10), 4049–4062 (2025).https://doi.org/10.1109/TMI
Kim, S., Jin, P., Song, S., Chen, C., Li, Y., Ren, H., Li, X., Liu, T., Li, Q.: Echofm: Foundation model for generalizable echocardiogram analysis. IEEE Transactions on Medical Imaging44(10), 4049–4062 (2025).https://doi.org/10.1109/TMI. 2025.3580713
work page doi:10.1109/tmi 2025
-
[12]
In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV)
Lu, Y., Weng, M., Xiao, Z., Jiang, R., Su, W., Zheng, G., Lu, P., Li, X.: Dynamic- dino: Fine-grained mixture of experts tuning for real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV). pp. 20847–20856 (October 2025)
work page 2025
-
[13]
TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models
Ma, C., Jiao, J., Liang, S., Fu, J., Wang, Q., Li, Z., Wang, Y., Guo, Y.: Tinyusfm: Towards compact and efficient ultrasound foundation models. arXiv preprint arXiv:2510.19239 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Can- cer Imaging11(1A), S167 (2011)
Madsen, H.H.T., Rasmussen, F.: Contrast-enhanced ultrasound in oncology. Can- cer Imaging11(1A), S167 (2011)
work page 2011
-
[15]
J Med Imaging (Bellingham)7(1), 014501 (Jan 2020)
Maraci, M.A., Yaqub, M., Craik, R., Beriwal, S., Self, A., von Dadelszen, P., Pa- pageorghiou, A., Noble, J.A.: Toward point-of-care ultrasound estimation of fetal gestational age from the trans-cerebellar diameter using CNN-based ultrasound image analysis. J Med Imaging (Bellingham)7(1), 014501 (Jan 2020)
work page 2020
-
[16]
Matthews, B.: Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure405(2), 442–451 (1975).https://doi.org/10.1016/0005-2795(75)90109-9
-
[17]
Peterson, W., Birdsall, T., Fox, W.: The theory of signal detectability. Transactions oftheIREProfessionalGrouponInformationTheory4(4),171–212(1954).https: //doi.org/10.1109/TIT.1954.1057460
-
[18]
In: Proceedings of the IEEE/CVF international conference on computer vision
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)
work page 2021
-
[19]
Sarris, I., Ioannou, C., Chamberlain, P., Ohuma, E., Roseman, F., Hoch, L., Alt- man, D.G., Papageorghiou, A.T., International Fetal and Newborn Growth Con- sortium for the 21st Century (INTERGROWTH-21st): Intra- and interobserver variability in fetal ultrasound measurements. Ultrasound Obstet. Gynecol.39(3), 266–273 (2012)
work page 2012
-
[20]
Pediatric Transplantation19(1), E1–E6 (2015)
Sasaki, K., Sakamoto, S., Uchida, H., Shigeta, T., Matsunami, M., Kanazawa, H., Fukuda, A., Nakazawa, A., Sato, M., Ito, S., et al.: Two-step transplantation for primary hyperoxaluria: A winning strategy to prevent progression of systemic oxalosis in early onset renal insufficiency cases. Pediatric Transplantation19(1), E1–E6 (2015)
work page 2015
-
[21]
JMIR Res Protoc11(9), e37374 (Sep 2022)
Self, A., Chen, Q., Desiraju, B.K., Dhariwal, S., Gleed, A.D., Mishra, D., et al.: Developing clinical artificial intelligence for obstetric ultrasound to improve access in underserved regions: Protocol for a computer-assisted low-cost point-of-care ul- trasound (calopus) study. JMIR Res Protoc11(9), e37374 (Sep 2022)
work page 2022
-
[22]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Song, X., Xu, X., Zhang, J., Machado Reyes, D., Yan, P.: Dino-reg: Efficient mul- timodal image registration with distilled features. IEEE Transactions on Medical Imaging44(9), 3809–3819 (2025).https://doi.org/10.1109/TMI.2025.3567247 Title Suppressed Due to Excessive Length 11
-
[24]
npj Digital Medicine8(1), 213 (Apr 2025)
Vega, R., Dehghan, M., Nagdev, A., Buchanan, B., Kapur, J., Jaremko, J.L., Zonoobi, D.: Overcoming barriers in the use of artificial intelligence in point of care ultrasound. npj Digital Medicine8(1), 213 (Apr 2025)
work page 2025
-
[25]
JACC: Cardiovascu- lar Imaging13(8), 1771–1791 (2020).https://doi.org/10.1016/j.jcmg.2019
Villemain,O.,Baranger,J.,Friedberg,M.K.,Papadacci,C.,Dizeux,A.,Messas,E., Tanter, M., Pernot, M., Mertens, L.: Ultrafast ultrasound imaging in pediatric and adult cardiology: Techniques, applications, and perspectives. JACC: Cardiovascu- lar Imaging13(8), 1771–1791 (2020).https://doi.org/10.1016/j.jcmg.2019. 09.019
-
[26]
In: The AAAI Conference on Artificial Intelligence (AAAI)
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-iou loss: Faster and better learning for bounding box regression. In: The AAAI Conference on Artificial Intelligence (AAAI). pp. 12993–13000 (2020)
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.