pith. sign in

arxiv: 2603.18123 · v3 · pith:N7MQOJTQnew · submitted 2026-03-18 · 📡 eess.IV · cs.AI

Understanding Task Aggregation for Generalizable Ultrasound Foundation Models

Pith reviewed 2026-05-25 06:37 UTC · model grok-4.3

classification 📡 eess.IV cs.AI
keywords ultrasoundfoundation modelstask aggregationmulti-task learningmixture of expertsmedical imagingsegmentationclassification
0
0 comments X

The pith

Task aggregation in ultrasound models must weigh data scale and task type over clinical groupings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines when multiple ultrasound tasks can be trained together in one model without performance loss. It compares task-specific models, clinically grouped training, and all-task unified training across 27 tasks using a new framework called M2DINO. Results show that clinically grouped training boosts results only when data is plentiful but causes clear negative transfer when data is scarce. Unified training across all tasks delivers more stable performance regardless of clinical group. Segmentation tasks prove more sensitive to these choices than regression or classification tasks.

Core claim

Aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. Task sensitivity varies by task type, with segmentation showing the largest performance drops compared with regression and classification.

What carries the argument

M2DINO, a multi-organ multi-task framework on DINOv3 that inserts task-conditioned Mixture-of-Experts blocks to allocate capacity adaptively across tasks.

If this is right

  • Clinically-grouped training improves results only when training data is abundant for each group.
  • All-task unified training yields more consistent outcomes across different clinical groups and data regimes.
  • Segmentation tasks suffer larger performance drops from suboptimal aggregation than regression or classification tasks.
  • Aggregation decisions should jointly factor in data availability and task characteristics instead of clinical taxonomy alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data-scarce medical imaging domains may favor unified training over expert clinical groupings.
  • The observed interaction between data scale and aggregation strategy could inform adapter design in other imaging modalities.
  • Testing whether the same scale-dependent pattern appears in CT or MRI foundation models would clarify generality.

Load-bearing premise

The reported performance differences arise primarily from the choice of task aggregation strategy and its interaction with data scale rather than from unmentioned differences in data preprocessing, hyperparameter tuning, or the Mixture-of-Experts implementation.

What would settle it

Retraining the same models under identical preprocessing and hyperparameter settings while varying only the aggregation strategy and checking whether the performance gaps between grouped and unified training disappear or reverse.

Figures

Figures reproduced from arXiv: 2603.18123 by Amelia Jim\'enez-S\'anchez, Fangyijie Wang, Gu\'enol\'e Silvestre, Jieyun Bai, Karim Lekadir, Kathleen M. Curran, Tanya Akumu, Vien Ngoc Dang.

Figure 1
Figure 1. Figure 1: Overview of our M2DINO framework. (a) Ultrasound images are processed by a shared DINOv3 encoder augmented with task-conditioned MoE blocks. The unified representation is optimized for segmentation, detection, regression, and classification via task-specific prediction heads. Frozen and trainable components are indicated. (b) A conceptual comparison of the three training paradigms. Although the architectur… view at source ↗
Figure 2
Figure 2. Figure 2: Absolute performance of TS, CG, and AU training paradigms across represen￾tative tasks: segmentation (DSC ↑), classification (AUC ↑), and regression (MRE ↓). Abd: Abdomen; MO: Multi-organ. multi-organ classification, CG and AU yield small, modest gains. However, the Breast and Lung groups exhibit different trends. AU improves lung classification (AUC: 0.396 → 0.525). In contrast, CG shows large performance… view at source ↗
Figure 3
Figure 3. Figure 3: Relative performance change (∆, %) with respect to TS [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified models can underperform task-specific baselines. We hypothesize that this degradation arises not from model capacity limitations, but from task aggregation strategies that ignore interactions between task heterogeneity and available training data scale. In this work, we systematically analyze when heterogeneous ultrasound tasks can be jointly learned without performance loss, establishing practical criteria for task aggregation in unified clinical imaging models. We introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. We systematically evaluate 27 ultrasound tasks spanning segmentation, classification, detection, and regression under three paradigms: task-specific, clinically-grouped, and all-task unified training. Our results show that aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. We further observe that task sensitivity varies by task type in our experiments: segmentation shows the largest performance drops compared with regression and classification. These findings provide practical guidance for ultrasound foundation models, emphasizing that aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces M2DINO, a multi-organ multi-task ultrasound foundation model based on DINOv3 augmented with task-conditioned Mixture-of-Experts blocks. It evaluates 27 tasks (segmentation, classification, detection, regression) across three aggregation paradigms—task-specific, clinically-grouped, and all-task unified training—and concludes that aggregation effectiveness depends strongly on training data scale: clinically-grouped training can improve performance in data-rich regimes but induces negative transfer in low-data regimes, while all-task unified training yields more consistent results; segmentation tasks are most sensitive to aggregation.

Significance. If the reported performance differences are shown to arise from the aggregation strategies themselves rather than confounding factors, the work supplies actionable criteria for designing unified ultrasound models by jointly considering data scale and task type. The M2DINO architecture with adaptive MoE capacity allocation represents a concrete technical contribution that could be adopted in future multi-task imaging frameworks.

major comments (2)
  1. [Abstract / Methods] Abstract and Methods: the central claim that performance differences arise from the choice of task aggregation strategy and its interaction with data scale is not supported by any quantitative results, error bars, statistical tests, or controls in the abstract; the experimental description supplies no explicit statement that data preprocessing pipelines, optimizer schedules, learning-rate searches, or MoE routing/expert allocation were held identical across the three paradigms.
  2. [Experiments] Experimental evaluation: without matched controls on preprocessing, hyperparameter tuning, and MoE implementation details, the observed interaction between clinically-grouped training and data scale (positive in data-rich, negative transfer in low-data) cannot be attributed to aggregation strategy rather than unequal optimization effort; this directly undermines the strongest claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below regarding support for claims and experimental controls.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods: the central claim that performance differences arise from the choice of task aggregation strategy and its interaction with data scale is not supported by any quantitative results, error bars, statistical tests, or controls in the abstract; the experimental description supplies no explicit statement that data preprocessing pipelines, optimizer schedules, learning-rate searches, or MoE routing/expert allocation were held identical across the three paradigms.

    Authors: The abstract is a concise summary and does not include detailed quantitative results or statistical tests, which is standard practice. The full manuscript reports performance metrics for all 27 tasks under the three paradigms. We agree an explicit statement on controls is missing from the experimental description and will add it to the Methods section, confirming identical preprocessing pipelines, optimizer schedules, learning-rate searches, and MoE routing/expert allocation across paradigms. revision: yes

  2. Referee: [Experiments] Experimental evaluation: without matched controls on preprocessing, hyperparameter tuning, and MoE implementation details, the observed interaction between clinically-grouped training and data scale (positive in data-rich, negative transfer in low-data) cannot be attributed to aggregation strategy rather than unequal optimization effort; this directly undermines the strongest claim.

    Authors: Matched controls were used throughout: identical data preprocessing, hyperparameter tuning procedures, and MoE implementation details were applied to all three training paradigms to isolate the effect of aggregation strategy. We will add an explicit statement documenting these controls in the revised Methods section. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental comparisons rest on independent benchmarks, not self-referential definitions or fitted predictions.

full rationale

The paper reports empirical results from training and evaluating M2DINO on 27 ultrasound tasks under three paradigms (task-specific, clinically-grouped, all-task unified). No equations, fitted parameters, or derivations are presented that reduce to their own inputs. Performance differences are attributed to data scale and task aggregation via direct experimental comparison; the abstract and reader's summary confirm absence of self-definitional constructs, self-citation load-bearing for uniqueness theorems, or renaming of known results as novel derivations. Central claims remain falsifiable against external benchmarks and do not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new entities; full manuscript details unavailable for audit.

pith-pipeline@v0.9.0 · 5795 in / 1112 out tokens · 34254 ms · 2026-05-25T06:37:52.761949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

  1. [1]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(4), 2245–2264 (2025).https://doi.org/10.1109/TPAMI.2024.3506283

    Awais, M., Naseer, M., Khan, S., Anwer, R.M., Cholakkal, H., Shah, M., Yang, M.H., Khan, F.S.: Foundation models defining a new era in vision: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence47(4), 2245–2264 (2025).https://doi.org/10.1109/TPAMI.2024.3506283

  2. [2]

    IEEE Transactions on Medical Imaging44(2), 1005–1018 (2025).https://doi.org/10.1109/TMI.2024.3472672

    Chen, H., Cai, Y., Wang, C., Chen, L., Zhang, B., Han, H., Guo, Y., Ding, H., Zhang, Q.: Multi-organ foundation model for universal ultrasound image segmen- tation with task prompt and anatomical prior. IEEE Transactions on Medical Imaging44(2), 1005–1018 (2025).https://doi.org/10.1109/TMI.2024.3472672

  3. [3]

    Dice,L.R.:Measuresoftheamountofecologicassociationbetweenspecies.Ecology 26(3), 297–302 (1945)

  4. [4]

    In: Explainable Artificial Intelligence

    Dorszewski, T., Tětková, L., Jenssen, R., Hansen, L.K., Wickstrøm, K.K.: From colors to classes: Emergence of concepts in vision transformers. In: Explainable Artificial Intelligence. pp. 28–47. Springer Nature Switzerland (2026)

  5. [5]

    Medical Image Analysis 95, 103187 (2024).https://doi.org/10.1016/j.media.2024.103187

    Huang, L., Zhou, J., Jiao, J., Zhou, S., Chang, C., Wang, Y., Guo, Y.: Stan- dardization of ultrasound images across various centers: M2o-diffgan bridging the gaps among unpaired multi-domain ultrasound images. Medical Image Analysis 95, 103187 (2024).https://doi.org/10.1016/j.media.2024.103187

  6. [6]

    IEEE Transactions on Pattern Analysis and Machine Intelli- gence15(9), 850–863 (1993).https://doi.org/10.1109/34.232073

    Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing images using the hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelli- gence15(9), 850–863 (1993).https://doi.org/10.1109/34.232073

  7. [7]

    Iakubovskii, P.: Segmentation models pytorch (2019),https://github.com/ qubvel/segmentation_models.pytorch

  8. [8]

    Advances in Neural Information Processing Systems36, 69625–69637 (2023)

    Jain, Y., Behl, H., Kira, Z., Vineet, V.: Damex: Dataset-aware mixture-of-experts for visual understanding of mixture-of-datasets. Advances in Neural Information Processing Systems36, 69625–69637 (2023)

  9. [9]

    Medical Image Analysis 96, 103202 (2024).https://doi.org/10.1016/j.media.2024.103202 10 F

    Jiao,J.,Zhou,J.,Li,X.,Xia,M.,Huang,Y.,Huang,L.,Wang,N.,Zhang,X.,Zhou, S., Wang, Y., Guo, Y.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis 96, 103202 (2024).https://doi.org/10.1016/j.media.2024.103202 10 F. Wang et al

  10. [10]

    IScience28(8) (2025)

    Kang, Q., Lao, Q., Gao, J., Bao, W., He, Z., Du, C., Lu, Q., Li, K.: Urfm: a gen- eral ultrasound representation foundation model for advancing ultrasound image diagnosis. IScience28(8) (2025)

  11. [11]

    IEEE Transactions on Medical Imaging44(10), 4049–4062 (2025).https://doi.org/10.1109/TMI

    Kim, S., Jin, P., Song, S., Chen, C., Li, Y., Ren, H., Li, X., Liu, T., Li, Q.: Echofm: Foundation model for generalizable echocardiogram analysis. IEEE Transactions on Medical Imaging44(10), 4049–4062 (2025).https://doi.org/10.1109/TMI. 2025.3580713

  12. [12]

    In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV)

    Lu, Y., Weng, M., Xiao, Z., Jiang, R., Su, W., Zheng, G., Lu, P., Li, X.: Dynamic- dino: Fine-grained mixture of experts tuning for real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV). pp. 20847–20856 (October 2025)

  13. [13]

    TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models

    Ma, C., Jiao, J., Liang, S., Fu, J., Wang, Q., Li, Z., Wang, Y., Guo, Y.: Tinyusfm: Towards compact and efficient ultrasound foundation models. arXiv preprint arXiv:2510.19239 (2025)

  14. [14]

    Can- cer Imaging11(1A), S167 (2011)

    Madsen, H.H.T., Rasmussen, F.: Contrast-enhanced ultrasound in oncology. Can- cer Imaging11(1A), S167 (2011)

  15. [15]

    J Med Imaging (Bellingham)7(1), 014501 (Jan 2020)

    Maraci, M.A., Yaqub, M., Craik, R., Beriwal, S., Self, A., von Dadelszen, P., Pa- pageorghiou, A., Noble, J.A.: Toward point-of-care ultrasound estimation of fetal gestational age from the trans-cerebellar diameter using CNN-based ultrasound image analysis. J Med Imaging (Bellingham)7(1), 014501 (Jan 2020)

  16. [16]

    Biochimica et Biophysica Acta (BBA) - Protein Structure405(2), 442–451 (1975).https://doi.org/10.1016/0005-2795(75)90109-9

    Matthews, B.: Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure405(2), 442–451 (1975).https://doi.org/10.1016/0005-2795(75)90109-9

  17. [17]

    Transactions oftheIREProfessionalGrouponInformationTheory4(4),171–212(1954).https: //doi.org/10.1109/TIT.1954.1057460

    Peterson, W., Birdsall, T., Fox, W.: The theory of signal detectability. Transactions oftheIREProfessionalGrouponInformationTheory4(4),171–212(1954).https: //doi.org/10.1109/TIT.1954.1057460

  18. [18]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)

  19. [19]

    Ultrasound Obstet

    Sarris, I., Ioannou, C., Chamberlain, P., Ohuma, E., Roseman, F., Hoch, L., Alt- man, D.G., Papageorghiou, A.T., International Fetal and Newborn Growth Con- sortium for the 21st Century (INTERGROWTH-21st): Intra- and interobserver variability in fetal ultrasound measurements. Ultrasound Obstet. Gynecol.39(3), 266–273 (2012)

  20. [20]

    Pediatric Transplantation19(1), E1–E6 (2015)

    Sasaki, K., Sakamoto, S., Uchida, H., Shigeta, T., Matsunami, M., Kanazawa, H., Fukuda, A., Nakazawa, A., Sato, M., Ito, S., et al.: Two-step transplantation for primary hyperoxaluria: A winning strategy to prevent progression of systemic oxalosis in early onset renal insufficiency cases. Pediatric Transplantation19(1), E1–E6 (2015)

  21. [21]

    JMIR Res Protoc11(9), e37374 (Sep 2022)

    Self, A., Chen, Q., Desiraju, B.K., Dhariwal, S., Gleed, A.D., Mishra, D., et al.: Developing clinical artificial intelligence for obstetric ultrasound to improve access in underserved regions: Protocol for a computer-assisted low-cost point-of-care ul- trasound (calopus) study. JMIR Res Protoc11(9), e37374 (Sep 2022)

  22. [22]

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...

  23. [23]

    IEEE Transactions on Medical Imaging44(9), 3809–3819 (2025).https://doi.org/10.1109/TMI.2025.3567247 Title Suppressed Due to Excessive Length 11

    Song, X., Xu, X., Zhang, J., Machado Reyes, D., Yan, P.: Dino-reg: Efficient mul- timodal image registration with distilled features. IEEE Transactions on Medical Imaging44(9), 3809–3819 (2025).https://doi.org/10.1109/TMI.2025.3567247 Title Suppressed Due to Excessive Length 11

  24. [24]

    npj Digital Medicine8(1), 213 (Apr 2025)

    Vega, R., Dehghan, M., Nagdev, A., Buchanan, B., Kapur, J., Jaremko, J.L., Zonoobi, D.: Overcoming barriers in the use of artificial intelligence in point of care ultrasound. npj Digital Medicine8(1), 213 (Apr 2025)

  25. [25]

    JACC: Cardiovascu- lar Imaging13(8), 1771–1791 (2020).https://doi.org/10.1016/j.jcmg.2019

    Villemain,O.,Baranger,J.,Friedberg,M.K.,Papadacci,C.,Dizeux,A.,Messas,E., Tanter, M., Pernot, M., Mertens, L.: Ultrafast ultrasound imaging in pediatric and adult cardiology: Techniques, applications, and perspectives. JACC: Cardiovascu- lar Imaging13(8), 1771–1791 (2020).https://doi.org/10.1016/j.jcmg.2019. 09.019

  26. [26]

    In: The AAAI Conference on Artificial Intelligence (AAAI)

    Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-iou loss: Faster and better learning for bounding box regression. In: The AAAI Conference on Artificial Intelligence (AAAI). pp. 12993–13000 (2020)