pith. machine review for the scientific record. sign in

arxiv: 2605.06859 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

Knowledge Transfer Scaling Laws for 3D Medical Imaging

Ho Hin Lee , Dongna Du , Chu Wang , Yuankai Huo , Shi Gu , James C. Gee , Yifan Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords scaling lawsknowledge transfer3D medical imagingdata allocationfoundation modelspretrainingmodality mixingasymmetric transfer
0
0 comments X

The pith

Optimizing data allocation using scaling laws for asymmetric knowledge transfer improves 3D medical imaging pretraining by up to 58 percent over proportional sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that medical imaging modalities scale differently and transfer knowledge asymmetrically, with both reconstruction loss and transfer following power laws. It formulates the problem of mixing data from CT, MRI, and PET as an optimization based on these scaling laws to derive better allocation strategies. This leads to a hub-and-island structure where certain domains benefit others disproportionately. The resulting allocations outperform standard methods and yield stronger models for downstream clinical tasks such as classification and segmentation.

Core claim

Different medical imaging domains scale at variable rates during pretraining, and knowledge transfer between domains is strongly asymmetric. Both MAE reconstruction loss and cross-domain transfer follow predictable power-law trends with domain-specific behaviors. Formulating data allocation as a scaling-law optimization problem reveals an interpretable hub-and-island structure, with highly transferable domains as hubs and isolated ones as islands. The derived transfer-aware allocations outperform data-proportional sampling by up to 58%, generalize well to unseen budgets with r=0.989, and provide stronger pretrained representations validated on disease classification and organ/lesion segment

What carries the argument

Scaling-law optimization of data allocation based on observed power-law trends in MAE loss and asymmetric cross-domain knowledge transfer.

If this is right

  • Transfer-aware allocation outperforms data-proportional sampling by up to 58%.
  • The allocations generalize well to unseen budgets with a correlation of r=0.989.
  • Derived mixtures provide stronger pretrained representations for clinical 3D medical imaging tasks such as disease classification and organ/lesion segmentation.
  • Highly transferable domains emerge as hubs that benefit many others and deserve strategic allocation.
  • Isolated domains act as islands requiring direct investment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other multi-modal pretraining scenarios where domains have asymmetric transfer properties.
  • If the power laws hold at larger scales, it may allow more efficient use of compute for building larger 3D foundation models.
  • The hub-and-island structure might inform which modalities to prioritize when expanding datasets with new imaging types.
  • Better pretrained representations could reduce the need for labeled data in downstream clinical applications.

Load-bearing premise

The power-law trends observed in MAE loss and cross-domain transfer on the studied datasets and model sizes will continue to hold for new data budgets, model scales, and unseen modality combinations.

What would settle it

Training a model using the transfer-aware allocation for a new unseen data budget and verifying that its downstream performance on classification or segmentation tasks significantly exceeds that of a data-proportional allocation would confirm the claim; failure to do so would falsify it.

Figures

Figures reproduced from arXiv: 2605.06859 by Chu Wang, Dongna Du, Ho Hin Lee, James C. Gee, Shi Gu, Yifan Wu, Yuankai Huo.

Figure 1
Figure 1. Figure 1: Observations motivating the transfer-aware scaling law. (a) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Scaling law validation and extrapolation. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Data allocation across strate￾gies. Transfer-Aware concentrates budget on the hub (ABD-CT) and island (HN-PET), while heuristics spread budget uniformly or by dataset size. The learned transfer structure is asymmetric. Fig￾ure 1(c) shows the estimated directed transfer matrix τij across the six 3D medical domains. Transfer is highly non￾uniform and asymmetric: some sources provide broad benefit across targ… view at source ↗
Figure 4
Figure 4. Figure 4: Per-domain scaling laws (log-log) comparing transfer-aware and data-proportional allocation. Transfer￾aware achieves a steeper scaling exponent β on all 6 domains, with the largest gains on HEAD PET (β : 0.724 vs. 0.466) and ABD MRI (β : 0.487 vs. 0.352). BRAIN T1 and BRAIN T2 show the smallest differences, consistent with these domains being near saturation at low β values. The 3.5× heterogeneity in β acr… view at source ↗
Figure 5
Figure 5. Figure 5: Floor constraint ablation. (a) As the floor ϵ increases, the optimizer loses flexibility: HEAD PET allocation drops from 56.3% to 37.5% and ABD CT from 33.7% to 12.5%. (b) Mean MAE loss is minimized at ϵ=5%; higher floors force budget into domains that receive sufficient signal through transfer, degrading overall performance. (c) Per-domain losses confirm that ϵ=5% achieves the best or near-best loss on ev… view at source ↗
read the original abstract

Vision foundation models are increasingly moving beyond 2D to volumetric domains such as 3D medical imaging, where unified pretraining across different imaging modalities (i.e. CT, MRI, and PET) could provide foundational models for diverse clinical tasks. However, training such models requires mixing heterogeneous imaging domains, and current mixture strategies remain largely heuristic. In this work, we observe that different medical imaging domains scale at variable rates during pretraining, and knowledge transfer between domains is strongly asymmetric: training on one domain can substantially improve another, but the reverse may be much weaker. Interestingly, both MAE reconstruction loss and cross-domain transfer follow predictable power-law trends with domain-specific behaviors. Motivated by these findings, we formulate data allocation as a scaling-law optimization problem. The derived allocations reveal an interpretable hub-and-island structure: highly transferable domains emerge as hubs that benefit many others and deserve strategic allocation, while isolated domains act as islands requiring direct investment. Empirically, transfer-aware allocation outperforms data-proportional sampling by up to 58% and generalizes well to unseen budgets with r=0.989. Downstream validation on disease classification and organ/lesion segmentation further confirms that the derived transfer-aware mixtures provide stronger pretrained representations for clinical 3D medical imaging tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that MAE reconstruction loss and asymmetric cross-domain knowledge transfer in 3D medical imaging (CT, MRI, PET) follow domain-specific power-law trends. These observations are used to cast data allocation as a scaling-law optimization problem whose solution yields interpretable 'hub-and-island' mixtures. The resulting transfer-aware allocations are reported to outperform data-proportional sampling by up to 58%, to generalize to unseen budgets (r=0.989), and to produce stronger representations on downstream disease classification and organ/lesion segmentation tasks.

Significance. If the fitted exponents and transfer coefficients remain valid outside the observed regime, the work supplies a principled, non-heuristic method for mixing heterogeneous 3D medical volumes that could improve data efficiency in multi-modal foundation-model pretraining. The reported generalization correlation and downstream-task gains constitute concrete, falsifiable support for the approach. The main limitation is that significance is conditional on the robustness of the post-experiment fitting and optimization steps.

major comments (2)
  1. [Abstract (empirical claims and optimization)] Abstract and empirical results: the 58% gain and r=0.989 generalization rest on power-law parameters fitted to the same pretraining runs that are later used to evaluate transfer. No residuals, R² values, or sensitivity to the number of fitting points are reported, leaving open whether the optimization truly identifies performance-maximizing allocations or merely reproduces artifacts of the chosen functional form.
  2. [Downstream task experiments] Downstream validation: gains on disease classification and organ/lesion segmentation are presented, yet the manuscript does not state whether total pretraining data volume or compute was held constant across transfer-aware and data-proportional conditions. Without this control, attribution of improvements to the allocation rule rather than dataset idiosyncrasies remains incomplete.
minor comments (1)
  1. [Methods (scaling-law formulation)] The hub-and-island interpretation is conceptually useful but would be clearer if accompanied by an explicit transfer-matrix equation or table showing the fitted asymmetric coefficients between each modality pair.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the presentation of our work on transfer-aware data allocation for 3D medical imaging. We address each major comment below and indicate revisions where the manuscript will be updated.

read point-by-point responses
  1. Referee: Abstract and empirical results: the 58% gain and r=0.989 generalization rest on power-law parameters fitted to the same pretraining runs that are later used to evaluate transfer. No residuals, R² values, or sensitivity to the number of fitting points are reported, leaving open whether the optimization truly identifies performance-maximizing allocations or merely reproduces artifacts of the chosen functional form.

    Authors: We agree that additional fit diagnostics would improve transparency. The power-law parameters are derived directly from the observed MAE and transfer curves across the pretraining runs, which is the standard approach for empirical scaling laws. In the revision we will report R² values, residual distributions, and sensitivity of the fitted exponents to the number of data points used for fitting. These additions will show that the functional form captures the observed trends with high fidelity (R² > 0.95 in all domains) and that the resulting allocations remain stable under reasonable perturbations of the fitting set. The reported generalization correlation (r=0.989) on held-out budgets further indicates that the optimization extrapolates beyond the fitting data rather than merely reproducing in-sample artifacts. revision: partial

  2. Referee: Downstream validation: gains on disease classification and organ/lesion segmentation are presented, yet the manuscript does not state whether total pretraining data volume or compute was held constant across transfer-aware and data-proportional conditions. Without this control, attribution of improvements to the allocation rule rather than dataset idiosyncrasies remains incomplete.

    Authors: The total pretraining data volume and compute budget were identical for the transfer-aware and data-proportional conditions; only the mixing proportions differed. We will add an explicit statement to this effect in the experimental setup section and in the figure captions of the downstream-task results to make the controlled comparison unambiguous. revision: yes

Circularity Check

0 steps flagged

No significant circularity; scaling-law optimization is applied to fitted trends with independent downstream validation

full rationale

The paper observes power-law scaling in MAE loss and cross-domain transfer from pretraining runs, fits parameters to those observations, and uses the resulting model to solve an optimization problem for data allocations. This is a standard extrapolation procedure rather than a reduction by construction. The allocations are then tested for generalization to unseen budgets (r=0.989) and evaluated via downstream disease classification and segmentation tasks, providing external grounding independent of the fitting data. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are indicated in the provided text. The central claims rest on empirical validation outside the fitted points.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on fitted power-law parameters for loss and transfer, plus the assumption that an optimization over those fitted curves will produce allocations that improve real downstream performance.

free parameters (2)
  • domain-specific scaling exponents for MAE loss
    Fitted to pretraining curves on each imaging modality to predict loss reduction with added data.
  • asymmetric transfer coefficients between modality pairs
    Fitted from cross-domain pretraining experiments to quantify how much one domain improves another.
axioms (2)
  • domain assumption MAE reconstruction loss and cross-domain transfer performance follow power-law scaling with data volume in each domain.
    Invoked to turn observed trends into an optimizable allocation problem.
  • ad hoc to paper The linear combination of domain contributions under the fitted transfer matrix accurately predicts mixture performance.
    Introduced to derive the hub-and-island allocations from the scaling observations.

pith-pipeline@v0.9.0 · 5543 in / 1585 out tokens · 53231 ms · 2026-05-11T01:27:33.858439+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 4 internal anchors

  1. [1]

    ArXiv pp

    Adewole, M., Rudie, J.D., Gbdamosi, A., Toyobo, O., Raymond, C., Zhang, D., Omidiji, O., Akinola, R., Suwaid, M.A., Emegoakor, A., et al.: The brain tumor segmentation (brats) challenge 2023: Glioma segmentation in sub-saharan africa patient population (brats-africa). ArXiv pp. arXiv–2305 (2023)

  2. [2]

    Massively multilingual neural machine translation in the wild: Findings and challenges

    Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., Johnson, M., Krikun, M., Chen, M.X., Cao, Y ., Foster, G., Cherry, C., et al.: Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019 (2019)

  3. [3]

    Advances in Neural Information Processing Systems36, 78142–78167 (2023)

    Bai, Y ., Ying, J., Cao, Y ., Lv, X., He, Y ., Wang, X., Yu, J., Zeng, K., Xiao, Y ., Lyu, H., et al.: Benchmarking foundation models with language-model-as-an-examiner. Advances in Neural Information Processing Systems36, 78142–78167 (2023)

  4. [4]

    Neurology105(2), e213831 (2025)

    Barnard, L., Botha, H., Corriveau-Lecavalier, N., Graff-Radford, J., Dicks, E., Gogineni, V ., Zhang, G., Burkett, B.J., Johnson, D.R., Huls, S.J., et al.: An fdg-pet–based machine learning framework to support neurologic decision-making in alzheimer disease and related disorders. Neurology105(2), e213831 (2025)

  5. [5]

    Bassi, P.R., Li, W., Tang, Y ., Isensee, F., Wang, Z., Chen, J., Chou, Y .C., Roy, S., Kirchhoff, Y ., Rokuss, M., et al.: Touchstone benchmark: Are we on the right way for evaluating ai algorithms for medical segmentation? Advances in Neural Information Processing Systems37, 15184–15201 (2024)

  6. [6]

    SIAM Journal on Scientific Computing 21(1), 1–23 (1999)

    Branch, M.A., Coleman, T.F., Li, Y .: A subspace, interior, and conjugate gradient method for large-scale bound-constrained minimization problems. SIAM Journal on Scientific Computing 21(1), 1–23 (1999)

  7. [7]

    Med3d: Transfer learning for 3d medical image analysis

    Chen, S., Ma, K., Zheng, Y .: Med3d: Transfer learning for 3d medical image analysis. arXiv preprint arXiv:1904.00625 (2019)

  8. [8]

    Available: https://arxiv.org/abs/2304.09151

    Chung, H.W., Constant, N., Garcia, X., Roberts, A., Tay, Y ., Narang, S., Firat, O.: Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining. arXiv preprint arXiv:2304.09151 (2023)

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Gao, Y .: Training like a medical resident: Context-prior learning toward universal medical image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11194–11204 (2024)

  11. [11]

    Bimix: Bivariate data mixing law for language model pretraining,

    Ge, C., Ma, Z., Chen, D., Li, Y ., Ding, B.: Bimix: A bivariate data mixing law for language model pretraining. arXiv preprint arXiv:2405.14908 (2024) 10

  12. [12]

    In: International MICCAI Brainlesion Workshop

    Hatamizadeh, A., Nath, V ., Tang, Y ., Yang, D., Roth, H.R., Xu, D.: Swin unetr: Swin trans- formers for semantic segmentation of brain tumors in mri images. In: International MICCAI Brainlesion Workshop. pp. 272–284. Springer (2022)

  13. [13]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Hatamizadeh, A., Tang, Y ., Nath, V ., Yang, D., Myronenko, A., Landman, B., Roth, H.R., Xu, D.: Unetr: Transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 574–584 (2022)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16000–16009 (2022)

  15. [15]

    Scaling laws for transfer

    Hernandez, D., Kaplan, J., Henighan, T., McCandlish, S.: Scaling laws for transfer. arXiv preprint arXiv:2102.01293 (2021)

  16. [16]

    Training Compute-Optimal Large Language Models

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute-optimal large language models. arXiv preprint arXiv:2203.1555610(2022)

  17. [17]

    Data scaling laws for radiology foundation models.arXiv preprint arXiv:2509.12818, 2025

    Ilse, M., Sharma, H., Schwaighofer, A., Bond-Taylor, S., Pérez-García, F., Melnichenko, O., Sykes, A.M.G., Horst, K.K., Khandelwal, A., Reynolds, M., et al.: Data scaling laws for radiology foundation models. arXiv preprint arXiv:2509.12818 (2025)

  18. [18]

    Nature methods18(2), 203–211 (2021)

    Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods18(2), 203–211 (2021)

  19. [19]

    arXiv preprint arXiv:2206.08023 (2022)

    Ji, Y ., Bai, H., Yang, J., Ge, C., Zhu, Y ., Zhang, R., Li, Z., Zhang, L., Ma, W., Wan, X., et al.: Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. arXiv preprint arXiv:2206.08023 (2022)

  20. [20]

    Advances in Neural Information Processing Systems37, 111318–111357 (2024)

    Jin, R., Xu, Z., Zhong, Y ., Yao, Q., Dou, Q., Zhou, S.K., Li, X.: Fairmedfm: fairness bench- marking for medical imaging foundation models. Advances in Neural Information Processing Systems37, 111318–111357 (2024)

  21. [21]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Rad- ford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

  22. [22]

    Forschungsbericht- Deutsche Forschungs- und Versuchsanstalt fur Luft- und Raumfahrt (1988)

    Kraft, D.: A software package for sequential quadratic programming. Forschungsbericht- Deutsche Forschungs- und Versuchsanstalt fur Luft- und Raumfahrt (1988)

  23. [23]

    Nature Biomedical Engineering6(12), 1346–1352 (2022)

    Krishnan, R., Rajpurkar, P., Topol, E.J.: Self-supervised learning in medicine and healthcare. Nature Biomedical Engineering6(12), 1346–1352 (2022)

  24. [24]

    In: The Eleventh Inter- national Conference on Learning Representations (2022)

    Lee, H.H., Bao, S., Huo, Y ., Landman, B.A.: 3d ux-net: A large kernel volumetric convnet modernizing hierarchical transformer for medical image segmentation. In: The Eleventh Inter- national Conference on Learning Representations (2022)

  25. [25]

    arXiv preprint arXiv:2505.19603 (2025)

    Lee, H.H., Liu, Q., Bao, S., Huo, Y ., Landman, B.A.: Rep3d: Re-parameterize large 3d kernels with low-rank receptive modeling for medical imaging. arXiv preprint arXiv:2505.19603 (2025)

  26. [26]

    arXiv preprint arXiv:2303.05785 (2023)

    Lee, H.H., Liu, Q., Bao, S., Yang, Q., Yu, X., Cai, L.Y ., Li, T., Huo, Y ., Koutsoukos, X., Landman, B.A.: Scaling up 3d kernels with bayesian frequency re-parameterization for medical image segmentation. arXiv preprint arXiv:2303.05785 (2023)

  27. [27]

    arXiv preprint arXiv:2407.01492 , year=

    Liu, Q., Zheng, X., Muennighoff, N., Zeng, G., Dou, L., Pang, T., Jiang, J., Lin, M.: Regmix: Data mixture as regression for language model pre-training. arXiv preprint arXiv:2407.01492 (2024)

  28. [28]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 11

  29. [29]

    Journal of cognitive neuroscience19(9), 1498–1507 (2007)

    Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience19(9), 1498–1507 (2007)

  30. [30]

    In: Fourth Head and Neck Cancer Tumor Lesion Segmentation, Diagnosis and Prognosis (2025)

    Quetin, S., Enger, S.A.: Automatic lesion and lymph node segmentation from pet and ct scans of the head and neck region: a hecktor 2025 challenge report. In: Fourth Head and Neck Cancer Tumor Lesion Segmentation, Diagnosis and Prognosis (2025)

  31. [31]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Rui, S., Chen, L., Tang, Z., Wang, L., Liu, M., Zhang, S., Wang, X.: Multi-modal vision pre-training for medical image analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5164–5174 (2025)

  32. [32]

    Scaling laws for optimal data mixtures,

    Shukor, M., Bethune, L., Busbridge, D., Grangier, D., Fini, E., El-Nouby, A., Ablin, P.: Scaling laws for optimal data mixtures. arXiv preprint arXiv:2507.09404 (2025)

  33. [33]

    arXiv preprint arXiv:2006.03829 (2020)

    Taleb, A., Loetzsch, W., Danz, N., Severin, J., Gaertner, T., Bergner, B., Lippert, C.: 3d self-supervised methods for medical imaging. arXiv preprint arXiv:2006.03829 (2020)

  34. [34]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tang, Y ., Yang, D., Li, W., Roth, H.R., Landman, B., Xu, D., Nath, V ., Hatamizadeh, A.: Self- supervised pre-training of swin transformers for 3d medical image analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20730–20740 (2022)

  35. [35]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wald, T., Ulrich, C., Suprijadi, J., Ziegler, S., Nohel, M., Peretzke, R., Kohler, G., Maier- Hein, K.: An openmind for 3d medical vision self-supervised learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23839–23879 (2025)

  36. [36]

    Scientific Data10(1), 574 (2023)

    Wang, D., Wang, X., Wang, L., Li, M., Da, Q., Liu, X., Gao, X., Shen, J., He, J., Shen, T., et al.: A real-world dataset and benchmark for foundation model adaptation in medical image classification. Scientific Data10(1), 574 (2023)

  37. [37]

    Mis-fm: 3d medical image segmentation using foundation models pretrained on a large-scale unannotated dataset.arXiv preprint arXiv:2306.16925, 2023

    Wang, G., Wu, J., Luo, X., Liu, X., Li, K., Zhang, S.: Mis-fm: 3d medical image segmenta- tion using foundation models pretrained on a large-scale unannotated dataset. arXiv preprint arXiv:2306.16925 (2023)

  38. [38]

    Advances in Neural Information Processing Systems37, 1082–1116 (2024)

    Wang, J., Wang, X., Lyu, L., Chen, J., Ma, F.: Fedmeki: A benchmark for scaling medical foun- dation models via federated knowledge injection. Advances in Neural Information Processing Systems37, 1082–1116 (2024)

  39. [39]

    In: International Workshop on Machine Learning in Medical Imaging

    Wu, B., Xie, Y ., Zhang, Z., Ge, J., Yaxley, K., Bahadir, S., Wu, Q., Liu, Y ., To, M.S.: Bhsd: A 3d multi-class brain hemorrhage segmentation dataset. In: International Workshop on Machine Learning in Medical Imaging. pp. 147–156. Springer (2023)

  40. [40]

    Advances in Neural Information Processing Systems36, 69798–69818 (2023)

    Xie, S.M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y ., Liang, P.S., Le, Q.V ., Ma, T., Yu, A.W.: Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems36, 69798–69818 (2023)

  41. [41]

    In: European Conference on Computer Vision

    Xie, Y ., Zhang, J., Xia, Y ., Wu, Q.: Unimiss: Universal medical self-supervised learning via breaking dimensionality barrier. In: European Conference on Computer Vision. pp. 558–575. Springer (2022)

  42. [42]

    npj Digital Medicine8(1), 639 (2025)

    Xu, T., Hosseini, S., Anderson, C., Rinaldi, A., Krishnan, R.G., Martel, A.L., Goubran, M.: A generalizable 3d framework and model for self-supervised learning in medical imaging. npj Digital Medicine8(1), 639 (2025)

  43. [43]

    In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies

    Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., Raffel, C.: mt5: A massively multilingual pre-trained text-to-text transformer. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies. pp. 483–498 (2021)

  44. [44]

    Scientific data10(1), 41 (2023) 12

    Yang, J., Shi, R., Wei, D., Liu, Z., Zhao, L., Ke, B., Pfister, H., Ni, B.: Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific data10(1), 41 (2023) 12

  45. [45]

    arXiv preprint arXiv:2403.16952 , year=

    Ye, J., Liu, P., Sun, T., Zhan, J., Zhou, Y ., Qiu, X.: Data mixing laws: Optimizing data mixtures by predicting language modeling performance. arXiv preprint arXiv:2403.16952 (2024)

  46. [46]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ye, Y ., Xie, Y ., Zhang, J., Chen, Z., Wu, Q., Xia, Y .: Continual self-supervised learning: Towards universal multi-modal medical data representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11114–11124 (2024)

  47. [47]

    Medical Image Analysis p

    Yu, X., Yang, Q., Zhou, Y ., Cai, L.Y ., Gao, R., Lee, H.H., Li, T., Bao, S., Xu, Z., Lasko, T.A., et al.: Unest: local spatial representation learning with hierarchical transformer for efficient medical segmentation. Medical Image Analysis p. 102939 (2023)

  48. [48]

    Available

    Zhou, Z., Sodha, V ., Pang, J., Gotway, M.B., Liang, J.: Models genesis. Medical image analysis 67, 101840 (2021) 13 A Experimental Details A.1 Dataset Description Pretraining datasets.Our six pretraining domains are drawn from three publicly available 3D medical imaging collections spanning CT, MRI, and PET (Table 4). To simulate realistic data imbalance...

  49. [49]

    Load the MAE checkpoint containing unetr_state_dict and metadata (scale_factor, num_layers)

  50. [50]

    Build a fresh UNETR without_channels = num_classesfor the target task

  51. [51]

    Extract all keys matchingvit.* from the checkpoint and load them into the new UNETR’s ViT encoder, verifying shape compatibility

  52. [52]

    Classification: ViT encoder with MLP head.For classification tasks, we extract the same pretrained ViT encoder and attach a lightweight MLP classification head

    All remaining parameters (CNN decoder, skip connections, output head) retain random initialization. Classification: ViT encoder with MLP head.For classification tasks, we extract the same pretrained ViT encoder and attach a lightweight MLP classification head. The ViT processes the input 963 volume into a sequence of patch embeddings, which are globally a...