BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

Annabella Shewarega; James S. Duncan; Julius Chapiro; Junlin Yang; Lawrence H. Staib; Nicha C. Dvornek; Peiyu Duan; Tian Yu; Yuexi Du

arxiv: 2605.26376 · v1 · pith:OR5TKGAFnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI· cs.LG

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

Junlin Yang , Tian Yu , Nicha C. Dvornek , Yuexi Du , Peiyu Duan , Annabella Shewarega , Lawrence H. Staib , James S. Duncan

show 1 more author

Julius Chapiro

This is my paper

Pith reviewed 2026-06-29 22:18 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords hepatocellular carcinomasurvival predictionmixture of expertsvision-language modelmultimodal prognosisbiological factorizationMRI radiology reportsphenotype stratification

0 comments

The pith

A biologically factorized mixture-of-experts model separates liver and tumor factors from MRI and reports to improve hepatocellular carcinoma survival prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing vision-language models for HCC prognosis learn one entangled representation that mixes hepatic reserve and tumor factors, which reduces both accuracy and the ability to link predictions to distinct biology. BioFact-MoE inserts biologically supervised experts inside a residual mixture-of-experts survival network so that one expert pathway learns liver-related features and another learns tumor-related features. On a cohort of 588 patients pretrained on over 4,500 image-report pairs, the model raises 12-, 18-, and 24-month AUCs to 75.33 percent, 75.85 percent, and 73.96 percent while producing gated weights that stratify phenotypes and embeddings that correlate selectively with liver-function or tumor-burden markers without any direct supervision on those markers.

Core claim

BioFact-MoE explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture; the resulting model improves scalar survival prediction across time horizons and yields gated expert weights and latent embeddings whose selective associations with clinical markers arise from the factorization.

What carries the argument

Biologically factorized Mixture of Experts (MoE) with biologically supervised experts inside a residual MoE survival architecture that decomposes hepatic and tumor latent factors from vision-language inputs.

If this is right

Gated expert weights produce phenotype-aware risk groups whose survival curves differ by treatment history.
Hepatic embeddings correlate with liver-function markers and tumor embeddings correlate with tumor-burden markers at p less than 0.05 without explicit supervision on those markers.
The same architecture yields higher AUCs than standard vision-language or non-factorized MoE baselines at every tested horizon.
Pathway-informed gating reveals treatment-associated survival heterogeneity in held-out validation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the factorization holds, the same supervised-expert pattern could be tested on other multimodal cancer datasets where two dominant biological axes drive outcome.
The selective marker associations suggest the model could be used to flag patients whose risk is driven more by liver decompensation than by tumor progression, an angle not directly tested in the paper.
Because the experts remain inside a residual MoE, the architecture may tolerate addition of further supervised pathways for additional biological axes without retraining the entire network.

Load-bearing premise

The biological supervision on the experts is sufficient to force cleanly separable liver and tumor latent factors rather than simply capturing dataset-specific correlations.

What would settle it

Remove the biological supervision from the expert pathways, retrain on the same data, and check whether the selective correlations between embeddings and liver-function versus tumor-burden markers disappear while prediction AUCs fall to baseline levels.

Figures

Figures reproduced from arXiv: 2605.26376 by Annabella Shewarega, James S. Duncan, Julius Chapiro, Junlin Yang, Lawrence H. Staib, Nicha C. Dvornek, Peiyu Duan, Tian Yu, Yuexi Du.

**Figure 1.** Figure 1: BioFact-MoE framework. Stage 1: LLM-guided report decomposition supervises three pathway-specific LoRA adapters via contrastive pretraining with anatomical patch masking. Stage 2: Frozen pathway encoders are integrated by a residual MoE survival head with adaptive gating for Cox-based survival prediction. 2 Method 2.1 Stage 1: Pathway-Specific Biological Factorized Pretraining BioFact-MoE framework operat… view at source ↗

**Figure 2.** Figure 2: Phenotype-Aware Stratification: Beyond scalar risk, gate weights stratify patients by dominant biological axis. Among patients receiving the same TACE treatment, liver-driven and tumor-driven subgroups show significantly different survival trajectories, capturing heterogeneity invisible to clinical staging alone [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: In an exploratory treatment analysis, patients identified as liver low-risk by the hepatic pathway showed the greatest predicted benefit from TACE, consistent with clinical expectations. 3.5 Ablation Studies [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision-language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor-related factors, limiting both accuracy and biological interpretability. We present BioFact-MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image-report pairs), BioFact-MoE consistently improves survival prediction over all baselines across time horizons, achieving 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype-aware risk stratification. Pathway-informed gating uncovers clinically meaningful treatment-associated survival heterogeneity. In held-out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p<0.05), without supervision. The code is available at https://github.com/jy-639/BioFact-MoE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BioFact-MoE adds supervised experts to separate liver and tumor factors in HCC multimodal survival modeling, but the abstract leaves open whether the factorization or the supervision itself produces the reported selectivity.

read the letter

The paper introduces BioFact-MoE, a residual MoE that uses biologically supervised experts to decompose hepatic and tumor latent factors from MRI-report pairs for HCC prognosis. It reports AUCs of 75.33%, 75.85%, and 73.96% at 12, 18, and 24 months on N=588 patients, plus selective embedding correlations with liver-function and tumor-burden markers at p<0.05, and releases the code.

What stands out is the concrete attempt to build interpretability into the architecture via gated experts for phenotype-aware stratification, applied to a routine clinical multimodal setting. The numbers are specific and the pretraining on 4,582 pairs is noted.

The soft spots are clear from the abstract alone. No ablations, derivations, or error bars appear, so it is impossible to tell whether the residual MoE path adds anything beyond the biological supervision. The stress-test concern lands: if the experts are supervised with signals already correlated to the same markers later tested, the selective associations could be by construction rather than emergent. Dataset-specific correlations in this cohort size could produce the same pattern without factorization. The claim of associations arising "without supervision" therefore needs direct verification against the methods.

This is for readers in medical imaging and hepatology who work on survival models and want to test disentangled representations. A serious referee should see the full paper to check the supervision mechanism and controls; the topic and cohort are relevant enough to justify that step even if revisions are needed.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BioFact-MoE, a residual Mixture-of-Experts architecture for vision-language survival modeling in HCC. Biologically supervised experts are used to explicitly decompose hepatic functional reserve and tumor-related factors from multiparametric MRI and radiology reports. On a cohort of N=588 patients (pretrained on 4,582 image-report pairs), the model reports improved time-to-event AUCs (12-month 75.33%, 18-month 75.85%, 24-month 73.96%) over baselines, phenotype-aware risk stratification via gated expert weights, and selective associations of the resulting hepatic and tumor embeddings with liver-function and tumor-burden markers (p<0.05) in held-out data without direct supervision on those markers. Code is released.

Significance. If the reported factorization is shown to be robust and independent of the supervision signals, the work would supply a concrete template for disentangling biologically distinct pathways inside multimodal prognostic models for heterogeneous cancers. The public code release is a clear strength that supports reproducibility.

major comments (2)

[Abstract and §3] Abstract and §3 (Model): The statement that hepatic and tumor embeddings exhibit selective marker associations 'without supervision' is load-bearing for the central factorization claim. The manuscript must specify the exact supervision signals, loss terms, and pathway annotations used to train the liver and tumor experts; without this, it is impossible to rule out that the reported p<0.05 associations arise by construction from supervision signals that already correlate with the tested clinical markers.
[§4 and §5] §4 (Experiments) and §5 (Results): No ablation is presented that removes biological supervision while retaining the same expert count, residual MoE structure, and gating mechanism. Such a control is required to demonstrate that performance gains and embedding selectivity are attributable to the factorization rather than to the supervision itself or to dataset-specific correlations in the N=588 cohort.

minor comments (2)

[Abstract] Abstract: AUC values are reported to two decimal places without accompanying standard deviations, number of runs, or confidence intervals; adding these would allow readers to assess stability of the reported gains.
[§5] Figure captions and §5: The description of 'pathway-informed gating' should explicitly state which clinical variables or annotations are used as pathway inputs so that the stratification analysis can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which highlight key aspects needed to strengthen the central claims of the manuscript. We address each major comment below and will revise the paper to provide the requested clarifications and experiments.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Model): The statement that hepatic and tumor embeddings exhibit selective marker associations 'without supervision' is load-bearing for the central factorization claim. The manuscript must specify the exact supervision signals, loss terms, and pathway annotations used to train the liver and tumor experts; without this, it is impossible to rule out that the reported p<0.05 associations arise by construction from supervision signals that already correlate with the tested clinical markers.

Authors: We agree that explicit specification of the supervision is necessary to support the factorization claim. The biological supervision for the experts relies on pathway annotations automatically extracted from the pretraining radiology reports (4,582 pairs) and aligned to MRI features via dedicated contrastive and reconstruction loss terms: the liver expert uses hepatic functional reserve pathway labels (e.g., fibrosis and cirrhosis indicators), while the tumor expert uses oncologic pathway labels (e.g., vascular invasion and nodule descriptors). These annotations are distinct from the specific held-out clinical markers (liver function tests and tumor burden metrics) used for the post-hoc p<0.05 association tests in the N=588 cohort. In the revision we will expand §3 with a dedicated subsection and table that lists the exact annotations, loss formulations, and gating mechanism, explicitly noting that the tested markers were never part of training supervision. This will demonstrate that the selective associations are emergent. revision: yes
Referee: [§4 and §5] §4 (Experiments) and §5 (Results): No ablation is presented that removes biological supervision while retaining the same expert count, residual MoE structure, and gating mechanism. Such a control is required to demonstrate that performance gains and embedding selectivity are attributable to the factorization rather than to the supervision itself or to dataset-specific correlations in the N=588 cohort.

Authors: We concur that an ablation isolating the effect of biological supervision is important. The current baselines include non-MoE VLMs and a standard residual MoE without pathway supervision, but we will add the requested control: a 'non-biological' variant that retains the identical expert count, residual connections, and gating network but replaces the pathway-specific losses with generic reconstruction objectives. We will report the resulting 12-/18-/24-month AUCs and embedding-marker correlations on the same splits. This experiment will be included in the revised §4 and §5 to quantify the incremental benefit of the biological factorization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with no reducible steps shown

full rationale

The abstract and provided text contain no equations, training details, or derivation chain that can be inspected for reduction to inputs. Claims such as 'without supervision' for marker associations and 'biologically supervised experts' are stated at a high level but do not exhibit self-definitional structure, fitted inputs renamed as predictions, or load-bearing self-citations. No specific reduction (e.g., Eq. X = Eq. Y by construction) is quotable. The central performance claims rest on empirical results in an N=588 cohort with external validation, which is independent of the factorization description. This is the expected honest non-finding when no load-bearing circular step is exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is populated from stated claims rather than from explicit equations or methods sections. The central claim rests on the existence of separable biological factors that can be supervised into distinct experts and on the representativeness of the N=588 cohort.

axioms (1)

domain assumption The HCC patient cohort (N=588, pretrained on 4,582 image-report pairs) is representative of the target clinical population for survival modeling.
Stated cohort size and pretraining scale in abstract; no further justification supplied.

invented entities (1)

Biologically supervised experts (liver and tumor pathways) no independent evidence
purpose: Explicit decomposition of hepatic functional reserve versus tumor-related oncologic factors inside the MoE survival model.
Introduced in abstract as the core architectural novelty; no independent evidence of separability provided beyond the reported embedding correlations.

pith-pipeline@v0.9.1-grok · 5826 in / 1505 out tokens · 24303 ms · 2026-06-29T22:18:06.389350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Radiology314(2), e241613 (2025)

AkinciD’Antonoli,T.,Berger,L.K.,Indrakanti,A.K.,Vishwanathan,N.,Weiss,J., Jung, M., Berkarda, Z., Rau, A., Reisert, M., Küstner, T., et al.: Totalsegmentator mri: robust sequence-independent segmentation of multiple anatomic structures in mri. Radiology314(2), e241613 (2025)

2025
[2]

In: Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers).pp.1932–1945 (2024)

Dou, S., Zhou, E., Liu, Y., Gao, S., Shen, W., Xiong, L., Zhou, Y., Wang, X., Xi, Z., Fan, X., et al.: Loramoe: Alleviating world knowledge forgetting in large language models via moe-style plugin. In: Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers).pp.1932–1945 (2024)

1932
[3]

In: International Con- ference on Information Processing in Medical Imaging

Du, Y., Onofrey, J.A., Dvornek, N.C.: Multi-view and multi-scale alignment for contrastive language-image pre-training in mammography. In: International Con- ference on Information Processing in Medical Imaging. pp. 247–262. Springer (2025)

2025
[4]

Foglia, B., Turato, C., Cannito, S.: Hepatocellular carcinoma: latest research in pathogenesis, detection and treatment (2023)

2023
[5]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

2022
[6]

In: Proceedings of the IEEE/CVF international conference on computer vision

Huang, S.C., Shen, L., Lungren, M.P., Yeung, S.: Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3942–3951 (2021)

2021
[7]

Nature methods18(2), 203–211 (2021)

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods18(2), 203–211 (2021)

2021
[8]

BMC medical research methodology18(1), 24 (2018)

Katzman, J.L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., Kluger, Y.: Deep- surv:personalizedtreatmentrecommendersystemusingacoxproportionalhazards deep neural network. BMC medical research methodology18(1), 24 (2018)

2018
[9]

arXiv preprint arXiv:2404.15159 , year=

Li, D., Ma, Y., Wang, N., Ye, Z., Cheng, Z., Tang, Y., Zhang, Y., Duan, L., Zuo, J., Yang, C., et al.: Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts. arXiv preprint arXiv:2404.15159 (2024)

work page arXiv 2024
[10]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Li, Y., Lai, H., Zhou, X., Ming, S., Ma, W., Wei, W., Zhou, S.K.: More perfor- mant and scalable: Rethinking contrastive vision-language pre-training of radiol- ogy in the llm era. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 348–357. Springer (2025)

2025
[11]

Journal of hepatology64(3), 601–608 (2016)

Liu, P.H., Hsu, C.Y., Hsia, C.Y., Lee, Y.H., Su, C.W., Huang, Y.H., Lee, F.Y., Lin, H.C., Huo, T.I.: Prognosis of hepatocellular carcinoma: assessment of eleven staging systems. Journal of hepatology64(3), 601–608 (2016)

2016
[12]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 10 F. Author et al

2021
[13]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

arXiv preprint arXiv:2511.21889 (2025)

Willis, R., Bakos, J.: Exploring fusion strategies for multimodal vision-language systems. arXiv preprint arXiv:2511.21889 (2025)

work page arXiv 2025
[15]

Nature communications16(1), 3504 (2025)

Wu, Y., Liu, Y., Yang, Y., Yao, M.S., Yang, W., Shi, X., Yang, L., Li, D., Liu, Y., Yin, S., et al.: A concept-based interpretable model for the diagnosis of choroid neoplasias using multimodal data. Nature communications16(1), 3504 (2025)

2025
[16]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

arXiv preprint arXiv:2309.05444 , year=

Zadouri, T., Üstün, A., Ahmadian, A., Ermiş, B., Locatelli, A., Hooker, S.: Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444 (2023)

work page arXiv 2023
[18]

arXiv preprint arXiv:2601.06847 (2026)

Zhang, M., Wu, X., Luo, H., Wang, F., Lv, Y.: Medground: Bridging the evidence gap in medical vision-language models with verified grounding data. arXiv preprint arXiv:2601.06847 (2026)

work page arXiv 2026
[19]

NEJM AI , year=

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., Tupini, A., Wang, Y., Mazzola, M., Shukla, S., Liden, L., Gao, J., Crabtree, A., Piening, B., Bifulco, C., Lungren, M.P., Naumann, T., Wang, S., Poon, H.: A multimodal biomedical foundation model trained from fifteen million image–text pairs....

work page doi:10.1056/aioa2400640 2024

[1] [1]

Radiology314(2), e241613 (2025)

AkinciD’Antonoli,T.,Berger,L.K.,Indrakanti,A.K.,Vishwanathan,N.,Weiss,J., Jung, M., Berkarda, Z., Rau, A., Reisert, M., Küstner, T., et al.: Totalsegmentator mri: robust sequence-independent segmentation of multiple anatomic structures in mri. Radiology314(2), e241613 (2025)

2025

[2] [2]

In: Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers).pp.1932–1945 (2024)

Dou, S., Zhou, E., Liu, Y., Gao, S., Shen, W., Xiong, L., Zhou, Y., Wang, X., Xi, Z., Fan, X., et al.: Loramoe: Alleviating world knowledge forgetting in large language models via moe-style plugin. In: Proceedings of the 62nd Annual Meeting of the AssociationforComputationalLinguistics(Volume1:LongPapers).pp.1932–1945 (2024)

1932

[3] [3]

In: International Con- ference on Information Processing in Medical Imaging

Du, Y., Onofrey, J.A., Dvornek, N.C.: Multi-view and multi-scale alignment for contrastive language-image pre-training in mammography. In: International Con- ference on Information Processing in Medical Imaging. pp. 247–262. Springer (2025)

2025

[4] [4]

Foglia, B., Turato, C., Cannito, S.: Hepatocellular carcinoma: latest research in pathogenesis, detection and treatment (2023)

2023

[5] [5]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

2022

[6] [6]

In: Proceedings of the IEEE/CVF international conference on computer vision

Huang, S.C., Shen, L., Lungren, M.P., Yeung, S.: Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3942–3951 (2021)

2021

[7] [7]

Nature methods18(2), 203–211 (2021)

Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods18(2), 203–211 (2021)

2021

[8] [8]

BMC medical research methodology18(1), 24 (2018)

Katzman, J.L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., Kluger, Y.: Deep- surv:personalizedtreatmentrecommendersystemusingacoxproportionalhazards deep neural network. BMC medical research methodology18(1), 24 (2018)

2018

[9] [9]

arXiv preprint arXiv:2404.15159 , year=

Li, D., Ma, Y., Wang, N., Ye, Z., Cheng, Z., Tang, Y., Zhang, Y., Duan, L., Zuo, J., Yang, C., et al.: Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts. arXiv preprint arXiv:2404.15159 (2024)

work page arXiv 2024

[10] [10]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Li, Y., Lai, H., Zhou, X., Ming, S., Ma, W., Wei, W., Zhou, S.K.: More perfor- mant and scalable: Rethinking contrastive vision-language pre-training of radiol- ogy in the llm era. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 348–357. Springer (2025)

2025

[11] [11]

Journal of hepatology64(3), 601–608 (2016)

Liu, P.H., Hsu, C.Y., Hsia, C.Y., Lee, Y.H., Su, C.W., Huang, Y.H., Lee, F.Y., Lin, H.C., Huo, T.I.: Prognosis of hepatocellular carcinoma: assessment of eleven staging systems. Journal of hepatology64(3), 601–608 (2016)

2016

[12] [12]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 10 F. Author et al

2021

[13] [13]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

arXiv preprint arXiv:2511.21889 (2025)

Willis, R., Bakos, J.: Exploring fusion strategies for multimodal vision-language systems. arXiv preprint arXiv:2511.21889 (2025)

work page arXiv 2025

[15] [15]

Nature communications16(1), 3504 (2025)

Wu, Y., Liu, Y., Yang, Y., Yao, M.S., Yang, W., Shi, X., Yang, L., Li, D., Liu, Y., Yin, S., et al.: A concept-based interpretable model for the diagnosis of choroid neoplasias using multimodal data. Nature communications16(1), 3504 (2025)

2025

[16] [16]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

arXiv preprint arXiv:2309.05444 , year=

Zadouri, T., Üstün, A., Ahmadian, A., Ermiş, B., Locatelli, A., Hooker, S.: Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444 (2023)

work page arXiv 2023

[18] [18]

arXiv preprint arXiv:2601.06847 (2026)

Zhang, M., Wu, X., Luo, H., Wang, F., Lv, Y.: Medground: Bridging the evidence gap in medical vision-language models with verified grounding data. arXiv preprint arXiv:2601.06847 (2026)

work page arXiv 2026

[19] [19]

NEJM AI , year=

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., Wong, C., Tupini, A., Wang, Y., Mazzola, M., Shukla, S., Liden, L., Gao, J., Crabtree, A., Piening, B., Bifulco, C., Lungren, M.P., Naumann, T., Wang, S., Poon, H.: A multimodal biomedical foundation model trained from fifteen million image–text pairs....

work page doi:10.1056/aioa2400640 2024