pith. sign in

arxiv: 2606.06696 · v1 · pith:A5L5SDQUnew · submitted 2026-06-04 · 💻 cs.CV · cs.AI

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

Pith reviewed 2026-06-28 01:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords biomedical imagingvision-language modelsbenchmarkvisual perceptionmultimodal AImedical imagingdomain generalizationobject detection
0
0 comments X

The pith

A new benchmark for biomedical vision-language models shows that reported high accuracies often conceal deficiencies in visual perception and domain generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Massive Multimodal Biomedical Understanding benchmark as a way to test whether vision-language models can accurately interpret subtle visual details across many types of biomedical images. It covers 35 submodalities and uses both open-ended and closed tasks for classification and object detection to check performance at different scales and settings. Testing 17 models reveals that medical adaptation helps in some cases but does not eliminate gaps in perception or the ability to handle new contexts. This matters because models that score well on narrower tests may still fail when faced with real variation in medical imaging data.

Core claim

The MMBU benchmark demonstrates that high accuracy on established biomedical VLM tests frequently masks underlying weaknesses in visual perception and the capacity to generalize across diverse modalities, scales, and clinical contexts, even after medical adaptation.

What carries the argument

The MMBU benchmark, a dataset spanning 35 submodalities with structured metadata that enables parallel evaluation of ungrounded and grounded classification plus object detection.

If this is right

  • Medical adaptation improves some model scores but leaves measurable perception gaps across modalities.
  • Model performance differs markedly depending on biological scale and imaging type.
  • Systematic testing across open and closed task formats exposes where generalization fails.
  • Both open-weight and closed frontier models exhibit domain-specific perception limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training approaches may need to target fine visual feature extraction more directly rather than leaning on associated text.
  • Routine use of perception-focused benchmarks like MMBU could become part of validation before clinical deployment.
  • The same evaluation structure might help identify perception shortfalls in vision-language models applied to other technical image domains.

Load-bearing premise

The benchmark's selection of tasks and submodalities succeeds in isolating visual perception from language cues or dataset biases.

What would settle it

If models that score highly on prior benchmarks also score highly on MMBU with no measurable perception or generalization shortfalls, the claim that existing tests mask deficiencies would not hold.

Figures

Figures reproduced from arXiv: 2606.06696 by Alejandro Lozano, Daniel Vela Jarquin, James Burgess, Jeffrey J. Nirschl, Jin Ye, Josiah Aklilu, Junjun He, Ming Hu, Min Woo Sun, Paola Avila, Robayo, Robert Tibshirani, Ryan D'Cunha, Ryan Nayebi, Serena Yeung-Levy, Xiaoxiao Sun, Xin Chen, Yue Yao, Yuhui Zhang, Zhongying Deng.

Figure 1
Figure 1. Figure 1: The data landscape of MMBU. Current biomedical VLM evaluation relies on roughly 20 commonly used datasets. However, as the training data for large models expands, this evaluation becomes inadequate due to issues such as data pollution and a lack of diversity. We introduce MMBU to address this issue. 1 Introduction Biomedical vision-language models (VLMs) are increasingly explored for a wide range of biomed… view at source ↗
Figure 2
Figure 2. Figure 2: Multi-task visual examples and metadata-driven question construction in MMBU. TOP: Data collection metadata extraction and standardization. Middle: Representative samples from diverse medical domains and modalities across three task types, including classification, detection, and segmentation. Bottom: Example of question construction in MMBU using newly collected metadata. An example bench￾mark question is… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of MMBU dataset composition across modalities, submodalities, med￾ical domains, and body parts. (a) Distribution of top-level imaging modalities. (b) Dis￾tribution of imaging submodalities, shown in two panels for readability, with counts on a log2 scale. (c) Distribution of medical domains grouped into clinical and labora￾tory categories. (d) Top body parts (15 out of 95 total body-part categorie… view at source ↗
Figure 4
Figure 4. Figure 4: Aggregate performance on MMBU. Performance of a representative set of VLMs on the classification and detection tasks in the benchmark. Solid outlines denote open-format results, while boxes without outlines denote closed-format results. Models are ranked by their closed-format performance. Dashes indicate models adapted to medical data. Similar colors indicate models of the same family. 5 Results 5.1 Overa… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of model performance across aggregated biomedical do [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Medical models vs. base models. a) Head-to-head comparison of general-purpose and medical vision–language models on MMBU, showing the propor￾tion of questions where the medical model wins, ties, or loses. b) Relationship between medical model training data size (in millions of examples) and win rate on MMBU. 5.3 Additional Findings and Discussion Beyond the aggregate results in [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparison across MMBU subsets (x-axis) and legacy [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Modality and submodality within MMBU. This section provides additional dataset statistics. Supplement [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: F1 scores for VLMs on MMBU organized by 11 unique modalities [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: F1 scores for VLMs for MMBU’s 35 unique submodalities. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Dumbbell plot of open vs closed VQA scores across SOTA models [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Dumbbell plot of open vs closed VQA scores across SOTA models [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Dumbbell plot of open vs closed VQA scores across SOTA models [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Dumbbell plot of open vs closed VQA scores across SOTA models [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Correctness heatmap of all models tested on MMBU [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Radial plot comparing base and medical models on MMBU domains [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Radial plot comparing base and medical models on MMBU domains [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Radial plot comparing base and medical models on MMBU domains [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Radial plot comparing accuracy on the different question types by [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Radial plot comparing accuracy on the different question types by [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Comparing MMBU questions evaluated with and without the im [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: UMAP comparing MMBU’s Radiology subset against popular radi [PITH_FULL_IMAGE:figures/full_fig_p035_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: UMAP comparing MMBU’s Radiology subset against popular radi [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: UMAP comparing MMBU’s Pathology subset against a popular [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: UMAP comparing MMBU’s Pathology subset against a popular [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Model Performance on GMAI-MMBench vs MMBU (GMAI Sub [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗
read the original abstract

Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the Massive Multimodal Biomedical Understanding (MMBU) benchmark, described as the largest biomedical vision-language benchmark to date. It covers 35 submodalities with rich structured metadata and supports open/closed variants of ungrounded classification, grounded classification, and object detection tasks across biological scales and imaging modalities. Evaluation of 15 open-weight and 2 frontier VLMs shows that medical adaptation yields some gains, yet the authors conclude that high accuracy on established benchmarks can mask deficiencies in visual perception and domain generalization.

Significance. If MMBU's task suite and construction genuinely isolate visual perception capabilities from language priors and dataset biases, the benchmark would constitute a useful addition to the field by enabling more targeted diagnosis of VLM limitations in biomedical settings and supporting development of models with better domain generalization.

major comments (1)
  1. [Abstract] Abstract: The central claim—that established benchmarks mask deficiencies in visual perception and domain generalization—requires evidence that MMBU tasks measure visual content rather than textual cues or statistical shortcuts. The abstract provides no description of benchmark construction details, data sources, or controls such as vision-ablated baselines, adversarial text variants, or annotation-leakage checks, leaving the claim unsupported by the presented information.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for the abstract to better substantiate the central claim. We address the comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim—that established benchmarks mask deficiencies in visual perception and domain generalization—requires evidence that MMBU tasks measure visual content rather than textual cues or statistical shortcuts. The abstract provides no description of benchmark construction details, data sources, or controls such as vision-ablated baselines, adversarial text variants, or annotation-leakage checks, leaving the claim unsupported by the presented information.

    Authors: We agree that the abstract, constrained by length, omits key supporting details present in the full manuscript. Section 3 describes the benchmark construction from 35 public biomedical datasets with structured metadata across scales and modalities; Section 4 details the task variants (ungrounded/grounded classification and detection); and Section 5 reports controls including vision-ablated baselines (showing sharp performance drops without images) and comparisons that isolate visual perception from language priors. We will revise the abstract to concisely reference the construction process, data sources, and the use of such controls to support the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper introduces and evaluates the MMBU benchmark as an empirical dataset covering 35 submodalities with tasks in classification and detection. No mathematical derivations, parameter fitting, or predictive claims appear in the provided text. The central claim about masking deficiencies in existing benchmarks rests on direct model evaluations rather than any self-referential construction or self-citation chain. This matches the default expectation for benchmark papers, which are self-contained against external model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark paper introduces no free parameters, axioms, or invented entities; it is an empirical contribution.

pith-pipeline@v0.9.1-grok · 5786 in / 967 out tokens · 49301 ms · 2026-06-28T01:37:10.123870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 2 canonical work pages

  1. [1]

    S.: Microvqa: A multimodal reasoning benchmark for microscopy-based scientific research (2025),https://arxiv.org/abs/2503.13399

    Burgess, J., Nirschl, J.J., Bravo-Sánchez, L., Lozano, A., Gupte, S.R., Galaz- Montoya, J.G., Zhang, Y., Su, Y., Bhowmik, D., Coman, Z., Hasan, S.M., Jo- hannesson, A., Leineweber, W.D., Nair, M.G., Yarlagadda, R., Zuraski, C., Chiu, W., Cohen, S., Hansen, J.N., Leonetti, M.D., Liu, C., Lundberg, E., Yeung-Levy, 16 D’Cunha et al. S.: Microvqa: A multimoda...

  2. [2]

    Chen, P., Ye, J., Wang, G., Li, Y., Deng, Z., Li, W., Li, T., Duan, H., Huang, Z., Su, Y., Wang, B., Zhang, S., Fu, B., Cai, J., Zhuang, B., Seibel, E.J., He, J., Qiao, Y.: Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai (2024),https://arxiv.org/abs/2408.03361

  3. [3]

    arXiv preprint arXiv:2509.18234 (2025)

    Gu, Y., Fu, J., Liu, X., Valanarasu, J.M.J., Codella, N., Tan, R., Liu, Q., Jin, Y., Zhang, S., Wang, J., et al.: The illusion of readiness: Stress testing large fron- tier models on multimodal medical benchmarks. arXiv preprint arXiv:2509.18234 (2025)

  4. [4]

    He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering (2020),https://arxiv.org/abs/2003.10286

  5. [5]

    Hu, Y., Li, T., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm (2024),https: //arxiv.org/abs/2402.09181

  6. [6]

    arXiv preprint arXiv:2411.08870 (2024)

    Jeong, D.P., Mani, P., Garg, S., Lipton, Z.C., Oberst, M.: The limited impact of medical adaptation of large language and vision-language models. arXiv preprint arXiv:2411.08870 (2024)

  7. [7]

    Scientific Data5 (2018).https://doi.org/10.1038/sdata.2018.251,https://www.nature.com/ articles/sdata2018251

    Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data5 (2018).https://doi.org/10.1038/sdata.2018.251,https://www.nature.com/ articles/sdata2018251

  8. [8]

    Le, A., Liu, H., Wang, Y., Liu, Z., Zhu, R., Weng, T., Yu, J., Wang, B., Wu, Y., Yan,K.,Sun,Q.,Jiang,M.,Pei,J.,Liu,S.,Zheng,H.,Li,Z.,Noble,A.,Souquet,J., Guo, X., Lin, M., Guo, H.: U2-bench: Benchmarking large vision-language models on ultrasound understanding (2025),https://arxiv.org/abs/2505.17779

  9. [9]

    Advances in Neural Information Processing Systems36, 28541–28564 (2023)

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

  10. [10]

    Advances in Neural Information Processing Systems36 (2024)

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36 (2024)

  11. [11]

    Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering (2021), https://arxiv.org/abs/2102.09542

  12. [12]

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2024),https://arxiv.org/abs/2310.03744

  13. [13]

    arXiv preprint arXiv:2407.01791 (2024)

    Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung- Levy, S.:{\mu}-bench: A vision-language benchmark for microscopy understand- ing. arXiv preprint arXiv:2407.01791 (2024)

  14. [14]

    In: Proceedingsofthe38thInternationalConferenceonNeuralInformationProcessing Systems

    Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung-Levy, S.: Micro-bench: a vision-language benchmark for microscopy understanding. In: Proceedingsofthe38thInternationalConferenceonNeuralInformationProcessing Systems. pp. 30670–30685 (2024)

  15. [15]

    Advances in Neural Information Processing Systems37, 131035– 131071 (2024) MMBU: Massive Multimodal Biomedical Understanding 17

    Maruf, M., Daw, A., Mehrab, K.S., Manogaran, H.B., Neog, A., Sawhney, M., Khurana, M., Balhoff, J.P., Bakış, Y., Altintas, B., et al.: Vlm4bio: A benchmark dataset to evaluate pretrained vision-language models for trait discovery from bi- ological images. Advances in Neural Information Processing Systems37, 131035– 131071 (2024) MMBU: Massive Multimodal B...

  16. [16]

    Model ID: gpt-4.1-mini; snapshot: gpt-4.1-mini-2025-04-14

    OpenAI: GPT-4.1 mini.https://developers.openai.com/api/docs/models/ gpt-4.1-mini(2025), openAI API model documentation. Model ID: gpt-4.1-mini; snapshot: gpt-4.1-mini-2025-04-14. Accessed: 2026-05-27

  17. [17]

    Model ID: gpt-5.4-mini; snapshot: gpt-5.4-mini-2026-03-17

    OpenAI: GPT-5.4 mini.https://developers.openai.com/api/docs/models/ gpt-5.4-mini(2026), openAI API model documentation. Model ID: gpt-5.4-mini; snapshot: gpt-5.4-mini-2026-03-17. Accessed: 2026-05-27

  18. [18]

    arXiv preprint arXiv:2511.23269 (2025)

    Ossowski, T., Zhang, S., Liu, Q., Qin, G., Tan, R., Naumann, T., Hu, J., Poon, H.: Octomed: Data recipes for state-of-the-art multimodal medical reasoning. arXiv preprint arXiv:2511.23269 (2025)

  19. [19]

    Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

  20. [20]

    arXiv preprint arXiv:2404.18416 (2024)

    Saab, K., Tu, T., Weng, W.H., Tanno, R., Stutz, D., Wulczyn, E., Zhang, F., Strother, T., Park, C., Vedadi, E., et al.: Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416 (2024)

  21. [21]

    arXiv preprint arXiv:2507.05201 (2025)

    Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

  22. [22]

    Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., Vosoughi, S.: Judging the judges: A systematic study of position bias in llm-as-a-judge. In: Proceedings of the 14th In- ternational Joint Conference on Natural Language Processing and the 4th Confer- ence of the Asia-Pacific Chapter of the Association for Computational Linguistics. pp. 292–314 (2025)

  23. [23]

    Team, G.: Gemma 3 technical report (2025),https://arxiv.org/abs/2503.19786

  24. [24]

    NPJ digital medicine5(1), 48 (2022)

    Varoquaux, G., Cheplygina, V.: Machine learning for medical imaging: method- ological failures and recommendations for the future. NPJ digital medicine5(1), 48 (2022)

  25. [25]

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

  26. [26]

    Ra- diography22(2), e131–e136 (2016)

    Wright, C., Reeves, P.: Radbench: benchmarking image interpretation skills. Ra- diography22(2), e131–e136 (2016)

  27. [27]

    arXiv preprint arXiv:2506.07044 (2025)

    Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

  28. [28]

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, 18 D’Cunha et al. P....

  29. [29]

    Yang, Y., Zhang, H., Gichoya, J.W., Katabi, D., Ghassemi, M.: The limits of fair medicalimagingaiinreal-worldgeneralization.Naturemedicine30(10),2838–2848 (2024)

  30. [30]

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi (2024),https://arxiv.org/abs/2311.16502

  31. [31]

    arXiv preprint arXiv:2303.00915 (2023)

    Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)

  32. [32]

    npj Artificial Intelligence1(1), 44 (2025)

    Zhou, J., Li, H., Chen, S., Chen, Z., Han, Z., Gao, X.: Large language models in biomedicine and healthcare. npj Artificial Intelligence1(1), 44 (2025)

  33. [33]

    Radiology311(2) (2024).https://doi.org/ 10.1148/radiol.233270,https://pubs.rsna.org/doi/full/10.1148/radiol

    Zhou, Y., Ong, H., Kennedy, P., Wu, C.C., Kazam, J., Hentel, K., Flanders, A., Shih, G., Peng, Y.: Evaluating gpt-4v (gpt-4 with vision) on detection of radio- logic findings on chest radiographs. Radiology311(2) (2024).https://doi.org/ 10.1148/radiol.233270,https://pubs.rsna.org/doi/full/10.1148/radiol. 233270, pMID: 38712869 In the supplementary materia...