MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

Alejandro Lozano; Daniel Vela Jarquin; James Burgess; Jeffrey J. Nirschl; Jin Ye; Josiah Aklilu; Junjun He; Ming Hu; Min Woo Sun; Paola Avila

arxiv: 2606.06696 · v1 · pith:A5L5SDQUnew · submitted 2026-06-04 · 💻 cs.CV · cs.AI

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

Ryan D'Cunha , Alejandro Lozano , Xiaoxiao Sun , Daniel Vela Jarquin , Min Woo Sun , Josiah Aklilu , James Burgess , Yuhui Zhang

show 12 more authors

Ryan Nayebi Paola Avila Robayo Jin Ye Ming Hu Zhongying Deng Junjun He Xin Chen Yue Yao Robert Tibshirani Jeffrey J. Nirschl Serena Yeung-Levy

This is my paper

Pith reviewed 2026-06-28 01:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords biomedical imagingvision-language modelsbenchmarkvisual perceptionmultimodal AImedical imagingdomain generalizationobject detection

0 comments

The pith

A new benchmark for biomedical vision-language models shows that reported high accuracies often conceal deficiencies in visual perception and domain generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Massive Multimodal Biomedical Understanding benchmark as a way to test whether vision-language models can accurately interpret subtle visual details across many types of biomedical images. It covers 35 submodalities and uses both open-ended and closed tasks for classification and object detection to check performance at different scales and settings. Testing 17 models reveals that medical adaptation helps in some cases but does not eliminate gaps in perception or the ability to handle new contexts. This matters because models that score well on narrower tests may still fail when faced with real variation in medical imaging data.

Core claim

The MMBU benchmark demonstrates that high accuracy on established biomedical VLM tests frequently masks underlying weaknesses in visual perception and the capacity to generalize across diverse modalities, scales, and clinical contexts, even after medical adaptation.

What carries the argument

The MMBU benchmark, a dataset spanning 35 submodalities with structured metadata that enables parallel evaluation of ungrounded and grounded classification plus object detection.

If this is right

Medical adaptation improves some model scores but leaves measurable perception gaps across modalities.
Model performance differs markedly depending on biological scale and imaging type.
Systematic testing across open and closed task formats exposes where generalization fails.
Both open-weight and closed frontier models exhibit domain-specific perception limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training approaches may need to target fine visual feature extraction more directly rather than leaning on associated text.
Routine use of perception-focused benchmarks like MMBU could become part of validation before clinical deployment.
The same evaluation structure might help identify perception shortfalls in vision-language models applied to other technical image domains.

Load-bearing premise

The benchmark's selection of tasks and submodalities succeeds in isolating visual perception from language cues or dataset biases.

What would settle it

If models that score highly on prior benchmarks also score highly on MMBU with no measurable perception or generalization shortfalls, the claim that existing tests mask deficiencies would not hold.

Figures

Figures reproduced from arXiv: 2606.06696 by Alejandro Lozano, Daniel Vela Jarquin, James Burgess, Jeffrey J. Nirschl, Jin Ye, Josiah Aklilu, Junjun He, Ming Hu, Min Woo Sun, Paola Avila, Robayo, Robert Tibshirani, Ryan D'Cunha, Ryan Nayebi, Serena Yeung-Levy, Xiaoxiao Sun, Xin Chen, Yue Yao, Yuhui Zhang, Zhongying Deng.

**Figure 1.** Figure 1: The data landscape of MMBU. Current biomedical VLM evaluation relies on roughly 20 commonly used datasets. However, as the training data for large models expands, this evaluation becomes inadequate due to issues such as data pollution and a lack of diversity. We introduce MMBU to address this issue. 1 Introduction Biomedical vision-language models (VLMs) are increasingly explored for a wide range of biomed… view at source ↗

**Figure 2.** Figure 2: Multi-task visual examples and metadata-driven question construction in MMBU. TOP: Data collection metadata extraction and standardization. Middle: Representative samples from diverse medical domains and modalities across three task types, including classification, detection, and segmentation. Bottom: Example of question construction in MMBU using newly collected metadata. An example benchmark question is… view at source ↗

**Figure 3.** Figure 3: Overview of MMBU dataset composition across modalities, submodalities, medical domains, and body parts. (a) Distribution of top-level imaging modalities. (b) Distribution of imaging submodalities, shown in two panels for readability, with counts on a log2 scale. (c) Distribution of medical domains grouped into clinical and laboratory categories. (d) Top body parts (15 out of 95 total body-part categorie… view at source ↗

**Figure 4.** Figure 4: Aggregate performance on MMBU. Performance of a representative set of VLMs on the classification and detection tasks in the benchmark. Solid outlines denote open-format results, while boxes without outlines denote closed-format results. Models are ranked by their closed-format performance. Dashes indicate models adapted to medical data. Similar colors indicate models of the same family. 5 Results 5.1 Overa… view at source ↗

**Figure 5.** Figure 5: Comparison of model performance across aggregated biomedical do [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Medical models vs. base models. a) Head-to-head comparison of general-purpose and medical vision–language models on MMBU, showing the proportion of questions where the medical model wins, ties, or loses. b) Relationship between medical model training data size (in millions of examples) and win rate on MMBU. 5.3 Additional Findings and Discussion Beyond the aggregate results in [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 7.** Figure 7: Performance comparison across MMBU subsets (x-axis) and legacy [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Modality and submodality within MMBU. This section provides additional dataset statistics. Supplement [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: F1 scores for VLMs on MMBU organized by 11 unique modalities [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: F1 scores for VLMs for MMBU’s 35 unique submodalities. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Dumbbell plot of open vs closed VQA scores across SOTA models [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Dumbbell plot of open vs closed VQA scores across SOTA models [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Dumbbell plot of open vs closed VQA scores across SOTA models [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Dumbbell plot of open vs closed VQA scores across SOTA models [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Correctness heatmap of all models tested on MMBU [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Radial plot comparing base and medical models on MMBU domains [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: Radial plot comparing base and medical models on MMBU domains [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: Radial plot comparing base and medical models on MMBU domains [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Radial plot comparing accuracy on the different question types by [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 20.** Figure 20: Radial plot comparing accuracy on the different question types by [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Comparing MMBU questions evaluated with and without the im [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗

**Figure 22.** Figure 22: UMAP comparing MMBU’s Radiology subset against popular radi [PITH_FULL_IMAGE:figures/full_fig_p035_22.png] view at source ↗

**Figure 23.** Figure 23: UMAP comparing MMBU’s Radiology subset against popular radi [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗

**Figure 24.** Figure 24: UMAP comparing MMBU’s Pathology subset against a popular [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗

**Figure 25.** Figure 25: UMAP comparing MMBU’s Pathology subset against a popular [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗

**Figure 26.** Figure 26: Model Performance on GMAI-MMBench vs MMBU (GMAI Sub [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗

read the original abstract

Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMBU brings useful scale to biomedical VLM testing with 35 submodalities, but the abstract leaves the key claim about isolating visual perception thin on supporting details.

read the letter

The paper introduces MMBU as the largest biomedical VLM benchmark so far, spanning 35 submodalities with structured metadata and three task families in both open and closed forms. They run 17 models and report that medical adaptation gives some gains while standard benchmarks still hide perception and generalization shortfalls.

The scale and breadth are the clear addition. Covering biological scales, clinical settings, and imaging types in one suite lets people compare models more systematically than the usual narrow sets. The model results section shows concrete differences across variants, which is the kind of data that can guide follow-up work.

The soft spot is construction transparency. The central point—that high scores on old benchmarks mask real visual weaknesses—only holds if the new tasks actually force models to use image content rather than language priors or metadata shortcuts. The abstract gives no numbers on vision-ablated controls, adversarial text tests, or leakage checks, so it is difficult to tell how much of the reported gap is genuine perception failure versus benchmark artifact. That gap matters because the claim is load-bearing.

This is for groups building or auditing biomedical VLMs who need broader test coverage. A methods-focused reader will want the full construction appendix before adopting it as a standard.

It should go to peer review. The scale is real and the evaluation setup is worth referee scrutiny even if the perception-isolation argument needs more evidence to land cleanly.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the Massive Multimodal Biomedical Understanding (MMBU) benchmark, described as the largest biomedical vision-language benchmark to date. It covers 35 submodalities with rich structured metadata and supports open/closed variants of ungrounded classification, grounded classification, and object detection tasks across biological scales and imaging modalities. Evaluation of 15 open-weight and 2 frontier VLMs shows that medical adaptation yields some gains, yet the authors conclude that high accuracy on established benchmarks can mask deficiencies in visual perception and domain generalization.

Significance. If MMBU's task suite and construction genuinely isolate visual perception capabilities from language priors and dataset biases, the benchmark would constitute a useful addition to the field by enabling more targeted diagnosis of VLM limitations in biomedical settings and supporting development of models with better domain generalization.

major comments (1)

[Abstract] Abstract: The central claim—that established benchmarks mask deficiencies in visual perception and domain generalization—requires evidence that MMBU tasks measure visual content rather than textual cues or statistical shortcuts. The abstract provides no description of benchmark construction details, data sources, or controls such as vision-ablated baselines, adversarial text variants, or annotation-leakage checks, leaving the claim unsupported by the presented information.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for the abstract to better substantiate the central claim. We address the comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim—that established benchmarks mask deficiencies in visual perception and domain generalization—requires evidence that MMBU tasks measure visual content rather than textual cues or statistical shortcuts. The abstract provides no description of benchmark construction details, data sources, or controls such as vision-ablated baselines, adversarial text variants, or annotation-leakage checks, leaving the claim unsupported by the presented information.

Authors: We agree that the abstract, constrained by length, omits key supporting details present in the full manuscript. Section 3 describes the benchmark construction from 35 public biomedical datasets with structured metadata across scales and modalities; Section 4 details the task variants (ungrounded/grounded classification and detection); and Section 5 reports controls including vision-ablated baselines (showing sharp performance drops without images) and comparisons that isolate visual perception from language priors. We will revise the abstract to concisely reference the construction process, data sources, and the use of such controls to support the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper introduces and evaluates the MMBU benchmark as an empirical dataset covering 35 submodalities with tasks in classification and detection. No mathematical derivations, parameter fitting, or predictive claims appear in the provided text. The central claim about masking deficiencies in existing benchmarks rests on direct model evaluations rather than any self-referential construction or self-citation chain. This matches the default expectation for benchmark papers, which are self-contained against external model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark paper introduces no free parameters, axioms, or invented entities; it is an empirical contribution.

pith-pipeline@v0.9.1-grok · 5786 in / 967 out tokens · 49301 ms · 2026-06-28T01:37:10.123870+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 2 canonical work pages

[1]

S.: Microvqa: A multimodal reasoning benchmark for microscopy-based scientific research (2025),https://arxiv.org/abs/2503.13399

Burgess, J., Nirschl, J.J., Bravo-Sánchez, L., Lozano, A., Gupte, S.R., Galaz- Montoya, J.G., Zhang, Y., Su, Y., Bhowmik, D., Coman, Z., Hasan, S.M., Jo- hannesson, A., Leineweber, W.D., Nair, M.G., Yarlagadda, R., Zuraski, C., Chiu, W., Cohen, S., Hansen, J.N., Leonetti, M.D., Liu, C., Lundberg, E., Yeung-Levy, 16 D’Cunha et al. S.: Microvqa: A multimoda...

arXiv 2025
[2]

Chen, P., Ye, J., Wang, G., Li, Y., Deng, Z., Li, W., Li, T., Duan, H., Huang, Z., Su, Y., Wang, B., Zhang, S., Fu, B., Cai, J., Zhuang, B., Seibel, E.J., He, J., Qiao, Y.: Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai (2024),https://arxiv.org/abs/2408.03361

arXiv 2024
[3]

arXiv preprint arXiv:2509.18234 (2025)

Gu, Y., Fu, J., Liu, X., Valanarasu, J.M.J., Codella, N., Tan, R., Liu, Q., Jin, Y., Zhang, S., Wang, J., et al.: The illusion of readiness: Stress testing large fron- tier models on multimodal medical benchmarks. arXiv preprint arXiv:2509.18234 (2025)

arXiv 2025
[4]

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering (2020),https://arxiv.org/abs/2003.10286

Pith/arXiv arXiv 2020
[5]

Hu, Y., Li, T., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm (2024),https: //arxiv.org/abs/2402.09181

arXiv 2024
[6]

arXiv preprint arXiv:2411.08870 (2024)

Jeong, D.P., Mani, P., Garg, S., Lipton, Z.C., Oberst, M.: The limited impact of medical adaptation of large language and vision-language models. arXiv preprint arXiv:2411.08870 (2024)

arXiv 2024
[7]

Scientific Data5 (2018).https://doi.org/10.1038/sdata.2018.251,https://www.nature.com/ articles/sdata2018251

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data5 (2018).https://doi.org/10.1038/sdata.2018.251,https://www.nature.com/ articles/sdata2018251

work page doi:10.1038/sdata.2018.251 2018
[8]

Le, A., Liu, H., Wang, Y., Liu, Z., Zhu, R., Weng, T., Yu, J., Wang, B., Wu, Y., Yan,K.,Sun,Q.,Jiang,M.,Pei,J.,Liu,S.,Zheng,H.,Li,Z.,Noble,A.,Souquet,J., Guo, X., Lin, M., Guo, H.: U2-bench: Benchmarking large vision-language models on ultrasound understanding (2025),https://arxiv.org/abs/2505.17779

arXiv 2025
[9]

Advances in Neural Information Processing Systems36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

2023
[10]

Advances in Neural Information Processing Systems36 (2024)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36 (2024)

2024
[11]

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering (2021), https://arxiv.org/abs/2102.09542

arXiv 2021
[12]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2024),https://arxiv.org/abs/2310.03744

Pith/arXiv arXiv 2024
[13]

arXiv preprint arXiv:2407.01791 (2024)

Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung- Levy, S.:{\mu}-bench: A vision-language benchmark for microscopy understand- ing. arXiv preprint arXiv:2407.01791 (2024)

arXiv 2024
[14]

In: Proceedingsofthe38thInternationalConferenceonNeuralInformationProcessing Systems

Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung-Levy, S.: Micro-bench: a vision-language benchmark for microscopy understanding. In: Proceedingsofthe38thInternationalConferenceonNeuralInformationProcessing Systems. pp. 30670–30685 (2024)

2024
[15]

Advances in Neural Information Processing Systems37, 131035– 131071 (2024) MMBU: Massive Multimodal Biomedical Understanding 17

Maruf, M., Daw, A., Mehrab, K.S., Manogaran, H.B., Neog, A., Sawhney, M., Khurana, M., Balhoff, J.P., Bakış, Y., Altintas, B., et al.: Vlm4bio: A benchmark dataset to evaluate pretrained vision-language models for trait discovery from bi- ological images. Advances in Neural Information Processing Systems37, 131035– 131071 (2024) MMBU: Massive Multimodal B...

2024
[16]

Model ID: gpt-4.1-mini; snapshot: gpt-4.1-mini-2025-04-14

OpenAI: GPT-4.1 mini.https://developers.openai.com/api/docs/models/ gpt-4.1-mini(2025), openAI API model documentation. Model ID: gpt-4.1-mini; snapshot: gpt-4.1-mini-2025-04-14. Accessed: 2026-05-27

2025
[17]

Model ID: gpt-5.4-mini; snapshot: gpt-5.4-mini-2026-03-17

OpenAI: GPT-5.4 mini.https://developers.openai.com/api/docs/models/ gpt-5.4-mini(2026), openAI API model documentation. Model ID: gpt-5.4-mini; snapshot: gpt-5.4-mini-2026-03-17. Accessed: 2026-05-27

2026
[18]

arXiv preprint arXiv:2511.23269 (2025)

Ossowski, T., Zhang, S., Liu, Q., Qin, G., Tan, R., Naumann, T., Hu, J., Poon, H.: Octomed: Data recipes for state-of-the-art multimodal medical reasoning. arXiv preprint arXiv:2511.23269 (2025)

arXiv 2025
[19]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

Pith/arXiv arXiv 2025
[20]

arXiv preprint arXiv:2404.18416 (2024)

Saab, K., Tu, T., Weng, W.H., Tanno, R., Stutz, D., Wulczyn, E., Zhang, F., Strother, T., Park, C., Vedadi, E., et al.: Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416 (2024)

Pith/arXiv arXiv 2024
[21]

arXiv preprint arXiv:2507.05201 (2025)

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

Pith/arXiv arXiv 2025
[22]

Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., Vosoughi, S.: Judging the judges: A systematic study of position bias in llm-as-a-judge. In: Proceedings of the 14th In- ternational Joint Conference on Natural Language Processing and the 4th Confer- ence of the Asia-Pacific Chapter of the Association for Computational Linguistics. pp. 292–314 (2025)

2025
[23]

Team, G.: Gemma 3 technical report (2025),https://arxiv.org/abs/2503.19786

Pith/arXiv arXiv 2025
[24]

NPJ digital medicine5(1), 48 (2022)

Varoquaux, G., Cheplygina, V.: Machine learning for medical imaging: method- ological failures and recommendations for the future. NPJ digital medicine5(1), 48 (2022)

2022
[25]

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

Pith/arXiv arXiv 2025
[26]

Ra- diography22(2), e131–e136 (2016)

Wright, C., Reeves, P.: Radbench: benchmarking image interpretation skills. Ra- diography22(2), e131–e136 (2016)

2016
[27]

arXiv preprint arXiv:2506.07044 (2025)

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

Pith/arXiv arXiv 2025
[28]

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, 18 D’Cunha et al. P....

Pith/arXiv arXiv 2025
[29]

Yang, Y., Zhang, H., Gichoya, J.W., Katabi, D., Ghassemi, M.: The limits of fair medicalimagingaiinreal-worldgeneralization.Naturemedicine30(10),2838–2848 (2024)

2024
[30]

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi (2024),https://arxiv.org/abs/2311.16502

Pith/arXiv arXiv 2024
[31]

arXiv preprint arXiv:2303.00915 (2023)

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)

Pith/arXiv arXiv 2023
[32]

npj Artificial Intelligence1(1), 44 (2025)

Zhou, J., Li, H., Chen, S., Chen, Z., Han, Z., Gao, X.: Large language models in biomedicine and healthcare. npj Artificial Intelligence1(1), 44 (2025)

2025
[33]

Radiology311(2) (2024).https://doi.org/ 10.1148/radiol.233270,https://pubs.rsna.org/doi/full/10.1148/radiol

Zhou, Y., Ong, H., Kennedy, P., Wu, C.C., Kazam, J., Hentel, K., Flanders, A., Shih, G., Peng, Y.: Evaluating gpt-4v (gpt-4 with vision) on detection of radio- logic findings on chest radiographs. Radiology311(2) (2024).https://doi.org/ 10.1148/radiol.233270,https://pubs.rsna.org/doi/full/10.1148/radiol. 233270, pMID: 38712869 In the supplementary materia...

work page doi:10.1148/radiol.233270 2024

[1] [1]

S.: Microvqa: A multimodal reasoning benchmark for microscopy-based scientific research (2025),https://arxiv.org/abs/2503.13399

Burgess, J., Nirschl, J.J., Bravo-Sánchez, L., Lozano, A., Gupte, S.R., Galaz- Montoya, J.G., Zhang, Y., Su, Y., Bhowmik, D., Coman, Z., Hasan, S.M., Jo- hannesson, A., Leineweber, W.D., Nair, M.G., Yarlagadda, R., Zuraski, C., Chiu, W., Cohen, S., Hansen, J.N., Leonetti, M.D., Liu, C., Lundberg, E., Yeung-Levy, 16 D’Cunha et al. S.: Microvqa: A multimoda...

arXiv 2025

[2] [2]

Chen, P., Ye, J., Wang, G., Li, Y., Deng, Z., Li, W., Li, T., Duan, H., Huang, Z., Su, Y., Wang, B., Zhang, S., Fu, B., Cai, J., Zhuang, B., Seibel, E.J., He, J., Qiao, Y.: Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai (2024),https://arxiv.org/abs/2408.03361

arXiv 2024

[3] [3]

arXiv preprint arXiv:2509.18234 (2025)

Gu, Y., Fu, J., Liu, X., Valanarasu, J.M.J., Codella, N., Tan, R., Liu, Q., Jin, Y., Zhang, S., Wang, J., et al.: The illusion of readiness: Stress testing large fron- tier models on multimodal medical benchmarks. arXiv preprint arXiv:2509.18234 (2025)

arXiv 2025

[4] [4]

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering (2020),https://arxiv.org/abs/2003.10286

Pith/arXiv arXiv 2020

[5] [5]

Hu, Y., Li, T., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm (2024),https: //arxiv.org/abs/2402.09181

arXiv 2024

[6] [6]

arXiv preprint arXiv:2411.08870 (2024)

Jeong, D.P., Mani, P., Garg, S., Lipton, Z.C., Oberst, M.: The limited impact of medical adaptation of large language and vision-language models. arXiv preprint arXiv:2411.08870 (2024)

arXiv 2024

[7] [7]

Scientific Data5 (2018).https://doi.org/10.1038/sdata.2018.251,https://www.nature.com/ articles/sdata2018251

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data5 (2018).https://doi.org/10.1038/sdata.2018.251,https://www.nature.com/ articles/sdata2018251

work page doi:10.1038/sdata.2018.251 2018

[8] [8]

Le, A., Liu, H., Wang, Y., Liu, Z., Zhu, R., Weng, T., Yu, J., Wang, B., Wu, Y., Yan,K.,Sun,Q.,Jiang,M.,Pei,J.,Liu,S.,Zheng,H.,Li,Z.,Noble,A.,Souquet,J., Guo, X., Lin, M., Guo, H.: U2-bench: Benchmarking large vision-language models on ultrasound understanding (2025),https://arxiv.org/abs/2505.17779

arXiv 2025

[9] [9]

Advances in Neural Information Processing Systems36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

2023

[10] [10]

Advances in Neural Information Processing Systems36 (2024)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36 (2024)

2024

[11] [11]

Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering (2021), https://arxiv.org/abs/2102.09542

arXiv 2021

[12] [12]

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2024),https://arxiv.org/abs/2310.03744

Pith/arXiv arXiv 2024

[13] [13]

arXiv preprint arXiv:2407.01791 (2024)

Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung- Levy, S.:{\mu}-bench: A vision-language benchmark for microscopy understand- ing. arXiv preprint arXiv:2407.01791 (2024)

arXiv 2024

[14] [14]

In: Proceedingsofthe38thInternationalConferenceonNeuralInformationProcessing Systems

Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung-Levy, S.: Micro-bench: a vision-language benchmark for microscopy understanding. In: Proceedingsofthe38thInternationalConferenceonNeuralInformationProcessing Systems. pp. 30670–30685 (2024)

2024

[15] [15]

Advances in Neural Information Processing Systems37, 131035– 131071 (2024) MMBU: Massive Multimodal Biomedical Understanding 17

Maruf, M., Daw, A., Mehrab, K.S., Manogaran, H.B., Neog, A., Sawhney, M., Khurana, M., Balhoff, J.P., Bakış, Y., Altintas, B., et al.: Vlm4bio: A benchmark dataset to evaluate pretrained vision-language models for trait discovery from bi- ological images. Advances in Neural Information Processing Systems37, 131035– 131071 (2024) MMBU: Massive Multimodal B...

2024

[16] [16]

Model ID: gpt-4.1-mini; snapshot: gpt-4.1-mini-2025-04-14

OpenAI: GPT-4.1 mini.https://developers.openai.com/api/docs/models/ gpt-4.1-mini(2025), openAI API model documentation. Model ID: gpt-4.1-mini; snapshot: gpt-4.1-mini-2025-04-14. Accessed: 2026-05-27

2025

[17] [17]

Model ID: gpt-5.4-mini; snapshot: gpt-5.4-mini-2026-03-17

OpenAI: GPT-5.4 mini.https://developers.openai.com/api/docs/models/ gpt-5.4-mini(2026), openAI API model documentation. Model ID: gpt-5.4-mini; snapshot: gpt-5.4-mini-2026-03-17. Accessed: 2026-05-27

2026

[18] [18]

arXiv preprint arXiv:2511.23269 (2025)

Ossowski, T., Zhang, S., Liu, Q., Qin, G., Tan, R., Naumann, T., Hu, J., Poon, H.: Octomed: Data recipes for state-of-the-art multimodal medical reasoning. arXiv preprint arXiv:2511.23269 (2025)

arXiv 2025

[19] [19]

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

Pith/arXiv arXiv 2025

[20] [20]

arXiv preprint arXiv:2404.18416 (2024)

Saab, K., Tu, T., Weng, W.H., Tanno, R., Stutz, D., Wulczyn, E., Zhang, F., Strother, T., Park, C., Vedadi, E., et al.: Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416 (2024)

Pith/arXiv arXiv 2024

[21] [21]

arXiv preprint arXiv:2507.05201 (2025)

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)

Pith/arXiv arXiv 2025

[22] [22]

Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., Vosoughi, S.: Judging the judges: A systematic study of position bias in llm-as-a-judge. In: Proceedings of the 14th In- ternational Joint Conference on Natural Language Processing and the 4th Confer- ence of the Asia-Pacific Chapter of the Association for Computational Linguistics. pp. 292–314 (2025)

2025

[23] [23]

Team, G.: Gemma 3 technical report (2025),https://arxiv.org/abs/2503.19786

Pith/arXiv arXiv 2025

[24] [24]

NPJ digital medicine5(1), 48 (2022)

Varoquaux, G., Cheplygina, V.: Machine learning for medical imaging: method- ological failures and recommendations for the future. NPJ digital medicine5(1), 48 (2022)

2022

[25] [25]

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

Pith/arXiv arXiv 2025

[26] [26]

Ra- diography22(2), e131–e136 (2016)

Wright, C., Reeves, P.: Radbench: benchmarking image interpretation skills. Ra- diography22(2), e131–e136 (2016)

2016

[27] [27]

arXiv preprint arXiv:2506.07044 (2025)

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

Pith/arXiv arXiv 2025

[28] [28]

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, 18 D’Cunha et al. P....

Pith/arXiv arXiv 2025

[29] [29]

Yang, Y., Zhang, H., Gichoya, J.W., Katabi, D., Ghassemi, M.: The limits of fair medicalimagingaiinreal-worldgeneralization.Naturemedicine30(10),2838–2848 (2024)

2024

[30] [30]

Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi (2024),https://arxiv.org/abs/2311.16502

Pith/arXiv arXiv 2024

[31] [31]

arXiv preprint arXiv:2303.00915 (2023)

Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)

Pith/arXiv arXiv 2023

[32] [32]

npj Artificial Intelligence1(1), 44 (2025)

Zhou, J., Li, H., Chen, S., Chen, Z., Han, Z., Gao, X.: Large language models in biomedicine and healthcare. npj Artificial Intelligence1(1), 44 (2025)

2025

[33] [33]

Radiology311(2) (2024).https://doi.org/ 10.1148/radiol.233270,https://pubs.rsna.org/doi/full/10.1148/radiol

Zhou, Y., Ong, H., Kennedy, P., Wu, C.C., Kazam, J., Hentel, K., Flanders, A., Shih, G., Peng, Y.: Evaluating gpt-4v (gpt-4 with vision) on detection of radio- logic findings on chest radiographs. Radiology311(2) (2024).https://doi.org/ 10.1148/radiol.233270,https://pubs.rsna.org/doi/full/10.1148/radiol. 233270, pMID: 38712869 In the supplementary materia...

work page doi:10.1148/radiol.233270 2024