MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models
Pith reviewed 2026-06-28 01:37 UTC · model grok-4.3
The pith
A new benchmark for biomedical vision-language models shows that reported high accuracies often conceal deficiencies in visual perception and domain generalization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The MMBU benchmark demonstrates that high accuracy on established biomedical VLM tests frequently masks underlying weaknesses in visual perception and the capacity to generalize across diverse modalities, scales, and clinical contexts, even after medical adaptation.
What carries the argument
The MMBU benchmark, a dataset spanning 35 submodalities with structured metadata that enables parallel evaluation of ungrounded and grounded classification plus object detection.
If this is right
- Medical adaptation improves some model scores but leaves measurable perception gaps across modalities.
- Model performance differs markedly depending on biological scale and imaging type.
- Systematic testing across open and closed task formats exposes where generalization fails.
- Both open-weight and closed frontier models exhibit domain-specific perception limits.
Where Pith is reading between the lines
- Training approaches may need to target fine visual feature extraction more directly rather than leaning on associated text.
- Routine use of perception-focused benchmarks like MMBU could become part of validation before clinical deployment.
- The same evaluation structure might help identify perception shortfalls in vision-language models applied to other technical image domains.
Load-bearing premise
The benchmark's selection of tasks and submodalities succeeds in isolating visual perception from language cues or dataset biases.
What would settle it
If models that score highly on prior benchmarks also score highly on MMBU with no measurable perception or generalization shortfalls, the claim that existing tests mask deficiencies would not hold.
Figures
read the original abstract
Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Massive Multimodal Biomedical Understanding (MMBU) benchmark, described as the largest biomedical vision-language benchmark to date. It covers 35 submodalities with rich structured metadata and supports open/closed variants of ungrounded classification, grounded classification, and object detection tasks across biological scales and imaging modalities. Evaluation of 15 open-weight and 2 frontier VLMs shows that medical adaptation yields some gains, yet the authors conclude that high accuracy on established benchmarks can mask deficiencies in visual perception and domain generalization.
Significance. If MMBU's task suite and construction genuinely isolate visual perception capabilities from language priors and dataset biases, the benchmark would constitute a useful addition to the field by enabling more targeted diagnosis of VLM limitations in biomedical settings and supporting development of models with better domain generalization.
major comments (1)
- [Abstract] Abstract: The central claim—that established benchmarks mask deficiencies in visual perception and domain generalization—requires evidence that MMBU tasks measure visual content rather than textual cues or statistical shortcuts. The abstract provides no description of benchmark construction details, data sources, or controls such as vision-ablated baselines, adversarial text variants, or annotation-leakage checks, leaving the claim unsupported by the presented information.
Simulated Author's Rebuttal
We thank the referee for their review and for highlighting the need for the abstract to better substantiate the central claim. We address the comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim—that established benchmarks mask deficiencies in visual perception and domain generalization—requires evidence that MMBU tasks measure visual content rather than textual cues or statistical shortcuts. The abstract provides no description of benchmark construction details, data sources, or controls such as vision-ablated baselines, adversarial text variants, or annotation-leakage checks, leaving the claim unsupported by the presented information.
Authors: We agree that the abstract, constrained by length, omits key supporting details present in the full manuscript. Section 3 describes the benchmark construction from 35 public biomedical datasets with structured metadata across scales and modalities; Section 4 details the task variants (ungrounded/grounded classification and detection); and Section 5 reports controls including vision-ablated baselines (showing sharp performance drops without images) and comparisons that isolate visual perception from language priors. We will revise the abstract to concisely reference the construction process, data sources, and the use of such controls to support the claim. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivations or fitted predictions
full rationale
The paper introduces and evaluates the MMBU benchmark as an empirical dataset covering 35 submodalities with tasks in classification and detection. No mathematical derivations, parameter fitting, or predictive claims appear in the provided text. The central claim about masking deficiencies in existing benchmarks rests on direct model evaluations rather than any self-referential construction or self-citation chain. This matches the default expectation for benchmark papers, which are self-contained against external model runs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Burgess, J., Nirschl, J.J., Bravo-Sánchez, L., Lozano, A., Gupte, S.R., Galaz- Montoya, J.G., Zhang, Y., Su, Y., Bhowmik, D., Coman, Z., Hasan, S.M., Jo- hannesson, A., Leineweber, W.D., Nair, M.G., Yarlagadda, R., Zuraski, C., Chiu, W., Cohen, S., Hansen, J.N., Leonetti, M.D., Liu, C., Lundberg, E., Yeung-Levy, 16 D’Cunha et al. S.: Microvqa: A multimoda...
arXiv 2025
-
[2]
Chen, P., Ye, J., Wang, G., Li, Y., Deng, Z., Li, W., Li, T., Duan, H., Huang, Z., Su, Y., Wang, B., Zhang, S., Fu, B., Cai, J., Zhuang, B., Seibel, E.J., He, J., Qiao, Y.: Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai (2024),https://arxiv.org/abs/2408.03361
arXiv 2024
-
[3]
arXiv preprint arXiv:2509.18234 (2025)
Gu, Y., Fu, J., Liu, X., Valanarasu, J.M.J., Codella, N., Tan, R., Liu, Q., Jin, Y., Zhang, S., Wang, J., et al.: The illusion of readiness: Stress testing large fron- tier models on multimodal medical benchmarks. arXiv preprint arXiv:2509.18234 (2025)
arXiv 2025
-
[4]
He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering (2020),https://arxiv.org/abs/2003.10286
Pith/arXiv arXiv 2020
-
[5]
Hu, Y., Li, T., Lu, Q., Shao, W., He, J., Qiao, Y., Luo, P.: Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm (2024),https: //arxiv.org/abs/2402.09181
arXiv 2024
-
[6]
arXiv preprint arXiv:2411.08870 (2024)
Jeong, D.P., Mani, P., Garg, S., Lipton, Z.C., Oberst, M.: The limited impact of medical adaptation of large language and vision-language models. arXiv preprint arXiv:2411.08870 (2024)
arXiv 2024
-
[7]
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific Data5 (2018).https://doi.org/10.1038/sdata.2018.251,https://www.nature.com/ articles/sdata2018251
-
[8]
Le, A., Liu, H., Wang, Y., Liu, Z., Zhu, R., Weng, T., Yu, J., Wang, B., Wu, Y., Yan,K.,Sun,Q.,Jiang,M.,Pei,J.,Liu,S.,Zheng,H.,Li,Z.,Noble,A.,Souquet,J., Guo, X., Lin, M., Guo, H.: U2-bench: Benchmarking large vision-language models on ultrasound understanding (2025),https://arxiv.org/abs/2505.17779
arXiv 2025
-
[9]
Advances in Neural Information Processing Systems36, 28541–28564 (2023)
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)
2023
-
[10]
Advances in Neural Information Processing Systems36 (2024)
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36 (2024)
2024
-
[11]
Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering (2021), https://arxiv.org/abs/2102.09542
arXiv 2021
-
[12]
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning (2024),https://arxiv.org/abs/2310.03744
Pith/arXiv arXiv 2024
-
[13]
arXiv preprint arXiv:2407.01791 (2024)
Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung- Levy, S.:{\mu}-bench: A vision-language benchmark for microscopy understand- ing. arXiv preprint arXiv:2407.01791 (2024)
arXiv 2024
-
[14]
In: Proceedingsofthe38thInternationalConferenceonNeuralInformationProcessing Systems
Lozano, A., Nirschl, J., Burgess, J., Gupte, S.R., Zhang, Y., Unell, A., Yeung-Levy, S.: Micro-bench: a vision-language benchmark for microscopy understanding. In: Proceedingsofthe38thInternationalConferenceonNeuralInformationProcessing Systems. pp. 30670–30685 (2024)
2024
-
[15]
Advances in Neural Information Processing Systems37, 131035– 131071 (2024) MMBU: Massive Multimodal Biomedical Understanding 17
Maruf, M., Daw, A., Mehrab, K.S., Manogaran, H.B., Neog, A., Sawhney, M., Khurana, M., Balhoff, J.P., Bakış, Y., Altintas, B., et al.: Vlm4bio: A benchmark dataset to evaluate pretrained vision-language models for trait discovery from bi- ological images. Advances in Neural Information Processing Systems37, 131035– 131071 (2024) MMBU: Massive Multimodal B...
2024
-
[16]
Model ID: gpt-4.1-mini; snapshot: gpt-4.1-mini-2025-04-14
OpenAI: GPT-4.1 mini.https://developers.openai.com/api/docs/models/ gpt-4.1-mini(2025), openAI API model documentation. Model ID: gpt-4.1-mini; snapshot: gpt-4.1-mini-2025-04-14. Accessed: 2026-05-27
2025
-
[17]
Model ID: gpt-5.4-mini; snapshot: gpt-5.4-mini-2026-03-17
OpenAI: GPT-5.4 mini.https://developers.openai.com/api/docs/models/ gpt-5.4-mini(2026), openAI API model documentation. Model ID: gpt-5.4-mini; snapshot: gpt-5.4-mini-2026-03-17. Accessed: 2026-05-27
2026
-
[18]
arXiv preprint arXiv:2511.23269 (2025)
Ossowski, T., Zhang, S., Liu, Q., Qin, G., Tan, R., Naumann, T., Hu, J., Poon, H.: Octomed: Data recipes for state-of-the-art multimodal medical reasoning. arXiv preprint arXiv:2511.23269 (2025)
arXiv 2025
-
[19]
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...
Pith/arXiv arXiv 2025
-
[20]
arXiv preprint arXiv:2404.18416 (2024)
Saab, K., Tu, T., Weng, W.H., Tanno, R., Stutz, D., Wulczyn, E., Zhang, F., Strother, T., Park, C., Vedadi, E., et al.: Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416 (2024)
Pith/arXiv arXiv 2024
-
[21]
arXiv preprint arXiv:2507.05201 (2025)
Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Traverse, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: Medgemma technical report. arXiv preprint arXiv:2507.05201 (2025)
Pith/arXiv arXiv 2025
-
[22]
Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., Vosoughi, S.: Judging the judges: A systematic study of position bias in llm-as-a-judge. In: Proceedings of the 14th In- ternational Joint Conference on Natural Language Processing and the 4th Confer- ence of the Asia-Pacific Chapter of the Association for Computational Linguistics. pp. 292–314 (2025)
2025
-
[23]
Team, G.: Gemma 3 technical report (2025),https://arxiv.org/abs/2503.19786
Pith/arXiv arXiv 2025
-
[24]
NPJ digital medicine5(1), 48 (2022)
Varoquaux, G., Cheplygina, V.: Machine learning for medical imaging: method- ological failures and recommendations for the future. NPJ digital medicine5(1), 48 (2022)
2022
-
[25]
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...
Pith/arXiv arXiv 2025
-
[26]
Ra- diography22(2), e131–e136 (2016)
Wright, C., Reeves, P.: Radbench: benchmarking image interpretation skills. Ra- diography22(2), e131–e136 (2016)
2016
-
[27]
arXiv preprint arXiv:2506.07044 (2025)
Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)
Pith/arXiv arXiv 2025
-
[28]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, 18 D’Cunha et al. P....
Pith/arXiv arXiv 2025
-
[29]
Yang, Y., Zhang, H., Gichoya, J.W., Katabi, D., Ghassemi, M.: The limits of fair medicalimagingaiinreal-worldgeneralization.Naturemedicine30(10),2838–2848 (2024)
2024
-
[30]
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi (2024),https://arxiv.org/abs/2311.16502
Pith/arXiv arXiv 2024
-
[31]
arXiv preprint arXiv:2303.00915 (2023)
Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)
Pith/arXiv arXiv 2023
-
[32]
npj Artificial Intelligence1(1), 44 (2025)
Zhou, J., Li, H., Chen, S., Chen, Z., Han, Z., Gao, X.: Large language models in biomedicine and healthcare. npj Artificial Intelligence1(1), 44 (2025)
2025
-
[33]
Zhou, Y., Ong, H., Kennedy, P., Wu, C.C., Kazam, J., Hentel, K., Flanders, A., Shih, G., Peng, Y.: Evaluating gpt-4v (gpt-4 with vision) on detection of radio- logic findings on chest radiographs. Radiology311(2) (2024).https://doi.org/ 10.1148/radiol.233270,https://pubs.rsna.org/doi/full/10.1148/radiol. 233270, pMID: 38712869 In the supplementary materia...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.