pith. sign in

arxiv: 2606.12953 · v1 · pith:4PSDTXTQnew · submitted 2026-06-11 · 💻 cs.AI · cs.CV· cs.LG· eess.IV

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

Pith reviewed 2026-06-27 06:54 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LGeess.IV
keywords medical vision-language modelopen pretrainingPathVQAVQA-MEDvision encodermacro-F1BLEU-1medical classification
0
0 comments X

The pith

OpenMedQ pretrained on 3.35 million open medical samples reaches 75.9 BLEU-1 on PathVQA, exceeding models up to 562 billion parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpenMedQ as a vision-language model trained on the broadest fully open medical dataset collection to date, consisting of 14 public datasets and roughly 3.35 million samples that cover pathology, radiology, microscopy, and clinical question answering. This pretraining produces state-of-the-art results on PathVQA while matching top reported scores on VQA-MED, and the learned vision encoder leads on transfer to eight separate medical classification tasks when all models use the identical downstream procedure. A reader would care because the results indicate that accessible public data alone can produce competitive or superior performance compared with far larger proprietary models.

Core claim

OpenMedQ is pretrained on 14 fully open datasets spanning pathology, radiology, microscopy, and text-only clinical QA for a total of approximately 3.35 million samples. It reaches a BLEU-1 score of 75.9 on PathVQA, exceeding Med-PaLM M models up to 562 billion parameters, and achieves 64.5 on VQA-MED to match the best reported result. The vision encoder, when applied to eight unseen medical classification benchmarks using the same downstream procedure, yields the highest average macro-F1 of 0.757 compared to prior models including BiomedCLIP, PMC-CLIP, and PubMedCLIP.

What carries the argument

The broad open pretraining mix of 14 datasets totaling 3.35 million samples that supplies diverse medical image-text pairs for learning general representations.

Load-bearing premise

The observed performance gains result from the breadth and openness of the 3.35 million sample pretraining mix rather than from differences in architecture, training hyperparameters, or evaluation settings.

What would settle it

A model trained from the same architecture and hyperparameters on a narrower subset of the same public data or on a different open mix that still matches or exceeds OpenMedQ on PathVQA and the eight classification benchmarks.

Figures

Figures reproduced from arXiv: 2606.12953 by Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert.

Figure 1
Figure 1. Figure 1: (a) Macro-F1 across 8 unseen medical classification benchmarks: all bars share an identical downstream recipe and differ only in the pretrained vision encoder. OpenMedQ attains the highest Mean (0.757). (b) OpenMedQ’s pretraining mix: 14 fully-open datasets (∼3.35M pairs), colored by modality group. et al., 2024) (72.27) despite using only 7B parameters. On VQA-MED, OpenMedQ reaches 64.5, just above the 20… view at source ↗
read the original abstract

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents OpenMedQ, a medical vision-language model pretrained on the broadest fully open mix to date (14 datasets, ~3.35M samples spanning pathology, radiology, microscopy, and clinical QA). It reports state-of-the-art BLEU-1 of 75.9 on PathVQA (surpassing Med-PaLM M variants up to 562B parameters) and matching the best VQA-MED BLEU-1 of 64.5; the vision encoder transferred under an identical downstream recipe achieves the highest average macro-F1 of 0.757 across 8 unseen medical classification benchmarks, outperforming BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). Code and a demo are released for reproducibility.

Significance. If the reported gains can be attributed to the breadth of the open pretraining corpus, the work would supply a useful, reproducible open baseline for medical VLMs and support community progress on transparent data mixes. The code release is a concrete strength that aids verification.

major comments (2)
  1. [Experiments] Experiments section: The claim that performance gains on the 8 classification benchmarks (average macro-F1 0.757) are due to the 3.35M-sample pretraining mix requires controlled ablations that fix model architecture, optimizer, learning-rate schedule, epochs, and augmentation pipeline while varying only the pretraining data; no such ablations are described, leaving open the possibility that differences in training procedure or implementation explain the results versus BiomedCLIP/PMC-CLIP/PubMedCLIP.
  2. [Experiments] Downstream evaluation protocol: The assertion of an 'identical downstream recipe' for vision-encoder transfer is stated but lacks explicit confirmation that every baseline was re-trained from its published checkpoint under precisely the same code, hyperparameters, and random seeds; without this, the comparison to the 0.745–0.746 macro-F1 scores is not load-bearing for the central claim.
minor comments (2)
  1. The abstract and methods would benefit from a table summarizing per-dataset sample counts, modalities, and any weighting used in the 3.35M pretraining mix.
  2. Reported metrics lack error bars or standard deviations across runs, which would help assess whether the 0.757 vs. 0.745–0.746 differences are statistically meaningful.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental rigor of our manuscript. We address each major comment below and will revise the paper to incorporate the suggested clarifications and additional analyses.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The claim that performance gains on the 8 classification benchmarks (average macro-F1 0.757) are due to the 3.35M-sample pretraining mix requires controlled ablations that fix model architecture, optimizer, learning-rate schedule, epochs, and augmentation pipeline while varying only the pretraining data; no such ablations are described, leaving open the possibility that differences in training procedure or implementation explain the results versus BiomedCLIP/PMC-CLIP/PubMedCLIP.

    Authors: We agree that the manuscript would be strengthened by controlled ablations that isolate the contribution of the pretraining data mix. The current version relies on comparisons to published baselines plus a from-scratch control but does not include the requested ablations. In the revised manuscript we will add these experiments, retraining the vision encoder under identical architecture, optimizer, learning-rate schedule, epochs, and augmentation settings while varying only the pretraining corpus, and report the results in a new table. revision: yes

  2. Referee: [Experiments] Downstream evaluation protocol: The assertion of an 'identical downstream recipe' for vision-encoder transfer is stated but lacks explicit confirmation that every baseline was re-trained from its published checkpoint under precisely the same code, hyperparameters, and random seeds; without this, the comparison to the 0.745–0.746 macro-F1 scores is not load-bearing for the central claim.

    Authors: We acknowledge that the manuscript would benefit from more explicit documentation of the downstream protocol. Although our released code enables verification of our procedure, the text does not confirm that all baselines were re-trained under identical conditions. In the revision we will add a dedicated paragraph in the experimental setup section stating that every baseline was re-trained from its published checkpoint using the exact same code, hyperparameters, and random seeds, thereby making the reported comparisons fully reproducible and load-bearing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark reporting with no derivation chain

full rationale

The paper reports empirical pretraining results on 3.35M samples and downstream benchmark scores (BLEU-1, macro-F1) without any mathematical derivation, first-principles prediction, or load-bearing self-citation chain. Claims rest on direct experimental outcomes and comparisons to external baselines; these are falsifiable via reproduction and do not reduce to fitted inputs or self-referential definitions by construction. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract was available; no model architecture equations, training objectives, or dataset construction details are provided, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5705 in / 1448 out tokens · 19329 ms · 2026-06-27T06:54:21.688164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 1 linked inside Pith

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    arXiv preprint arXiv:2302.13971 , year=

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

  3. [3]

    Wu, Chaoyi and Lin, Weixiong and Zhang, Xiaoman and Zhang, Ya and Xie, Weidi and Wang, Yanfeng , journal=

  4. [4]

    Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

  5. [5]

    Zhang, Sheng and Xu, Yanbo and Usuyama, Naoto and Xu, Hanwen and Bagga, Jaspreet and Tinn, Robert and Preston, Sam and Rao, Rajesh and Wei, Mu and Valluri, Naveen and others , journal=

  6. [6]

    2023 , organization=

    Lin, Weixiong and Zhao, Ziheng and Zhang, Xiaoman and Wu, Chaoyi and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , booktitle=. 2023 , organization=

  7. [7]

    Zhang, Xiaoman and Wu, Chaoyi and Zhao, Ziheng and Lin, Weixiong and Zhang, Ya and Wang, Yanfeng and Xie, Weidi , journal=

  8. [8]

    Li, Chunyuan and Wong, Cliff and Zhang, Sheng and Usuyama, Naoto and Liu, Haotian and Yang, Jianwei and Naumann, Tristan and Poon, Hoifung and Gao, Jianfeng , journal=

  9. [9]

    Zhang, Kai and Yu, Jun and Yan, Zhiling and Liu, Yixin and Adhikarla, Eashan and Fu, Sunyang and Chen, Xun and Chen, Chen and Zhou, Yuyin and Li, Xiang and others , journal=

  10. [10]

    Towards generalist biomedical

    Tu, Tao and Azizi, Shekoofeh and Driess, Danny and Schaekermann, Mike and Amin, Mohamed and Chang, Pi-Chuan and Carroll, Andrew and Lau, Charles and Tanno, Ryutaro and Ktena, Ira and others , journal=. Towards generalist biomedical

  11. [11]

    He, Xuehai and Zhang, Yichen and Mou, Luntian and Xing, Eric and Xie, Pengtao , journal=

  12. [12]

    Scientific Data , volume=

    A dataset of clinically generated visual questions and answers about radiology images , author=. Scientific Data , volume=

  13. [13]

    Liu, Bo and Zhan, Li-Ming and Xu, Li and Ma, Lin and Yang, Yan and Wu, Xiao-Ming , booktitle=

  14. [14]

    CLEF Working Notes , year=

    Ben Abacha, Asma and Hasan, Sadid A and Datla, Vivek V and Demner-Fushman, Dina and M. CLEF Working Notes , year=

  15. [15]

    Journal of the American Medical Informatics Association , volume=

    Preparing a collection of radiology examinations for distribution and retrieval , author=. Journal of the American Medical Informatics Association , volume=

  16. [16]

    Johnson, Alistair EW and Pollard, Tom J and Berkowitz, Seth J and Greenbaum, Nathaniel R and Lungren, Matthew P and Deng, Chih-ying and Mark, Roger G and Horng, Steven , journal=

  17. [17]

    Radiology Objects in COntext (

    Pelka, Obioma and Koitka, Sven and R. Radiology Objects in COntext (. Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis , pages=

  18. [18]

    Hu, Yutao and Li, Tianbin and Lu, Quanfeng and Shao, Wenqi and He, Junjun and Qiao, Yu and Luo, Ping , booktitle=

  19. [19]

    Lozano, Alejandro and Nirschl, Jeffrey and Burgess, James and Gupte, Sanket Rajan and Zhang, Yuhui and Unell, Alyssa and Yeung-Levy, Serena , journal=

  20. [20]

    Wang, Xiaosong and Peng, Yifan and Lu, Le and Lu, Zhiyong and Bagheri, Mohammadhadi and Summers, Ronald M , booktitle=

  21. [21]

    Data in Brief , volume=

    Dataset of breast ultrasound images , author=. Data in Brief , volume=

  22. [22]

    IEEE Transactions on Medical Imaging , volume=

    Hard sample aware noise robust learning for histopathology image classification , author=. IEEE Transactions on Medical Imaging , volume=

  23. [23]

    Scientific Data , volume=

    A curated mammography data set for use in computer-aided detection and diagnosis research , author=. Scientific Data , volume=

  24. [24]

    Large dataset of labeled optical coherence tomography (

    Kermany, Daniel and Zhang, Kang and Goldbaum, Michael , journal=. Large dataset of labeled optical coherence tomography (

  25. [25]

    International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=

    Open-ended medical visual question answering through prefix tuning of language models , author=. International Conference on Medical Image Computing and Computer-Assisted Intervention , pages=