pith. machine review for the scientific record. sign in

arxiv: 2305.10415 · v6 · submitted 2023-05-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical visual question answeringvisual instruction tuninggenerative modellarge language modelmedical imagingdataset constructionfine-tuning
0
0 comments X

The pith

A generative model trained on a 227k-pair medical VQA dataset from literature outperforms prior systems on clinical benchmarks after fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a dataset of 227,000 question-answer pairs drawn from nearly 150,000 medical images across different modalities and diseases. It then trains a model that feeds visual features from a pre-trained encoder into a large language model so the system can generate natural-language answers to questions about the images. After initial training on the new collection, the model is fine-tuned on existing test collections and produces more accurate free-form responses than earlier MedVQA approaches. The authors also release a manually checked harder test set to provide a stricter measure of progress in this generative setting.

Core claim

By constructing the PMC-VQA dataset containing 227k VQA pairs from 149k images and training a model that aligns a pre-trained vision encoder with a large language model, the approach achieves significantly better performance than prior MedVQA models in generating relevant and accurate free-form answers on benchmarks such as VQA-RAD, SLAKE, and Image-Clef-2019 after fine-tuning.

What carries the argument

Alignment between outputs of a pre-trained vision encoder and a large language model to support generation of free-form answers to medical visual questions.

If this is right

  • Models trained this way generate more relevant and accurate free-form answers on public MedVQA benchmarks.
  • The new dataset supports training across a wide range of medical image modalities and diseases.
  • A manually verified test set offers a stricter benchmark for evaluating generative MedVQA methods.
  • Centralized leaderboards help track improvements in the field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scaling medical visual instruction tuning could reduce reliance on task-specific annotated data for new clinical applications.
  • Generative MedVQA systems may integrate more smoothly into doctor-AI conversations than classification-based ones.
  • Literature-derived datasets might capture rare conditions better than small curated clinical sets if publication bias is limited.
  • The success on fine-tuning suggests the initial large-scale training builds useful medical visual representations that transfer to other tasks.

Load-bearing premise

Images and questions taken from published medical papers represent the variety and difficulty of questions that arise in actual clinical practice.

What would settle it

Running the trained model on the manually verified test set and finding that its answers are no more accurate or relevant than those of previous MedVQA models would indicate the dataset and training approach do not deliver the claimed improvement.

read the original abstract

Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery by leveraging artificial intelligence to interpret and answer questions based on medical images. In this study, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction and propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef-2019, significantly outperforming existing MedVQA models in generating relevant, accurate free-form answers. In addition, we propose a test set that has undergone manual verification, which is significantly more challenging, serving to better monitor the development of generative MedVQA methods. To facilitate comprehensive evaluation and comparison, we have maintained a leaderboard at https://paperswithcode.com/paper/pmc-vqa-visual-instruction-tuning-for-medical, offering a centralized resource for tracking progress and benchmarking state-of-the-art approaches. The PMC-VQA dataset emerges as a vital resource for the field of research, and the MedVInT presents a significant breakthrough in the area of MedVQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PMC-VQA, a large-scale dataset of 227k VQA pairs from 149k medical images extracted from PubMed Central literature across modalities and diseases. It proposes MedVInT, a generative model that aligns a pretrained vision encoder with an LLM for medical visual question answering. The model is pretrained on PMC-VQA then fine-tuned on public benchmarks (VQA-RAD, SLAKE, Image-Clef-2019), with claims of significant outperformance over prior MedVQA methods in free-form answer generation. A manually verified, more challenging test set is introduced along with a public leaderboard.

Significance. If the empirical gains hold under detailed scrutiny, the work would provide a valuable scalable pretraining resource for MedVQA and demonstrate the utility of generative alignment pipelines. The manual verification step and maintained leaderboard are constructive contributions that could help standardize evaluation. However, the central transfer claims depend on the unproven assumption that literature-derived pairs generalize to clinical distributions.

major comments (3)
  1. [§3] §3 (PMC-VQA construction): The dataset is built from PMC articles, which systematically favor clear, publishable findings; the manuscript provides no quantitative analysis of question ambiguity, diversity metrics, or comparison against real clinical query distributions. This directly affects the claim that fine-tuning gains on VQA-RAD/SLAKE/Image-Clef-2019 reflect genuine medical understanding rather than source artifacts.
  2. [§5] §5 (Experiments): The abstract and main claims assert significant outperformance, yet the manuscript supplies no ablation tables isolating the contribution of PMC-VQA pretraining versus standard fine-tuning, no error analysis on failure modes, and no statistical significance tests on the reported gains. These omissions make the central empirical result difficult to evaluate.
  3. [§4.2] §4.2 (Hard test set): The manually verified test set is described as significantly more challenging, but the paper does not specify the verification protocol, inter-annotator agreement, or how its distribution differs from the training split in terms of image quality, question complexity, or answer length.
minor comments (2)
  1. [Abstract] The abstract states quantitative superiority without any numbers; move at least the key accuracy or BLEU scores into the abstract for immediate readability.
  2. [§4] Notation for the vision-language alignment loss is introduced without an explicit equation number; add Eq. (X) and reference it consistently in §4.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the empirical support and transparency of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (PMC-VQA construction): The dataset is built from PMC articles, which systematically favor clear, publishable findings; the manuscript provides no quantitative analysis of question ambiguity, diversity metrics, or comparison against real clinical query distributions. This directly affects the claim that fine-tuning gains on VQA-RAD/SLAKE/Image-Clef-2019 reflect genuine medical understanding rather than source artifacts.

    Authors: We agree that further characterization of PMC-VQA is warranted. In the revision we will add quantitative analyses of question diversity (modality/disease/question-type distributions), lexical and semantic ambiguity indicators, and direct statistical comparisons of question/answer distributions against the target clinical benchmarks (VQA-RAD, SLAKE, Image-Clef-2019). We will also explicitly discuss the literature-to-clinical domain gap as a limitation and future-work item. revision: yes

  2. Referee: [§5] §5 (Experiments): The abstract and main claims assert significant outperformance, yet the manuscript supplies no ablation tables isolating the contribution of PMC-VQA pretraining versus standard fine-tuning, no error analysis on failure modes, and no statistical significance tests on the reported gains. These omissions make the central empirical result difficult to evaluate.

    Authors: We accept this criticism. The revised manuscript will include (i) ablation tables that isolate the effect of PMC-VQA pretraining, (ii) a dedicated error-analysis section with representative failure cases, and (iii) statistical significance testing (bootstrap resampling and McNemar tests) on all reported gains. These additions will be placed in §5 and the supplementary material. revision: yes

  3. Referee: [§4.2] §4.2 (Hard test set): The manually verified test set is described as significantly more challenging, but the paper does not specify the verification protocol, inter-annotator agreement, or how its distribution differs from the training split in terms of image quality, question complexity, or answer length.

    Authors: We will expand §4.2 with a full description of the verification protocol (annotator background, number of reviewers per sample, resolution procedure for disagreements), report inter-annotator agreement (Cohen’s κ), and provide comparative statistics (image-quality scores, question-length and complexity distributions, answer-length histograms) between the hard test set and the training split. revision: yes

Circularity Check

0 steps flagged

Standard empirical pipeline on external benchmarks; no load-bearing self-referential derivations or fitted predictions

full rationale

The paper constructs the PMC-VQA dataset from PubMed Central literature, aligns a pre-trained vision encoder with an LLM to form MedVInT, trains on the 227k pairs, and fine-tunes on independent public benchmarks (VQA-RAD, SLAKE, Image-Clef-2019). Reported gains are measured on those external test sets plus a manually verified subset. No equations, uniqueness theorems, or ansatzes are invoked that reduce the performance claims to quantities defined by the authors' own fitted parameters or prior self-citations. Self-citations, if present, are not load-bearing for the central empirical result. This matches the default expectation of a non-circular ML paper whose claims rest on reproducible external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the transferability of general vision-language alignment techniques to the medical domain and on the quality of literature-derived image-question pairs as training data.

free parameters (1)
  • Vision-language alignment parameters
    Learned parameters that connect the vision encoder output to the language model input during instruction tuning.
axioms (2)
  • domain assumption Pre-trained vision encoders extract features sufficient for medical image understanding when aligned with language models
    Invoked in the description of aligning a pre-trained vision encoder with an LLM.
  • domain assumption Literature-sourced image-question pairs form a representative training distribution for clinical MedVQA
    Implicit in the construction of PMC-VQA from PubMed Central articles.

pith-pipeline@v0.9.0 · 5586 in / 1449 out tokens · 35022 ms · 2026-05-15T23:04:29.314082+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef-2019, significantly outperforming existing MedVQA models in generating relevant, accurate free-form answers.

  • Foundation.LawOfExistence defect_zero_iff_one unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

    cs.CV 2026-05 accept novelty 8.0

    DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

  2. MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows

    cs.CV 2026-03 conditional novelty 8.0

    MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate ...

  3. CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

    cs.CV 2026-05 conditional novelty 7.0

    Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

  4. CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation

    cs.CV 2026-04 unverdicted novelty 7.0

    CheXthought supplies large-scale expert chain-of-thought reasoning and synchronized visual attention data for chest X-rays to train more accurate and interpretable clinical vision-language models.

  5. X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis

    cs.CV 2026-04 unverdicted novelty 7.0

    X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.

  6. Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

    cs.CV 2026-05 unverdicted novelty 6.0

    Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.

  7. RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

    cs.CV 2026-05 unverdicted novelty 6.0

    RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...

  8. MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

    cs.CV 2026-05 unverdicted novelty 6.0

    MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.

  9. Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.

  10. MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

    cs.CV 2026-04 unverdicted novelty 6.0

    MedSynapse-V evolves latent diagnostic memories via meta queries, causal counterfactual refinement with RL, and dual-branch memory transition to outperform prior medical VLM methods in diagnostic accuracy.

  11. Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

    cs.CV 2026-04 unverdicted novelty 6.0

    DCI unifies backdoor adjustment and instrumental variable learning in MedVQA to extract deconfounded representations, yielding better out-of-distribution performance on SLAKE, VQA-RAD and similar benchmarks.

  12. MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

    cs.CL 2026-04 unverdicted novelty 6.0

    MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.

  13. Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance

    cs.CV 2026-04 unverdicted novelty 6.0

    DCP-PD improves macro F1 scores on CT report generation benchmarks and introduces a hierarchical location-aware evaluation protocol that reveals ongoing challenges in pathology spatial grounding.

  14. Improving Medical VQA through Trajectory-Aware Process Supervision

    cs.LG 2026-04 conditional novelty 6.0

    A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.

  15. Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

    cs.CV 2026-03 conditional novelty 6.0

    Chain-of-thought underperforms direct answering in medical VQA due to a perception bottleneck, but ROI cues and textual grounding interventions can improve results and reverse the gap.

  16. LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

    cs.CV 2026-05 unverdicted novelty 5.0

    LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

  17. Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.

  18. MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...

  19. MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

    cs.CL 2026-02 unverdicted novelty 4.0

    MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 19 Pith papers · 9 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, R...

  3. [3]

    The medical segmentation decathlon.Nature Communications, 13(1):4128, 2022

    Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, et al. The medical segmentation decathlon.Nature Communications, 13(1):4128, 2022

  4. [4]

    Openflamingo, 2023

    Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, et al. Openflamingo, 2023

  5. [5]

    Artificial intelligence in healthcare: transforming the practice of medicine.Future healthcare journal, 8(2):e188–e194, 2021

    Junaid Bajwa, Usman Munir, Aditya Nori, and Bryan Williams. Artificial intelligence in healthcare: transforming the practice of medicine.Future healthcare journal, 8(2):e188–e194, 2021

  6. [6]

    Vqa- med: Overview of the medical visual question answering task at imageclef 2019

    Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa- med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

  7. [7]

    Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain

    Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021

  8. [8]

    Medpix™ receives patent, 2006

    Md BETHESDA. Medpix™ receives patent, 2006

  9. [9]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021

  10. [10]

    Chatffa: Interactive visual question answering on fundus fluorescein angiography image using chatgpt

    Xiaolan Chen, Pusheng Xu, Yao Li, Weiyi Zhang, Fan Song, Ying-Feng Zheng, Danli Shi, and Mingguang He. Chatffa: Interactive visual question answering on fundus fluorescein angiography image using chatgpt. Available at SSRN 4578568

  11. [11]

    Multi- modal masked autoencoders for medical vision-and-language pre-training

    Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan, and Tsung-Hui Chang. Multi- modal masked autoencoders for medical vision-and-language pre-training. InMedical Image Computing and Computer Assisted Intervention, pages 679–689. Springer, 2022

  12. [12]

    Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

    Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

  13. [13]

    Dwt-cv: Dense weight transfer-based cross validation strategy for model selection in biomedical data analysis

    Jianhong Cheng, Hulin Kuang, Qichang Zhao, Yahui Wang, Lei Xu, Jin Liu, and Jianxin Wang. Dwt-cv: Dense weight transfer-based cross validation strategy for model selection in biomedical data analysis. Future Generation Computer Systems, 135:20–29, 2022

  14. [14]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

  15. [15]

    The future landscape of large language models in medicine.Communications medicine, 3(1):141, 2023

    Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, Gregory P Veldhuizen, et al. The future landscape of large language models in medicine.Communications medicine, 3(1):141, 2023

  16. [16]

    Survey of multimodal medical question answering.BioMedInfor- matics, 4(1):50–74, 2023

    Hilmi Demirhan and Wlodek Zadrozny. Survey of multimodal medical question answering.BioMedInfor- matics, 4(1):50–74, 2023. |16

  17. [17]

    Optimal gradient checkpoint search for arbitrary computation graphs

    Jianwei Feng and Dong Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11442, 2021

  18. [18]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  19. [19]

    A self-adaptive discrim- inative autoencoder for medical applications.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8875–8886, 2022

    Xiaolong Ge, Yanpeng Qu, Changjing Shang, Longzhi Yang, and Qiang Shen. A self-adaptive discrim- inative autoencoder for medical applications.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8875–8886, 2022

  20. [20]

    Domain-specific language model pretraining for biomedical natural language processing

    Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021

  21. [21]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  22. [22]

    Towards visual question answering on pathology images

    Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Towards visual question answering on pathology images. pages 708–718, 2020

  23. [23]

    Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm.arXiv preprint arXiv:2402.09181, 2024

    Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm.arXiv preprint arXiv:2402.09181, 2024

  24. [24]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021

  25. [25]

    Peir digital library: Online resources and authoring system

    Kristopher N Jones, Dwain E Woode, Kristina Panizzi, and Peter G Anderson. Peir digital library: Online resources and authoring system. InProceedings of the AMIA Symposium, page 1075. American Medical Informatics Association, 2001

  26. [26]

    Chaos challenge-combined (ct-mr) healthy abdominal organ segmentation.Medical Image Analysis, 69:101950, 2021

    A Emre Kavur, N Sinem Gezer, Mustafa Barış, Sinem Aslan, Pierre-Henri Conze, Vladimir Groza, Duc Duy Pham, Soumick Chatterjee, Philipp Ernst, Savaş Özkan, et al. Chaos challenge-combined (ct-mr) healthy abdominal organ segmentation.Medical Image Analysis, 69:101950, 2021

  27. [27]

    Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models.PLoS digital health, 2(2):e0000198, 2023

    Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, et al. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models.PLoS digital health, 2(2):e0000198, 2023

  28. [28]

    A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):1–10, 2018

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):1–10, 2018

  29. [29]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36, 2024

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36, 2024

  30. [30]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023

  31. [31]

    Silkie: Preference distillation for large visual language models.arXiv preprint arXiv:2312.10665, 2023

    Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual language models.arXiv preprint arXiv:2312.10665, 2023

  32. [32]

    Pmc-clip: Contrastive language-image pre-training using biomedical documents

    Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-clip: Contrastive language-image pre-training using biomedical documents. 2023

  33. [33]

    Medical visual question answering: A survey.arXiv preprint arXiv:2111.10056, 2022

    Zhihong Lin, Donghao Zhang, Qingyi Tac, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. Medical visual question answering: A survey.arXiv preprint arXiv:2111.10056, 2022

  34. [34]

    Contrastive pre-training and representation distillation for medical visual question answering based on radiology images

    Bo Liu, Li-Ming Zhan, and Xiao-Ming Wu. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. InMedical Image Computing and Computer Assisted Intervention, pages 210–220. Springer, 2021. |17

  35. [35]

    Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021

  36. [36]

    Qilin-med-vl: Towards chinese large vision-language model for general healthcare.arXiv preprint arXiv:2310.17956, 2023

    Junling Liu, Ziming Wang, Qichen Ye, Dading Chong, Peilin Zhou, and Yining Hua. Qilin-med-vl: Towards chinese large vision-language model for general healthcare.arXiv preprint arXiv:2310.17956, 2023

  37. [37]

    Decoupled Weight Decay Regularization

    IlyaLoshchilovandFrankHutter. Decoupledweightdecayregularization. arXiv preprint arXiv:1711.05101, 2017

  38. [38]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  39. [39]

    Med-flamingo: a multimodal medical few-shot learner

    Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR, 2023

  40. [40]

    Overcoming data limitation in medical visual question answering

    Binh D Nguyen, Thanh-Toan Do, Binh X Nguyen, Tuong Do, Erman Tjiputra, and Quang D Tran. Overcoming data limitation in medical visual question answering. InMedical Image Computing and Computer Assisted Intervention, pages 522–530. Springer, 2019

  41. [41]

    A concise model for medical image captioning

    Aaron Nicolson, Jason Dowling, and Bevan Koopman. A concise model for medical image captioning. In CLEF (Working Notes), pages 1611–1619, 2023

  42. [42]

    Capabilities of GPT-4 on Medical Challenge Problems

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023

  43. [43]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  44. [44]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  45. [45]

    Patient-centered radiology reports with generative artificial intelligence: adding value to radiology reporting.Scientific Reports, 14(1):13218, 2024

    Jiwoo Park, Kangrok Oh, Kyunghwa Han, and Young Han Lee. Patient-centered radiology reports with generative artificial intelligence: adding value to radiology reporting.Scientific Reports, 14(1):13218, 2024

  46. [46]

    Radiology objects in context (roco): a multimodal image dataset

    Obioma Pelka, Sven Koitka, Johannes Rückert, Felix Nensa, and Christoph M Friedrich. Radiology objects in context (roco): a multimodal image dataset. InMICCAI Workshop on Large-scale Annotation of Biomedical Data and Expert Label Synthesis (LABELS) 2018, pages 180–189. Springer, 2018

  47. [47]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  48. [48]

    Pubmed central: The genbank of the published literature

    Richard J Roberts. Pubmed central: The genbank of the published literature. volume 98, pages 381–382. National Acad Sciences, 2001

  49. [49]

    The role of large language models in medical education: applications and implications, 2023

    Conrad W Safranek, Anne Elizabeth Sidamon-Eristoff, Aidan Gilson, and David Chartash. The role of large language models in medical education: applications and implications, 2023

  50. [50]

    Quilt-llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos

    Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv preprint arXiv:2312.04746, 2023

  51. [51]

    Large language models encode clinical knowledge

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022

  52. [52]

    Medicat: A dataset of medical images, captions, and textual references

    Sanjay Subramanian et al. Medicat: A dataset of medical images, captions, and textual references. In Findings of EMNLP, 2020

  53. [53]

    Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023. |18

  54. [54]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  55. [55]

    Chatcad: Interactive computer- aided diagnosis on medical image using large language models.arXiv preprint arXiv:2302.07257, 2023

    Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. Chatcad: Interactive computer- aided diagnosis on medical image using large language models.arXiv preprint arXiv:2302.07257, 2023

  56. [56]

    Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

    Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017

  57. [57]

    Pmc-llama: Towards building open-source language models for medicine.arXiv preprint arXiv:2304.14454, 2023

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-llama: Towards building open-source language models for medicine.arXiv preprint arXiv:2304.14454, 2023

  58. [58]

    Towards generalist foundation model for radiology.arXiv preprint arXiv:2308.02463, 2023

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology.arXiv preprint arXiv:2308.02463, 2023

  59. [59]

    Hallucination benchmark in medical visual question answering

    Jinge Wu, Yunsoo Kim, and Honghan Wu. Hallucination benchmark in medical visual question answering. arXiv preprint arXiv:2401.05827, 2024

  60. [60]

    The impact of chatgpt and llms on medical imaging stakeholders: perspectives and use cases.Meta-Radiology, page 100007, 2023

    Jiancheng Yang, Hongwei Bran Li, and Donglai Wei. The impact of chatgpt and llms on medical imaging stakeholders: perspectives and use cases.Meta-Radiology, page 100007, 2023

  61. [61]

    Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis

    Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. In2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 191–195. IEEE, 2021

  62. [62]

    Unidcp: Unifying multiple medical vision-language tasks via dynamic cross-modal learnable prompts.arXiv preprint arXiv:2312.11171, 2023

    Chenlu Zhan, Yufei Zhang, Yu Lin, Gaoang Wang, and Hongwei Wang. Unidcp: Unifying multiple medical vision-language tasks via dynamic cross-modal learnable prompts.arXiv preprint arXiv:2312.11171, 2023

  63. [63]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

  64. [64]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023. |19 A Supplemental Materials A.1 Data Analysis Fig. 6 shows the percent...