arxiv: 2305.10415 · v6 · submitted 2023-05-17 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Xiaoman Zhang , Chaoyi Wu , Ziheng Zhao , Weixiong Lin , Ya Zhang , Yanfeng Wang , Weidi Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-15 23:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical visual question answeringvisual instruction tuninggenerative modellarge language modelmedical imagingdataset constructionfine-tuning

0 comments

The pith

A generative model trained on a 227k-pair medical VQA dataset from literature outperforms prior systems on clinical benchmarks after fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a dataset of 227,000 question-answer pairs drawn from nearly 150,000 medical images across different modalities and diseases. It then trains a model that feeds visual features from a pre-trained encoder into a large language model so the system can generate natural-language answers to questions about the images. After initial training on the new collection, the model is fine-tuned on existing test collections and produces more accurate free-form responses than earlier MedVQA approaches. The authors also release a manually checked harder test set to provide a stricter measure of progress in this generative setting.

Core claim

By constructing the PMC-VQA dataset containing 227k VQA pairs from 149k images and training a model that aligns a pre-trained vision encoder with a large language model, the approach achieves significantly better performance than prior MedVQA models in generating relevant and accurate free-form answers on benchmarks such as VQA-RAD, SLAKE, and Image-Clef-2019 after fine-tuning.

What carries the argument

Alignment between outputs of a pre-trained vision encoder and a large language model to support generation of free-form answers to medical visual questions.

If this is right

Models trained this way generate more relevant and accurate free-form answers on public MedVQA benchmarks.
The new dataset supports training across a wide range of medical image modalities and diseases.
A manually verified test set offers a stricter benchmark for evaluating generative MedVQA methods.
Centralized leaderboards help track improvements in the field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling medical visual instruction tuning could reduce reliance on task-specific annotated data for new clinical applications.
Generative MedVQA systems may integrate more smoothly into doctor-AI conversations than classification-based ones.
Literature-derived datasets might capture rare conditions better than small curated clinical sets if publication bias is limited.
The success on fine-tuning suggests the initial large-scale training builds useful medical visual representations that transfer to other tasks.

Load-bearing premise

Images and questions taken from published medical papers represent the variety and difficulty of questions that arise in actual clinical practice.

What would settle it

Running the trained model on the manually verified test set and finding that its answers are no more accurate or relevant than those of previous MedVQA models would indicate the dataset and training approach do not deliver the claimed improvement.

read the original abstract

Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery by leveraging artificial intelligence to interpret and answer questions based on medical images. In this study, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction and propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef-2019, significantly outperforming existing MedVQA models in generating relevant, accurate free-form answers. In addition, we propose a test set that has undergone manual verification, which is significantly more challenging, serving to better monitor the development of generative MedVQA methods. To facilitate comprehensive evaluation and comparison, we have maintained a leaderboard at https://paperswithcode.com/paper/pmc-vqa-visual-instruction-tuning-for-medical, offering a centralized resource for tracking progress and benchmarking state-of-the-art approaches. The PMC-VQA dataset emerges as a vital resource for the field of research, and the MedVInT presents a significant breakthrough in the area of MedVQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a large new PMC-VQA dataset and applies instruction tuning to MedVQA, but the outperformance claims rest on unshown numbers and a literature-sourced collection that may not match real clinical distributions.

read the letter

The main addition is the PMC-VQA dataset of 227k pairs from 149k images pulled from PubMed Central, plus the MedVInT model that aligns a vision encoder with an LLM and gets fine-tuned on VQA-RAD, SLAKE, and Image-Clef-2019. They also release a manually verified harder test set and a leaderboard. That dataset construction pipeline and the generative framing are concrete steps forward for medical visual QA work that wants free-form answers rather than classification outputs. The scale and public release make it a usable resource even if the model itself is a standard alignment setup. The manual verification on the test set is a practical improvement over purely automatic splits. The soft spots sit in the evaluation and data provenance. The abstract states significant gains over prior MedVQA models without any metrics, ablations, or error breakdowns, so the central empirical claim is difficult to judge from the text alone. Because the pairs come from published literature, they likely skew toward clearer, caption-derived questions and less ambiguous cases than typical clinical queries; if the fine-tuning gains largely reflect that distribution match rather than deeper visual understanding, the transfer story weakens. No internal contradictions in the setup, but the representativeness assumption is the least secured part. This is for groups building or benchmarking medical multimodal models who need more training data. It deserves peer review because the dataset is new and substantial enough to matter, even if the results section will need close scrutiny on numbers and bias checks.

Referee Report

3 major / 2 minor

Summary. The paper introduces PMC-VQA, a large-scale dataset of 227k VQA pairs from 149k medical images extracted from PubMed Central literature across modalities and diseases. It proposes MedVInT, a generative model that aligns a pretrained vision encoder with an LLM for medical visual question answering. The model is pretrained on PMC-VQA then fine-tuned on public benchmarks (VQA-RAD, SLAKE, Image-Clef-2019), with claims of significant outperformance over prior MedVQA methods in free-form answer generation. A manually verified, more challenging test set is introduced along with a public leaderboard.

Significance. If the empirical gains hold under detailed scrutiny, the work would provide a valuable scalable pretraining resource for MedVQA and demonstrate the utility of generative alignment pipelines. The manual verification step and maintained leaderboard are constructive contributions that could help standardize evaluation. However, the central transfer claims depend on the unproven assumption that literature-derived pairs generalize to clinical distributions.

major comments (3)

[§3] §3 (PMC-VQA construction): The dataset is built from PMC articles, which systematically favor clear, publishable findings; the manuscript provides no quantitative analysis of question ambiguity, diversity metrics, or comparison against real clinical query distributions. This directly affects the claim that fine-tuning gains on VQA-RAD/SLAKE/Image-Clef-2019 reflect genuine medical understanding rather than source artifacts.
[§5] §5 (Experiments): The abstract and main claims assert significant outperformance, yet the manuscript supplies no ablation tables isolating the contribution of PMC-VQA pretraining versus standard fine-tuning, no error analysis on failure modes, and no statistical significance tests on the reported gains. These omissions make the central empirical result difficult to evaluate.
[§4.2] §4.2 (Hard test set): The manually verified test set is described as significantly more challenging, but the paper does not specify the verification protocol, inter-annotator agreement, or how its distribution differs from the training split in terms of image quality, question complexity, or answer length.

minor comments (2)

[Abstract] The abstract states quantitative superiority without any numbers; move at least the key accuracy or BLEU scores into the abstract for immediate readability.
[§4] Notation for the vision-language alignment loss is introduced without an explicit equation number; add Eq. (X) and reference it consistently in §4.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the empirical support and transparency of the manuscript.

read point-by-point responses

Referee: [§3] §3 (PMC-VQA construction): The dataset is built from PMC articles, which systematically favor clear, publishable findings; the manuscript provides no quantitative analysis of question ambiguity, diversity metrics, or comparison against real clinical query distributions. This directly affects the claim that fine-tuning gains on VQA-RAD/SLAKE/Image-Clef-2019 reflect genuine medical understanding rather than source artifacts.

Authors: We agree that further characterization of PMC-VQA is warranted. In the revision we will add quantitative analyses of question diversity (modality/disease/question-type distributions), lexical and semantic ambiguity indicators, and direct statistical comparisons of question/answer distributions against the target clinical benchmarks (VQA-RAD, SLAKE, Image-Clef-2019). We will also explicitly discuss the literature-to-clinical domain gap as a limitation and future-work item. revision: yes
Referee: [§5] §5 (Experiments): The abstract and main claims assert significant outperformance, yet the manuscript supplies no ablation tables isolating the contribution of PMC-VQA pretraining versus standard fine-tuning, no error analysis on failure modes, and no statistical significance tests on the reported gains. These omissions make the central empirical result difficult to evaluate.

Authors: We accept this criticism. The revised manuscript will include (i) ablation tables that isolate the effect of PMC-VQA pretraining, (ii) a dedicated error-analysis section with representative failure cases, and (iii) statistical significance testing (bootstrap resampling and McNemar tests) on all reported gains. These additions will be placed in §5 and the supplementary material. revision: yes
Referee: [§4.2] §4.2 (Hard test set): The manually verified test set is described as significantly more challenging, but the paper does not specify the verification protocol, inter-annotator agreement, or how its distribution differs from the training split in terms of image quality, question complexity, or answer length.

Authors: We will expand §4.2 with a full description of the verification protocol (annotator background, number of reviewers per sample, resolution procedure for disagreements), report inter-annotator agreement (Cohen’s κ), and provide comparative statistics (image-quality scores, question-length and complexity distributions, answer-length histograms) between the hard test set and the training split. revision: yes

Circularity Check

0 steps flagged

Standard empirical pipeline on external benchmarks; no load-bearing self-referential derivations or fitted predictions

full rationale

The paper constructs the PMC-VQA dataset from PubMed Central literature, aligns a pre-trained vision encoder with an LLM to form MedVInT, trains on the 227k pairs, and fine-tunes on independent public benchmarks (VQA-RAD, SLAKE, Image-Clef-2019). Reported gains are measured on those external test sets plus a manually verified subset. No equations, uniqueness theorems, or ansatzes are invoked that reduce the performance claims to quantities defined by the authors' own fitted parameters or prior self-citations. Self-citations, if present, are not load-bearing for the central empirical result. This matches the default expectation of a non-circular ML paper whose claims rest on reproducible external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the transferability of general vision-language alignment techniques to the medical domain and on the quality of literature-derived image-question pairs as training data.

free parameters (1)

Vision-language alignment parameters
Learned parameters that connect the vision encoder output to the language model input during instruction tuning.

axioms (2)

domain assumption Pre-trained vision encoders extract features sufficient for medical image understanding when aligned with language models
Invoked in the description of aligning a pre-trained vision encoder with an LLM.
domain assumption Literature-sourced image-question pairs form a representative training distribution for clinical MedVQA
Implicit in the construction of PMC-VQA from PubMed Central articles.

pith-pipeline@v0.9.0 · 5586 in / 1449 out tokens · 35022 ms · 2026-05-15T23:04:29.314082+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef-2019, significantly outperforming existing MedVQA models in generating relevant, accurate free-form answers.
Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
cs.CV 2026-05 accept novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
cs.CV 2026-03 conditional novelty 8.0

MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate ...
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation
cs.CV 2026-04 unverdicted novelty 7.0

CheXthought supplies large-scale expert chain-of-thought reasoning and synchronized visual attention data for chest X-rays to train more accurate and interpretable clinical vision-language models.
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
cs.CV 2026-04 unverdicted novelty 7.0

X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
cs.CV 2026-05 unverdicted novelty 6.0

Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
cs.CV 2026-05 unverdicted novelty 6.0

MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
cs.AI 2026-05 unverdicted novelty 6.0

MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
cs.CV 2026-04 unverdicted novelty 6.0

MedSynapse-V evolves latent diagnostic memories via meta queries, causal counterfactual refinement with RL, and dual-branch memory transition to outperform prior medical VLM methods in diagnostic accuracy.
Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA
cs.CV 2026-04 unverdicted novelty 6.0

DCI unifies backdoor adjustment and instrumental variable learning in MedVQA to extract deconfounded representations, yielding better out-of-distribution performance on SLAKE, VQA-RAD and similar benchmarks.
MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging
cs.CL 2026-04 unverdicted novelty 6.0

MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.
Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance
cs.CV 2026-04 unverdicted novelty 6.0

DCP-PD improves macro F1 scores on CT report generation benchmarks and introduces a hierarchical location-aware evaluation protocol that reveals ongoing challenges in pathology spatial grounding.
Improving Medical VQA through Trajectory-Aware Process Supervision
cs.LG 2026-04 conditional novelty 6.0

A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
cs.CV 2026-03 conditional novelty 6.0

Chain-of-thought underperforms direct answering in medical VQA due to a perception bottleneck, but ROI cues and textual grounding interventions can improve results and reverse the gap.
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 5.0

LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs
cs.CL 2026-04 unverdicted novelty 5.0

A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
cs.CL 2026-02 unverdicted novelty 4.0

MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

work page 2022
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, R...

work page 2022
[3]

The medical segmentation decathlon.Nature Communications, 13(1):4128, 2022

Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, et al. The medical segmentation decathlon.Nature Communications, 13(1):4128, 2022

work page 2022
[4]

Openflamingo, 2023

Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, et al. Openflamingo, 2023

work page 2023
[5]

Artificial intelligence in healthcare: transforming the practice of medicine.Future healthcare journal, 8(2):e188–e194, 2021

Junaid Bajwa, Usman Munir, Aditya Nori, and Bryan Williams. Artificial intelligence in healthcare: transforming the practice of medicine.Future healthcare journal, 8(2):e188–e194, 2021

work page 2021
[6]

Vqa- med: Overview of the medical visual question answering task at imageclef 2019

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa- med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

work page 2019
[7]

Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain

Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021

work page 2021
[8]

Medpix™ receives patent, 2006

Md BETHESDA. Medpix™ receives patent, 2006

work page 2006
[9]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021

work page 2021
[10]

Chatffa: Interactive visual question answering on fundus fluorescein angiography image using chatgpt

Xiaolan Chen, Pusheng Xu, Yao Li, Weiyi Zhang, Fan Song, Ying-Feng Zheng, Danli Shi, and Mingguang He. Chatffa: Interactive visual question answering on fundus fluorescein angiography image using chatgpt. Available at SSRN 4578568

work page
[11]

Multi- modal masked autoencoders for medical vision-and-language pre-training

Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan, and Tsung-Hui Chang. Multi- modal masked autoencoders for medical vision-and-language pre-training. InMedical Image Computing and Computer Assisted Intervention, pages 679–689. Springer, 2022

work page 2022
[12]

Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024

work page arXiv 2024
[13]

Dwt-cv: Dense weight transfer-based cross validation strategy for model selection in biomedical data analysis

Jianhong Cheng, Hulin Kuang, Qichang Zhao, Yahui Wang, Lei Xu, Jin Liu, and Jianxin Wang. Dwt-cv: Dense weight transfer-based cross validation strategy for model selection in biomedical data analysis. Future Generation Computer Systems, 135:20–29, 2022

work page 2022
[14]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023

work page 2023
[15]

The future landscape of large language models in medicine.Communications medicine, 3(1):141, 2023

Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, Gregory P Veldhuizen, et al. The future landscape of large language models in medicine.Communications medicine, 3(1):141, 2023

work page 2023
[16]

Survey of multimodal medical question answering.BioMedInfor- matics, 4(1):50–74, 2023

Hilmi Demirhan and Wlodek Zadrozny. Survey of multimodal medical question answering.BioMedInfor- matics, 4(1):50–74, 2023. |16

work page 2023
[17]

Optimal gradient checkpoint search for arbitrary computation graphs

Jianwei Feng and Dong Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11442, 2021

work page 2021
[18]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[19]

A self-adaptive discrim- inative autoencoder for medical applications.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8875–8886, 2022

Xiaolong Ge, Yanpeng Qu, Changjing Shang, Longzhi Yang, and Qiang Shen. A self-adaptive discrim- inative autoencoder for medical applications.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8875–8886, 2022

work page 2022
[20]

Domain-specific language model pretraining for biomedical natural language processing

Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021

work page 2021
[21]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[22]

Towards visual question answering on pathology images

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Towards visual question answering on pathology images. pages 708–718, 2020

work page 2020
[23]

Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm.arXiv preprint arXiv:2402.09181, 2024

Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm.arXiv preprint arXiv:2402.09181, 2024

work page arXiv 2024
[24]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021

work page 2021
[25]

Peir digital library: Online resources and authoring system

Kristopher N Jones, Dwain E Woode, Kristina Panizzi, and Peter G Anderson. Peir digital library: Online resources and authoring system. InProceedings of the AMIA Symposium, page 1075. American Medical Informatics Association, 2001

work page 2001
[26]

Chaos challenge-combined (ct-mr) healthy abdominal organ segmentation.Medical Image Analysis, 69:101950, 2021

A Emre Kavur, N Sinem Gezer, Mustafa Barış, Sinem Aslan, Pierre-Henri Conze, Vladimir Groza, Duc Duy Pham, Soumick Chatterjee, Philipp Ernst, Savaş Özkan, et al. Chaos challenge-combined (ct-mr) healthy abdominal organ segmentation.Medical Image Analysis, 69:101950, 2021

work page 2021
[27]

Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models.PLoS digital health, 2(2):e0000198, 2023

Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, et al. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models.PLoS digital health, 2(2):e0000198, 2023

work page 2023
[28]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):1–10, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):1–10, 2018

work page 2018
[29]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36, 2024

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[30]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Silkie: Preference distillation for large visual language models.arXiv preprint arXiv:2312.10665, 2023

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual language models.arXiv preprint arXiv:2312.10665, 2023

work page arXiv 2023
[32]

Pmc-clip: Contrastive language-image pre-training using biomedical documents

Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-clip: Contrastive language-image pre-training using biomedical documents. 2023

work page 2023
[33]

Medical visual question answering: A survey.arXiv preprint arXiv:2111.10056, 2022

Zhihong Lin, Donghao Zhang, Qingyi Tac, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. Medical visual question answering: A survey.arXiv preprint arXiv:2111.10056, 2022

work page arXiv 2022
[34]

Contrastive pre-training and representation distillation for medical visual question answering based on radiology images

Bo Liu, Li-Ming Zhan, and Xiao-Ming Wu. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. InMedical Image Computing and Computer Assisted Intervention, pages 210–220. Springer, 2021. |17

work page 2021
[35]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021

work page 2021
[36]

Qilin-med-vl: Towards chinese large vision-language model for general healthcare.arXiv preprint arXiv:2310.17956, 2023

Junling Liu, Ziming Wang, Qichen Ye, Dading Chong, Peilin Zhou, and Yining Hua. Qilin-med-vl: Towards chinese large vision-language model for general healthcare.arXiv preprint arXiv:2310.17956, 2023

work page arXiv 2023
[37]

Decoupled Weight Decay Regularization

IlyaLoshchilovandFrankHutter. Decoupledweightdecayregularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR, 2023

work page 2023
[40]

Overcoming data limitation in medical visual question answering

Binh D Nguyen, Thanh-Toan Do, Binh X Nguyen, Tuong Do, Erman Tjiputra, and Quang D Tran. Overcoming data limitation in medical visual question answering. InMedical Image Computing and Computer Assisted Intervention, pages 522–530. Springer, 2019

work page 2019
[41]

A concise model for medical image captioning

Aaron Nicolson, Jason Dowling, and Bevan Koopman. A concise model for medical image captioning. In CLEF (Working Notes), pages 1611–1619, 2023

work page 2023
[42]

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[45]

Patient-centered radiology reports with generative artificial intelligence: adding value to radiology reporting.Scientific Reports, 14(1):13218, 2024

Jiwoo Park, Kangrok Oh, Kyunghwa Han, and Young Han Lee. Patient-centered radiology reports with generative artificial intelligence: adding value to radiology reporting.Scientific Reports, 14(1):13218, 2024

work page 2024
[46]

Radiology objects in context (roco): a multimodal image dataset

Obioma Pelka, Sven Koitka, Johannes Rückert, Felix Nensa, and Christoph M Friedrich. Radiology objects in context (roco): a multimodal image dataset. InMICCAI Workshop on Large-scale Annotation of Biomedical Data and Expert Label Synthesis (LABELS) 2018, pages 180–189. Springer, 2018

work page 2018
[47]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[48]

Pubmed central: The genbank of the published literature

Richard J Roberts. Pubmed central: The genbank of the published literature. volume 98, pages 381–382. National Acad Sciences, 2001

work page 2001
[49]

The role of large language models in medical education: applications and implications, 2023

Conrad W Safranek, Anne Elizabeth Sidamon-Eristoff, Aidan Gilson, and David Chartash. The role of large language models in medical education: applications and implications, 2023

work page 2023
[50]

Quilt-llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos

Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv preprint arXiv:2312.04746, 2023

work page arXiv 2023
[51]

Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022

work page arXiv 2022
[52]

Medicat: A dataset of medical images, captions, and textual references

Sanjay Subramanian et al. Medicat: A dataset of medical images, captions, and textual references. In Findings of EMNLP, 2020

work page 2020
[53]

Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023. |18

work page 1930
[54]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Chatcad: Interactive computer- aided diagnosis on medical image using large language models.arXiv preprint arXiv:2302.07257, 2023

Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. Chatcad: Interactive computer- aided diagnosis on medical image using large language models.arXiv preprint arXiv:2302.07257, 2023

work page arXiv 2023
[56]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017

work page 2097
[57]

Pmc-llama: Towards building open-source language models for medicine.arXiv preprint arXiv:2304.14454, 2023

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-llama: Towards building open-source language models for medicine.arXiv preprint arXiv:2304.14454, 2023

work page arXiv 2023
[58]

Towards generalist foundation model for radiology.arXiv preprint arXiv:2308.02463, 2023

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology.arXiv preprint arXiv:2308.02463, 2023

work page arXiv 2023
[59]

Hallucination benchmark in medical visual question answering

Jinge Wu, Yunsoo Kim, and Honghan Wu. Hallucination benchmark in medical visual question answering. arXiv preprint arXiv:2401.05827, 2024

work page arXiv 2024
[60]

The impact of chatgpt and llms on medical imaging stakeholders: perspectives and use cases.Meta-Radiology, page 100007, 2023

Jiancheng Yang, Hongwei Bran Li, and Donglai Wei. The impact of chatgpt and llms on medical imaging stakeholders: perspectives and use cases.Meta-Radiology, page 100007, 2023

work page 2023
[61]

Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis

Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. In2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 191–195. IEEE, 2021

work page 2021
[62]

Unidcp: Unifying multiple medical vision-language tasks via dynamic cross-modal learnable prompts.arXiv preprint arXiv:2312.11171, 2023

Chenlu Zhan, Yufei Zhang, Yu Lin, Gaoang Wang, and Hongwei Wang. Unidcp: Unifying multiple medical vision-language tasks via dynamic cross-modal learnable prompts.arXiv preprint arXiv:2312.11171, 2023

work page arXiv 2023
[63]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023. |19 A Supplemental Materials A.1 Data Analysis Fig. 6 shows the percent...

work page internal anchor Pith review Pith/arXiv arXiv 2023