Recognition: 2 theorem links
· Lean TheoremPMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Pith reviewed 2026-05-15 23:04 UTC · model grok-4.3
The pith
A generative model trained on a 227k-pair medical VQA dataset from literature outperforms prior systems on clinical benchmarks after fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constructing the PMC-VQA dataset containing 227k VQA pairs from 149k images and training a model that aligns a pre-trained vision encoder with a large language model, the approach achieves significantly better performance than prior MedVQA models in generating relevant and accurate free-form answers on benchmarks such as VQA-RAD, SLAKE, and Image-Clef-2019 after fine-tuning.
What carries the argument
Alignment between outputs of a pre-trained vision encoder and a large language model to support generation of free-form answers to medical visual questions.
If this is right
- Models trained this way generate more relevant and accurate free-form answers on public MedVQA benchmarks.
- The new dataset supports training across a wide range of medical image modalities and diseases.
- A manually verified test set offers a stricter benchmark for evaluating generative MedVQA methods.
- Centralized leaderboards help track improvements in the field.
Where Pith is reading between the lines
- Scaling medical visual instruction tuning could reduce reliance on task-specific annotated data for new clinical applications.
- Generative MedVQA systems may integrate more smoothly into doctor-AI conversations than classification-based ones.
- Literature-derived datasets might capture rare conditions better than small curated clinical sets if publication bias is limited.
- The success on fine-tuning suggests the initial large-scale training builds useful medical visual representations that transfer to other tasks.
Load-bearing premise
Images and questions taken from published medical papers represent the variety and difficulty of questions that arise in actual clinical practice.
What would settle it
Running the trained model on the manually verified test set and finding that its answers are no more accurate or relevant than those of previous MedVQA models would indicate the dataset and training approach do not deliver the claimed improvement.
read the original abstract
Medical Visual Question Answering (MedVQA) presents a significant opportunity to enhance diagnostic accuracy and healthcare delivery by leveraging artificial intelligence to interpret and answer questions based on medical images. In this study, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction and propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. We establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef-2019, significantly outperforming existing MedVQA models in generating relevant, accurate free-form answers. In addition, we propose a test set that has undergone manual verification, which is significantly more challenging, serving to better monitor the development of generative MedVQA methods. To facilitate comprehensive evaluation and comparison, we have maintained a leaderboard at https://paperswithcode.com/paper/pmc-vqa-visual-instruction-tuning-for-medical, offering a centralized resource for tracking progress and benchmarking state-of-the-art approaches. The PMC-VQA dataset emerges as a vital resource for the field of research, and the MedVInT presents a significant breakthrough in the area of MedVQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PMC-VQA, a large-scale dataset of 227k VQA pairs from 149k medical images extracted from PubMed Central literature across modalities and diseases. It proposes MedVInT, a generative model that aligns a pretrained vision encoder with an LLM for medical visual question answering. The model is pretrained on PMC-VQA then fine-tuned on public benchmarks (VQA-RAD, SLAKE, Image-Clef-2019), with claims of significant outperformance over prior MedVQA methods in free-form answer generation. A manually verified, more challenging test set is introduced along with a public leaderboard.
Significance. If the empirical gains hold under detailed scrutiny, the work would provide a valuable scalable pretraining resource for MedVQA and demonstrate the utility of generative alignment pipelines. The manual verification step and maintained leaderboard are constructive contributions that could help standardize evaluation. However, the central transfer claims depend on the unproven assumption that literature-derived pairs generalize to clinical distributions.
major comments (3)
- [§3] §3 (PMC-VQA construction): The dataset is built from PMC articles, which systematically favor clear, publishable findings; the manuscript provides no quantitative analysis of question ambiguity, diversity metrics, or comparison against real clinical query distributions. This directly affects the claim that fine-tuning gains on VQA-RAD/SLAKE/Image-Clef-2019 reflect genuine medical understanding rather than source artifacts.
- [§5] §5 (Experiments): The abstract and main claims assert significant outperformance, yet the manuscript supplies no ablation tables isolating the contribution of PMC-VQA pretraining versus standard fine-tuning, no error analysis on failure modes, and no statistical significance tests on the reported gains. These omissions make the central empirical result difficult to evaluate.
- [§4.2] §4.2 (Hard test set): The manually verified test set is described as significantly more challenging, but the paper does not specify the verification protocol, inter-annotator agreement, or how its distribution differs from the training split in terms of image quality, question complexity, or answer length.
minor comments (2)
- [Abstract] The abstract states quantitative superiority without any numbers; move at least the key accuracy or BLEU scores into the abstract for immediate readability.
- [§4] Notation for the vision-language alignment loss is introduced without an explicit equation number; add Eq. (X) and reference it consistently in §4.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the empirical support and transparency of the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (PMC-VQA construction): The dataset is built from PMC articles, which systematically favor clear, publishable findings; the manuscript provides no quantitative analysis of question ambiguity, diversity metrics, or comparison against real clinical query distributions. This directly affects the claim that fine-tuning gains on VQA-RAD/SLAKE/Image-Clef-2019 reflect genuine medical understanding rather than source artifacts.
Authors: We agree that further characterization of PMC-VQA is warranted. In the revision we will add quantitative analyses of question diversity (modality/disease/question-type distributions), lexical and semantic ambiguity indicators, and direct statistical comparisons of question/answer distributions against the target clinical benchmarks (VQA-RAD, SLAKE, Image-Clef-2019). We will also explicitly discuss the literature-to-clinical domain gap as a limitation and future-work item. revision: yes
-
Referee: [§5] §5 (Experiments): The abstract and main claims assert significant outperformance, yet the manuscript supplies no ablation tables isolating the contribution of PMC-VQA pretraining versus standard fine-tuning, no error analysis on failure modes, and no statistical significance tests on the reported gains. These omissions make the central empirical result difficult to evaluate.
Authors: We accept this criticism. The revised manuscript will include (i) ablation tables that isolate the effect of PMC-VQA pretraining, (ii) a dedicated error-analysis section with representative failure cases, and (iii) statistical significance testing (bootstrap resampling and McNemar tests) on all reported gains. These additions will be placed in §5 and the supplementary material. revision: yes
-
Referee: [§4.2] §4.2 (Hard test set): The manually verified test set is described as significantly more challenging, but the paper does not specify the verification protocol, inter-annotator agreement, or how its distribution differs from the training split in terms of image quality, question complexity, or answer length.
Authors: We will expand §4.2 with a full description of the verification protocol (annotator background, number of reviewers per sample, resolution procedure for disagreements), report inter-annotator agreement (Cohen’s κ), and provide comparative statistics (image-quality scores, question-length and complexity distributions, answer-length histograms) between the hard test set and the training split. revision: yes
Circularity Check
Standard empirical pipeline on external benchmarks; no load-bearing self-referential derivations or fitted predictions
full rationale
The paper constructs the PMC-VQA dataset from PubMed Central literature, aligns a pre-trained vision encoder with an LLM to form MedVInT, trains on the 227k pairs, and fine-tunes on independent public benchmarks (VQA-RAD, SLAKE, Image-Clef-2019). Reported gains are measured on those external test sets plus a manually verified subset. No equations, uniqueness theorems, or ansatzes are invoked that reduce the performance claims to quantities defined by the authors' own fitted parameters or prior self-citations. Self-citations, if present, are not load-bearing for the central empirical result. This matches the default expectation of a non-circular ML paper whose claims rest on reproducible external evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- Vision-language alignment parameters
axioms (2)
- domain assumption Pre-trained vision encoders extract features sufficient for medical image understanding when aligned with language models
- domain assumption Literature-sourced image-question pairs form a representative training distribution for clinical MedVQA
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We train the proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD, SLAKE, and Image-Clef-2019, significantly outperforming existing MedVQA models in generating relevant, accurate free-form answers.
-
Foundation.LawOfExistencedefect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
-
MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate ...
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation
CheXthought supplies large-scale expert chain-of-thought reasoning and synchronized visual attention data for chest X-rays to train more accurate and interpretable clinical vision-language models.
-
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
-
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
MedVIGIL introduces a clinician-supervised benchmark showing medical VLMs frequently give fluent answers on broken visual evidence, with top models 14 points below human radiologists on the composite score.
-
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
-
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
MedSynapse-V evolves latent diagnostic memories via meta queries, causal counterfactual refinement with RL, and dual-branch memory transition to outperform prior medical VLM methods in diagnostic accuracy.
-
Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA
DCI unifies backdoor adjustment and instrumental variable learning in MedVQA to extract deconfounded representations, yielding better out-of-distribution performance on SLAKE, VQA-RAD and similar benchmarks.
-
MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging
MedRCube is a new fine-grained evaluation framework that benchmarks 33 MLLMs on medical imaging, ranks Lingshu-32B highest, and finds a significant positive link between shortcut behaviors and diagnostic performance.
-
Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance
DCP-PD improves macro F1 scores on CT report generation benchmarks and introduces a hierarchical location-aware evaluation protocol that reveals ongoing challenges in pathology spatial grounding.
-
Improving Medical VQA through Trajectory-Aware Process Supervision
A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
-
Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
Chain-of-thought underperforms direct answering in medical VQA due to a perception bottleneck, but ROI cues and textual grounding interventions can improve results and reverse the gap.
-
LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.
-
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs
A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
-
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
-
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022
work page 2022
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, R...
work page 2022
-
[3]
The medical segmentation decathlon.Nature Communications, 13(1):4128, 2022
Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Annette Kopp-Schneider, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, et al. The medical segmentation decathlon.Nature Communications, 13(1):4128, 2022
work page 2022
-
[4]
Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, et al. Openflamingo, 2023
work page 2023
-
[5]
Junaid Bajwa, Usman Munir, Aditya Nori, and Bryan Williams. Artificial intelligence in healthcare: transforming the practice of medicine.Future healthcare journal, 8(2):e188–e194, 2021
work page 2021
-
[6]
Vqa- med: Overview of the medical visual question answering task at imageclef 2019
Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa- med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019
work page 2019
-
[7]
Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021
work page 2021
- [8]
-
[9]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021
work page 2021
-
[10]
Chatffa: Interactive visual question answering on fundus fluorescein angiography image using chatgpt
Xiaolan Chen, Pusheng Xu, Yao Li, Weiyi Zhang, Fan Song, Ying-Feng Zheng, Danli Shi, and Mingguang He. Chatffa: Interactive visual question answering on fundus fluorescein angiography image using chatgpt. Available at SSRN 4578568
-
[11]
Multi- modal masked autoencoders for medical vision-and-language pre-training
Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan, and Tsung-Hui Chang. Multi- modal masked autoencoders for medical vision-and-language pre-training. InMedical Image Computing and Computer Assisted Intervention, pages 679–689. Springer, 2022
work page 2022
-
[12]
Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation.arXiv preprint arXiv:2401.12208, 2024
-
[13]
Jianhong Cheng, Hulin Kuang, Qichang Zhao, Yahui Wang, Lei Xu, Jin Liu, and Jianxin Wang. Dwt-cv: Dense weight transfer-based cross validation strategy for model selection in biomedical data analysis. Future Generation Computer Systems, 135:20–29, 2022
work page 2022
-
[14]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023
work page 2023
-
[15]
The future landscape of large language models in medicine.Communications medicine, 3(1):141, 2023
Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, Gregory P Veldhuizen, et al. The future landscape of large language models in medicine.Communications medicine, 3(1):141, 2023
work page 2023
-
[16]
Survey of multimodal medical question answering.BioMedInfor- matics, 4(1):50–74, 2023
Hilmi Demirhan and Wlodek Zadrozny. Survey of multimodal medical question answering.BioMedInfor- matics, 4(1):50–74, 2023. |16
work page 2023
-
[17]
Optimal gradient checkpoint search for arbitrary computation graphs
Jianwei Feng and Dong Huang. Optimal gradient checkpoint search for arbitrary computation graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11442, 2021
work page 2021
-
[18]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[19]
Xiaolong Ge, Yanpeng Qu, Changjing Shang, Longzhi Yang, and Qiang Shen. A self-adaptive discrim- inative autoencoder for medical applications.IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8875–8886, 2022
work page 2022
-
[20]
Domain-specific language model pretraining for biomedical natural language processing
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021
work page 2021
-
[21]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[22]
Towards visual question answering on pathology images
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Towards visual question answering on pathology images. pages 708–718, 2020
work page 2020
-
[23]
Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm.arXiv preprint arXiv:2402.09181, 2024
-
[24]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021
work page 2021
-
[25]
Peir digital library: Online resources and authoring system
Kristopher N Jones, Dwain E Woode, Kristina Panizzi, and Peter G Anderson. Peir digital library: Online resources and authoring system. InProceedings of the AMIA Symposium, page 1075. American Medical Informatics Association, 2001
work page 2001
-
[26]
A Emre Kavur, N Sinem Gezer, Mustafa Barış, Sinem Aslan, Pierre-Henri Conze, Vladimir Groza, Duc Duy Pham, Soumick Chatterjee, Philipp Ernst, Savaş Özkan, et al. Chaos challenge-combined (ct-mr) healthy abdominal organ segmentation.Medical Image Analysis, 69:101950, 2021
work page 2021
-
[27]
Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, et al. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models.PLoS digital health, 2(2):e0000198, 2023
work page 2023
-
[28]
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):1–10, 2018
work page 2018
-
[29]
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[30]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual language models.arXiv preprint arXiv:2312.10665, 2023
-
[32]
Pmc-clip: Contrastive language-image pre-training using biomedical documents
Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-clip: Contrastive language-image pre-training using biomedical documents. 2023
work page 2023
-
[33]
Medical visual question answering: A survey.arXiv preprint arXiv:2111.10056, 2022
Zhihong Lin, Donghao Zhang, Qingyi Tac, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. Medical visual question answering: A survey.arXiv preprint arXiv:2111.10056, 2022
-
[34]
Bo Liu, Li-Ming Zhan, and Xiao-Ming Wu. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. InMedical Image Computing and Computer Assisted Intervention, pages 210–220. Springer, 2021. |17
work page 2021
-
[35]
Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021
work page 2021
-
[36]
Junling Liu, Ziming Wang, Qichen Ye, Dading Chong, Peilin Zhou, and Yining Hua. Qilin-med-vl: Towards chinese large vision-language model for general healthcare.arXiv preprint arXiv:2310.17956, 2023
-
[37]
Decoupled Weight Decay Regularization
IlyaLoshchilovandFrankHutter. Decoupledweightdecayregularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Med-flamingo: a multimodal medical few-shot learner
Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR, 2023
work page 2023
-
[40]
Overcoming data limitation in medical visual question answering
Binh D Nguyen, Thanh-Toan Do, Binh X Nguyen, Tuong Do, Erman Tjiputra, and Quang D Tran. Overcoming data limitation in medical visual question answering. InMedical Image Computing and Computer Assisted Intervention, pages 522–530. Springer, 2019
work page 2019
-
[41]
A concise model for medical image captioning
Aaron Nicolson, Jason Dowling, and Bevan Koopman. A concise model for medical image captioning. In CLEF (Working Notes), pages 1611–1619, 2023
work page 2023
-
[42]
Capabilities of GPT-4 on Medical Challenge Problems
Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
work page 2002
-
[45]
Jiwoo Park, Kangrok Oh, Kyunghwa Han, and Young Han Lee. Patient-centered radiology reports with generative artificial intelligence: adding value to radiology reporting.Scientific Reports, 14(1):13218, 2024
work page 2024
-
[46]
Radiology objects in context (roco): a multimodal image dataset
Obioma Pelka, Sven Koitka, Johannes Rückert, Felix Nensa, and Christoph M Friedrich. Radiology objects in context (roco): a multimodal image dataset. InMICCAI Workshop on Large-scale Annotation of Biomedical Data and Expert Label Synthesis (LABELS) 2018, pages 180–189. Springer, 2018
work page 2018
-
[47]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[48]
Pubmed central: The genbank of the published literature
Richard J Roberts. Pubmed central: The genbank of the published literature. volume 98, pages 381–382. National Acad Sciences, 2001
work page 2001
-
[49]
The role of large language models in medical education: applications and implications, 2023
Conrad W Safranek, Anne Elizabeth Sidamon-Eristoff, Aidan Gilson, and David Chartash. The role of large language models in medical education: applications and implications, 2023
work page 2023
-
[50]
Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv preprint arXiv:2312.04746, 2023
-
[51]
Large language models encode clinical knowledge
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022
-
[52]
Medicat: A dataset of medical images, captions, and textual references
Sanjay Subramanian et al. Medicat: A dataset of medical images, captions, and textual references. In Findings of EMNLP, 2020
work page 2020
-
[53]
Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023
Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940, 2023. |18
work page 1930
-
[54]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Sheng Wang, Zihao Zhao, Xi Ouyang, Qian Wang, and Dinggang Shen. Chatcad: Interactive computer- aided diagnosis on medical image using large language models.arXiv preprint arXiv:2302.07257, 2023
-
[56]
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017
work page 2097
-
[57]
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-llama: Towards building open-source language models for medicine.arXiv preprint arXiv:2304.14454, 2023
-
[58]
Towards generalist foundation model for radiology.arXiv preprint arXiv:2308.02463, 2023
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology.arXiv preprint arXiv:2308.02463, 2023
-
[59]
Hallucination benchmark in medical visual question answering
Jinge Wu, Yunsoo Kim, and Honghan Wu. Hallucination benchmark in medical visual question answering. arXiv preprint arXiv:2401.05827, 2024
-
[60]
Jiancheng Yang, Hongwei Bran Li, and Donglai Wei. The impact of chatgpt and llms on medical imaging stakeholders: perspectives and use cases.Meta-Radiology, page 100007, 2023
work page 2023
-
[61]
Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis
Jiancheng Yang, Rui Shi, and Bingbing Ni. Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. In2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 191–195. IEEE, 2021
work page 2021
-
[62]
Chenlu Zhan, Yufei Zhang, Yu Lin, Gaoang Wang, and Hongwei Wang. Unidcp: Unifying multiple medical vision-language tasks via dynamic cross-modal learnable prompts.arXiv preprint arXiv:2312.11171, 2023
-
[63]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023. |19 A Supplemental Materials A.1 Data Analysis Fig. 6 shows the percent...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.