pith. sign in

arxiv: 2306.00890 · v1 · pith:QW3YFI5Bnew · submitted 2023-06-01 · 💻 cs.CV · cs.CL

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Pith reviewed 2026-05-24 08:31 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords LLaVA-Medbiomedical vision-language modelmultimodal conversational AIPubMed figure captionsGPT-4 self-instructioncurriculum learningbiomedical visual question answeringfine-tuning
0
0 comments X

The pith

A general vision-language model can be adapted to answer open-ended questions about biomedical images after two-stage training on PubMed data and GPT-4 instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a large general-domain vision-language model can be turned into a biomedical conversational assistant by first aligning it to figure-caption pairs extracted from PubMed Central and then fine-tuning it on open-ended instruction data created by GPT-4. This curriculum approach lets the model learn biomedical vocabulary and then conversational semantics, completing the entire process in under 15 hours on eight A100 GPUs. A sympathetic reader would care because the result suggests that domain-specific multimodal AI need not require enormous new pretraining runs or hand-labeled datasets. If the approach holds, it opens a route to rapidly building capable assistants that follow natural-language instructions about medical images.

Core claim

By leveraging a broad-coverage biomedical figure-caption dataset from PubMed Central and using GPT-4 to generate instruction-following data from those captions, a general vision-language model can be fine-tuned in two stages: first to align biomedical vocabulary using the raw pairs, then to master open-ended conversational semantics. The resulting LLaVA-Med model follows open-ended instructions to discuss biomedical images and outperforms prior supervised state-of-the-art methods on certain metrics across three standard biomedical visual question answering benchmarks.

What carries the argument

Two-stage curriculum learning that first aligns the model to biomedical figure-caption pairs and then tunes it on GPT-4-generated open-ended instruction data.

If this is right

  • Biomedical visual question answering systems can reach competitive accuracy without task-specific labeled training sets.
  • The same two-stage alignment-plus-instruction process can be applied to other image-heavy scientific domains.
  • Releasing the generated instruction data and the final model lowers the barrier for follow-on biomedical multimodal research.
  • Conversational interfaces for biomedical images become feasible with modest compute rather than industrial-scale resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may generalize to other technical domains where large caption collections exist but conversational labels do not.
  • Success depends on continued access to a strong general-purpose language model like GPT-4 for data generation.
  • If the conversational capability transfers to real clinical workflows, it could accelerate adoption of image-based decision support tools.

Load-bearing premise

The instruction-following examples produced by GPT-4 from figure captions are rich enough to teach genuine open-ended conversational ability rather than surface-level pattern matching.

What would settle it

A set of new biomedical images paired with open-ended questions where expert clinicians rate LLaVA-Med responses as consistently less accurate or less instruction-following than a supervised baseline would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2306.00890 by Chunyuan Li, Cliff Wong, Haotian Liu, Hoifung Poon, Jianfeng Gao, Jianwei Yang, Naoto Usuyama, Sheng Zhang, Tristan Naumann.

Figure 4
Figure 4. Figure 4: Contrast-enhanced CT scan of the chest for patient #1. A [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 1
Figure 1. Figure 1: An instance of our GPT-4 generated instruction-following data. Top: The figure and caption [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The data statistics of biomedical multimodal instruction-following data: (a,b) The root [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLaVA-Med was initialized with the general-domain LLaVA and then continuously trained [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: messages we use to prompt GPT-4 to generate medical visual instruction-following data. Manually curated few-shot examples are included in the prompt, where each example has input sample[‘context’] and output sample[‘response’]. Please see [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 2
Figure 2. Figure 2: Chest X-ray. Cardiomegaly with diffuse bilateral interstitial infiltrates and a [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: One of the few-shot examples used in our prompt to construct medical visual instruction [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces LLaVA-Med, a biomedical vision-language assistant trained in under 15 hours on eight A100 GPUs. It extracts a large figure-caption dataset from PubMed Central, uses GPT-4 to self-instruct open-ended instruction-following data from the captions, and applies a two-stage curriculum: first aligning biomedical vocabulary on raw caption pairs, then fine-tuning on the GPT-4 data to acquire open-ended conversational semantics. The authors claim the resulting model exhibits excellent multimodal conversational capability for open-ended biomedical image inquiries and outperforms prior supervised state-of-the-art on certain metrics across three standard biomedical VQA datasets. The instruction data and model weights are to be released.

Significance. If the performance claims hold with proper controls and the model demonstrates genuine open-ended conversational ability rather than vocabulary-driven pattern matching, the work would be significant as a low-cost, reproducible recipe for domain-adapting general VLMs to biomedicine via self-instruction. The curriculum-learning framing and planned public release of data and model are concrete strengths that would aid follow-on research.

major comments (2)
  1. [§3.2] §3.2 (Data Generation): No human evaluation, error analysis, or comparison against human-written instructions is provided for the GPT-4-generated instruction-following data. This is load-bearing for the central claim that stage 2 teaches 'open-ended conversational semantics' beyond stage 1, because PubMed captions are short and non-conversational and any VQA gains could arise from vocabulary alignment alone.
  2. [§4.2] §4.2 (Ablation and Evaluation): No ablation isolating the contribution of the GPT-4 self-instruction stage versus caption alignment alone is reported. Without it, the attribution of conversational capability and the reported VQA improvements to the second stage cannot be verified, especially given that the three VQA benchmarks are predominantly closed-ended.
minor comments (1)
  1. [Abstract] Abstract: The statement that LLaVA-Med 'outperforms previous supervised state-of-the-art on certain metrics' is imprecise; the main text should explicitly list the metrics, the magnitude of gains, the exact baselines, and any statistical testing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects for strengthening the evidence in our work. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Data Generation): No human evaluation, error analysis, or comparison against human-written instructions is provided for the GPT-4-generated instruction-following data. This is load-bearing for the central claim that stage 2 teaches 'open-ended conversational semantics' beyond stage 1, because PubMed captions are short and non-conversational and any VQA gains could arise from vocabulary alignment alone.

    Authors: We agree that direct validation of the GPT-4-generated instruction data would provide stronger support for the distinct contribution of stage 2. In the revised manuscript, we will add a human evaluation on a sampled subset of the generated instructions, including error analysis and a comparison to human-written instructions. This addition will help substantiate that stage 2 imparts conversational semantics beyond vocabulary alignment from stage 1. revision: yes

  2. Referee: [§4.2] §4.2 (Ablation and Evaluation): No ablation isolating the contribution of the GPT-4 self-instruction stage versus caption alignment alone is reported. Without it, the attribution of conversational capability and the reported VQA improvements to the second stage cannot be verified, especially given that the three VQA benchmarks are predominantly closed-ended.

    Authors: We concur that an explicit ablation is necessary to isolate the effect of the GPT-4 self-instruction stage. The revised manuscript will include an ablation comparing model performance after stage 1 (caption alignment only) versus after stage 2 (full curriculum) on the three biomedical VQA benchmarks. This will allow direct attribution of any gains to the second stage while acknowledging the closed-ended nature of the benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training procedure with external benchmarks

full rationale

The paper presents an empirical method: extract PubMed figure-caption pairs, use GPT-4 to generate instruction-following data, then apply a two-stage curriculum fine-tune on a base vision-language model. Performance is measured on three external VQA datasets. No equations, fitted parameters renamed as predictions, or derivation chain exist. The central claim (conversational capability after stage 2) is supported by reported metrics rather than reducing to a self-defined quantity or self-citation. Self-citations, if present, are not load-bearing for any mathematical result. This is standard non-circular empirical ML work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the assumption that GPT-4 can reliably generate useful open-ended biomedical instruction data from captions and that the two-stage curriculum produces genuine capability gains; no explicit free parameters beyond standard training hyperparameters are mentioned in the abstract, and no new physical or mathematical entities are introduced.

axioms (2)
  • domain assumption GPT-4 outputs from figure captions constitute high-quality, diverse instruction-following examples suitable for fine-tuning conversational behavior.
    Invoked when the authors describe the second training stage as learning 'open-ended conversational semantics' from GPT-4 generated data.
  • domain assumption A general-domain vision-language model can be effectively adapted to biomedicine via staged alignment on figure-caption pairs followed by instruction tuning.
    Core premise of the curriculum learning method described in the abstract.

pith-pipeline@v0.9.0 · 5832 in / 1587 out tokens · 34192 ms · 2026-05-24T08:31:28.310292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

    cs.CV 2026-05 unverdicted novelty 7.0

    DermAgent orchestrates seven vision-language tools in a Plan-Execute-Reflect loop with dual-modality retrieval from 413k cases and a critic module to outperform GPT-4o by 17.6% in zero-shot dermatological diagnosis accuracy.

  2. ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

    cs.CV 2026-05 unverdicted novelty 7.0

    ProtoMedAgent uses a privacy-aware agentic workflow with neuro-symbolic bottlenecks to achieve 91.2% faithfulness in clinical report generation, significantly outperforming standard RAG methods on a large patient cohort.

  3. CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

    cs.CV 2026-05 conditional novelty 7.0

    Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

  4. Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

    cs.CV 2026-04 unverdicted novelty 7.0

    Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.

  5. R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

    cs.AI 2025-03 conditional novelty 7.0

    R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.

  6. Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

    cs.CV 2024-06 unverdicted novelty 7.0

    Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.

  7. Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.

  8. CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

  9. An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

    cs.CV 2026-04 unverdicted novelty 6.0

    A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.

  10. MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

    cs.CV 2026-05 unverdicted novelty 5.0

    Introduces MedFM-Robust benchmark to evaluate robustness of medical vision-language and segmentation foundation models under real-world conditions.

  11. RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

    cs.CV 2026-05 unverdicted novelty 5.0

    RoiMAM integrates a training-free ROI Generation Module with Semantic Selective Suppression and a Text Prompt Enhancer to produce a compact VLM that reports 2 percent and 4.6 percent accuracy gains on SLAKE and PMC-VQ...

  12. AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

    cs.AI 2026-05 unverdicted novelty 5.0

    Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.

  13. RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

    cs.CL 2026-05 unverdicted novelty 5.0

    LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.

  14. Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning

    cs.CV 2026-04 unverdicted novelty 5.0

    A vision-language model pre-trained via instruction tuning on CT-report pairs improves survival prediction accuracy over baselines, especially when clinical data alone is weak, while also producing text answers to cli...

  15. Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

    cs.CL 2024-06 unverdicted novelty 5.0

    Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.

  16. Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

    cs.LG 2024-02 unverdicted novelty 5.0

    POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

  17. MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

    cs.CV 2026-05 unverdicted novelty 4.0

    Proposes MedFM-Robust benchmark to evaluate robustness of medical vision-language and segmentation foundation models for clinical reliability.

  18. MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

    cs.CL 2026-02 unverdicted novelty 4.0

    MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

  19. Foundation Models in Biomedical Imaging: Turning Hype into Reality

    q-bio.QM 2025-12 unverdicted novelty 4.0

    Foundation models excel at pattern recognition in biomedical imaging but lack causal reasoning, robustness, and safety for real-world use, so they should augment rather than replace clinical expertise according to the...

  20. Agent AI: Surveying the Horizons of Multimodal Interaction

    cs.AI 2024-01 unverdicted novelty 4.0

    The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.

  21. Improved Baselines with Visual Instruction Tuning

    cs.CV 2023-10 conditional novelty 4.0

    Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

  22. Data-Centric Foundation Models in Computational Healthcare: A Survey

    cs.LG 2024-01 unverdicted novelty 3.0

    The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.

  23. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 22 Pith papers · 10 internal anchors

  1. [1]

    https://wanglab.ml/clinical_camel.html, 2023

    Clinical Camel. https://wanglab.ml/clinical_camel.html, 2023. 2

  2. [2]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021. 1

  3. [3]

    Covid or not covid? a great mimicker behind the smoke screen

    Malek Ayoub, Megan Quamme, Abdul-Rahman K Abdel-Reheem, Poe Lwin, and Megan K Quamme. Covid or not covid? a great mimicker behind the smoke screen. Cureus, 13(11),

  4. [4]

    Chondromyxoid fibroma of the rib: A rare benign tumor with potential for local recurrence

    Bappy Basak, Alexander Haragan, Michael Shackcloth, and Joyce Thekkudan. Chondromyxoid fibroma of the rib: A rare benign tumor with potential for local recurrence. Cureus, 13(10),

  5. [5]

    Vision– language model for visual question answering in medical imagery

    Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Laila Bashmal, and Mansour Zuair. Vision– language model for visual question answering in medical imagery. Bioengineering, 2023. 3, 9

  6. [6]

    Vaping-induced lung injury: An uncharted territory

    Anchit Bharat, Nikita Jain, Belaal Sheikh, Hafiz Jeelani, and Maryna Shayuk. Vaping-induced lung injury: An uncharted territory. Cureus, 12, 07 2020. 7

  7. [7]

    Making the most of text semantics to improve biomedical vision–language processing

    Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. In ECCV. Springer, 2022. 2

  8. [8]

    Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? In Findings of the Association for Computational Linguistics: EACL 2023, pages 1151–1163, 2023. 2, 3, 9

  9. [9]

    Vision- language pre-training: Basics, recent advances, and future trends

    Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision- language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 2022. 1

  10. [10]

    Domain-specific language model pretraining for biomedical natural language processing

    Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021. 2

  11. [11]

    Don’t stop pretraining: Adapt language models to domains and tasks

    Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020. 6

  12. [12]

    Medalpaca–an open-source collection of medical conversational ai models and training data

    Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023. 2

  13. [13]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020. 3, 8

  14. [14]

    ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission

    Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019. 2

  15. [15]

    Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, page 317,

  16. [16]

    A dataset of clinically generated visual questions and answers about radiology images

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 2018. 7

  17. [17]

    Biobert: a pre-trained biomedical language representation model for biomedical text mining

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020. 2

  18. [18]

    Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine

    Peter Lee, Sebastien Bubeck, and Joseph Petro. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023. 2

  19. [19]

    The ai revolution in medicine: Gpt-4 and beyond

    Peter Lee, Carey Goldberg, and Isaac Kohane. The ai revolution in medicine: Gpt-4 and beyond

  20. [20]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS, 2020. 3 11

  21. [21]

    ELEV ATER: A bench- mark and toolkit for evaluating language-augmented visual models

    Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. ELEV ATER: A bench- mark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks, 2022. 1

  22. [22]

    Self-supervised vision- language pretraining for medical visual question answering

    Pengfei Li, Gang Liu, Lin Tan, Jinying Liao, and Shenjun Zhong. Self-supervised vision- language pretraining for medical visual question answering. arXiv preprint arXiv:2211.13594,

  23. [23]

    Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In International Symposium on Biomedical Imaging (ISBI). IEEE, 2021. 8, 10

  24. [24]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023. 1, 2, 4, 6

  25. [25]

    Learning customized visual models with retrieval-augmented knowledge

    Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, Yong Jae Lee, and Chunyuan Li. Learning customized visual models with retrieval-augmented knowledge. arXiv preprint arXiv:2301.07094, 2023. 3

  26. [26]

    Q2atransformer: Improving medical vqa via an answer querying decoder

    Yunyi Liu, Zhanyu Wang, Dong Xu, and Luping Zhou. Q2atransformer: Improving medical vqa via an answer querying decoder. arXiv preprint arXiv:2304.01611, 2023. 3, 9

  27. [27]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems,

  28. [28]

    Biogpt: generative pre-trained transformer for biomedical text generation and mining

    Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 2022. 2, 3

  29. [29]

    Conventional osteosarcoma of the mandible: Report of a rare case

    Hassan Mirmohammad Sadeghi, Abbas Karimi, Samira Derakhshan, Pouyan Aminishakib, and Kiarash Parchami. Conventional osteosarcoma of the mandible: Report of a rare case. Clinical Case Reports, 9(9):e04843, 2021. 5

  30. [30]

    Capabilities of GPT-4 on Medical Challenge Problems

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Ca- pabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023. 2

  31. [31]

    OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2022. 2

  32. [32]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. https://arxiv.org/abs/2303.08774, 2023. 1, 2

  33. [33]

    Quadratus femoris partial tear secondary to occult ischiofemoral impingement

    Kyriakos A Papavasiliou, Dimitrios Stamiris, Stavros Stamiris, Antonia Bintoudi, and Elefthe- rios Tsiridis. Quadratus femoris partial tear secondary to occult ischiofemoral impingement. Journal of Orthopaedic Case Reports, 11(9):7, 2021. 5

  34. [34]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023. 2

  35. [35]

    The appropriate use of radiography in clinical practice: a report of two cases of biomechanical versus malignant spine pain

    Roger Kevin Pringle and Lawrence H Wyatt. The appropriate use of radiography in clinical practice: a report of two cases of biomechanical versus malignant spine pain. Chiropractic & Osteopathy, 14(1):1–8, 2006. 4

  36. [36]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 9

  37. [37]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019. 3 12

  38. [38]

    Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia

    George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin-Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 2019. 2

  39. [39]

    Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities

    Chang Shu, Baian Chen, Fangyu Liu, Zihao Fu, Ehsan Shareghi, and Nigel Collier. Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities. 2023. 2

  40. [40]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3

  41. [41]

    Open-ended medical visual question answering through prefix tuning of language models

    Tom van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees GM Snoek, and Marcel Worring. Open-ended medical visual question answering through prefix tuning of language models. arXiv preprint arXiv:2303.05977, 2023. 3, 9

  42. [42]

    BiomedLM: a domain-specific large language model for biomedical text

    A Venigalla, J Frankle, and M Carbin. BiomedLM: a domain-specific large language model for biomedical text. MosaicML. Accessed: Dec, 23, 2022. 3

  43. [43]

    Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality

    Vicuna. Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality. https: //vicuna.lmsys.org/, 2023. 3, 6

  44. [44]

    Huatuo: Tuning llama model with chinese medical knowledge, 2023

    Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. Huatuo: Tuning llama model with chinese medical knowledge, 2023. 2

  45. [45]

    Pmc-llama: Further finetuning llama on medical papers

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023. 2

  46. [46]

    Doctorglm: Fine-tuning your chinese doctor is not a herculean task

    Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao, Yuxiao Liu, Qian Wang, and Dinggang Shen. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097, 2023. 2

  47. [47]

    Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge

    Li Yunxiang, Li Zihan, Zhang Kai, Dan Ruilong, and Zhang You. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023. 2

  48. [48]

    Delayed spontaneous regression of metastatic gastric cancer: A case report of a rare finding

    Mansoor Zafar, Abdul Wahab Paracha, Muteeb Ashraf, Tila Muhammad, Mark Whitehead, Muhammad Toqeer, and Abdul Paracha. Delayed spontaneous regression of metastatic gastric cancer: A case report of a rare finding. Cureus, 13(12), 2021. 5

  49. [49]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915, 2023. 2, 3, 6, 9 13 A Data Instructions for brief image description. The list of instructions used to brie...