LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Pith reviewed 2026-05-24 08:31 UTC · model grok-4.3
The pith
A general vision-language model can be adapted to answer open-ended questions about biomedical images after two-stage training on PubMed data and GPT-4 instructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By leveraging a broad-coverage biomedical figure-caption dataset from PubMed Central and using GPT-4 to generate instruction-following data from those captions, a general vision-language model can be fine-tuned in two stages: first to align biomedical vocabulary using the raw pairs, then to master open-ended conversational semantics. The resulting LLaVA-Med model follows open-ended instructions to discuss biomedical images and outperforms prior supervised state-of-the-art methods on certain metrics across three standard biomedical visual question answering benchmarks.
What carries the argument
Two-stage curriculum learning that first aligns the model to biomedical figure-caption pairs and then tunes it on GPT-4-generated open-ended instruction data.
If this is right
- Biomedical visual question answering systems can reach competitive accuracy without task-specific labeled training sets.
- The same two-stage alignment-plus-instruction process can be applied to other image-heavy scientific domains.
- Releasing the generated instruction data and the final model lowers the barrier for follow-on biomedical multimodal research.
- Conversational interfaces for biomedical images become feasible with modest compute rather than industrial-scale resources.
Where Pith is reading between the lines
- The method may generalize to other technical domains where large caption collections exist but conversational labels do not.
- Success depends on continued access to a strong general-purpose language model like GPT-4 for data generation.
- If the conversational capability transfers to real clinical workflows, it could accelerate adoption of image-based decision support tools.
Load-bearing premise
The instruction-following examples produced by GPT-4 from figure captions are rich enough to teach genuine open-ended conversational ability rather than surface-level pattern matching.
What would settle it
A set of new biomedical images paired with open-ended questions where expert clinicians rate LLaVA-Med responses as consistently less accurate or less instruction-following than a supervised baseline would falsify the performance claim.
Figures
read the original abstract
Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LLaVA-Med, a biomedical vision-language assistant trained in under 15 hours on eight A100 GPUs. It extracts a large figure-caption dataset from PubMed Central, uses GPT-4 to self-instruct open-ended instruction-following data from the captions, and applies a two-stage curriculum: first aligning biomedical vocabulary on raw caption pairs, then fine-tuning on the GPT-4 data to acquire open-ended conversational semantics. The authors claim the resulting model exhibits excellent multimodal conversational capability for open-ended biomedical image inquiries and outperforms prior supervised state-of-the-art on certain metrics across three standard biomedical VQA datasets. The instruction data and model weights are to be released.
Significance. If the performance claims hold with proper controls and the model demonstrates genuine open-ended conversational ability rather than vocabulary-driven pattern matching, the work would be significant as a low-cost, reproducible recipe for domain-adapting general VLMs to biomedicine via self-instruction. The curriculum-learning framing and planned public release of data and model are concrete strengths that would aid follow-on research.
major comments (2)
- [§3.2] §3.2 (Data Generation): No human evaluation, error analysis, or comparison against human-written instructions is provided for the GPT-4-generated instruction-following data. This is load-bearing for the central claim that stage 2 teaches 'open-ended conversational semantics' beyond stage 1, because PubMed captions are short and non-conversational and any VQA gains could arise from vocabulary alignment alone.
- [§4.2] §4.2 (Ablation and Evaluation): No ablation isolating the contribution of the GPT-4 self-instruction stage versus caption alignment alone is reported. Without it, the attribution of conversational capability and the reported VQA improvements to the second stage cannot be verified, especially given that the three VQA benchmarks are predominantly closed-ended.
minor comments (1)
- [Abstract] Abstract: The statement that LLaVA-Med 'outperforms previous supervised state-of-the-art on certain metrics' is imprecise; the main text should explicitly list the metrics, the magnitude of gains, the exact baselines, and any statistical testing.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects for strengthening the evidence in our work. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Data Generation): No human evaluation, error analysis, or comparison against human-written instructions is provided for the GPT-4-generated instruction-following data. This is load-bearing for the central claim that stage 2 teaches 'open-ended conversational semantics' beyond stage 1, because PubMed captions are short and non-conversational and any VQA gains could arise from vocabulary alignment alone.
Authors: We agree that direct validation of the GPT-4-generated instruction data would provide stronger support for the distinct contribution of stage 2. In the revised manuscript, we will add a human evaluation on a sampled subset of the generated instructions, including error analysis and a comparison to human-written instructions. This addition will help substantiate that stage 2 imparts conversational semantics beyond vocabulary alignment from stage 1. revision: yes
-
Referee: [§4.2] §4.2 (Ablation and Evaluation): No ablation isolating the contribution of the GPT-4 self-instruction stage versus caption alignment alone is reported. Without it, the attribution of conversational capability and the reported VQA improvements to the second stage cannot be verified, especially given that the three VQA benchmarks are predominantly closed-ended.
Authors: We concur that an explicit ablation is necessary to isolate the effect of the GPT-4 self-instruction stage. The revised manuscript will include an ablation comparing model performance after stage 1 (caption alignment only) versus after stage 2 (full curriculum) on the three biomedical VQA benchmarks. This will allow direct attribution of any gains to the second stage while acknowledging the closed-ended nature of the benchmarks. revision: yes
Circularity Check
No circularity: empirical training procedure with external benchmarks
full rationale
The paper presents an empirical method: extract PubMed figure-caption pairs, use GPT-4 to generate instruction-following data, then apply a two-stage curriculum fine-tune on a base vision-language model. Performance is measured on three external VQA datasets. No equations, fitted parameters renamed as predictions, or derivation chain exist. The central claim (conversational capability after stage 2) is supported by reported metrics rather than reducing to a self-defined quantity or self-citation. Self-citations, if present, are not load-bearing for any mathematical result. This is standard non-circular empirical ML work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption GPT-4 outputs from figure captions constitute high-quality, diverse instruction-following examples suitable for fine-tuning conversational behavior.
- domain assumption A general-domain vision-language model can be effectively adapted to biomedicine via staged alignment on figure-caption pairs followed by instruction tuning.
Forward citations
Cited by 23 Pith papers
-
DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making
DermAgent orchestrates seven vision-language tools in a Plan-Execute-Reflect loop with dual-modality retrieval from 413k cases and a critic module to outperform GPT-4o by 17.6% in zero-shot dermatological diagnosis accuracy.
-
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
ProtoMedAgent uses a privacy-aware agentic workflow with neuro-symbolic bottlenecks to achieve 91.2% faithfulness in clinical report generation, significantly outperforming standard RAG methods on a large patient cohort.
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
-
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
Presents Med-HallMark benchmark, MediHall Score metric, and MediHallDetector model for hallucination detection and evaluation in medical LVLMs.
-
Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning
Infection-Reasoner, a 4B VLM, reaches 86.8% accuracy on wound infection classification while producing rationales rated mostly correct by experts, via GPT-5.1 distillation followed by reinforcement learning.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis
A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
-
MedFM-Robust: Benchmarking Robustness of Medical Foundation Models
Introduces MedFM-Robust benchmark to evaluate robustness of medical vision-language and segmentation foundation models under real-world conditions.
-
RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding
RoiMAM integrates a training-free ROI Generation Module with Semantic Selective Suppression and a Text Prompt Enhancer to produce a compact VLM that reports 2 percent and 4.6 percent accuracy gains on SLAKE and PMC-VQ...
-
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
-
RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI
LoRA fine-tuning of 3-4B SLMs on 162K multi-task radiology data yields strong performance deployable on consumer CPUs at 4-8 tokens/second.
-
Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning
A vision-language model pre-trained via instruction tuning on CT-report pairs improves survival prediction accuracy over baselines, especially when clinical data alone is weak, while also producing text answers to cli...
-
Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
-
MedFM-Robust: Benchmarking Robustness of Medical Foundation Models
Proposes MedFM-Robust benchmark to evaluate robustness of medical vision-language and segmentation foundation models for clinical reliability.
-
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
-
Foundation Models in Biomedical Imaging: Turning Hype into Reality
Foundation models excel at pattern recognition in biomedical imaging but lack causal reasoning, robustness, and safety for real-world use, so they should augment rather than replace clinical expertise according to the...
-
Agent AI: Surveying the Horizons of Multimodal Interaction
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
-
Improved Baselines with Visual Instruction Tuning
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
-
Data-Centric Foundation Models in Computational Healthcare: A Survey
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
https://wanglab.ml/clinical_camel.html, 2023
Clinical Camel. https://wanglab.ml/clinical_camel.html, 2023. 2
work page 2023
-
[2]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021. 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Covid or not covid? a great mimicker behind the smoke screen
Malek Ayoub, Megan Quamme, Abdul-Rahman K Abdel-Reheem, Poe Lwin, and Megan K Quamme. Covid or not covid? a great mimicker behind the smoke screen. Cureus, 13(11),
-
[4]
Chondromyxoid fibroma of the rib: A rare benign tumor with potential for local recurrence
Bappy Basak, Alexander Haragan, Michael Shackcloth, and Joyce Thekkudan. Chondromyxoid fibroma of the rib: A rare benign tumor with potential for local recurrence. Cureus, 13(10),
-
[5]
Vision– language model for visual question answering in medical imagery
Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Laila Bashmal, and Mansour Zuair. Vision– language model for visual question answering in medical imagery. Bioengineering, 2023. 3, 9
work page 2023
-
[6]
Vaping-induced lung injury: An uncharted territory
Anchit Bharat, Nikita Jain, Belaal Sheikh, Hafiz Jeelani, and Maryna Shayuk. Vaping-induced lung injury: An uncharted territory. Cureus, 12, 07 2020. 7
work page 2020
-
[7]
Making the most of text semantics to improve biomedical vision–language processing
Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. In ECCV. Springer, 2022. 2
work page 2022
-
[8]
Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? In Findings of the Association for Computational Linguistics: EACL 2023, pages 1151–1163, 2023. 2, 3, 9
work page 2023
-
[9]
Vision- language pre-training: Basics, recent advances, and future trends
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision- language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 2022. 1
work page 2022
-
[10]
Domain-specific language model pretraining for biomedical natural language processing
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021. 2
work page 2021
-
[11]
Don’t stop pretraining: Adapt language models to domains and tasks
Suchin Gururangan, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020. 6
-
[12]
Medalpaca–an open-source collection of medical conversational ai models and training data
Tianyu Han, Lisa C Adams, Jens-Michalis Papaioannou, Paul Grundmann, Tom Oberhauser, Alexander Löser, Daniel Truhn, and Keno K Bressem. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023. 2
-
[13]
PathVQA: 30000+ Questions for Medical Visual Question Answering
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020. 3, 8
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[14]
ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission
Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019. 2
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[15]
Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports
Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, page 317,
-
[16]
A dataset of clinically generated visual questions and answers about radiology images
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 2018. 7
work page 2018
-
[17]
Biobert: a pre-trained biomedical language representation model for biomedical text mining
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020. 2
work page 2020
-
[18]
Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine
Peter Lee, Sebastien Bubeck, and Joseph Petro. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023. 2
work page 2023
-
[19]
The ai revolution in medicine: Gpt-4 and beyond
Peter Lee, Carey Goldberg, and Isaac Kohane. The ai revolution in medicine: Gpt-4 and beyond
-
[20]
Retrieval-augmented generation for knowledge-intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS, 2020. 3 11
work page 2020
-
[21]
ELEV ATER: A bench- mark and toolkit for evaluating language-augmented visual models
Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. ELEV ATER: A bench- mark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks, 2022. 1
work page 2022
-
[22]
Self-supervised vision- language pretraining for medical visual question answering
Pengfei Li, Gang Liu, Lin Tan, Jinying Liao, and Shenjun Zhong. Self-supervised vision- language pretraining for medical visual question answering. arXiv preprint arXiv:2211.13594,
-
[23]
Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In International Symposium on Biomedical Imaging (ISBI). IEEE, 2021. 8, 10
work page 2021
-
[24]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023. 1, 2, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Learning customized visual models with retrieval-augmented knowledge
Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, Yong Jae Lee, and Chunyuan Li. Learning customized visual models with retrieval-augmented knowledge. arXiv preprint arXiv:2301.07094, 2023. 3
-
[26]
Q2atransformer: Improving medical vqa via an answer querying decoder
Yunyi Liu, Zhanyu Wang, Dong Xu, and Luping Zhou. Q2atransformer: Improving medical vqa via an answer querying decoder. arXiv preprint arXiv:2304.01611, 2023. 3, 9
-
[27]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems,
-
[28]
Biogpt: generative pre-trained transformer for biomedical text generation and mining
Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 2022. 2, 3
work page 2022
-
[29]
Conventional osteosarcoma of the mandible: Report of a rare case
Hassan Mirmohammad Sadeghi, Abbas Karimi, Samira Derakhshan, Pouyan Aminishakib, and Kiarash Parchami. Conventional osteosarcoma of the mandible: Report of a rare case. Clinical Case Reports, 9(9):e04843, 2021. 5
work page 2021
-
[30]
Capabilities of GPT-4 on Medical Challenge Problems
Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Ca- pabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2022. 2
work page 2022
-
[32]
OpenAI. GPT-4 technical report. https://arxiv.org/abs/2303.08774, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Quadratus femoris partial tear secondary to occult ischiofemoral impingement
Kyriakos A Papavasiliou, Dimitrios Stamiris, Stavros Stamiris, Antonia Bintoudi, and Elefthe- rios Tsiridis. Quadratus femoris partial tear secondary to occult ischiofemoral impingement. Journal of Orthopaedic Case Reports, 11(9):7, 2021. 5
work page 2021
-
[34]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Roger Kevin Pringle and Lawrence H Wyatt. The appropriate use of radiography in clinical practice: a report of two cases of biomechanical versus malignant spine pain. Chiropractic & Osteopathy, 14(1):1–8, 2006. 4
work page 2006
-
[36]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 9
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019. 3 12
work page 2019
-
[38]
George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, Veronica Arteaga, Maya Galperin-Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 2019. 2
work page 2019
-
[39]
Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities
Chang Shu, Baian Chen, Fangyu Liu, Zihao Fu, Ehsan Shareghi, and Nigel Collier. Visual med-alpaca: A parameter-efficient biomedical llm with visual capabilities. 2023. 2
work page 2023
-
[40]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Open-ended medical visual question answering through prefix tuning of language models
Tom van Sonsbeek, Mohammad Mahdi Derakhshani, Ivona Najdenkoska, Cees GM Snoek, and Marcel Worring. Open-ended medical visual question answering through prefix tuning of language models. arXiv preprint arXiv:2303.05977, 2023. 3, 9
-
[42]
BiomedLM: a domain-specific large language model for biomedical text
A Venigalla, J Frankle, and M Carbin. BiomedLM: a domain-specific large language model for biomedical text. MosaicML. Accessed: Dec, 23, 2022. 3
work page 2022
-
[43]
Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality
Vicuna. Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality. https: //vicuna.lmsys.org/, 2023. 3, 6
work page 2023
-
[44]
Huatuo: Tuning llama model with chinese medical knowledge, 2023
Haochun Wang, Chi Liu, Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin, and Ting Liu. Huatuo: Tuning llama model with chinese medical knowledge, 2023. 2
work page 2023
-
[45]
Pmc-llama: Further finetuning llama on medical papers
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023. 2
-
[46]
Doctorglm: Fine-tuning your chinese doctor is not a herculean task
Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao, Yuxiao Liu, Qian Wang, and Dinggang Shen. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097, 2023. 2
-
[47]
Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge
Li Yunxiang, Li Zihan, Zhang Kai, Dan Ruilong, and Zhang You. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023. 2
-
[48]
Delayed spontaneous regression of metastatic gastric cancer: A case report of a rare finding
Mansoor Zafar, Abdul Wahab Paracha, Muteeb Ashraf, Tila Muhammad, Mark Whitehead, Muhammad Toqeer, and Abdul Paracha. Delayed spontaneous regression of metastatic gastric cancer: A case report of a rare finding. Cureus, 13(12), 2021. 5
work page 2021
-
[49]
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv preprint arXiv:2303.00915, 2023. 2, 3, 6, 9 13 A Data Instructions for brief image description. The list of instructions used to brie...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.