pith. sign in

arxiv: 2606.21194 · v1 · pith:3UZ4IVBLnew · submitted 2026-06-19 · 💻 cs.CV · cs.AI· cs.CL

MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models

Pith reviewed 2026-06-26 14:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords medical vision-language modelslay language generationexpert-lay gapbenchmark datasetpatient-accessible explanationsUMLS ontologyMedLayXPlainMedLayEval
0
0 comments X

The pith

Medical vision-language models show a systematic gap between expert accuracy and patient-accessible descriptions of images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MedLayXPlain, a dataset of 122,789 region-grounded medical images paired with expert and lay captions drawn from UMLS ontologies, to test whether current models can generate explanations suitable for patients. It applies the HOVER pipeline to produce the lay versions while preserving meaning and introduces MedLayEval to score alignment on clinical attributes. Benchmarking 33 models finds that medical VLMs handle expert captions well but degrade when switching to simple language, whereas general VLMs produce more readable text yet omit or distort clinical details. The result matters because recent laws require immediate patient access to imaging results for education and decisions.

Core claim

Benchmarking 33 VLMs on MedLayXPlain-122K reveals a systematic Expert-Lay Gap: medical VLMs achieve strong expert captioning but suffer significant lay-register degradation, while general-purpose VLMs produce more accessible language yet lack clinical precision, confirming that neither current paradigm adequately serves patient-facing communication.

What carries the argument

The Hierarchical Ontology-Verified Refinement (HOVER) pipeline, which builds lay captions from expert ones via patient-centric vocabulary mapping, LLM-based constrained rewriting, and cross-model visual verification to maintain semantic equivalence.

If this is right

  • Medical VLMs require targeted adaptation to maintain clinical precision while shifting to lay-register language.
  • General-purpose VLMs need added clinical constraints or fine-tuning to reach acceptable medical accuracy.
  • Standard NLG metrics are inadequate for this task because they correlate poorly with clinical judgment.
  • The MedLayXPlain benchmark supplies a standardized testbed for developing models that support patient communication.
  • Closing the gap would directly support patient education and shared decision-making under immediate-access regulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid training that exposes models to both expert and lay pairs from the dataset could produce outputs usable by both audiences.
  • The gap may extend beyond images to other medical text generation tasks such as report summarization.
  • Deploying models without addressing the gap risks providing either overly technical or imprecise information to patients.
  • The ontology hierarchy could support automatic generation of explanations at multiple levels of detail beyond the current three.

Load-bearing premise

The HOVER pipeline combined with the distilled MedLayEval evaluator correctly enforces semantic equivalence and clinical alignment without introducing systematic bias from the LLM rewriting steps.

What would settle it

Independent clinical experts rating a sample of lay captions from both medical and general VLMs find no consistent difference in accessibility or accuracy across the five attributes.

Figures

Figures reproduced from arXiv: 2606.21194 by Chae Young Lim, Han Jang, Heeseong Eum, Hyeonjin Goh, Junhyeok Lee, Kyu Sung Choi, Songsoo Kim.

Figure 1
Figure 1. Figure 1: Motivation and positioning of MEDLAYXPLAIN. (a) Clinical motivation for patient￾accessible rewriting of medical reports. (b) Existing resources offer text-only expert-lay pairs (left) or expert-only multimodal data (center). MEDLAYXPLAIN-122K combines all four elements (right). Medical Lay Language Generation. Translating clinical jargon into patient-accessible language has been studied under Medical Lay L… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed HOVER pipeline. (a) Raw expert caption with jargon (highlighted). (b) Step 1 maps medical entities to patient-friendly terms via three-level UMLS ontology (C ⊂ T ⊂ G). Step 2 generates a constrained lay draft with full ontology context. Step 3 performs cross-model visual verification (pass/revise/fail). (c) Refined lay caption (highlighted). 3 Method We present MEDLAYXPLAIN, a fram… view at source ↗
Figure 3
Figure 3. Figure 3: MEDLAYEVAL architecture. (a) Train￾ing: the student evaluator is distilled from a teacher verifier via per-attribute weighted MSE. (b) Infer￾ence: the evaluator predicts five attribute scores from an image, expert caption, and candidate lay caption. LoRA weights are merged at inference. Architecture. We build MEDLAYEVAL on Qwen2.5-VL-3B-Instruct [4], adapting it from a generative VLM into a regression-base… view at source ↗
Figure 4
Figure 4. Figure 4: VLM leaderboard on MEDLAYXPLAIN-122K, ranked by MEDLAYEVAL overall score (lay register, n = 5,000). A 46.7-point spread separates the top- and bottom-ranked models, confirming that expert-to-lay medical image description remains a substantial open challenge. 4 Experiments 4.1 Experimental Setup Dataset Statistics. MEDLAYXPLAIN-122K is constructed from 12 publicly available source datasets spanning 8 imagin… view at source ↗
read the original abstract

Medical Vision-Language Models (Med-VLMs) achieve strong expert-level performance, yet their ability to generate patient-accessible descriptions remains underexplored. With the 21st Century Cures Act now mandating immediate patient access to diagnostic imaging results, evaluating whether Med-VLMs can bridge this Expert-Lay Gap is both urgent and clinically consequential for patient education and shared decision-making. To this end, we introduce MedLayXPlain, the first large-scale multimodal benchmark and evaluation framework for Medical Lay Language Generation (MLLG). MedLayXPlain-122K provides 122,789 region-grounded samples across 8 imaging modalities from 12 publicly available source datasets, each comprising a medical image with paired expert and lay captions anchored in a three-level Unified Medical Language System (UMLS) ontology hierarchy spanning 7 semantic groups, 43 semantic types, and 2,411 medical concepts. Lay captions are constructed via Hierarchical Ontology-Verified Refinement (HOVER), a three-step pipeline combining patient-centric vocabulary mapping, LLM-based constrained rewriting, and cross-model visual verification to enforce semantic equivalence while preventing hallucination. We further introduce MedLayEval, a lightweight 3B evaluator distilled from a 27B verifier that scores expert-lay alignment across five clinically grounded attributes, addressing the poor correlation between standard NLG metrics and clinical judgment. Benchmarking 33 VLMs on MedLayXPlain-122K reveals a systematic Expert-Lay Gap: medical VLMs achieve strong expert captioning but suffer significant lay-register degradation, while general-purpose VLMs produce more accessible language yet lack clinical precision, confirming that neither current paradigm adequately serves patient-facing communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MedLayXPlain-122K, a 122,789-sample multimodal benchmark for Medical Lay Language Generation (MLLG) spanning 8 imaging modalities and 12 public datasets. Expert-lay caption pairs are generated via the Hierarchical Ontology-Verified Refinement (HOVER) pipeline, which maps to a three-level UMLS ontology (7 semantic groups, 43 types, 2,411 concepts), performs LLM-constrained rewriting, and applies cross-model visual verification. A 3B-parameter MedLayEval evaluator is distilled from a 27B model to score five clinically grounded alignment attributes. Benchmarking 33 VLMs reveals a systematic Expert-Lay Gap: medical VLMs perform well on expert captions but degrade on lay-register output, while general-purpose VLMs produce more accessible language at the cost of clinical precision.

Significance. If the constructed pairs and evaluator are shown to be faithful, the work directly addresses a regulatory need (21st Century Cures Act) for patient-accessible imaging descriptions and supplies the first large-scale, ontology-anchored testbed for evaluating whether current Med-VLMs can support shared decision-making. The scale, public-data sourcing, and explicit separation of expert vs. lay registers are clear strengths; the result would be actionable for model developers and clinicians.

major comments (3)
  1. [Abstract / HOVER pipeline] Abstract and Methods (HOVER pipeline description): the claim that HOVER 'enforces semantic equivalence while preventing hallucination' rests on UMLS mapping + LLM rewriting + cross-model verification, yet no human expert adjudication, inter-rater agreement, or external clinical review of the resulting expert-lay pairs is reported. Because the central Expert-Lay Gap finding is defined by performance differences on these pairs, undetected clinical drift in the rewriting step would render the measured degradation an artifact rather than a model property.
  2. [Abstract / MedLayEval] Abstract (MedLayEval): the 3B evaluator is asserted to address 'poor correlation between standard NLG metrics and clinical judgment,' but no quantitative validation (e.g., correlation coefficients with radiologist ratings, comparison against the 27B teacher on held-out cases, or error analysis) is provided. This is load-bearing for all downstream VLM rankings.
  3. [Benchmarking results] Benchmark results section: the reported gap is presented without error bars, statistical significance tests across the 33 models, or ablation on the five MedLayEval attributes. Without these, it is unclear whether the 'systematic' degradation is robust or driven by a subset of modalities or semantic groups.
minor comments (2)
  1. [Abstract] The abstract states '122,789 region-grounded samples' but does not clarify how region grounding is preserved or verified after the LLM rewriting step.
  2. [Introduction / Methods] Notation for the three-level UMLS hierarchy (groups/types/concepts) should be defined once with an explicit example in the main text rather than only in the abstract.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below with honest responses and indicate where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / HOVER pipeline] Abstract and Methods (HOVER pipeline description): the claim that HOVER 'enforces semantic equivalence while preventing hallucination' rests on UMLS mapping + LLM rewriting + cross-model verification, yet no human expert adjudication, inter-rater agreement, or external clinical review of the resulting expert-lay pairs is reported. Because the central Expert-Lay Gap finding is defined by performance differences on these pairs, undetected clinical drift in the rewriting step would render the measured degradation an artifact rather than a model property.

    Authors: The HOVER pipeline relies on UMLS ontology mapping to 2,411 concepts, LLM-constrained rewriting, and cross-model visual verification to promote semantic equivalence. We acknowledge that the manuscript reports no human expert adjudication, inter-rater agreement, or external clinical review of the pairs. This is a genuine limitation. We will revise by adding an expanded Limitations section that discusses the automated safeguards and their potential shortcomings, along with any available verification statistics from the pipeline. revision: partial

  2. Referee: [Abstract / MedLayEval] Abstract (MedLayEval): the 3B evaluator is asserted to address 'poor correlation between standard NLG metrics and clinical judgment,' but no quantitative validation (e.g., correlation coefficients with radiologist ratings, comparison against the 27B teacher on held-out cases, or error analysis) is provided. This is load-bearing for all downstream VLM rankings.

    Authors: We agree that quantitative validation of MedLayEval (e.g., correlations with radiologist ratings or error analysis versus the 27B teacher) is important and is not reported in the manuscript. The current work describes the distillation process but lacks these external metrics. As the required human ratings were not collected, we cannot supply them. We will revise the text to detail the internal distillation validation that was performed and to state explicitly in the Limitations section that external clinical correlation remains an open requirement for future work. revision: partial

  3. Referee: [Benchmarking results] Benchmark results section: the reported gap is presented without error bars, statistical significance tests across the 33 models, or ablation on the five MedLayEval attributes. Without these, it is unclear whether the 'systematic' degradation is robust or driven by a subset of modalities or semantic groups.

    Authors: We accept that the benchmarking results would be more robust with error bars, statistical significance tests, and attribute ablations. These analyses can be performed on the existing evaluation outputs. We will revise the results section to add error bars to the reported metrics, include statistical significance testing across the 33 models, and provide ablations on the five MedLayEval attributes to assess consistency across modalities and semantic groups. revision: yes

standing simulated objections not resolved
  • Human expert adjudication, inter-rater agreement, or external clinical review of the HOVER expert-lay pairs
  • Quantitative validation of MedLayEval against radiologist ratings or detailed error analysis versus the 27B teacher

Circularity Check

0 steps flagged

No circularity: benchmark constructed from external public datasets with independent evaluation

full rationale

The paper constructs MedLayXPlain-122K from 12 publicly available external source datasets and introduces HOVER and MedLayEval as new pipelines. No derivation reduces a claimed result to a fitted parameter or self-citation by construction. The central benchmarking result (Expert-Lay Gap) is an empirical measurement on held-out models, not a self-referential prediction. Self-citations, if present, are not load-bearing for the core claims. This matches the default non-circular case for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claims rest on the unverified correctness of the HOVER pipeline and MedLayEval correlation with clinical judgment; these cannot be assessed from the abstract.

axioms (1)
  • domain assumption The UMLS ontology hierarchy provides an accurate and complete mapping from expert medical concepts to patient-centric lay vocabulary across the covered semantic groups and types.
    Invoked as the foundation for the Hierarchical Ontology-Verified Refinement step that constructs lay captions.
invented entities (1)
  • MedLayEval no independent evidence
    purpose: Lightweight 3B evaluator that scores expert-lay alignment on five clinically grounded attributes.
    Distilled from a 27B verifier; no independent evidence of its correlation with clinical judgment is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5866 in / 1354 out tokens · 32730 ms · 2026-06-26T14:23:28.556409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

108 extracted references · 15 linked inside Pith

  1. [1]

    Large language models for simplifying radiology reports: a systematic review and meta-analysis of patient, public, and clinician evaluations.The Lancet Digital Health, 2026

    Samer Alabed, Abigail Anderson, Ahmed Maiter, Anthony Hughes, Niamh McAnenly, Mahan Salehi, Michael Sharkey, Krit Dwivedi, Alireza Hokmabadi, Fares Alahdab, et al. Large language models for simplifying radiology reports: a systematic review and meta-analysis of patient, public, and clinician evaluations.The Lancet Digital Health, 2026

  2. [2]

    Claude models overview, 2025

    Anthropic. Claude models overview, 2025. URL https://docs.anthropic.com/en/docs/ about-claude/models

  3. [3]

    Introducing Claude Opus 4.7, April 2026

    Anthropic. Introducing Claude Opus 4.7, April 2026. URL https://www.anthropic.com/ news/claude-opus-4-7

  4. [4]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    Qwen2.5-vl technical report,

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

  6. [6]

    URLhttps://arxiv.org/abs/2502.13923

  7. [7]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

  8. [8]

    Contemporary trends in reviewing test results through the electronic patient portal among patients with cancer.JAMA oncology, 10(1):139–140, 2024

    Sheena Bhalla, Tanushree Prasad, Donglu Xie, and David E Gerber. Contemporary trends in reviewing test results through the electronic patient portal among patients with cancer.JAMA oncology, 10(1):139–140, 2024

  9. [9]

    The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004

    Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004

  10. [10]

    Snomed-ct: The advanced terminology and coding system for ehealth

    L Bos and K Donnelly. Snomed-ct: The advanced terminology and coding system for ehealth. Stud Health Technol Inform, 121:279–290, 2006

  11. [11]

    Sam-med2d.arXiv preprint arXiv:2308.16184, 2023

    Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, et al. Sam-med2d.arXiv preprint arXiv:2308.16184, 2023

  12. [12]

    Machine learning in computational histopathology: Challenges and opportunities.Genes, Chromosomes and Cancer, 62(9):540– 556, 2023

    Michael Cooper, Zongliang Ji, and Rahul G Krishnan. Machine learning in computational histopathology: Challenges and opportunities.Genes, Chromosomes and Cancer, 62(9):540– 556, 2023

  13. [13]

    Unsolicited patient complaints following the 21st century cures act information-blocking rule

    Robert J Dambrino IV , Henry J Domenico, John A Graves, Melinda JB Buntin, William Martinez, S Trent Rosenbloom, and William O Cooper. Unsolicited patient complaints following the 21st century cures act information-blocking rule. InJAMA Health Forum, volume 4, page e233244, 2023

  14. [14]

    Gemma 3 technical report, 2025

    Google DeepMind. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786

  15. [15]

    A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

    Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

  16. [16]

    Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023

  17. [17]

    A new readability yardstick.Journal of applied psychology, 32(3):221, 1948

    Rudolph Flesch. A new readability yardstick.Journal of applied psychology, 32(3):221, 1948. 10

  18. [18]

    Making science simple: Corpora for the lay summarisation of scientific literature.arXiv preprint arXiv:2210.09932, 2022

    Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton. Making science simple: Corpora for the lay summarisation of scientific literature.arXiv preprint arXiv:2210.09932, 2022

  19. [19]

    Making science simple: Corpora for the lay summarisation of scientific literature

    Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton. Making science simple: Corpora for the lay summarisation of scientific literature. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10589–10604, 2022

  20. [20]

    Overview of the biolaysumm 2024 shared task on the lay summarization of biomedical research articles.arXiv preprint arXiv:2408.08566, 2024

    Tomas Goldsack, Carolina Scarton, Matthew Shardlow, and Chenghua Lin. Overview of the biolaysumm 2024 shared task on the lay summarization of biomedical research articles.arXiv preprint arXiv:2408.08566, 2024

  21. [21]

    Gemini 2.5 flash model card, 2025

    Google DeepMind. Gemini 2.5 flash model card, 2025. URL https://ai.google.dev/ gemini-api/docs/models#gemini-2.5-flash

  22. [22]

    Gemini 3 flash model card, 2026

    Google DeepMind. Gemini 3 flash model card, 2026. URL https://ai.google.dev/ gemini-api/docs/models#gemini-3-flash-preview

  23. [23]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  24. [24]

    Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

    Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

  25. [25]

    Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

  26. [26]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  27. [27]

    Medlaybench-v: A large-scale benchmark for expert-lay semantic alignment in medical vision language models.arXiv preprint arXiv:2604.05738, 2026

    Han Jang, Junhyeok Lee, Heeseong Eum, and Kyu Sung Choi. Medlaybench-v: A large-scale benchmark for expert-lay semantic alignment in medical vision language models.arXiv preprint arXiv:2604.05738, 2026

  28. [28]

    Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

    Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

  29. [29]

    Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

    Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

  30. [30]

    Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

    Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

  31. [31]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  32. [32]

    A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

  33. [33]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023. 11

  34. [34]

    Magical: Medical lay language generation via semantic invariance and layperson-tailored adaptation

    Weibin Liao, Tianlong Wang, Yinghao Zhu, Yasha Wang, Junyi Gao, and Liantao Ma. Magical: Medical lay language generation via semantic invariance and layperson-tailored adaptation. arXiv preprint arXiv:2508.08730, 2025

  35. [35]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  36. [36]

    Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

  37. [37]

    Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific literature

    Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, et al. Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific literature. InProceedings of the Computer Vision and Pattern Recognition Conference,...

  38. [38]

    Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534, 2024

    Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534, 2024

  39. [39]

    The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014

    Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014

  40. [40]

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innova- tion

    Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innova- tion. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ , April 2025. Accessed: 2026-05-06

  41. [41]

    Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

    Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

  42. [42]

    Med-flamingo: a multimodal medical few-shot learner

    Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine learning for health (ML4H), pages 353–367. PMLR, 2023

  43. [43]

    PMC open access subset

    National Library of Medicine. PMC open access subset. https://pmc.ncbi.nlm.nih.gov/ tools/openftlist/, 2003–. Accessed: 2026-05-06

  44. [44]

    Medical subject headings (MeSH)

    National Library of Medicine. Medical subject headings (MeSH). https://www.nlm.nih. gov/mesh/, 2024

  45. [45]

    Scispacy: fast and robust models for biomedical natural language processing

    Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. Scispacy: fast and robust models for biomedical natural language processing. InProceedings of the 18th BioNLP workshop and shared task, pages 319–327, 2019

  46. [46]

    GPT-5.4 and GPT-5.4-mini model card, February 2026

    OpenAI. GPT-5.4 and GPT-5.4-mini model card, February 2026. URL https://platform. openai.com/docs/models

  47. [47]

    GPT-5.5 system card, April 2026

    OpenAI. GPT-5.5 system card, April 2026. URL https://openai.com/index/ gpt-5-5-system-card/

  48. [48]

    Green: Generative radiology report evaluation and error notation

    Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Blueth- gen, Arne Edward Michalson Md, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, et al. Green: Generative radiology report evaluation and error notation. InFindings of the association for computational linguistics: EMNLP 2024, pages 374–390, 2024

  49. [49]

    Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

    Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

  50. [50]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 12

  51. [51]

    Patient access of their radiology reports before and after implementation of 21st century cures act information-blocking provisions at a large multicampus health system

    Jordan R Pollock, Skye A Buckner Petty, John J Schmitz, Jacob Varner, Allie M Metcalfe, and Nelly Tan. Patient access of their radiology reports before and after implementation of 21st century cures act information-blocking provisions at a large multicampus health system. American Journal of Roentgenology, 222(6):e2330343, 2024

  52. [52]

    A prospective controlled trial of large language model–based simplification of oncologic ct reports for patients with cancer.Radiology, 317(2):e251844, 2025

    Philipp Prucker, Keno K Bressem, Jan Peeken, Mateo Jukic, Alexander W Marka, Maximilian Strenzke, Su Hwan Kim, Christian J Mertens, Dominik Weller, Tristan Lemke, et al. A prospective controlled trial of large language model–based simplification of oncologic ct reports for patients with cancer.Radiology, 317(2):e251844, 2025

  53. [53]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  54. [54]

    Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

    Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b

  55. [55]

    Seco de Herrera, et al

    Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, et al. Rocov2: Radiology objects in context version 2, an updated multimodal image dataset. Scientific Data, 11(1):688, 2024

  56. [56]

    Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

  57. [57]

    Neural text simplification of clinical letters with a domain specific phrase table

    Matthew Shardlow and Raheel Nawaz. Neural text simplification of clinical letters with a domain specific phrase table. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 380–389, 2019

  58. [58]

    Universal lesion segmentation chal- lenge 2023: a comparative research of different algorithms.arXiv preprint arXiv:2502.10608, 2025

    Kaiwen Shi, Yifei Li, Binh Ho, Jovian Wang, and Kobe Guo. Universal lesion segmentation chal- lenge 2023: a comparative research of different algorithms.arXiv preprint arXiv:2502.10608, 2025

  59. [59]

    Nci thesaurus: a semantic model integrating cancer-related clinical and molecular information.Journal of biomedical informatics, 40(1):30–43, 2007

    Nicholas Sioutos, Sherri de Coronado, Margaret W Haber, Frank W Hartel, Wen-Ling Shaiu, and Lawrence W Wright. Nci thesaurus: a semantic model integrating cancer-related clinical and molecular information.Journal of biomedical informatics, 40(1):30–43, 2007

  60. [60]

    Perspec- tives of patients about immediate access to test results through an online patient portal.JAMA Network Open, 6(3):e233572, 2023

    Bryan D Steitz, Robert W Turer, Chen-Tan Lin, Scott MacDonald, Liz Salmi, Adam Wright, Christoph U Lehmann, Karen Langford, Samuel A McDonald, Thomas J Reese, et al. Perspec- tives of patients about immediate access to test results through an online patient portal.JAMA Network Open, 6(3):e233572, 2023

  61. [61]

    Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

    Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

  62. [62]

    The patient-friendly radiology report: history, evolution, challenges and opportunities.Clinical Imaging, 89:128–135, 2022

    Nina S Vincoff, Matthew A Barish, and Gregory Grimaldi. The patient-friendly radiology report: history, evolution, challenges and opportunities.Clinical Imaging, 89:128–135, 2022

  63. [63]

    Automated metrics for medical multi-document summarization disagree with human evaluations

    Lucy Lu Wang, Julia Otmakhova, Jay DeYoung, Thinh Hung Truong, Bailey Kuehl, Erin Bransom, and Byron C Wallace. Automated metrics for medical multi-document summarization disagree with human evaluations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9871–9889, 2023

  64. [64]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  65. [65]

    Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases

    Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017. 13

  66. [66]

    Self-preference bias in llm-as-a-judge

    Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. arXiv preprint arXiv:2410.21819, 2024

  67. [67]

    Overview of the biolaysumm 2025 shared task on lay summarization of biomedical research articles and radiology reports

    Chenghao Xiao, Kun Zhao, Xiao Wang, Siwei Wu, Sixing Yan, Tomas Goldsack, Sophia Anani- adou, Noura Al Moubayed, Liang Zhan, William K Cheung, et al. Overview of the biolaysumm 2025 shared task on lay summarization of biomedical research articles and radiology reports. In Proceedings of the 24th Workshop on Biomedical Language Processing, pages 365–377, 2025

  68. [68]

    Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine.arXiv preprint arXiv:2408.02900, 2024

    Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, et al. Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine.arXiv preprint arXiv:2408.02900, 2024

  69. [69]

    Radeval: A framework for radiology text evaluation

    Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, et al. Radeval: A framework for radiology text evaluation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 546–557, 2025

  70. [70]

    Optimizing statistical machine translation for text simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

    Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing statistical machine translation for text simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

  71. [71]

    Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

    Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

  72. [72]

    Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning.Journal of medical imaging, 5(3):036501–036501, 2018

    Ke Yan, Xiaosong Wang, Le Lu, and Ronald M Summers. Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning.Journal of medical imaging, 5(3):036501–036501, 2018

  73. [73]

    Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp

    Zonghai Yao, Nandyala Siddharth Kantu, Guanghao Wei, Hieu Tran, Zhangqi Duan, Sunjae Kwon, Zhichao Yang, and Hong Yu. Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12609–12629, 2024

  74. [74]

    Medical thinking with multiple images

    Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, et al. Medical thinking with multiple images. arXiv preprint arXiv:2604.16506, 2026

  75. [75]

    Exploring and developing consumer health vocabularies.Journal of the American Medical Informatics Association, 13(1):24–29, 2006

    Qing T Zeng and Tony Tse. Exploring and developing consumer health vocabularies.Journal of the American Medical Informatics Association, 13(1):24–29, 2006

  76. [76]

    A multimodal biomedical foundation model trained from fifteen million image–text pairs.Nejm Ai, 2(1):AIoa2400640, 2025

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.Nejm Ai, 2(1):AIoa2400640, 2025

  77. [77]

    Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

  78. [78]

    Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

  79. [79]

    X-ray made simple: Lay radiology report generation and robust evaluation.arXiv preprint arXiv:2406.17911, 2024

    Kun Zhao, Chenghao Xiao, Sixing Yan, Haoteng Tang, William K Cheung, Noura Al Moubayed, Liang Zhan, and Chenghua Lin. X-ray made simple: Lay radiology report generation and robust evaluation.arXiv preprint arXiv:2406.17911, 2024

  80. [80]

    Ratescore: A metric for radiology report generation

    Weike Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Ratescore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, 2024

Showing first 80 references.