MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models

Chae Young Lim; Han Jang; Heeseong Eum; Hyeonjin Goh; Junhyeok Lee; Kyu Sung Choi; Songsoo Kim

arxiv: 2606.21194 · v1 · pith:3UZ4IVBLnew · submitted 2026-06-19 · 💻 cs.CV · cs.AI· cs.CL

MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models

Han Jang , Junhyeok Lee , Songsoo Kim , Chae Young Lim , Hyeonjin Goh , Heeseong Eum , Kyu Sung Choi This is my paper

Pith reviewed 2026-06-26 14:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords medical vision-language modelslay language generationexpert-lay gapbenchmark datasetpatient-accessible explanationsUMLS ontologyMedLayXPlainMedLayEval

0 comments

The pith

Medical vision-language models show a systematic gap between expert accuracy and patient-accessible descriptions of images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MedLayXPlain, a dataset of 122,789 region-grounded medical images paired with expert and lay captions drawn from UMLS ontologies, to test whether current models can generate explanations suitable for patients. It applies the HOVER pipeline to produce the lay versions while preserving meaning and introduces MedLayEval to score alignment on clinical attributes. Benchmarking 33 models finds that medical VLMs handle expert captions well but degrade when switching to simple language, whereas general VLMs produce more readable text yet omit or distort clinical details. The result matters because recent laws require immediate patient access to imaging results for education and decisions.

Core claim

Benchmarking 33 VLMs on MedLayXPlain-122K reveals a systematic Expert-Lay Gap: medical VLMs achieve strong expert captioning but suffer significant lay-register degradation, while general-purpose VLMs produce more accessible language yet lack clinical precision, confirming that neither current paradigm adequately serves patient-facing communication.

What carries the argument

The Hierarchical Ontology-Verified Refinement (HOVER) pipeline, which builds lay captions from expert ones via patient-centric vocabulary mapping, LLM-based constrained rewriting, and cross-model visual verification to maintain semantic equivalence.

If this is right

Medical VLMs require targeted adaptation to maintain clinical precision while shifting to lay-register language.
General-purpose VLMs need added clinical constraints or fine-tuning to reach acceptable medical accuracy.
Standard NLG metrics are inadequate for this task because they correlate poorly with clinical judgment.
The MedLayXPlain benchmark supplies a standardized testbed for developing models that support patient communication.
Closing the gap would directly support patient education and shared decision-making under immediate-access regulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid training that exposes models to both expert and lay pairs from the dataset could produce outputs usable by both audiences.
The gap may extend beyond images to other medical text generation tasks such as report summarization.
Deploying models without addressing the gap risks providing either overly technical or imprecise information to patients.
The ontology hierarchy could support automatic generation of explanations at multiple levels of detail beyond the current three.

Load-bearing premise

The HOVER pipeline combined with the distilled MedLayEval evaluator correctly enforces semantic equivalence and clinical alignment without introducing systematic bias from the LLM rewriting steps.

What would settle it

Independent clinical experts rating a sample of lay captions from both medical and general VLMs find no consistent difference in accessibility or accuracy across the five attributes.

Figures

Figures reproduced from arXiv: 2606.21194 by Chae Young Lim, Han Jang, Heeseong Eum, Hyeonjin Goh, Junhyeok Lee, Kyu Sung Choi, Songsoo Kim.

**Figure 1.** Figure 1: Motivation and positioning of MEDLAYXPLAIN. (a) Clinical motivation for patientaccessible rewriting of medical reports. (b) Existing resources offer text-only expert-lay pairs (left) or expert-only multimodal data (center). MEDLAYXPLAIN-122K combines all four elements (right). Medical Lay Language Generation. Translating clinical jargon into patient-accessible language has been studied under Medical Lay L… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed HOVER pipeline. (a) Raw expert caption with jargon (highlighted). (b) Step 1 maps medical entities to patient-friendly terms via three-level UMLS ontology (C ⊂ T ⊂ G). Step 2 generates a constrained lay draft with full ontology context. Step 3 performs cross-model visual verification (pass/revise/fail). (c) Refined lay caption (highlighted). 3 Method We present MEDLAYXPLAIN, a fram… view at source ↗

**Figure 3.** Figure 3: MEDLAYEVAL architecture. (a) Training: the student evaluator is distilled from a teacher verifier via per-attribute weighted MSE. (b) Inference: the evaluator predicts five attribute scores from an image, expert caption, and candidate lay caption. LoRA weights are merged at inference. Architecture. We build MEDLAYEVAL on Qwen2.5-VL-3B-Instruct [4], adapting it from a generative VLM into a regression-base… view at source ↗

**Figure 4.** Figure 4: VLM leaderboard on MEDLAYXPLAIN-122K, ranked by MEDLAYEVAL overall score (lay register, n = 5,000). A 46.7-point spread separates the top- and bottom-ranked models, confirming that expert-to-lay medical image description remains a substantial open challenge. 4 Experiments 4.1 Experimental Setup Dataset Statistics. MEDLAYXPLAIN-122K is constructed from 12 publicly available source datasets spanning 8 imagin… view at source ↗

read the original abstract

Medical Vision-Language Models (Med-VLMs) achieve strong expert-level performance, yet their ability to generate patient-accessible descriptions remains underexplored. With the 21st Century Cures Act now mandating immediate patient access to diagnostic imaging results, evaluating whether Med-VLMs can bridge this Expert-Lay Gap is both urgent and clinically consequential for patient education and shared decision-making. To this end, we introduce MedLayXPlain, the first large-scale multimodal benchmark and evaluation framework for Medical Lay Language Generation (MLLG). MedLayXPlain-122K provides 122,789 region-grounded samples across 8 imaging modalities from 12 publicly available source datasets, each comprising a medical image with paired expert and lay captions anchored in a three-level Unified Medical Language System (UMLS) ontology hierarchy spanning 7 semantic groups, 43 semantic types, and 2,411 medical concepts. Lay captions are constructed via Hierarchical Ontology-Verified Refinement (HOVER), a three-step pipeline combining patient-centric vocabulary mapping, LLM-based constrained rewriting, and cross-model visual verification to enforce semantic equivalence while preventing hallucination. We further introduce MedLayEval, a lightweight 3B evaluator distilled from a 27B verifier that scores expert-lay alignment across five clinically grounded attributes, addressing the poor correlation between standard NLG metrics and clinical judgment. Benchmarking 33 VLMs on MedLayXPlain-122K reveals a systematic Expert-Lay Gap: medical VLMs achieve strong expert captioning but suffer significant lay-register degradation, while general-purpose VLMs produce more accessible language yet lack clinical precision, confirming that neither current paradigm adequately serves patient-facing communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedLayXPlain gives a practical benchmark for the expert-lay gap in medical VLMs, but the HOVER pipeline's LLM rewriting step lacks external validation and could create the measured gap as an artifact.

read the letter

The paper's core offering is MedLayXPlain-122K, a large set of region-grounded image pairs with expert and lay captions drawn from 12 public datasets across eight modalities. They anchor everything in a UMLS hierarchy and use the HOVER pipeline (ontology mapping, constrained LLM rewrite, cross-model visual check) plus a distilled 3B MedLayEval scorer to measure how well 33 VLMs handle patient-accessible language. The reported pattern is straightforward: medical VLMs do well on expert captions but degrade on lay versions, while general VLMs are more readable yet lose clinical accuracy.

That framing around the Cures Act requirement for patient access is timely, and pulling together this scale of multimodal data with an ontology backbone is a concrete step forward. The idea of a lightweight clinical evaluator also addresses a real weakness in standard NLG metrics.

The soft spot sits in the lay-caption construction. The abstract gives no evidence of human or expert review to confirm that the LLM rewrites preserve exact semantic content without simplification or drift. If the rewriting step systematically alters concepts in ways the cross-model check misses, then the expert-lay gap and the precision-accessibility trade-off become partly benchmark artifacts rather than model properties. No error bars, statistical tests, or correlation with actual clinician judgment are mentioned either.

This is for groups working on medical VLMs, patient education tools, or regulatory-compliant explainability. A reader who needs a ready dataset and evaluator for lay-language tasks will find usable material here. The work is coherent enough on its own terms to deserve peer review, mainly so the data pipeline can be examined in detail.

Referee Report

3 major / 2 minor

Summary. The paper introduces MedLayXPlain-122K, a 122,789-sample multimodal benchmark for Medical Lay Language Generation (MLLG) spanning 8 imaging modalities and 12 public datasets. Expert-lay caption pairs are generated via the Hierarchical Ontology-Verified Refinement (HOVER) pipeline, which maps to a three-level UMLS ontology (7 semantic groups, 43 types, 2,411 concepts), performs LLM-constrained rewriting, and applies cross-model visual verification. A 3B-parameter MedLayEval evaluator is distilled from a 27B model to score five clinically grounded alignment attributes. Benchmarking 33 VLMs reveals a systematic Expert-Lay Gap: medical VLMs perform well on expert captions but degrade on lay-register output, while general-purpose VLMs produce more accessible language at the cost of clinical precision.

Significance. If the constructed pairs and evaluator are shown to be faithful, the work directly addresses a regulatory need (21st Century Cures Act) for patient-accessible imaging descriptions and supplies the first large-scale, ontology-anchored testbed for evaluating whether current Med-VLMs can support shared decision-making. The scale, public-data sourcing, and explicit separation of expert vs. lay registers are clear strengths; the result would be actionable for model developers and clinicians.

major comments (3)

[Abstract / HOVER pipeline] Abstract and Methods (HOVER pipeline description): the claim that HOVER 'enforces semantic equivalence while preventing hallucination' rests on UMLS mapping + LLM rewriting + cross-model verification, yet no human expert adjudication, inter-rater agreement, or external clinical review of the resulting expert-lay pairs is reported. Because the central Expert-Lay Gap finding is defined by performance differences on these pairs, undetected clinical drift in the rewriting step would render the measured degradation an artifact rather than a model property.
[Abstract / MedLayEval] Abstract (MedLayEval): the 3B evaluator is asserted to address 'poor correlation between standard NLG metrics and clinical judgment,' but no quantitative validation (e.g., correlation coefficients with radiologist ratings, comparison against the 27B teacher on held-out cases, or error analysis) is provided. This is load-bearing for all downstream VLM rankings.
[Benchmarking results] Benchmark results section: the reported gap is presented without error bars, statistical significance tests across the 33 models, or ablation on the five MedLayEval attributes. Without these, it is unclear whether the 'systematic' degradation is robust or driven by a subset of modalities or semantic groups.

minor comments (2)

[Abstract] The abstract states '122,789 region-grounded samples' but does not clarify how region grounding is preserved or verified after the LLM rewriting step.
[Introduction / Methods] Notation for the three-level UMLS hierarchy (groups/types/concepts) should be defined once with an explicit example in the main text rather than only in the abstract.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below with honest responses and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract / HOVER pipeline] Abstract and Methods (HOVER pipeline description): the claim that HOVER 'enforces semantic equivalence while preventing hallucination' rests on UMLS mapping + LLM rewriting + cross-model verification, yet no human expert adjudication, inter-rater agreement, or external clinical review of the resulting expert-lay pairs is reported. Because the central Expert-Lay Gap finding is defined by performance differences on these pairs, undetected clinical drift in the rewriting step would render the measured degradation an artifact rather than a model property.

Authors: The HOVER pipeline relies on UMLS ontology mapping to 2,411 concepts, LLM-constrained rewriting, and cross-model visual verification to promote semantic equivalence. We acknowledge that the manuscript reports no human expert adjudication, inter-rater agreement, or external clinical review of the pairs. This is a genuine limitation. We will revise by adding an expanded Limitations section that discusses the automated safeguards and their potential shortcomings, along with any available verification statistics from the pipeline. revision: partial
Referee: [Abstract / MedLayEval] Abstract (MedLayEval): the 3B evaluator is asserted to address 'poor correlation between standard NLG metrics and clinical judgment,' but no quantitative validation (e.g., correlation coefficients with radiologist ratings, comparison against the 27B teacher on held-out cases, or error analysis) is provided. This is load-bearing for all downstream VLM rankings.

Authors: We agree that quantitative validation of MedLayEval (e.g., correlations with radiologist ratings or error analysis versus the 27B teacher) is important and is not reported in the manuscript. The current work describes the distillation process but lacks these external metrics. As the required human ratings were not collected, we cannot supply them. We will revise the text to detail the internal distillation validation that was performed and to state explicitly in the Limitations section that external clinical correlation remains an open requirement for future work. revision: partial
Referee: [Benchmarking results] Benchmark results section: the reported gap is presented without error bars, statistical significance tests across the 33 models, or ablation on the five MedLayEval attributes. Without these, it is unclear whether the 'systematic' degradation is robust or driven by a subset of modalities or semantic groups.

Authors: We accept that the benchmarking results would be more robust with error bars, statistical significance tests, and attribute ablations. These analyses can be performed on the existing evaluation outputs. We will revise the results section to add error bars to the reported metrics, include statistical significance testing across the 33 models, and provide ablations on the five MedLayEval attributes to assess consistency across modalities and semantic groups. revision: yes

standing simulated objections not resolved

Human expert adjudication, inter-rater agreement, or external clinical review of the HOVER expert-lay pairs
Quantitative validation of MedLayEval against radiologist ratings or detailed error analysis versus the 27B teacher

Circularity Check

0 steps flagged

No circularity: benchmark constructed from external public datasets with independent evaluation

full rationale

The paper constructs MedLayXPlain-122K from 12 publicly available external source datasets and introduces HOVER and MedLayEval as new pipelines. No derivation reduces a claimed result to a fitted parameter or self-citation by construction. The central benchmarking result (Expert-Lay Gap) is an empirical measurement on held-out models, not a self-referential prediction. Self-citations, if present, are not load-bearing for the core claims. This matches the default non-circular case for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claims rest on the unverified correctness of the HOVER pipeline and MedLayEval correlation with clinical judgment; these cannot be assessed from the abstract.

axioms (1)

domain assumption The UMLS ontology hierarchy provides an accurate and complete mapping from expert medical concepts to patient-centric lay vocabulary across the covered semantic groups and types.
Invoked as the foundation for the Hierarchical Ontology-Verified Refinement step that constructs lay captions.

invented entities (1)

MedLayEval no independent evidence
purpose: Lightweight 3B evaluator that scores expert-lay alignment on five clinically grounded attributes.
Distilled from a 27B verifier; no independent evidence of its correlation with clinical judgment is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5866 in / 1354 out tokens · 32730 ms · 2026-06-26T14:23:28.556409+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

108 extracted references · 15 linked inside Pith

[1]

Large language models for simplifying radiology reports: a systematic review and meta-analysis of patient, public, and clinician evaluations.The Lancet Digital Health, 2026

Samer Alabed, Abigail Anderson, Ahmed Maiter, Anthony Hughes, Niamh McAnenly, Mahan Salehi, Michael Sharkey, Krit Dwivedi, Alireza Hokmabadi, Fares Alahdab, et al. Large language models for simplifying radiology reports: a systematic review and meta-analysis of patient, public, and clinician evaluations.The Lancet Digital Health, 2026

2026
[2]

Claude models overview, 2025

Anthropic. Claude models overview, 2025. URL https://docs.anthropic.com/en/docs/ about-claude/models

2025
[3]

Introducing Claude Opus 4.7, April 2026

Anthropic. Introducing Claude Opus 4.7, April 2026. URL https://www.anthropic.com/ news/claude-opus-4-7

2026
[4]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[5]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,
[6]

URLhttps://arxiv.org/abs/2502.13923

Pith/arXiv arXiv
[7]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005
[8]

Contemporary trends in reviewing test results through the electronic patient portal among patients with cancer.JAMA oncology, 10(1):139–140, 2024

Sheena Bhalla, Tanushree Prasad, Donglu Xie, and David E Gerber. Contemporary trends in reviewing test results through the electronic patient portal among patients with cancer.JAMA oncology, 10(1):139–140, 2024

2024
[9]

The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004

Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004

2004
[10]

Snomed-ct: The advanced terminology and coding system for ehealth

L Bos and K Donnelly. Snomed-ct: The advanced terminology and coding system for ehealth. Stud Health Technol Inform, 121:279–290, 2006

2006
[11]

Sam-med2d.arXiv preprint arXiv:2308.16184, 2023

Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, et al. Sam-med2d.arXiv preprint arXiv:2308.16184, 2023

arXiv 2023
[12]

Machine learning in computational histopathology: Challenges and opportunities.Genes, Chromosomes and Cancer, 62(9):540– 556, 2023

Michael Cooper, Zongliang Ji, and Rahul G Krishnan. Machine learning in computational histopathology: Challenges and opportunities.Genes, Chromosomes and Cancer, 62(9):540– 556, 2023

2023
[13]

Unsolicited patient complaints following the 21st century cures act information-blocking rule

Robert J Dambrino IV , Henry J Domenico, John A Graves, Melinda JB Buntin, William Martinez, S Trent Rosenbloom, and William O Cooper. Unsolicited patient complaints following the 21st century cures act information-blocking rule. InJAMA Health Forum, volume 4, page e233244, 2023

2023
[14]

Gemma 3 technical report, 2025

Google DeepMind. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786

2025
[15]

A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

2025
[16]

Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023

2023
[17]

A new readability yardstick.Journal of applied psychology, 32(3):221, 1948

Rudolph Flesch. A new readability yardstick.Journal of applied psychology, 32(3):221, 1948. 10

1948
[18]

Making science simple: Corpora for the lay summarisation of scientific literature.arXiv preprint arXiv:2210.09932, 2022

Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton. Making science simple: Corpora for the lay summarisation of scientific literature.arXiv preprint arXiv:2210.09932, 2022

arXiv 2022
[19]

Making science simple: Corpora for the lay summarisation of scientific literature

Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton. Making science simple: Corpora for the lay summarisation of scientific literature. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10589–10604, 2022

2022
[20]

Overview of the biolaysumm 2024 shared task on the lay summarization of biomedical research articles.arXiv preprint arXiv:2408.08566, 2024

Tomas Goldsack, Carolina Scarton, Matthew Shardlow, and Chenghua Lin. Overview of the biolaysumm 2024 shared task on the lay summarization of biomedical research articles.arXiv preprint arXiv:2408.08566, 2024

arXiv 2024
[21]

Gemini 2.5 flash model card, 2025

Google DeepMind. Gemini 2.5 flash model card, 2025. URL https://ai.google.dev/ gemini-api/docs/models#gemini-2.5-flash

2025
[22]

Gemini 3 flash model card, 2026

Google DeepMind. Gemini 3 flash model card, 2026. URL https://ai.google.dev/ gemini-api/docs/models#gemini-3-flash-preview

2026
[23]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[24]

Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Pith/arXiv arXiv 2003
[25]

Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

Pith/arXiv arXiv 2016
[26]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[27]

Medlaybench-v: A large-scale benchmark for expert-lay semantic alignment in medical vision language models.arXiv preprint arXiv:2604.05738, 2026

Han Jang, Junhyeok Lee, Heeseong Eum, and Kyu Sung Choi. Medlaybench-v: A large-scale benchmark for expert-lay semantic alignment in medical vision language models.arXiv preprint arXiv:2604.05738, 2026

Pith/arXiv arXiv 2026
[28]

Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

arXiv 2025
[29]

Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

2016
[30]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

2019
[31]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[32]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

2018
[33]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023. 11

2023
[34]

Magical: Medical lay language generation via semantic invariance and layperson-tailored adaptation

Weibin Liao, Tianlong Wang, Yinghao Zhu, Yasha Wang, Junyi Gao, and Liantao Ma. Magical: Medical lay language generation via semantic invariance and layperson-tailored adaptation. arXiv preprint arXiv:2508.08730, 2025

arXiv 2025
[35]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[36]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

2021
[37]

Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific literature

Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, et al. Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific literature. InProceedings of the Computer Vision and Pattern Recognition Conference,...

2025
[38]

Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534, 2024

Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534, 2024

arXiv 2023
[39]

The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014

Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014

1993
[40]

The Llama 4 herd: The beginning of a new era of natively multimodal AI innova- tion

Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innova- tion. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ , April 2025. Accessed: 2026-05-06

2025
[41]

Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

2023
[42]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine learning for health (ML4H), pages 353–367. PMLR, 2023

2023
[43]

PMC open access subset

National Library of Medicine. PMC open access subset. https://pmc.ncbi.nlm.nih.gov/ tools/openftlist/, 2003–. Accessed: 2026-05-06

2003
[44]

Medical subject headings (MeSH)

National Library of Medicine. Medical subject headings (MeSH). https://www.nlm.nih. gov/mesh/, 2024

2024
[45]

Scispacy: fast and robust models for biomedical natural language processing

Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. Scispacy: fast and robust models for biomedical natural language processing. InProceedings of the 18th BioNLP workshop and shared task, pages 319–327, 2019

2019
[46]

GPT-5.4 and GPT-5.4-mini model card, February 2026

OpenAI. GPT-5.4 and GPT-5.4-mini model card, February 2026. URL https://platform. openai.com/docs/models

2026
[47]

GPT-5.5 system card, April 2026

OpenAI. GPT-5.5 system card, April 2026. URL https://openai.com/index/ gpt-5-5-system-card/

2026
[48]

Green: Generative radiology report evaluation and error notation

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Blueth- gen, Arne Edward Michalson Md, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, et al. Green: Generative radiology report evaluation and error notation. InFindings of the association for computational linguistics: EMNLP 2024, pages 374–390, 2024

2024
[49]

Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

2024
[50]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 12

2002
[51]

Patient access of their radiology reports before and after implementation of 21st century cures act information-blocking provisions at a large multicampus health system

Jordan R Pollock, Skye A Buckner Petty, John J Schmitz, Jacob Varner, Allie M Metcalfe, and Nelly Tan. Patient access of their radiology reports before and after implementation of 21st century cures act information-blocking provisions at a large multicampus health system. American Journal of Roentgenology, 222(6):e2330343, 2024

2024
[52]

A prospective controlled trial of large language model–based simplification of oncologic ct reports for patients with cancer.Radiology, 317(2):e251844, 2025

Philipp Prucker, Keno K Bressem, Jan Peeken, Mateo Jukic, Alexander W Marka, Maximilian Strenzke, Su Hwan Kim, Christian J Mertens, Dominik Weller, Tristan Lemke, et al. A prospective controlled trial of large language model–based simplification of oncologic ct reports for patients with cancer.Radiology, 317(2):e251844, 2025

2025
[53]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[54]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b

2026
[55]

Seco de Herrera, et al

Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, et al. Rocov2: Radiology objects in context version 2, an updated multimodal image dataset. Scientific Data, 11(1):688, 2024

2024
[56]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025
[57]

Neural text simplification of clinical letters with a domain specific phrase table

Matthew Shardlow and Raheel Nawaz. Neural text simplification of clinical letters with a domain specific phrase table. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 380–389, 2019

2019
[58]

Universal lesion segmentation chal- lenge 2023: a comparative research of different algorithms.arXiv preprint arXiv:2502.10608, 2025

Kaiwen Shi, Yifei Li, Binh Ho, Jovian Wang, and Kobe Guo. Universal lesion segmentation chal- lenge 2023: a comparative research of different algorithms.arXiv preprint arXiv:2502.10608, 2025

arXiv 2023
[59]

Nci thesaurus: a semantic model integrating cancer-related clinical and molecular information.Journal of biomedical informatics, 40(1):30–43, 2007

Nicholas Sioutos, Sherri de Coronado, Margaret W Haber, Frank W Hartel, Wen-Ling Shaiu, and Lawrence W Wright. Nci thesaurus: a semantic model integrating cancer-related clinical and molecular information.Journal of biomedical informatics, 40(1):30–43, 2007

2007
[60]

Perspec- tives of patients about immediate access to test results through an online patient portal.JAMA Network Open, 6(3):e233572, 2023

Bryan D Steitz, Robert W Turer, Chen-Tan Lin, Scott MacDonald, Liz Salmi, Adam Wright, Christoph U Lehmann, Karen Langford, Samuel A McDonald, Thomas J Reese, et al. Perspec- tives of patients about immediate access to test results through an online patient portal.JAMA Network Open, 6(3):e233572, 2023

2023
[61]

Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

Pith/arXiv arXiv 2024
[62]

The patient-friendly radiology report: history, evolution, challenges and opportunities.Clinical Imaging, 89:128–135, 2022

Nina S Vincoff, Matthew A Barish, and Gregory Grimaldi. The patient-friendly radiology report: history, evolution, challenges and opportunities.Clinical Imaging, 89:128–135, 2022

2022
[63]

Automated metrics for medical multi-document summarization disagree with human evaluations

Lucy Lu Wang, Julia Otmakhova, Jay DeYoung, Thinh Hung Truong, Bailey Kuehl, Erin Bransom, and Byron C Wallace. Automated metrics for medical multi-document summarization disagree with human evaluations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9871–9889, 2023

2023
[64]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024
[65]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017. 13

2097
[66]

Self-preference bias in llm-as-a-judge

Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. arXiv preprint arXiv:2410.21819, 2024

Pith/arXiv arXiv 2024
[67]

Overview of the biolaysumm 2025 shared task on lay summarization of biomedical research articles and radiology reports

Chenghao Xiao, Kun Zhao, Xiao Wang, Siwei Wu, Sixing Yan, Tomas Goldsack, Sophia Anani- adou, Noura Al Moubayed, Liang Zhan, William K Cheung, et al. Overview of the biolaysumm 2025 shared task on lay summarization of biomedical research articles and radiology reports. In Proceedings of the 24th Workshop on Biomedical Language Processing, pages 365–377, 2025

2025
[68]

Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine.arXiv preprint arXiv:2408.02900, 2024

Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, et al. Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine.arXiv preprint arXiv:2408.02900, 2024

arXiv 2024
[69]

Radeval: A framework for radiology text evaluation

Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, et al. Radeval: A framework for radiology text evaluation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 546–557, 2025

2025
[70]

Optimizing statistical machine translation for text simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing statistical machine translation for text simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

2016
[71]

Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

Pith/arXiv arXiv 2025
[72]

Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning.Journal of medical imaging, 5(3):036501–036501, 2018

Ke Yan, Xiaosong Wang, Le Lu, and Ronald M Summers. Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning.Journal of medical imaging, 5(3):036501–036501, 2018

2018
[73]

Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp

Zonghai Yao, Nandyala Siddharth Kantu, Guanghao Wei, Hieu Tran, Zhangqi Duan, Sunjae Kwon, Zhichao Yang, and Hong Yu. Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12609–12629, 2024

2024
[74]

Medical thinking with multiple images

Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, et al. Medical thinking with multiple images. arXiv preprint arXiv:2604.16506, 2026

Pith/arXiv arXiv 2026
[75]

Exploring and developing consumer health vocabularies.Journal of the American Medical Informatics Association, 13(1):24–29, 2006

Qing T Zeng and Tony Tse. Exploring and developing consumer health vocabularies.Journal of the American Medical Informatics Association, 13(1):24–29, 2006

2006
[76]

A multimodal biomedical foundation model trained from fifteen million image–text pairs.Nejm Ai, 2(1):AIoa2400640, 2025

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.Nejm Ai, 2(1):AIoa2400640, 2025

2025
[77]

Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

Pith/arXiv arXiv 1904
[78]

Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

Pith/arXiv arXiv 2023
[79]

X-ray made simple: Lay radiology report generation and robust evaluation.arXiv preprint arXiv:2406.17911, 2024

Kun Zhao, Chenghao Xiao, Sixing Yan, Haoteng Tang, William K Cheung, Noura Al Moubayed, Liang Zhan, and Chenghua Lin. X-ray made simple: Lay radiology report generation and robust evaluation.arXiv preprint arXiv:2406.17911, 2024

arXiv 2024
[80]

Ratescore: A metric for radiology report generation

Weike Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Ratescore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, 2024

2024

Showing first 80 references.

[1] [1]

Large language models for simplifying radiology reports: a systematic review and meta-analysis of patient, public, and clinician evaluations.The Lancet Digital Health, 2026

Samer Alabed, Abigail Anderson, Ahmed Maiter, Anthony Hughes, Niamh McAnenly, Mahan Salehi, Michael Sharkey, Krit Dwivedi, Alireza Hokmabadi, Fares Alahdab, et al. Large language models for simplifying radiology reports: a systematic review and meta-analysis of patient, public, and clinician evaluations.The Lancet Digital Health, 2026

2026

[2] [2]

Claude models overview, 2025

Anthropic. Claude models overview, 2025. URL https://docs.anthropic.com/en/docs/ about-claude/models

2025

[3] [3]

Introducing Claude Opus 4.7, April 2026

Anthropic. Introducing Claude Opus 4.7, April 2026. URL https://www.anthropic.com/ news/claude-opus-4-7

2026

[4] [4]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[5] [5]

Qwen2.5-vl technical report,

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,

[6] [6]

URLhttps://arxiv.org/abs/2502.13923

Pith/arXiv arXiv

[7] [7]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005

[8] [8]

Contemporary trends in reviewing test results through the electronic patient portal among patients with cancer.JAMA oncology, 10(1):139–140, 2024

Sheena Bhalla, Tanushree Prasad, Donglu Xie, and David E Gerber. Contemporary trends in reviewing test results through the electronic patient portal among patients with cancer.JAMA oncology, 10(1):139–140, 2024

2024

[9] [9]

The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004

Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004

2004

[10] [10]

Snomed-ct: The advanced terminology and coding system for ehealth

L Bos and K Donnelly. Snomed-ct: The advanced terminology and coding system for ehealth. Stud Health Technol Inform, 121:279–290, 2006

2006

[11] [11]

Sam-med2d.arXiv preprint arXiv:2308.16184, 2023

Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, et al. Sam-med2d.arXiv preprint arXiv:2308.16184, 2023

arXiv 2023

[12] [12]

Machine learning in computational histopathology: Challenges and opportunities.Genes, Chromosomes and Cancer, 62(9):540– 556, 2023

Michael Cooper, Zongliang Ji, and Rahul G Krishnan. Machine learning in computational histopathology: Challenges and opportunities.Genes, Chromosomes and Cancer, 62(9):540– 556, 2023

2023

[13] [13]

Unsolicited patient complaints following the 21st century cures act information-blocking rule

Robert J Dambrino IV , Henry J Domenico, John A Graves, Melinda JB Buntin, William Martinez, S Trent Rosenbloom, and William O Cooper. Unsolicited patient complaints following the 21st century cures act information-blocking rule. InJAMA Health Forum, volume 4, page e233244, 2023

2023

[14] [14]

Gemma 3 technical report, 2025

Google DeepMind. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786

2025

[15] [15]

A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

2025

[16] [16]

Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023

2023

[17] [17]

A new readability yardstick.Journal of applied psychology, 32(3):221, 1948

Rudolph Flesch. A new readability yardstick.Journal of applied psychology, 32(3):221, 1948. 10

1948

[18] [18]

Making science simple: Corpora for the lay summarisation of scientific literature.arXiv preprint arXiv:2210.09932, 2022

Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton. Making science simple: Corpora for the lay summarisation of scientific literature.arXiv preprint arXiv:2210.09932, 2022

arXiv 2022

[19] [19]

Making science simple: Corpora for the lay summarisation of scientific literature

Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton. Making science simple: Corpora for the lay summarisation of scientific literature. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10589–10604, 2022

2022

[20] [20]

Overview of the biolaysumm 2024 shared task on the lay summarization of biomedical research articles.arXiv preprint arXiv:2408.08566, 2024

Tomas Goldsack, Carolina Scarton, Matthew Shardlow, and Chenghua Lin. Overview of the biolaysumm 2024 shared task on the lay summarization of biomedical research articles.arXiv preprint arXiv:2408.08566, 2024

arXiv 2024

[21] [21]

Gemini 2.5 flash model card, 2025

Google DeepMind. Gemini 2.5 flash model card, 2025. URL https://ai.google.dev/ gemini-api/docs/models#gemini-2.5-flash

2025

[22] [22]

Gemini 3 flash model card, 2026

Google DeepMind. Gemini 3 flash model card, 2026. URL https://ai.google.dev/ gemini-api/docs/models#gemini-3-flash-preview

2026

[23] [23]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[24] [24]

Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Pith/arXiv arXiv 2003

[25] [25]

Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

Pith/arXiv arXiv 2016

[26] [26]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[27] [27]

Medlaybench-v: A large-scale benchmark for expert-lay semantic alignment in medical vision language models.arXiv preprint arXiv:2604.05738, 2026

Han Jang, Junhyeok Lee, Heeseong Eum, and Kyu Sung Choi. Medlaybench-v: A large-scale benchmark for expert-lay semantic alignment in medical vision language models.arXiv preprint arXiv:2604.05738, 2026

Pith/arXiv arXiv 2026

[28] [28]

Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025

arXiv 2025

[29] [29]

Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016

2016

[30] [30]

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

2019

[31] [31]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023

[32] [32]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018

2018

[33] [33]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023. 11

2023

[34] [34]

Magical: Medical lay language generation via semantic invariance and layperson-tailored adaptation

Weibin Liao, Tianlong Wang, Yinghao Zhu, Yasha Wang, Junyi Gao, and Liantao Ma. Magical: Medical lay language generation via semantic invariance and layperson-tailored adaptation. arXiv preprint arXiv:2508.08730, 2025

arXiv 2025

[35] [35]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004

[36] [36]

Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

2021

[37] [37]

Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific literature

Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, et al. Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific literature. InProceedings of the Computer Vision and Pattern Recognition Conference,...

2025

[38] [38]

Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534, 2024

Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534, 2024

arXiv 2023

[39] [39]

The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014

Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014

1993

[40] [40]

The Llama 4 herd: The beginning of a new era of natively multimodal AI innova- tion

Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innova- tion. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ , April 2025. Accessed: 2026-05-06

2025

[41] [41]

Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023

2023

[42] [42]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine learning for health (ML4H), pages 353–367. PMLR, 2023

2023

[43] [43]

PMC open access subset

National Library of Medicine. PMC open access subset. https://pmc.ncbi.nlm.nih.gov/ tools/openftlist/, 2003–. Accessed: 2026-05-06

2003

[44] [44]

Medical subject headings (MeSH)

National Library of Medicine. Medical subject headings (MeSH). https://www.nlm.nih. gov/mesh/, 2024

2024

[45] [45]

Scispacy: fast and robust models for biomedical natural language processing

Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. Scispacy: fast and robust models for biomedical natural language processing. InProceedings of the 18th BioNLP workshop and shared task, pages 319–327, 2019

2019

[46] [46]

GPT-5.4 and GPT-5.4-mini model card, February 2026

OpenAI. GPT-5.4 and GPT-5.4-mini model card, February 2026. URL https://platform. openai.com/docs/models

2026

[47] [47]

GPT-5.5 system card, April 2026

OpenAI. GPT-5.5 system card, April 2026. URL https://openai.com/index/ gpt-5-5-system-card/

2026

[48] [48]

Green: Generative radiology report evaluation and error notation

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Blueth- gen, Arne Edward Michalson Md, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, et al. Green: Generative radiology report evaluation and error notation. InFindings of the association for computational linguistics: EMNLP 2024, pages 374–390, 2024

2024

[49] [49]

Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024

2024

[50] [50]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 12

2002

[51] [51]

Patient access of their radiology reports before and after implementation of 21st century cures act information-blocking provisions at a large multicampus health system

Jordan R Pollock, Skye A Buckner Petty, John J Schmitz, Jacob Varner, Allie M Metcalfe, and Nelly Tan. Patient access of their radiology reports before and after implementation of 21st century cures act information-blocking provisions at a large multicampus health system. American Journal of Roentgenology, 222(6):e2330343, 2024

2024

[52] [52]

A prospective controlled trial of large language model–based simplification of oncologic ct reports for patients with cancer.Radiology, 317(2):e251844, 2025

Philipp Prucker, Keno K Bressem, Jan Peeken, Mateo Jukic, Alexander W Marka, Maximilian Strenzke, Su Hwan Kim, Christian J Mertens, Dominik Weller, Tristan Lemke, et al. A prospective controlled trial of large language model–based simplification of oncologic ct reports for patients with cancer.Radiology, 317(2):e251844, 2025

2025

[53] [53]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026

[54] [54]

Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b

2026

[55] [55]

Seco de Herrera, et al

Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, et al. Rocov2: Radiology objects in context version 2, an updated multimodal image dataset. Scientific Data, 11(1):688, 2024

2024

[56] [56]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025

[57] [57]

Neural text simplification of clinical letters with a domain specific phrase table

Matthew Shardlow and Raheel Nawaz. Neural text simplification of clinical letters with a domain specific phrase table. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 380–389, 2019

2019

[58] [58]

Universal lesion segmentation chal- lenge 2023: a comparative research of different algorithms.arXiv preprint arXiv:2502.10608, 2025

Kaiwen Shi, Yifei Li, Binh Ho, Jovian Wang, and Kobe Guo. Universal lesion segmentation chal- lenge 2023: a comparative research of different algorithms.arXiv preprint arXiv:2502.10608, 2025

arXiv 2023

[59] [59]

Nci thesaurus: a semantic model integrating cancer-related clinical and molecular information.Journal of biomedical informatics, 40(1):30–43, 2007

Nicholas Sioutos, Sherri de Coronado, Margaret W Haber, Frank W Hartel, Wen-Ling Shaiu, and Lawrence W Wright. Nci thesaurus: a semantic model integrating cancer-related clinical and molecular information.Journal of biomedical informatics, 40(1):30–43, 2007

2007

[60] [60]

Perspec- tives of patients about immediate access to test results through an online patient portal.JAMA Network Open, 6(3):e233572, 2023

Bryan D Steitz, Robert W Turer, Chen-Tan Lin, Scott MacDonald, Liz Salmi, Adam Wright, Christoph U Lehmann, Karen Langford, Samuel A McDonald, Thomas J Reese, et al. Perspec- tives of patients about immediate access to test results through an online patient portal.JAMA Network Open, 6(3):e233572, 2023

2023

[61] [61]

Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024

Pith/arXiv arXiv 2024

[62] [62]

The patient-friendly radiology report: history, evolution, challenges and opportunities.Clinical Imaging, 89:128–135, 2022

Nina S Vincoff, Matthew A Barish, and Gregory Grimaldi. The patient-friendly radiology report: history, evolution, challenges and opportunities.Clinical Imaging, 89:128–135, 2022

2022

[63] [63]

Automated metrics for medical multi-document summarization disagree with human evaluations

Lucy Lu Wang, Julia Otmakhova, Jay DeYoung, Thinh Hung Truong, Bailey Kuehl, Erin Bransom, and Byron C Wallace. Automated metrics for medical multi-document summarization disagree with human evaluations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9871–9889, 2023

2023

[64] [64]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

Pith/arXiv arXiv 2024

[65] [65]

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017. 13

2097

[66] [66]

Self-preference bias in llm-as-a-judge

Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. arXiv preprint arXiv:2410.21819, 2024

Pith/arXiv arXiv 2024

[67] [67]

Overview of the biolaysumm 2025 shared task on lay summarization of biomedical research articles and radiology reports

Chenghao Xiao, Kun Zhao, Xiao Wang, Siwei Wu, Sixing Yan, Tomas Goldsack, Sophia Anani- adou, Noura Al Moubayed, Liang Zhan, William K Cheung, et al. Overview of the biolaysumm 2025 shared task on lay summarization of biomedical research articles and radiology reports. In Proceedings of the 24th Workshop on Biomedical Language Processing, pages 365–377, 2025

2025

[68] [68]

Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine.arXiv preprint arXiv:2408.02900, 2024

Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, et al. Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine.arXiv preprint arXiv:2408.02900, 2024

arXiv 2024

[69] [69]

Radeval: A framework for radiology text evaluation

Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, et al. Radeval: A framework for radiology text evaluation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 546–557, 2025

2025

[70] [70]

Optimizing statistical machine translation for text simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing statistical machine translation for text simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

2016

[71] [71]

Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

Pith/arXiv arXiv 2025

[72] [72]

Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning.Journal of medical imaging, 5(3):036501–036501, 2018

Ke Yan, Xiaosong Wang, Le Lu, and Ronald M Summers. Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning.Journal of medical imaging, 5(3):036501–036501, 2018

2018

[73] [73]

Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp

Zonghai Yao, Nandyala Siddharth Kantu, Guanghao Wei, Hieu Tran, Zhangqi Duan, Sunjae Kwon, Zhichao Yang, and Hong Yu. Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12609–12629, 2024

2024

[74] [74]

Medical thinking with multiple images

Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, et al. Medical thinking with multiple images. arXiv preprint arXiv:2604.16506, 2026

Pith/arXiv arXiv 2026

[75] [75]

Exploring and developing consumer health vocabularies.Journal of the American Medical Informatics Association, 13(1):24–29, 2006

Qing T Zeng and Tony Tse. Exploring and developing consumer health vocabularies.Journal of the American Medical Informatics Association, 13(1):24–29, 2006

2006

[76] [76]

A multimodal biomedical foundation model trained from fifteen million image–text pairs.Nejm Ai, 2(1):AIoa2400640, 2025

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.Nejm Ai, 2(1):AIoa2400640, 2025

2025

[77] [77]

Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

Pith/arXiv arXiv 1904

[78] [78]

Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

Pith/arXiv arXiv 2023

[79] [79]

X-ray made simple: Lay radiology report generation and robust evaluation.arXiv preprint arXiv:2406.17911, 2024

Kun Zhao, Chenghao Xiao, Sixing Yan, Haoteng Tang, William K Cheung, Noura Al Moubayed, Liang Zhan, and Chenghua Lin. X-ray made simple: Lay radiology report generation and robust evaluation.arXiv preprint arXiv:2406.17911, 2024

arXiv 2024

[80] [80]

Ratescore: A metric for radiology report generation

Weike Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Ratescore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, 2024

2024