MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models
Pith reviewed 2026-06-26 14:23 UTC · model grok-4.3
The pith
Medical vision-language models show a systematic gap between expert accuracy and patient-accessible descriptions of images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Benchmarking 33 VLMs on MedLayXPlain-122K reveals a systematic Expert-Lay Gap: medical VLMs achieve strong expert captioning but suffer significant lay-register degradation, while general-purpose VLMs produce more accessible language yet lack clinical precision, confirming that neither current paradigm adequately serves patient-facing communication.
What carries the argument
The Hierarchical Ontology-Verified Refinement (HOVER) pipeline, which builds lay captions from expert ones via patient-centric vocabulary mapping, LLM-based constrained rewriting, and cross-model visual verification to maintain semantic equivalence.
If this is right
- Medical VLMs require targeted adaptation to maintain clinical precision while shifting to lay-register language.
- General-purpose VLMs need added clinical constraints or fine-tuning to reach acceptable medical accuracy.
- Standard NLG metrics are inadequate for this task because they correlate poorly with clinical judgment.
- The MedLayXPlain benchmark supplies a standardized testbed for developing models that support patient communication.
- Closing the gap would directly support patient education and shared decision-making under immediate-access regulations.
Where Pith is reading between the lines
- Hybrid training that exposes models to both expert and lay pairs from the dataset could produce outputs usable by both audiences.
- The gap may extend beyond images to other medical text generation tasks such as report summarization.
- Deploying models without addressing the gap risks providing either overly technical or imprecise information to patients.
- The ontology hierarchy could support automatic generation of explanations at multiple levels of detail beyond the current three.
Load-bearing premise
The HOVER pipeline combined with the distilled MedLayEval evaluator correctly enforces semantic equivalence and clinical alignment without introducing systematic bias from the LLM rewriting steps.
What would settle it
Independent clinical experts rating a sample of lay captions from both medical and general VLMs find no consistent difference in accessibility or accuracy across the five attributes.
Figures
read the original abstract
Medical Vision-Language Models (Med-VLMs) achieve strong expert-level performance, yet their ability to generate patient-accessible descriptions remains underexplored. With the 21st Century Cures Act now mandating immediate patient access to diagnostic imaging results, evaluating whether Med-VLMs can bridge this Expert-Lay Gap is both urgent and clinically consequential for patient education and shared decision-making. To this end, we introduce MedLayXPlain, the first large-scale multimodal benchmark and evaluation framework for Medical Lay Language Generation (MLLG). MedLayXPlain-122K provides 122,789 region-grounded samples across 8 imaging modalities from 12 publicly available source datasets, each comprising a medical image with paired expert and lay captions anchored in a three-level Unified Medical Language System (UMLS) ontology hierarchy spanning 7 semantic groups, 43 semantic types, and 2,411 medical concepts. Lay captions are constructed via Hierarchical Ontology-Verified Refinement (HOVER), a three-step pipeline combining patient-centric vocabulary mapping, LLM-based constrained rewriting, and cross-model visual verification to enforce semantic equivalence while preventing hallucination. We further introduce MedLayEval, a lightweight 3B evaluator distilled from a 27B verifier that scores expert-lay alignment across five clinically grounded attributes, addressing the poor correlation between standard NLG metrics and clinical judgment. Benchmarking 33 VLMs on MedLayXPlain-122K reveals a systematic Expert-Lay Gap: medical VLMs achieve strong expert captioning but suffer significant lay-register degradation, while general-purpose VLMs produce more accessible language yet lack clinical precision, confirming that neither current paradigm adequately serves patient-facing communication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MedLayXPlain-122K, a 122,789-sample multimodal benchmark for Medical Lay Language Generation (MLLG) spanning 8 imaging modalities and 12 public datasets. Expert-lay caption pairs are generated via the Hierarchical Ontology-Verified Refinement (HOVER) pipeline, which maps to a three-level UMLS ontology (7 semantic groups, 43 types, 2,411 concepts), performs LLM-constrained rewriting, and applies cross-model visual verification. A 3B-parameter MedLayEval evaluator is distilled from a 27B model to score five clinically grounded alignment attributes. Benchmarking 33 VLMs reveals a systematic Expert-Lay Gap: medical VLMs perform well on expert captions but degrade on lay-register output, while general-purpose VLMs produce more accessible language at the cost of clinical precision.
Significance. If the constructed pairs and evaluator are shown to be faithful, the work directly addresses a regulatory need (21st Century Cures Act) for patient-accessible imaging descriptions and supplies the first large-scale, ontology-anchored testbed for evaluating whether current Med-VLMs can support shared decision-making. The scale, public-data sourcing, and explicit separation of expert vs. lay registers are clear strengths; the result would be actionable for model developers and clinicians.
major comments (3)
- [Abstract / HOVER pipeline] Abstract and Methods (HOVER pipeline description): the claim that HOVER 'enforces semantic equivalence while preventing hallucination' rests on UMLS mapping + LLM rewriting + cross-model verification, yet no human expert adjudication, inter-rater agreement, or external clinical review of the resulting expert-lay pairs is reported. Because the central Expert-Lay Gap finding is defined by performance differences on these pairs, undetected clinical drift in the rewriting step would render the measured degradation an artifact rather than a model property.
- [Abstract / MedLayEval] Abstract (MedLayEval): the 3B evaluator is asserted to address 'poor correlation between standard NLG metrics and clinical judgment,' but no quantitative validation (e.g., correlation coefficients with radiologist ratings, comparison against the 27B teacher on held-out cases, or error analysis) is provided. This is load-bearing for all downstream VLM rankings.
- [Benchmarking results] Benchmark results section: the reported gap is presented without error bars, statistical significance tests across the 33 models, or ablation on the five MedLayEval attributes. Without these, it is unclear whether the 'systematic' degradation is robust or driven by a subset of modalities or semantic groups.
minor comments (2)
- [Abstract] The abstract states '122,789 region-grounded samples' but does not clarify how region grounding is preserved or verified after the LLM rewriting step.
- [Introduction / Methods] Notation for the three-level UMLS hierarchy (groups/types/concepts) should be defined once with an explicit example in the main text rather than only in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below with honest responses and indicate where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract / HOVER pipeline] Abstract and Methods (HOVER pipeline description): the claim that HOVER 'enforces semantic equivalence while preventing hallucination' rests on UMLS mapping + LLM rewriting + cross-model verification, yet no human expert adjudication, inter-rater agreement, or external clinical review of the resulting expert-lay pairs is reported. Because the central Expert-Lay Gap finding is defined by performance differences on these pairs, undetected clinical drift in the rewriting step would render the measured degradation an artifact rather than a model property.
Authors: The HOVER pipeline relies on UMLS ontology mapping to 2,411 concepts, LLM-constrained rewriting, and cross-model visual verification to promote semantic equivalence. We acknowledge that the manuscript reports no human expert adjudication, inter-rater agreement, or external clinical review of the pairs. This is a genuine limitation. We will revise by adding an expanded Limitations section that discusses the automated safeguards and their potential shortcomings, along with any available verification statistics from the pipeline. revision: partial
-
Referee: [Abstract / MedLayEval] Abstract (MedLayEval): the 3B evaluator is asserted to address 'poor correlation between standard NLG metrics and clinical judgment,' but no quantitative validation (e.g., correlation coefficients with radiologist ratings, comparison against the 27B teacher on held-out cases, or error analysis) is provided. This is load-bearing for all downstream VLM rankings.
Authors: We agree that quantitative validation of MedLayEval (e.g., correlations with radiologist ratings or error analysis versus the 27B teacher) is important and is not reported in the manuscript. The current work describes the distillation process but lacks these external metrics. As the required human ratings were not collected, we cannot supply them. We will revise the text to detail the internal distillation validation that was performed and to state explicitly in the Limitations section that external clinical correlation remains an open requirement for future work. revision: partial
-
Referee: [Benchmarking results] Benchmark results section: the reported gap is presented without error bars, statistical significance tests across the 33 models, or ablation on the five MedLayEval attributes. Without these, it is unclear whether the 'systematic' degradation is robust or driven by a subset of modalities or semantic groups.
Authors: We accept that the benchmarking results would be more robust with error bars, statistical significance tests, and attribute ablations. These analyses can be performed on the existing evaluation outputs. We will revise the results section to add error bars to the reported metrics, include statistical significance testing across the 33 models, and provide ablations on the five MedLayEval attributes to assess consistency across modalities and semantic groups. revision: yes
- Human expert adjudication, inter-rater agreement, or external clinical review of the HOVER expert-lay pairs
- Quantitative validation of MedLayEval against radiologist ratings or detailed error analysis versus the 27B teacher
Circularity Check
No circularity: benchmark constructed from external public datasets with independent evaluation
full rationale
The paper constructs MedLayXPlain-122K from 12 publicly available external source datasets and introduces HOVER and MedLayEval as new pipelines. No derivation reduces a claimed result to a fitted parameter or self-citation by construction. The central benchmarking result (Expert-Lay Gap) is an empirical measurement on held-out models, not a self-referential prediction. Self-citations, if present, are not load-bearing for the core claims. This matches the default non-circular case for benchmark papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The UMLS ontology hierarchy provides an accurate and complete mapping from expert medical concepts to patient-centric lay vocabulary across the covered semantic groups and types.
invented entities (1)
-
MedLayEval
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Large language models for simplifying radiology reports: a systematic review and meta-analysis of patient, public, and clinician evaluations.The Lancet Digital Health, 2026
Samer Alabed, Abigail Anderson, Ahmed Maiter, Anthony Hughes, Niamh McAnenly, Mahan Salehi, Michael Sharkey, Krit Dwivedi, Alireza Hokmabadi, Fares Alahdab, et al. Large language models for simplifying radiology reports: a systematic review and meta-analysis of patient, public, and clinician evaluations.The Lancet Digital Health, 2026
2026
-
[2]
Claude models overview, 2025
Anthropic. Claude models overview, 2025. URL https://docs.anthropic.com/en/docs/ about-claude/models
2025
-
[3]
Introducing Claude Opus 4.7, April 2026
Anthropic. Introducing Claude Opus 4.7, April 2026. URL https://www.anthropic.com/ news/claude-opus-4-7
2026
-
[4]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[5]
Qwen2.5-vl technical report,
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,
-
[6]
URLhttps://arxiv.org/abs/2502.13923
-
[7]
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005
2005
-
[8]
Contemporary trends in reviewing test results through the electronic patient portal among patients with cancer.JAMA oncology, 10(1):139–140, 2024
Sheena Bhalla, Tanushree Prasad, Donglu Xie, and David E Gerber. Contemporary trends in reviewing test results through the electronic patient portal among patients with cancer.JAMA oncology, 10(1):139–140, 2024
2024
-
[9]
The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004
Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology.Nucleic acids research, 32(suppl_1):D267–D270, 2004
2004
-
[10]
Snomed-ct: The advanced terminology and coding system for ehealth
L Bos and K Donnelly. Snomed-ct: The advanced terminology and coding system for ehealth. Stud Health Technol Inform, 121:279–290, 2006
2006
-
[11]
Sam-med2d.arXiv preprint arXiv:2308.16184, 2023
Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, et al. Sam-med2d.arXiv preprint arXiv:2308.16184, 2023
arXiv 2023
-
[12]
Machine learning in computational histopathology: Challenges and opportunities.Genes, Chromosomes and Cancer, 62(9):540– 556, 2023
Michael Cooper, Zongliang Ji, and Rahul G Krishnan. Machine learning in computational histopathology: Challenges and opportunities.Genes, Chromosomes and Cancer, 62(9):540– 556, 2023
2023
-
[13]
Unsolicited patient complaints following the 21st century cures act information-blocking rule
Robert J Dambrino IV , Henry J Domenico, John A Graves, Melinda JB Buntin, William Martinez, S Trent Rosenbloom, and William O Cooper. Unsolicited patient complaints following the 21st century cures act information-blocking rule. InJAMA Health Forum, volume 4, page e233244, 2023
2023
-
[14]
Gemma 3 technical report, 2025
Google DeepMind. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503. 19786
2025
-
[15]
A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025
Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025
2025
-
[16]
Sedigheh Eslami, Christoph Meinel, and Gerard De Melo. Pubmedclip: How much does clip benefit visual question answering in the medical domain? InFindings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023
2023
-
[17]
A new readability yardstick.Journal of applied psychology, 32(3):221, 1948
Rudolph Flesch. A new readability yardstick.Journal of applied psychology, 32(3):221, 1948. 10
1948
-
[18]
Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton. Making science simple: Corpora for the lay summarisation of scientific literature.arXiv preprint arXiv:2210.09932, 2022
arXiv 2022
-
[19]
Making science simple: Corpora for the lay summarisation of scientific literature
Tomas Goldsack, Zhihao Zhang, Chenghua Lin, and Carolina Scarton. Making science simple: Corpora for the lay summarisation of scientific literature. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10589–10604, 2022
2022
-
[20]
Tomas Goldsack, Carolina Scarton, Matthew Shardlow, and Chenghua Lin. Overview of the biolaysumm 2024 shared task on the lay summarization of biomedical research articles.arXiv preprint arXiv:2408.08566, 2024
arXiv 2024
-
[21]
Gemini 2.5 flash model card, 2025
Google DeepMind. Gemini 2.5 flash model card, 2025. URL https://ai.google.dev/ gemini-api/docs/models#gemini-2.5-flash
2025
-
[22]
Gemini 3 flash model card, 2026
Google DeepMind. Gemini 3 flash model card, 2026. URL https://ai.google.dev/ gemini-api/docs/models#gemini-3-flash-preview
2026
-
[23]
The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
Pith/arXiv arXiv 2024
-
[24]
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020
Pith/arXiv arXiv 2003
-
[25]
Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016
Pith/arXiv arXiv 2016
-
[26]
Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022
2022
-
[27]
Han Jang, Junhyeok Lee, Heeseong Eum, and Kyu Sung Choi. Medlaybench-v: A large-scale benchmark for expert-lay semantic alignment in medical vision language models.arXiv preprint arXiv:2604.05738, 2026
Pith/arXiv arXiv 2026
-
[28]
Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, et al. Hulu-med: A transparent generalist model towards holistic medical vision-language understanding.arXiv preprint arXiv:2510.08668, 2025
arXiv 2025
-
[29]
Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016
Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database.Scientific data, 3(1):1–9, 2016
2016
-
[30]
Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019
Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019
2019
-
[31]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
2023
-
[32]
A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 180251, 2018
2018
-
[33]
Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023. 11
2023
-
[34]
Magical: Medical lay language generation via semantic invariance and layperson-tailored adaptation
Weibin Liao, Tianlong Wang, Yinghao Zhu, Yasha Wang, Junyi Gao, and Liantao Ma. Magical: Medical lay language generation via semantic invariance and layperson-tailored adaptation. arXiv preprint arXiv:2508.08730, 2025
arXiv 2025
-
[35]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
2004
-
[36]
Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021
2021
-
[37]
Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific literature
Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, et al. Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific literature. InProceedings of the Computer Vision and Pattern Recognition Conference,...
2025
-
[38]
Jun Ma, Yao Zhang, Song Gu, Cheng Ge, Ershuai Wang, Qin Zhou, Ziyan Huang, Pengju Lyu, Jian He, and Bo Wang. Automatic organ and pan-cancer segmentation in abdomen ct: the flare 2023 challenge.arXiv preprint arXiv:2408.12534, 2024
arXiv 2023
-
[39]
The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014
Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, et al. The multimodal brain tumor image segmentation benchmark (brats).IEEE transactions on medical imaging, 34(10):1993–2024, 2014
1993
-
[40]
The Llama 4 herd: The beginning of a new era of natively multimodal AI innova- tion
Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innova- tion. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ , April 2025. Accessed: 2026-05-06
2025
-
[41]
Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023
Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616(7956):259–265, 2023
2023
-
[42]
Med-flamingo: a multimodal medical few-shot learner
Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine learning for health (ML4H), pages 353–367. PMLR, 2023
2023
-
[43]
PMC open access subset
National Library of Medicine. PMC open access subset. https://pmc.ncbi.nlm.nih.gov/ tools/openftlist/, 2003–. Accessed: 2026-05-06
2003
-
[44]
Medical subject headings (MeSH)
National Library of Medicine. Medical subject headings (MeSH). https://www.nlm.nih. gov/mesh/, 2024
2024
-
[45]
Scispacy: fast and robust models for biomedical natural language processing
Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. Scispacy: fast and robust models for biomedical natural language processing. InProceedings of the 18th BioNLP workshop and shared task, pages 319–327, 2019
2019
-
[46]
GPT-5.4 and GPT-5.4-mini model card, February 2026
OpenAI. GPT-5.4 and GPT-5.4-mini model card, February 2026. URL https://platform. openai.com/docs/models
2026
-
[47]
GPT-5.5 system card, April 2026
OpenAI. GPT-5.5 system card, April 2026. URL https://openai.com/index/ gpt-5-5-system-card/
2026
-
[48]
Green: Generative radiology report evaluation and error notation
Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Blueth- gen, Arne Edward Michalson Md, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, et al. Green: Generative radiology report evaluation and error notation. InFindings of the association for computational linguistics: EMNLP 2024, pages 374–390, 2024
2024
-
[49]
Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024
Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.Advances in Neural Information Processing Systems, 37:68772–68802, 2024
2024
-
[50]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. 12
2002
-
[51]
Patient access of their radiology reports before and after implementation of 21st century cures act information-blocking provisions at a large multicampus health system
Jordan R Pollock, Skye A Buckner Petty, John J Schmitz, Jacob Varner, Allie M Metcalfe, and Nelly Tan. Patient access of their radiology reports before and after implementation of 21st century cures act information-blocking provisions at a large multicampus health system. American Journal of Roentgenology, 222(6):e2330343, 2024
2024
-
[52]
A prospective controlled trial of large language model–based simplification of oncologic ct reports for patients with cancer.Radiology, 317(2):e251844, 2025
Philipp Prucker, Keno K Bressem, Jan Peeken, Mateo Jukic, Alexander W Marka, Maximilian Strenzke, Su Hwan Kim, Christian J Mertens, Dominik Weller, Tristan Lemke, et al. A prospective controlled trial of large language model–based simplification of oncologic ct reports for patients with cancer.Radiology, 317(2):e251844, 2025
2025
-
[53]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
2026
-
[54]
Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026
Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026. URL https://qwen.ai/blog?id=qwen3.6-35b-a3b
2026
-
[55]
Seco de Herrera, et al
Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, et al. Rocov2: Radiology objects in context version 2, an updated multimodal image dataset. Scientific Data, 11(1):688, 2024
2024
-
[56]
Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025
Pith/arXiv arXiv 2025
-
[57]
Neural text simplification of clinical letters with a domain specific phrase table
Matthew Shardlow and Raheel Nawaz. Neural text simplification of clinical letters with a domain specific phrase table. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 380–389, 2019
2019
-
[58]
Kaiwen Shi, Yifei Li, Binh Ho, Jovian Wang, and Kobe Guo. Universal lesion segmentation chal- lenge 2023: a comparative research of different algorithms.arXiv preprint arXiv:2502.10608, 2025
arXiv 2023
-
[59]
Nci thesaurus: a semantic model integrating cancer-related clinical and molecular information.Journal of biomedical informatics, 40(1):30–43, 2007
Nicholas Sioutos, Sherri de Coronado, Margaret W Haber, Frank W Hartel, Wen-Ling Shaiu, and Lawrence W Wright. Nci thesaurus: a semantic model integrating cancer-related clinical and molecular information.Journal of biomedical informatics, 40(1):30–43, 2007
2007
-
[60]
Perspec- tives of patients about immediate access to test results through an online patient portal.JAMA Network Open, 6(3):e233572, 2023
Bryan D Steitz, Robert W Turer, Chen-Tan Lin, Scott MacDonald, Liz Salmi, Adam Wright, Christoph U Lehmann, Karen Langford, Samuel A McDonald, Thomas J Reese, et al. Perspec- tives of patients about immediate access to test results through an online patient portal.JAMA Network Open, 6(3):e233572, 2023
2023
-
[61]
Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models.arXiv preprint arXiv:2404.18796, 2024
Pith/arXiv arXiv 2024
-
[62]
The patient-friendly radiology report: history, evolution, challenges and opportunities.Clinical Imaging, 89:128–135, 2022
Nina S Vincoff, Matthew A Barish, and Gregory Grimaldi. The patient-friendly radiology report: history, evolution, challenges and opportunities.Clinical Imaging, 89:128–135, 2022
2022
-
[63]
Automated metrics for medical multi-document summarization disagree with human evaluations
Lucy Lu Wang, Julia Otmakhova, Jay DeYoung, Thinh Hung Truong, Bailey Kuehl, Erin Bransom, and Byron C Wallace. Automated metrics for medical multi-document summarization disagree with human evaluations. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9871–9889, 2023
2023
-
[64]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
Pith/arXiv arXiv 2024
-
[65]
Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly- supervised classification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017. 13
2097
-
[66]
Self-preference bias in llm-as-a-judge
Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri. Self-preference bias in llm-as-a-judge. arXiv preprint arXiv:2410.21819, 2024
Pith/arXiv arXiv 2024
-
[67]
Overview of the biolaysumm 2025 shared task on lay summarization of biomedical research articles and radiology reports
Chenghao Xiao, Kun Zhao, Xiao Wang, Siwei Wu, Sixing Yan, Tomas Goldsack, Sophia Anani- adou, Noura Al Moubayed, Liang Zhan, William K Cheung, et al. Overview of the biolaysumm 2025 shared task on lay summarization of biomedical research articles and radiology reports. In Proceedings of the 24th Workshop on Biomedical Language Processing, pages 365–377, 2025
2025
-
[68]
Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, et al. Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine.arXiv preprint arXiv:2408.02900, 2024
arXiv 2024
-
[69]
Radeval: A framework for radiology text evaluation
Justin Xu, Xi Zhang, Javid Abderezaei, Julie Bauml, Roger Boodoo, Fatemeh Haghighi, Ali Ganjizadeh, Eric Brattain, Dave Van Veen, Zaiqiao Meng, et al. Radeval: A framework for radiology text evaluation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 546–557, 2025
2025
-
[70]
Optimizing statistical machine translation for text simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016
Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing statistical machine translation for text simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016
2016
-
[71]
Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Cheng- hao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foun- dation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025
Pith/arXiv arXiv 2025
-
[72]
Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning.Journal of medical imaging, 5(3):036501–036501, 2018
Ke Yan, Xiaosong Wang, Le Lu, and Ronald M Summers. Deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning.Journal of medical imaging, 5(3):036501–036501, 2018
2018
-
[73]
Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp
Zonghai Yao, Nandyala Siddharth Kantu, Guanghao Wei, Hieu Tran, Zhangqi Duan, Sunjae Kwon, Zhichao Yang, and Hong Yu. Readme: Bridging medical jargon and lay understanding for patient education through data-centric nlp. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12609–12629, 2024
2024
-
[74]
Medical thinking with multiple images
Zonghai Yao, Benlu Wang, Yifan Zhang, Junda Wang, Iris Xia, Zhipeng Tang, Shuo Han, Feiyun Ouyang, Zhichao Yang, Arman Cohan, et al. Medical thinking with multiple images. arXiv preprint arXiv:2604.16506, 2026
Pith/arXiv arXiv 2026
-
[75]
Exploring and developing consumer health vocabularies.Journal of the American Medical Informatics Association, 13(1):24–29, 2006
Qing T Zeng and Tony Tse. Exploring and developing consumer health vocabularies.Journal of the American Medical Informatics Association, 13(1):24–29, 2006
2006
-
[76]
A multimodal biomedical foundation model trained from fifteen million image–text pairs.Nejm Ai, 2(1):AIoa2400640, 2025
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. A multimodal biomedical foundation model trained from fifteen million image–text pairs.Nejm Ai, 2(1):AIoa2400640, 2025
2025
-
[77]
Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019
Pith/arXiv arXiv 1904
-
[78]
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023
Pith/arXiv arXiv 2023
-
[79]
Kun Zhao, Chenghao Xiao, Sixing Yan, Haoteng Tang, William K Cheung, Noura Al Moubayed, Liang Zhan, and Chenghua Lin. X-ray made simple: Lay radiology report generation and robust evaluation.arXiv preprint arXiv:2406.17911, 2024
arXiv 2024
-
[80]
Ratescore: A metric for radiology report generation
Weike Zhao, Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Ratescore: A metric for radiology report generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15004–15019, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.