RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

Jiayan Yang; Wenqi Fang; Zhuoyu Wu

arxiv: 2605.15561 · v1 · pith:XN6M3JVNnew · submitted 2026-05-15 · 💻 cs.CV

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

Jiayan Yang , Zhuoyu Wu , Wenqi Fang This is my paper

Pith reviewed 2026-05-20 19:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical visual question answeringefficient vision language modelsregion of interest attentionmodel efficiencysemantic selective suppressiontext prompt enhancer

0 comments

The pith

RoiMAM delivers higher accuracy in medical visual question answering using a model less than one-fifth the size of standard approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops RoiMAM to address the inefficiency of large vision-language models in medical visual question answering. By incorporating a training-free module that generates and prioritizes regions of interest in images along with a text prompt enhancer, the model achieves better results on standard benchmarks while using far fewer parameters. A sympathetic reader would care because this suggests specialized medical AI can be made practical and deployable without massive computational resources. The approach demonstrates that targeted focus on relevant image areas can substitute for overall model scale in this task.

Core claim

The central claim is that integrating a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, along with a parameter-free Text Prompt Enhancer, produces an efficient vision-language model that improves accuracy by about 2% on SLAKE and 4.6% on PMC-VQA while using less than 20% of the parameters of the MedVInT-TD model.

What carries the argument

The ROI Generation Module with Semantic Selective Suppression that identifies and prioritizes lesion-relevant regions in medical images without requiring training.

If this is right

Medical visual question answering systems can operate effectively with substantially reduced model sizes.
Training-free techniques for selecting important image regions can enhance performance in domain-specific vision-language tasks.
Parameter-free text prompt enhancements can provide necessary context for better multimodal understanding.
Such designs support more accessible clinical applications of AI in medical diagnostics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could be adapted to non-medical visual question answering by developing analogous region selection strategies for other image types.
Combining RoiMAM with even larger base vision-language models might further boost performance without linear increases in size.
Validation on real-world clinical workflows would test whether the efficiency translates to practical time and cost savings in hospitals.

Load-bearing premise

The training-free ROI Generation Module with Semantic Selective Suppression can reliably identify and prioritize the lesion-relevant regions in medical images for the questions posed.

What would settle it

If the ROI module selects irrelevant areas on a held-out set of medical images, causing the model's accuracy to fall below that of the larger baseline model.

read the original abstract

Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20\% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RoiMAM, an efficient vision-language model for medical visual question answering. It combines a training-free ROI Generation Module with Semantic Selective Suppression to prioritize lesion-relevant regions in images, plus a Text Prompt Enhancer for modality-specific context without added parameters. The central empirical claim is that the resulting model uses less than 20% of the parameters of MedVInT-TD while improving accuracy by ~2% on SLAKE and ~4.6% on PMC-VQA.

Significance. If the ROI module's lesion prioritization is shown to be reliable and the accuracy gains are reproducible, the work would offer a practical route to smaller, deployable VLMs for clinical MedVQA. The training-free design is a positive feature for reproducibility.

major comments (2)

[§3.2] §3.2 (ROI Generation Module and Semantic Selective Suppression): the central efficiency/accuracy claim rests on this module correctly identifying lesion-relevant regions without training, yet no IoU, Dice, or overlap metrics versus expert lesion annotations are reported on SLAKE or PMC-VQA, and no ablation isolates its contribution from prompt engineering or baseline differences.
[§4] §4 (Experiments): the reported gains lack error bars, dataset statistics, failure-case analysis for small/low-contrast/multi-focal lesions, and ablations that would confirm the ROI module drives the 2–4.6 % improvements rather than other design choices.

minor comments (2)

[Abstract] Abstract: state the exact accuracy numbers and the precise definition of 'model size' (parameters, FLOPs, or memory) used in the <20 % comparison.
[§2] §2 (Related Work): add a brief comparison table of parameter counts and accuracies for the cited MedVQA baselines to contextualize the efficiency claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which help strengthen the validation of the ROI module. We respond to each major point below, committing to revisions that enhance the manuscript without misrepresenting our current results.

read point-by-point responses

Referee: [§3.2] §3.2 (ROI Generation Module and Semantic Selective Suppression): the central efficiency/accuracy claim rests on this module correctly identifying lesion-relevant regions without training, yet no IoU, Dice, or overlap metrics versus expert lesion annotations are reported on SLAKE or PMC-VQA, and no ablation isolates its contribution from prompt engineering or baseline differences.

Authors: We appreciate this observation on the need for direct validation of the training-free ROI module. The SLAKE and PMC-VQA datasets consist of VQA pairs without pixel-level expert lesion annotations, precluding computation of IoU or Dice scores. We will add qualitative ROI visualizations in the revised supplementary material to illustrate region selection. We will also include a new ablation in Section 4 that removes the ROI Generation Module and Semantic Selective Suppression while retaining the Text Prompt Enhancer, to isolate its contribution to the accuracy gains. revision: partial
Referee: [§4] §4 (Experiments): the reported gains lack error bars, dataset statistics, failure-case analysis for small/low-contrast/multi-focal lesions, and ablations that would confirm the ROI module drives the 2–4.6 % improvements rather than other design choices.

Authors: We agree these elements would improve rigor. In the revised manuscript we will report error bars from multiple runs with different seeds. Section 4.1 will be expanded with additional dataset statistics including question-type distributions and modality breakdowns. A new failure-case subsection will analyze performance on small, low-contrast, and multi-focal lesions with example images. The ablation study referenced above will further isolate the ROI module's role. revision: yes

standing simulated objections not resolved

Direct IoU, Dice or overlap metrics versus expert lesion annotations on SLAKE or PMC-VQA, as these datasets do not provide such annotations.

Circularity Check

0 steps flagged

No circularity in derivation chain; empirical architecture with external benchmarks

full rationale

The paper introduces RoiMAM as an empirical VLM architecture combining a training-free ROI Generation Module with Semantic Selective Suppression and a Text Prompt Enhancer. No equations, first-principles derivations, fitted parameters, or predictions are described that could reduce to inputs by construction. Central claims rest on direct accuracy and efficiency comparisons to the external baseline MedVInT-TD on SLAKE and PMC-VQA datasets, without self-citation load-bearing, ansatz smuggling, or renaming of known results. The design is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based on abstract only; no free parameters, axioms, or invented entities are specified beyond the high-level module names.

pith-pipeline@v0.9.0 · 5659 in / 1078 out tokens · 29940 ms · 2026-05-20T19:55:12.420215+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions... M=argmax i EI · E_Li / (||EI|| ||ELi||)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SA_i,j = SA_ori_i,j - SA_back_i,j if SA_back <= delta else epsilon*(...)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

[1]

INTRODUCTION Computer-aided medical diagnosis has progressed rapidly with the advent of deep-learning technologies. Recent Vi- sion–Language Models (VLMs), usually built on Large Language Models (LLMs), have advanced medical visual question answering (MedVQA) systems by jointly interpret- ing clinical images and textual queries. These systems show promise...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Early works, mainly based on the CLIP-based model, framed MedVQA as a classification task with predefined answer sets

RELA TED WORKS MedVQA systems aim to answer questions from medical im- ages, mirroring real-world diagnostic reasoning. Early works, mainly based on the CLIP-based model, framed MedVQA as a classification task with predefined answer sets. For instance, PubMedCLIP demonstrated CLIP’s applicability to the med- ical domain, and later datasets like PMC-VQA an...

work page
[3]

PROPOSED METHOD In this section, we detail our RoiMAM framework, designed to process and synthesize information from three distinct streams: an input image (I), a question (T), and a prompt specifying the image modality (L). As shown in the left panel of Figure 1, the input image is processed through parallel pathways: a Vision Encoder gen- erates image-l...

work page
[4]

EXPERIMENT RESULTS AND ANALYSIS 4.1. Experiment Setup Base Model.We utilized UniMedCLIP as the base CLIP model for both TPE and RGMo, adopted its vision encoder for RoiMAM, and selected Qwen2-1.5B as the base LLM due to its open-source availability and suitable parameter size [11][12]. Regarding the KeyBERT module, we employed a pretrained all-MiniLM-L6-v...

work page
[5]

The Text Prompt Enhancer leverages a CLIP model, sharing its vision encoder with RoiMAM, and a lightweight BERT layer to generate informative text prompts for the LLM

CONCLUSION In this paper, we introduce RoiMAM, a medical VLM that incorporates two key modules—the ROI Generation Module and the Text Prompt Enhancer—to guide the model toward relevant image regions for MedVQA tasks. The Text Prompt Enhancer leverages a CLIP model, sharing its vision encoder with RoiMAM, and a lightweight BERT layer to generate informativ...

work page
[6]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao, “Llava- med: Training a large language-and-vision assis- tant for biomedicine in one day,”arXiv preprint arXiv:2306.00890, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Generative models in medical visual question answering: A survey,

Wenjie Dong, Shuhao Shen, Yuqiang Han, Tao Tan, Jian Wu, and Hongxia Xu, “Generative models in medical visual question answering: A survey,”Applied Sciences, vol. 15, no. 6, 2025

work page 2025
[8]

Fine- grained adaptive visual prompt for generative medical visual question answering,

Ting Yu, Zixuan Tong, Jun Yu, and Ke Zhang, “Fine- grained adaptive visual prompt for generative medical visual question answering,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 9662–9670

work page 2025
[9]

Guid- ing medical vision-language models with diverse visual prompts: Framework design and comprehensive explo- ration of prompt variations,

Kangyu Zhu, Ziyuan Qin, Huahui Yi, Zekun Jiang, Qicheng Lao, Shaoting Zhang, and Kang Li, “Guid- ing medical vision-language models with diverse visual prompts: Framework design and comprehensive explo- ration of prompt variations,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics...

work page 2025
[10]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie, “Pmc- vqa: Visual instruction tuning for medical visual ques- tion answering,”arXiv preprint arXiv:2305.10415, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Pubmedclip: How much does clip benefit visual question answering in the medical domain?,

Sedigheh Eslami, Christoph Meinel, and Gerard De Melo, “Pubmedclip: How much does clip benefit visual question answering in the medical domain?,” in Findings of the Association for Computational Linguis- tics: EACL 2023, 2023, pp. 1151–1163

work page 2023
[12]

Pmc-clip: Contrastive language-image pre-training using biomedical documents,

Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie, “Pmc-clip: Contrastive language-image pre-training using biomedical documents,”arXiv preprint arXiv:2303.07240, 2023

work page arXiv 2023
[13]

arXiv preprint arXiv:2406.19280 , year=

Junying Chen, Ruyi Ouyang, Anningzhe Gao, Shu- nian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, and Benyou Wang, “Huatuogpt-vision, towards inject- ing medical visual knowledge into multimodal llms at scale,”arXiv preprint arXiv:2406.19280, 2024

work page arXiv 2024
[14]

Sam-med2d

Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiangand Hui Sun, Junjun He, Shaoting Zhang, Min Zhu, and Yu Qiao, “Sam-med2d,”arXiv preprint arXiv:2308.16184, 2023

work page arXiv 2023
[15]

Visual explanations of image-text representations via multi-modal information bottleneck attribution,

Ying Wang, Tim G. J. Rudner, and Andrew Gordon Wil- son, “Visual explanations of image-text representations via multi-modal information bottleneck attribution,” in Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023

work page 2023
[16]

Unimed-clip: Towards a unified image-text pretrain- ing paradigm for diverse medical imaging modalities,

Muhammad Uzair Khattak, Shahina Kunhimon, Muza- mmal Naseer, Salman Khan, and Fahad Shahbaz Khan, “Unimed-clip: Towards a unified image-text pretrain- ing paradigm for diverse medical imaging modalities,” arXiv preprint arXiv:2412.10372, 2024

work page arXiv 2024
[17]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Pubmed author-assigned keyword extraction (pub- medake) benchmark,

Jiasheng Sheng, Zelalem Gero, and Joyce C. Ho, “Pubmed author-assigned keyword extraction (pub- medake) benchmark,” inProceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, p. 4470–4474

work page 2022
[19]

A dataset of clinically gener- ated visual questions and answers about radiology im- ages,

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman, “A dataset of clinically gener- ated visual questions and answers about radiology im- ages,”Scientific data, vol. 5, no. 1, pp. 1–10, 2018

work page 2018
[20]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual ques- tion answering,

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual ques- tion answering,” in2021 IEEE 18th International Sym- posium on Biomedical Imaging (ISBI), 2021, pp. 1650– 1654

work page 2021
[21]

Radiology objects in context (roco): A multimodal image dataset,

Obioma Pelka, Sven Koitka, Johannes R ¨uckert, Felix Nensa, and C. Friedrich, “Radiology objects in context (roco): A multimodal image dataset,” inMICCAI Work- shop on Large-scale Annotation of Biomedical Data and Expert Label Synthesis (LABELS), 2018

work page 2018

[1] [1]

INTRODUCTION Computer-aided medical diagnosis has progressed rapidly with the advent of deep-learning technologies. Recent Vi- sion–Language Models (VLMs), usually built on Large Language Models (LLMs), have advanced medical visual question answering (MedVQA) systems by jointly interpret- ing clinical images and textual queries. These systems show promise...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Early works, mainly based on the CLIP-based model, framed MedVQA as a classification task with predefined answer sets

RELA TED WORKS MedVQA systems aim to answer questions from medical im- ages, mirroring real-world diagnostic reasoning. Early works, mainly based on the CLIP-based model, framed MedVQA as a classification task with predefined answer sets. For instance, PubMedCLIP demonstrated CLIP’s applicability to the med- ical domain, and later datasets like PMC-VQA an...

work page

[3] [3]

PROPOSED METHOD In this section, we detail our RoiMAM framework, designed to process and synthesize information from three distinct streams: an input image (I), a question (T), and a prompt specifying the image modality (L). As shown in the left panel of Figure 1, the input image is processed through parallel pathways: a Vision Encoder gen- erates image-l...

work page

[4] [4]

EXPERIMENT RESULTS AND ANALYSIS 4.1. Experiment Setup Base Model.We utilized UniMedCLIP as the base CLIP model for both TPE and RGMo, adopted its vision encoder for RoiMAM, and selected Qwen2-1.5B as the base LLM due to its open-source availability and suitable parameter size [11][12]. Regarding the KeyBERT module, we employed a pretrained all-MiniLM-L6-v...

work page

[5] [5]

The Text Prompt Enhancer leverages a CLIP model, sharing its vision encoder with RoiMAM, and a lightweight BERT layer to generate informative text prompts for the LLM

CONCLUSION In this paper, we introduce RoiMAM, a medical VLM that incorporates two key modules—the ROI Generation Module and the Text Prompt Enhancer—to guide the model toward relevant image regions for MedVQA tasks. The Text Prompt Enhancer leverages a CLIP model, sharing its vision encoder with RoiMAM, and a lightweight BERT layer to generate informativ...

work page

[6] [6]

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao, “Llava- med: Training a large language-and-vision assis- tant for biomedicine in one day,”arXiv preprint arXiv:2306.00890, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Generative models in medical visual question answering: A survey,

Wenjie Dong, Shuhao Shen, Yuqiang Han, Tao Tan, Jian Wu, and Hongxia Xu, “Generative models in medical visual question answering: A survey,”Applied Sciences, vol. 15, no. 6, 2025

work page 2025

[8] [8]

Fine- grained adaptive visual prompt for generative medical visual question answering,

Ting Yu, Zixuan Tong, Jun Yu, and Ke Zhang, “Fine- grained adaptive visual prompt for generative medical visual question answering,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 9662–9670

work page 2025

[9] [9]

Guid- ing medical vision-language models with diverse visual prompts: Framework design and comprehensive explo- ration of prompt variations,

Kangyu Zhu, Ziyuan Qin, Huahui Yi, Zekun Jiang, Qicheng Lao, Shaoting Zhang, and Kang Li, “Guid- ing medical vision-language models with diverse visual prompts: Framework design and comprehensive explo- ration of prompt variations,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics...

work page 2025

[10] [10]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie, “Pmc- vqa: Visual instruction tuning for medical visual ques- tion answering,”arXiv preprint arXiv:2305.10415, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Pubmedclip: How much does clip benefit visual question answering in the medical domain?,

Sedigheh Eslami, Christoph Meinel, and Gerard De Melo, “Pubmedclip: How much does clip benefit visual question answering in the medical domain?,” in Findings of the Association for Computational Linguis- tics: EACL 2023, 2023, pp. 1151–1163

work page 2023

[12] [12]

Pmc-clip: Contrastive language-image pre-training using biomedical documents,

Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie, “Pmc-clip: Contrastive language-image pre-training using biomedical documents,”arXiv preprint arXiv:2303.07240, 2023

work page arXiv 2023

[13] [13]

arXiv preprint arXiv:2406.19280 , year=

Junying Chen, Ruyi Ouyang, Anningzhe Gao, Shu- nian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, and Benyou Wang, “Huatuogpt-vision, towards inject- ing medical visual knowledge into multimodal llms at scale,”arXiv preprint arXiv:2406.19280, 2024

work page arXiv 2024

[14] [14]

Sam-med2d

Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiangand Hui Sun, Junjun He, Shaoting Zhang, Min Zhu, and Yu Qiao, “Sam-med2d,”arXiv preprint arXiv:2308.16184, 2023

work page arXiv 2023

[15] [15]

Visual explanations of image-text representations via multi-modal information bottleneck attribution,

Ying Wang, Tim G. J. Rudner, and Andrew Gordon Wil- son, “Visual explanations of image-text representations via multi-modal information bottleneck attribution,” in Thirty-seventh Conference on Neural Information Pro- cessing Systems, 2023

work page 2023

[16] [16]

Unimed-clip: Towards a unified image-text pretrain- ing paradigm for diverse medical imaging modalities,

Muhammad Uzair Khattak, Shahina Kunhimon, Muza- mmal Naseer, Salman Khan, and Fahad Shahbaz Khan, “Unimed-clip: Towards a unified image-text pretrain- ing paradigm for diverse medical imaging modalities,” arXiv preprint arXiv:2412.10372, 2024

work page arXiv 2024

[17] [17]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Pubmed author-assigned keyword extraction (pub- medake) benchmark,

Jiasheng Sheng, Zelalem Gero, and Joyce C. Ho, “Pubmed author-assigned keyword extraction (pub- medake) benchmark,” inProceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, p. 4470–4474

work page 2022

[19] [19]

A dataset of clinically gener- ated visual questions and answers about radiology im- ages,

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman, “A dataset of clinically gener- ated visual questions and answers about radiology im- ages,”Scientific data, vol. 5, no. 1, pp. 1–10, 2018

work page 2018

[20] [20]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual ques- tion answering,

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual ques- tion answering,” in2021 IEEE 18th International Sym- posium on Biomedical Imaging (ISBI), 2021, pp. 1650– 1654

work page 2021

[21] [21]

Radiology objects in context (roco): A multimodal image dataset,

Obioma Pelka, Sven Koitka, Johannes R ¨uckert, Felix Nensa, and C. Friedrich, “Radiology objects in context (roco): A multimodal image dataset,” inMICCAI Work- shop on Large-scale Annotation of Biomedical Data and Expert Label Synthesis (LABELS), 2018

work page 2018