pith. sign in

arxiv: 2606.12169 · v1 · pith:OFZIPU6Knew · submitted 2026-06-10 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

Pith reviewed 2026-06-27 09:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords medical vision-language modelsreasoning supervisionvisual question answeringmultimodal datasetbiomedical reasoningscientific articlesOpenMedReasonfine-grained evaluation
0
0 comments X

The pith

A corpus of 450K medical image-question pairs with human-authored reasoning traces from biomedical articles lifts vision-language model accuracy by 20 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenMedReason, a dataset of about 450,000 medical image-question-answer instances whose reasoning traces come mainly from curated human-written biomedical scientific articles instead of synthetic chains of thought. This resource is intended to train large vision-language models so their answers rest on both visual evidence and clinical knowledge rather than pattern matching alone. When models undergo supervised fine-tuning or reinforcement alignment on the corpus they gain 20 percent in visual question answering accuracy on average and land within 4.2 percent of the strongest comparable medical models. The improvements occur together in visual perception, medical knowledge, and the quality of the written rationale, and human judges favor the new reasoning traces over the base model in 86.1 percent of pairwise tests. A held-out benchmark scores each of those three capabilities separately so progress can be measured beyond final-answer correctness.

Core claim

OpenMedReason supplies high-fidelity supervision for medical vision-language models by deriving reasoning traces primarily from curated human-authored biomedical scientific articles rather than synthetic chains of thought, resulting in models that improve 20 percent in VQA accuracy, advance jointly on perception, knowledge, and rationale, and produce traces preferred 86.1 percent of the time.

What carries the argument

OpenMedReason, a multimodal medical reasoning corpus of approximately 450K image-question-answer instances whose reasoning traces are derived from curated biomedical human-authored scientific articles.

If this is right

  • Both supervised fine-tuning and reinforcement-based alignment produce measurable gains when trained on the corpus.
  • Improvements appear jointly across perception, medical knowledge, and rationale rather than being confined to one axis.
  • Reasoning traces generated after training are preferred over base-model traces in 86.1 percent of pairwise human comparisons.
  • Final performance reaches within 4.2 percent of the strongest medical LVLMs of comparable scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A parallel corpus built from human-authored articles in another technical domain could be expected to produce similar joint gains for domain-specific vision-language models.
  • The three-axis benchmark could be applied to diagnose whether future models continue to lag on particular modalities such as charts or microscopic images.
  • Repeating the training experiments on models substantially larger or smaller than those tested would reveal whether the reported accuracy lift depends on model scale.

Load-bearing premise

Reasoning traces extracted from human-authored biomedical scientific articles supply higher-fidelity supervision than synthetic chains of thought for training medical vision-language models.

What would settle it

Train one model on OpenMedReason and an identical model on an equal volume of synthetic reasoning traces, then compare both on held-out VQA accuracy and on human preference for their generated reasoning traces.

Figures

Figures reproduced from arXiv: 2606.12169 by Abeer Badawi, Adibvafa Fallahpour, Ali Etemad, Arash Afkanpour, Elham Dolatabadi, Leonid Sigal, Michael Colacci, Negin Baghbanzadeh, Pritam Sarkar.

Figure 1
Figure 1. Figure 1: (A) Overview of the multi-stage OPENMEDREASON curation pipeline, including quality filtering, context extraction, question construction, reasoning-trace generation, and verification. (B) Distribution of the 19 clinical task categories. Color coding corresponds to the task families defined in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of our model. (Left) Pairwise win / tie / lose rates from our model’s perspective against four baselines, judged head-to-head on medical VQA traces. Baseline wins (slate) denotes the baseline trace was preferred, Tie (light gray) indicates the judge marked the responses as equivalent, and Ours wins (purple) denotes our model was preferred. Our checkpoint (SFT + GRPO) is favored across all four … view at source ↗
Figure 3
Figure 3. Figure 3: Examples of images removed by the visual quality filtering stage. The filtering criteria [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataset statistics for OPENMEDREASON. (A) Distribution of question categories, with bars stacked by question style, showing the balance between short image-only questions and long image-plus-clinical-context questions. (B) Distribution of imaging modalities, grouped by primary modality, illustrating the diversity of radiology, visible-light photography, microscopy, diagrams, and plots/charts represented in… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative example of image-grounded medical reasoning. The transesophageal echocar [PITH_FULL_IMAGE:figures/full_fig_p037_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative example of image-grounded medical reasoning. The postoperative coronal [PITH_FULL_IMAGE:figures/full_fig_p038_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative example of image-grounded medical reasoning. The fluorescein and indocyanine [PITH_FULL_IMAGE:figures/full_fig_p039_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A screenshot of the designed framework with 100 examples for the physician expert to [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗
read the original abstract

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at huggingface.co/datasets/neginb/OpenMedReason.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces OpenMedReason, an open multimodal medical reasoning corpus of ~450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. It also releases OpenMedReason-Bench for fine-grained evaluation along perception, medical knowledge, and rationale axes. The authors report that SFT and reinforcement alignment on this data produce a 20% average VQA accuracy gain over the base model, performance within 4.2% of strong comparable-scale medical LVLMs, joint gains across the three axes, and 86.1% human preference for the generated reasoning traces over the base model.

Significance. An openly released, large-scale medical reasoning dataset with human-authored article grounding could be a useful resource for training LVLMs if the claimed fidelity advantage is substantiated. The public release of both dataset and code is a clear positive.

major comments (2)
  1. [Abstract] Abstract (paragraph 2): The claim that the traces supply 'high-fidelity supervision beyond synthetic chains of thought' because they are 'primarily derived from curated biomedical, human-authored scientific articles' is load-bearing for the central contribution, yet the abstract (and thus the manuscript's headline claim) provides no description of the derivation pipeline—manual extraction, LLM summarization, human editing, or automated alignment to images. Without this, the qualitative distinction from synthetic CoT cannot be evaluated.
  2. [Abstract] Abstract: No curation criteria for the source articles, no inter-annotator agreement statistics for reasoning quality, and no statistical significance tests or confidence intervals are reported for the 20% VQA improvement or the 86.1% preference rate. These omissions prevent assessment of whether the quantitative gains are reliable or attributable to the claimed fidelity rather than scale or domain coverage alone.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others' would benefit from a quantitative breakdown of modality distribution.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed feedback on the abstract and evaluation reporting. We address each major comment below and will revise the manuscript accordingly where the points can be addressed without misrepresenting the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph 2): The claim that the traces supply 'high-fidelity supervision beyond synthetic chains of thought' because they are 'primarily derived from curated biomedical, human-authored scientific articles' is load-bearing for the central contribution, yet the abstract (and thus the manuscript's headline claim) provides no description of the derivation pipeline—manual extraction, LLM summarization, human editing, or automated alignment to images. Without this, the qualitative distinction from synthetic CoT cannot be evaluated.

    Authors: We agree the abstract should include a concise description of the derivation pipeline to support the central claim. The full manuscript (Section 3) details that reasoning traces are obtained via automated parsing of scientific articles followed by human curation and image alignment. We will add a one-sentence summary of this process to the abstract in revision. revision: yes

  2. Referee: [Abstract] Abstract: No curation criteria for the source articles, no inter-annotator agreement statistics for reasoning quality, and no statistical significance tests or confidence intervals are reported for the 20% VQA improvement or the 86.1% preference rate. These omissions prevent assessment of whether the quantitative gains are reliable or attributable to the claimed fidelity rather than scale or domain coverage alone.

    Authors: Curation criteria for source articles are provided in Section 3.1. Inter-annotator agreement statistics are not reported because the traces are extracted from pre-existing human-authored articles rather than newly created multi-annotator labels. We will add statistical significance tests and confidence intervals for the VQA gains and preference rates in the revised results section. revision: partial

standing simulated objections not resolved
  • Inter-annotator agreement statistics for reasoning quality, as the dataset is constructed by deriving traces from existing scientific articles without new multi-annotator labeling.

Circularity Check

0 steps flagged

No circularity; dataset presented as external resource with empirical evaluation

full rationale

The paper introduces OpenMedReason as a corpus of ~450K instances whose reasoning traces are stated to be derived from curated human-authored biomedical articles, with no equations, fitted parameters, predictions, or self-citations invoked to justify any derivation. Performance gains (20% VQA lift, 86.1% preference) are reported as empirical outcomes on a held-out benchmark after SFT or alignment, without any reduction to model outputs or self-referential inputs. The central claim rests on the external sourcing of traces rather than any internal loop, making the work self-contained as a data contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that reasoning traces extracted from curated human-authored articles constitute higher-fidelity supervision than synthetic alternatives; this is a domain assumption rather than a derived quantity.

axioms (1)
  • domain assumption Reasoning traces extracted from curated biomedical, human-authored scientific articles supply high-fidelity supervision beyond synthetic chains of thought.
    Stated directly in the abstract as the distinguishing property of the corpus.

pith-pipeline@v0.9.1-grok · 5839 in / 1357 out tokens · 18522 ms · 2026-06-27T09:52:55.331677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 3 linked inside Pith

  1. [1]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

  2. [2]

    Medgemma technical report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

  3. [3]

    Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

    Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning.arXiv preprint arXiv:2506.07044, 2025

  4. [4]

    Towards generalist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024

    Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai.Nejm Ai, 1(3):AIoa2300138, 2024

  5. [5]

    Advancing medical representation learning through high-quality data

    Negin Baghbanzadeh, Adibvafa Fallahpour, Yasaman Parhizkar, Franklin Ogidi, Shuvendu Roy, Sajad Ashkezari, Vahid Reza Khazaie, Michael Colacci, Ali Etemad, Arash Afkanpour, et al. Advancing medical representation learning through high-quality data

  6. [6]

    The illusion of readiness in health ai.arXiv preprint arXiv:2509.18234, 2025

    Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel CF Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, et al. The illusion of readiness in health ai.arXiv preprint arXiv:2509.18234, 2025

  7. [7]

    Hidden flaws behind expert-level accuracy of multimodal gpt-4 vision in medicine.NPJ Digital Medicine, 7(1):190, 2024

    Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M Cheung, Robert Chen, Ronald M Summers, Justin F Rousseau, Peiyun Ni, Marc J Landsman, et al. Hidden flaws behind expert-level accuracy of multimodal gpt-4 vision in medicine.NPJ Digital Medicine, 7(1):190, 2024

  8. [8]

    Qoq-med: Building multimodal clinical foundation models with domain-aware grpo training

    Wei Dai, Peilin Chen, Chanakya Ekbote, and Paul Pu Liang. Qoq-med: Building multimodal clinical foundation models with domain-aware grpo training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  9. [9]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025

  10. [10]

    When does rl help medical vlms? disentangling vision, sft, and rl gains, 2026

    Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, and Babak Taati. When does rl help medical vlms? disentangling vision, sft, and rl gains, 2026

  11. [11]

    Octomed: Data recipes for state-of-the-art multimodal medical reasoning.arXiv preprint arXiv:2511.23269, 2025

    Timothy Ossowski, Sheng Zhang, Qianchu Liu, Guanghui Qin, Reuben Tan, Tristan Naumann, Junjie Hu, and Hoifung Poon. Octomed: Data recipes for state-of-the-art multimodal medical reasoning.arXiv preprint arXiv:2511.23269, 2025

  12. [12]

    Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. Medvlthinker: Simple baselines for multimodal medical reasoning.Neural Information Processing Systems (NeurIPS 2025) Workshop: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance, 2024

  13. [13]

    A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

  14. [14]

    Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

  15. [15]

    Towards visual question answering on pathology images

    Xuehai He, Zhuo Cai, Wenlan Wei, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Towards visual question answering on pathology images. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Lan...

  16. [16]

    Development of a large-scale medical visual question-answering dataset.Communications Medicine, 4(1):277, 2024

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Development of a large-scale medical visual question-answering dataset.Communications Medicine, 4(1):277, 2024. 10

  17. [17]

    Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm

    Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22170–22183, 2024

  18. [18]

    Medxpertqa: Benchmarking expert-level medical reasoning and understanding

    Yuxin Zuo, Shang Qu, Yifei Li, Zhang-Ren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. In International Conference on Machine Learning, pages 80961–80990. PMLR, 2025

  19. [19]

    JAMA Challenge

    American Medical Association. JAMA Challenge. https://jamanetwork.com/, 2024. Accessed: Jan. 1, 2024

  20. [20]

    Open-pmc-18m: A high-fidelity large scale medical dataset for multimodal representation learning.arXiv preprint arXiv:2506.02738, 2025

    Negin Baghbanzadeh, Mohammed Saidul Islam, Sajad Ashkezari, Elham Dolatabadi, and Arash Afkanpour. Open-pmc-18m: A high-fidelity large scale medical dataset for multimodal representation learning.arXiv preprint arXiv:2506.02738, 2025

  21. [21]

    M3cotbench: Benchmark chain-of-thought of mllms in medical image understanding.arXiv preprint arXiv:2601.08758, 2026

    Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, and Shuicheng Yan. M3cotbench: Benchmark chain-of-thought of mllms in medical image understanding.arXiv preprint arXiv:2601.08758, 2026

  22. [22]

    Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Transactions on Medical Imaging, 2026

    Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Yuheng Li, Konstantinos Psounis, and Xiaofeng Yang. Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.IEEE Transactions on Medical Imaging, 2026

  23. [23]

    Maddison, and Bo Wang

    Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J. Maddison, and Bo Wang. Bioreason: Incentivizing multimodal biological reasoning within a dna-llm model, 2025

  24. [24]

    Stiles, Filip Nem ˇcko, Alexander A

    Adibvafa Fallahpour, Arman Seyed-Ahmadi, Parsa Idehpour, Omar Ibrahim, Purav Gupta, Jack Naimer, Kevin Zhu, Arnav Shah, Shihao Ma, Abhinav Adduri, Talu Güloglu, Nuo Liu, Haotian Cui, Arihant Jain, Max de Castro, Amirfaham Fallahpour, Antonio Cembellin-Prieto, John S. Stiles, Filip Nem ˇcko, Alexander A. Nevue, Hyungseok C. Moon, Lucas Sosnick, Olivia Mark...

  25. [25]

    Towards injecting medical visual knowledge into multimodal llms at scale

    Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, et al. Towards injecting medical visual knowledge into multimodal llms at scale. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 7346–7370, 2024

  26. [26]

    Climb: Data foundations for large scale multimodal clinical foundation models

    Wei Dai, Peilin Chen, Malinda Lu, Daniel A Li, Haowen Wei, Hejie Cui, and Paul Pu Liang. Climb: Data foundations for large scale multimodal clinical foundation models. InInternational Conference on Machine Learning, pages 11904–11953. PMLR, 2025

  27. [27]

    Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine

    Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, et al. Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine. InThe Thirteenth International Conference on Learning Representations

  28. [28]

    Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai.Advances in Neural Information Processing Systems, 37:94327–94427, 2024

    Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, et al. Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai.Advances in Neural Information Processing Systems, 37:94327–94427, 2024

  29. [29]

    Medprobclip: Probabilistic adaptation of vision-language foundation model for reliable radiograph-report retrieval

    Ahmad Elallaf, Yu Zhang, Yuktha Masupalli, Jeong Yang, Young Lee, Zechun Cao, and Gongbo Liang. Medprobclip: Probabilistic adaptation of vision-language foundation model for reliable radiograph-report retrieval. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1–10, 2026

  30. [30]

    Cares: A comprehensive benchmark of trustworthiness in medical vision language models.Advances in Neural Information Processing Systems, 37:140334–140365, 2024

    Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, et al. Cares: A comprehensive benchmark of trustworthiness in medical vision language models.Advances in Neural Information Processing Systems, 37:140334–140365, 2024

  31. [31]

    Overview

    United States Medical Licensing Examination. Overview. https://www.usmle.org/ bulletin-information/overview, 2024. Accessed: Jan. 1, 2024

  32. [32]

    Mmedexpert-r1: Strengthening multimodal medical reasoning via domain-specific adaptation and clinical guideline reinforcement.arXiv preprint arXiv:2601.10949, 2026

    Meidan Ding, Jipeng Zhang, Wenxuan Wang, Haiqin Zhong, Xiaoling Luo, Wenting Chen, and Linlin Shen. Mmedexpert-r1: Strengthening multimodal medical reasoning via domain-specific adaptation and clinical guideline reinforcement.arXiv preprint arXiv:2601.10949, 2026. 11

  33. [33]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  34. [34]

    VinDr-CXR: An open dataset of chest X-rays with radiologist annotations.PhysioNet, June 2021

    Ha Quy Nguyen, Hieu Huy Pham, le tuan linh, Minh Dao, and lam khanh. VinDr-CXR: An open dataset of chest X-rays with radiologist annotations.PhysioNet, June 2021. Version 1.0.0

  35. [35]

    The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific Data, 5:180161, 2018

    Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific Data, 5:180161, 2018

  36. [36]

    VinDr-Mammo: A large-scale benchmark dataset for computer-aided detection and diagnosis in full-field digital mammography.PhysioNet, March

    Hieu Huy Pham, Hieu Nguyen Trung, and Ha Quy Nguyen. VinDr-Mammo: A large-scale benchmark dataset for computer-aided detection and diagnosis in full-field digital mammography.PhysioNet, March

  37. [37]

    Brain tumor classification (mri).https://www.kaggle.com/dsv/1183165, 2020

    Sartaj Bhuvaji, Ankita Kadam, Prajakta Bhumkar, Sameer Dedge, and Swati Kanchan. Brain tumor classification (mri).https://www.kaggle.com/dsv/1183165, 2020. Dataset

  38. [38]

    Smedsrud, Steven A

    Hanna Borgli, Vajira Thambawita, Pia H. Smedsrud, Steven A. Hicks, Debesh Jha, Sigrun L. Eskeland, Kristin R. Randel, Konstantin Pogorelov, Mathias Lux, Duc T. D. Nguyen, Dag Johansen, Carsten Griwodz, Håkon K. Stensland, Enrique Garcia-Ceja, Peter T. Schmidt, Hugo L. Hammer, Michael A. Riegler, Pål Halvorsen, and Thomas de Lange. Hyperkvasir, a comprehen...

  39. [39]

    Diabetic retinopathy detection

    Kaggle and EyePACS. Diabetic retinopathy detection. Kaggle Competition, 2015. Dataset

  40. [40]

    Aptos 2019 blindness detection

    Asia Pacific Tele-Ophthalmology Society. Aptos 2019 blindness detection. Kaggle Competition, 2019. Dataset

  41. [41]

    Dataset of breast ultrasound images.Data in Brief, 28:104863, 2020

    Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. Dataset of breast ultrasound images.Data in Brief, 28:104863, 2020. 12 A Data Curation Details A.1 Multi-Level Quality Filtering A.1.1 Visual Quality The visual-usability filter described in Section 3.1 applies four pixel-level checks; an image is rejected if it fails any of them. Represent...

  42. [42]

    The text has acceptable English quality

  43. [43]

    The content is human/clinical and relevant to patient-level medical interpretation

  44. [44]

    The content is not primarily non-human, veterinary, animal-model, bench-only, or non- medical

  45. [45]

    The text is informative enough for downstream medical-image QA

  46. [46]

    If an image is provided, it is usable for visual medical QA

  47. [47]

    decision

    The text contains reasoning signals that support a complex image-grounded QA pair. Reasoning signals include: •Causal:explains the cause or effect of a visual feature. •Comparative:contrasts the sub-figure with another condition, baseline, or group. • Methodological/Functional:explains how a mechanism works or why a visual pattern appears. FAILif the text...

  48. [48]

    Perception.The trace first states the image-grounded observations needed to answer the question. This includes only clinically relevant visual evidence, such as the modality, 27 anatomical region, visible abnormality, spatial pattern, morphology, signal, density, uptake, or microscopic appearance, depending on the image type

  49. [49]

    Clinical interpretation and medical knowledge.The trace then explains how the per- ceptual evidence should be interpreted clinically or biomedically. This step uses relevant medical knowledge, together with the source context, to connect the observed findings to the diagnosis, mechanism, management decision, anatomical interpretation, risk assessment, or ...

  50. [50]

    decision

    Answer justification.The trace ends with a concise justification that explicitly links the key perceptual evidence, clinical interpretation, and relevant medical knowledge to the selected answer. This structure keeps the rationale focused on the path from image evidence, through clinically grounded medical knowledge, to the final answer, while avoiding un...