pith. machine review for the scientific record. sign in

arxiv: 2605.09384 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI· q-bio.QM

Recognition: 2 theorem links

· Lean Theorem

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AIq-bio.QM
keywords medical visual question answeringchain-of-thought reasoningparameter-efficient adaptationvision-language modelsknowledge distillationLoRA fine-tuningPMC-VQA benchmarkmedical AI deployment
0
0 comments X

The pith

A 2B vision-language model with distilled chain-of-thought reasoning outperforms larger models on medical visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that compact vision-language models can acquire the multi-step reasoning needed for medical visual question answering by learning from a much larger model's explanations. This would matter because it allows powerful medical AI to run on portable devices with limited computing power while providing interpretable answers. The approach creates training data enriched with chain-of-thought steps and uses efficient adaptation to teach the small model without changing its size. Experiments indicate this closes much of the performance gap to bigger models and avoids shortcuts based on text alone.

Core claim

LiteMedCoT-VL is a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models by fine-tuning on explanation-enriched data using LoRA. All inference happens without image captions to simulate direct clinical interpretation. On the PMC-VQA benchmark the method reaches 64.9 percent accuracy, 11 points above the zero-shot 4B baseline and ahead of other published systems. This shows that reasoning distillation lets small models match or beat larger ones on medical tasks.

What carries the argument

The chain-of-thought distillation pipeline that enriches data with teacher explanations and applies parameter-efficient adaptation to the student model.

Load-bearing premise

The chain-of-thought explanations from the large teacher model are accurate enough and transfer cleanly to the small model during fine-tuning without adding errors or missing key medical insights.

What would settle it

The adapted 2B model would fail to show improved accuracy or would produce inconsistent reasoning chains compared to the teacher on held-out medical images.

Figures

Figures reproduced from arXiv: 2605.09384 by Caizhi Liao, Guo Liu, Haonan Lyu, Runze Ma, Shunbo Jia.

Figure 1
Figure 1. Figure 1: Overview of the LiteMedCoT-VL pipeline. Training samples are processed through [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Answer label distribution in the PMC-VQA training and test sets. Options B and C [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of evaluated models on the PMC-VQA test set. The horizontal dashed line [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Image ablation results across all baseline models. Removing images causes substantial [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-category accuracy on the PMC-VQA test set. Sample sizes per category are shown [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2--4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC-VQA benchmark, LiteMedCoT-VL achieves 64.9% accuracy, exceeding the zero-shot Qwen3-VL-4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at https://anonymous.4open.science/r/LiteMedCoT-VL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LiteMedCoT-VL, a pipeline for parameter-efficient adaptation of 2B-parameter vision-language models to medical visual question answering. It distills chain-of-thought reasoning from a 235B teacher model via LoRA fine-tuning on explanation-enriched VQA data, with all inference performed without image captions. On the PMC-VQA benchmark the method reports 64.9% accuracy, an 11-point gain over the zero-shot Qwen3-VL-4B baseline of 53.9% and outperformance of all published baselines. The authors conclude that reasoning distillation enables compact models to match or exceed larger models and release code publicly.

Significance. If the reported gains are shown to stem specifically from CoT transfer rather than standard supervised fine-tuning, the work would be significant for enabling interpretable medical VQA on resource-constrained hardware. The public code release is a clear strength that supports reproducibility and further research.

major comments (2)
  1. [Experiments] Experiments section: the central claim that the 11pp accuracy lift (64.9% vs. 53.9%) results from chain-of-thought reasoning distillation is not supported by any ablation that compares the full explanation-enriched training set against an otherwise identical LoRA fine-tuning run using only question-answer pairs. Without this control the necessity of the 235B teacher, the CoT component, and the specific pipeline remains unestablished.
  2. [Methods] Methods section: the manuscript provides no description of the teacher prompt engineering used to generate the CoT explanations, no statistical significance tests on the benchmark results, and no explicit controls or analysis for data leakage between the teacher-generated data and the PMC-VQA test set. These omissions directly affect the reliability of the transfer claim.
minor comments (2)
  1. [Abstract / Experiments] The abstract and results text should explicitly state the exact number of training examples and the LoRA hyperparameters (rank, alpha, dropout) used in the reported runs.
  2. [Results] Figure captions and the visual-grounding analysis paragraph would benefit from clearer description of how attention or grounding metrics were computed and what baseline they are compared against.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have reviewed each point carefully and provide the following point-by-point responses. We agree that additional controls and details are needed to strengthen the claims and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that the 11pp accuracy lift (64.9% vs. 53.9%) results from chain-of-thought reasoning distillation is not supported by any ablation that compares the full explanation-enriched training set against an otherwise identical LoRA fine-tuning run using only question-answer pairs. Without this control the necessity of the 235B teacher, the CoT component, and the specific pipeline remains unestablished.

    Authors: We agree that the current experiments do not isolate the contribution of the CoT explanations from standard supervised fine-tuning on QA pairs. In the revised manuscript we will add a direct ablation: an otherwise identical LoRA fine-tuning run on the same 2B model using only question-answer pairs (no explanations). The accuracy difference between this control and the full explanation-enriched run will be reported to substantiate the specific benefit of reasoning distillation. revision: yes

  2. Referee: [Methods] Methods section: the manuscript provides no description of the teacher prompt engineering used to generate the CoT explanations, no statistical significance tests on the benchmark results, and no explicit controls or analysis for data leakage between the teacher-generated data and the PMC-VQA test set. These omissions directly affect the reliability of the transfer claim.

    Authors: We will expand the Methods section with the exact prompts used to elicit CoT explanations from the 235B teacher. We will also add statistical significance testing (e.g., bootstrap confidence intervals or McNemar’s test) for the reported accuracy gains. For data leakage, we will include an explicit analysis checking for any overlap between the teacher-generated training instances and the PMC-VQA test set, together with a description of the generation protocol that avoids using test data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are self-contained

full rationale

The paper describes a LoRA-based distillation pipeline that enriches VQA training data with chain-of-thought explanations from a 235B teacher and reports measured accuracy (64.9%) on the external PMC-VQA benchmark against zero-shot baselines. No equations, parameter fits, or uniqueness theorems are presented; the central claim is an empirical comparison that does not reduce to any input by construction. No self-citations are invoked as load-bearing premises, and the method is described without renaming known results or smuggling ansatzes. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract; the ledger therefore lists only the high-level assumptions visible from the summary. No invented entities are introduced. Free parameters are the usual LoRA and training choices not enumerated here.

free parameters (1)
  • LoRA rank, alpha, and dropout
    Standard hyperparameters in LoRA fine-tuning whose specific values are not stated in the abstract but are required to reproduce the adaptation.
axioms (1)
  • domain assumption Chain-of-thought explanations generated by the 235B teacher are sufficiently accurate and generalizable to serve as training targets for the student.
    This premise underpins the entire distillation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5583 in / 1516 out tokens · 53494 ms · 2026-05-12T03:35:32.732587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 12 internal anchors

  1. [1]

    S Kevin Zhou, Hayit Greenspan, Christos Davatzikos, James S Duncan, Bram Van Ginneken, Anant Madabhushi, Jerry L Prince, Daniel Rueckert, and Ronald M Summers. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises.Proceedings of the IEEE, 109(5):820–838, 2021

  2. [2]

    Deep learning models in medical image analysis.Journal of Oral Bio- sciences, 64(3):312–320, 2022

    Masayuki Tsuneki. Deep learning models in medical image analysis.Journal of Oral Bio- sciences, 64(3):312–320, 2022

  3. [3]

    Medical image segmentation: A comprehensive review of deep learning-based methods.To- mography, 11(5):52, 2025

    Yuxiao Gao, Yang Jiang, Yanhong Peng, Fujiang Yuan, Xinyue Zhang, and Jianfeng Wang. Medical image segmentation: A comprehensive review of deep learning-based methods.To- mography, 11(5):52, 2025

  4. [4]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  5. [5]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

  6. [6]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  7. [7]

    Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

  8. [8]

    Large language model for medical images: A survey of taxonomy, systematic review, and future trends.Big Data Mining and Analytics, 8(2):496, 2025

    Peng Wang, Wenpeng Lu, Chunlin Lu, Ruoxi Zhou, Min Li, and Libo Qin. Large language model for medical images: A survey of taxonomy, systematic review, and future trends.Big Data Mining and Analytics, 8(2):496, 2025

  9. [9]

    Visual–language foundation models in medicine.The Visual Computer, 41(4):2953–2972, 2025

    Chunyu Liu, Yixiao Jin, Zhouyu Guan, Tingyao Li, Yiming Qin, Bo Qian, Zehua Jiang, Yi- lan Wu, Xiangning Wang, Ying Feng Zheng, et al. Visual–language foundation models in medicine.The Visual Computer, 41(4):2953–2972, 2025

  10. [10]

    Large language models for diabetes care: Potentials and prospects.Science bulletin, 69(5):583–588, 2024

    Bin Sheng, Zhouyu Guan, Lee-Ling Lim, Zehua Jiang, Nestoras Mathioudakis, Jiajia Li, Ruhan Liu, Yuqian Bao, Yong Mong Bee, Ya-Xing Wang, et al. Large language models for diabetes care: Potentials and prospects.Science bulletin, 69(5):583–588, 2024

  11. [11]

    Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

    Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

  12. [12]

    Huanru Henry Mao

    Amir M Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Ela- heh Badali Golezani, Zeynab Yasamani Ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, et al. A comprehensive survey on knowledge distillation.arXiv preprint arXiv:2503.12067, 2025

  13. [13]

    Step-cot: Stepwise visual chain-of-thought for medical visual question answering.arXiv preprint arXiv:2603.13878, 2026

    Lin Fan, Yafei Ou, Zhipeng Deng, Pengyu Dai, Hou Chongxian, Jiale Yan, Yaqian Li, Kai- wen Long, Xun Gong, Masayuki Ikebe, et al. Step-cot: Stepwise visual chain-of-thought for medical visual question answering.arXiv preprint arXiv:2603.13878, 2026

  14. [14]

    CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation

    Sonali Sharma, Jin Long, George Shih, Sarah Eid, Christian Bluethgen, Francine L Jacob- son, Emily B Tsai, Ahmed M Alaa, Curtis P Langlotz, Global Radiology Consortium, et al. Chexthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest x-ray interpretation.arXiv preprint arXiv:2604.26288, 2026

  15. [15]

    Improving Medical VQA through Trajectory-Aware Process Supervision

    Halil Ibrahim Gulluk and Olivier Gevaert. Improving medical vqa through trajectory-aware process supervision.arXiv preprint arXiv:2605.04064, 2026. 14

  16. [16]

    Care: Towards clinical ac- countability in multi-modal medical reasoning with an evidence-grounded agentic framework

    Yuexi Du, Jinglu Wang, Shujie Liu, Nicha C Dvornek, and Yan Lu. Care: Towards clinical ac- countability in multi-modal medical reasoning with an evidence-grounded agentic framework. arXiv preprint arXiv:2603.01607, 2026

  17. [17]

    Elicit and enhance: Advancing multimodal reasoning in medical scenarios.arXiv preprint arXiv:2505.23118, 2025

    Zhongzhen Huang, Linjie Mu, Yakun Zhu, Xiangyu Zhao, Shaoting Zhang, and Xiaofan Zhang. Elicit and enhance: Advancing multimodal reasoning in medical scenarios.arXiv preprint arXiv:2505.23118, 2025

  18. [18]

    Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

  19. [19]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  20. [20]

    Image thresholding approaches for medi- cal image segmentation-short literature review.Procedia Computer Science, 219:1485–1492, 2023

    Sandra Jardim, João António, and Carlos Mora. Image thresholding approaches for medi- cal image segmentation-short literature review.Procedia Computer Science, 219:1485–1492, 2023

  21. [21]

    Handcrafted vs

    Mohammad R Salmanpour, Somayeh Sadat Mehrnia, Sajad Jabarzadeh Ghandilu, Zhino Safahi, Sonya Falahati, Shahram Taeb, Ghazal Mousavi, Mehdi Maghsudi, Ahmad Shariftabrizi, Ilker Hacihaliloglu, et al. Handcrafted vs. deep radiomics vs. fusion vs. deep learning: A comprehensive review of machine learning-based cancer outcome prediction in pet and spect imagin...

  22. [22]

    A review of convolutional neural network based methods for medical image classification.Computers in biology and medicine, 185:109507, 2025

    Chao Chen, Nor Ashidi Mat Isa, and Xin Liu. A review of convolutional neural network based methods for medical image classification.Computers in biology and medicine, 185:109507, 2025

  23. [23]

    Deep learning-based object detection algorithms in medical imaging: Systematic review.Heliyon, 11(1), 2025

    Carina Albuquerque, Roberto Henriques, and Mauro Castelli. Deep learning-based object detection algorithms in medical imaging: Systematic review.Heliyon, 11(1), 2025

  24. [24]

    Deep convolutional neural networks in medical image analysis: A review.Information, 16(3):195, 2025

    Ibomoiye Domor Mienye, Theo G Swart, George Obaido, Matt Jordan, and Philip Ilono. Deep convolutional neural networks in medical image analysis: A review.Information, 16(3):195, 2025

  25. [25]

    TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

    Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medical image segmentation.arXiv preprint arXiv:2102.04306, 2021

  26. [26]

    Swin-unet: Unet-like pure transformer for medical image segmentation

    Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. InEuropean conference on computer vision, pages 205–218. Springer, 2022

  27. [27]

    Transformers in medical image analysis.Intelligent Medicine, 3(1):59–78, 2023

    Kelei He, Chen Gan, Zhuoyuan Li, Islem Rekik, Zihao Yin, Wen Ji, Yang Gao, Qian Wang, Junfeng Zhang, and Dinggang Shen. Transformers in medical image analysis.Intelligent Medicine, 3(1):59–78, 2023

  28. [28]

    Generative models in medical visual question answering: A survey.Applied Sciences, 15(6):2983, 2025

    Wenjie Dong, Shuhao Shen, Yuqiang Han, Tao Tan, Jian Wu, and Hongxia Xu. Generative models in medical visual question answering: A survey.Applied Sciences, 15(6):2983, 2025

  29. [29]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  30. [30]

    Flava: A foundational language and vision alignment model

    Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion, pages 15638–15650, 2022

  31. [31]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023. 15

  32. [32]

    Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically- labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

  33. [33]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

  34. [34]

    Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain

    Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. InProceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021

  35. [35]

    Knowledge distillation based on transformed teacher matching

    Kaixiang Zheng and En-Hui Yang. Knowledge distillation based on transformed teacher matching.arXiv preprint arXiv:2402.11148, 2024

  36. [36]

    Student-friendly knowledge distillation

    Mengyang Yuan, Bo Lang, and Fengnan Quan. Student-friendly knowledge distillation. Knowledge-Based Systems, 296:111915, 2024

  37. [37]

    Robust multi- modal fusion architecture for medical data with knowledge distillation.Computer methods and programs in biomedicine, 260:108568, 2025

    Muyu Wang, Shiyu Fan, Yichen Li, Binyu Gao, Zhongrang Xie, and Hui Chen. Robust multi- modal fusion architecture for medical data with knowledge distillation.Computer methods and programs in biomedicine, 260:108568, 2025

  38. [38]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianfeng Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

  39. [39]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024

  40. [40]

    SmolVLM: Redefining small and efficient multimodal models

    Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

  41. [41]

    Comt: Chain-of-medical-thought reduces hallucination in medical report gener- ation

    Yue Jiang, Jiawei Chen, Dingkang Yang, Mingcheng Li, Shunli Wang, Tong Wu, Ke Li, and Lihua Zhang. Comt: Chain-of-medical-thought reduces hallucination in medical report gener- ation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  42. [42]

    Large language models are reasoning teach- ers

    Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teach- ers. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 14852–14882, 2023

  43. [43]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in lan- guage models.arXiv preprint arXiv:2203.11171, 2022

  44. [44]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  45. [45]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

  46. [46]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021. 16

  47. [47]

    Unipelt: A unified framework for parameter-efficient language model tuning

    Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Scott Yih, and Madian Khabsa. Unipelt: A unified framework for parameter-efficient language model tuning. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6253–6264, 2022

  48. [48]

    A survey on lora of large language models.Frontiers of Computer Science, 19(7):197605, 2025

    Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, and Yunjun Gao. A survey on lora of large language models.Frontiers of Computer Science, 19(7):197605, 2025

  49. [49]

    Visual-language model fine-tuning via lora for structed medical reports generating for lung x-ray skans

    Egor V olkov, Vadim Sechin, and Alexey Averkin. Visual-language model fine-tuning via lora for structed medical reports generating for lung x-ray skans. In2025 XXVIII International Conference on Soft Computing and Measurements (SCM), pages 438–442. IEEE, 2025

  50. [50]

    Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

    Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

  51. [51]

    MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

    Zheng Jiang, Heng Guo, Chengyu Fang, Changchen Xiao, Xinyang Hu, Lifeng Sun, and Min- feng Xu. Medvr: Annotation-free medical visual reasoning via agentic reinforcement learning. arXiv preprint arXiv:2604.08203, 2026

  52. [52]

    arXiv preprint arXiv:2510.10052 (2025)

    Kaitao Chen, Shaohao Rui, Yankai Jiang, Jiamin Wu, Qihao Zheng, Chunfeng Song, Xiaosong Wang, Mu Zhou, and Mianxin Liu. Think twice to see more: Iterative visual reasoning in medical vlms.arXiv preprint arXiv:2510.10052, 2025

  53. [53]

    Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

    Zibo Xu, Qiang Li, Ke Lu, Jin Wang, Weizhi Nie, and Yuting Su. Dual causal inference: Integrating backdoor adjustment and instrumental variable learning for medical vqa.arXiv preprint arXiv:2604.20306, 2026

  54. [54]

    Beyond accuracy: Evaluating visual grounding in multimodal medical reasoning.arXiv preprint arXiv:2603.03437, 2026

    Anas Zafar, Leema Krishna Murali, and Ashish Vashist. Beyond accuracy: Evaluating visual grounding in multimodal medical reasoning.arXiv preprint arXiv:2603.03437, 2026

  55. [55]

    Hallucination benchmark in medical visual question answering.arXiv preprint arXiv:2401.05827, 2024

    Jinge Wu, Yunsoo Kim, and Honghan Wu. Hallucination benchmark in medical visual question answering.arXiv preprint arXiv:2401.05827, 2024

  56. [56]

    The impact of image resolution on biomedical multimodal large language models.arXiv preprint arXiv:2510.18304, 2025

    Liangyu Chen, James Burgess, Jeffrey J Nirschl, Orr Zohar, and Serena Yeung-Levy. The impact of image resolution on biomedical multimodal large language models.arXiv preprint arXiv:2510.18304, 2025

  57. [57]

    arXiv preprint arXiv:2505.16964 (2025)

    Suhao Yu, Haojin Wang, Juncheng Wu, Luyang Luo, Jingshen Wang, Cihang Xie, Pranav Rajpurkar, Carl Yang, Yang Yang, Kang Wang, et al. Medframeqa: A multi-image medical vqa benchmark for clinical reasoning.arXiv preprint arXiv:2505.16964, 2025

  58. [58]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision-language models.arXiv preprint arXiv:2308.01390, 2023

  59. [59]

    A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5 (1):180251, 2018

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5 (1):180251, 2018. 17