M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Jiangning Zhang; Jinsheng Bai; Juntao Jiang; Shuicheng Yan; Weiwei Jin; Weixuan Liu; Xiaobin Hu; Yali Bi; Yong Liu; Zhucun Xue

arxiv: 2601.08758 · v4 · pith:H7YKSN5Snew · submitted 2026-01-13 · 📡 eess.IV · cs.CV

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Juntao Jiang , Jiangning Zhang , Yali Bi , Jinsheng Bai , Weixuan Liu , Weiwei Jin , Zhucun Xue , Yong Liu

show 2 more authors

Xiaobin Hu Shuicheng Yan

This is my paper

classification 📡 eess.IV cs.CV

keywords reasoningm3cotbenchmedicalmllmsimageunderstandingbenchmarkchain-of-thought

0 comments

read the original abstract

Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. Such opaque reasoning processes lack reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion
cs.LG 2026-05 unverdicted novelty 7.0

MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.
OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models
cs.CV 2026-06 unverdicted novelty 6.0

OpenMedReason supplies a large open corpus of multimodal medical reasoning examples extracted from scientific articles, paired with a benchmark that measures perception, knowledge, and rationale quality, yielding 20% ...