arxiv: 2603.06665 · v2 · submitted 2026-03-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Yuan Wu , Zongxian Yang , Jiayu Qian , Songpan Gao , Guanxing Chen , Qiankun Li , Yu-An Huang , Zhi-An Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical visual question answeringchain-of-thought promptingvision-language modelsperception bottleneckvisual groundinginference-time interventions

0 comments

The pith

In medical visual question answering, chain-of-thought prompting often reduces accuracy compared to direct answers because it amplifies early visual perception errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that chain-of-thought reasoning underperforms direct answering in medical visual question answering tasks for both general and specialized vision-language models. This occurs because subtle medical visual features create a perception bottleneck that weakens initial grounding, allowing chain-of-thought to build upon and compound those uncertainties. The authors introduce two simple interventions—region-of-interest anchoring and textual description grounding—that boost performance and often reverse the trend. Readers should care as this challenges the assumption that more reasoning always helps in complex domains like medicine, pointing instead to the need for stronger visual foundations.

Core claim

On medical visual question answering, CoT frequently underperforms direct answering across general-purpose and medical-specific models due to a medical perception bottleneck where subtle cues weaken visual grounding and CoT compounds early perceptual uncertainty. Training-free interventions via perception anchoring and description grounding improve accuracy and mitigate the degradation.

What carries the argument

Medical perception bottleneck that weakens visual grounding and causes chain-of-thought to compound perceptual errors.

Load-bearing premise

The observed performance gap is driven primarily by compounding perceptual uncertainty from the medical perception bottleneck rather than by variations in prompting, model size, or dataset artifacts.

What would settle it

Measuring CoT versus DirA performance after providing perfect initial visual perception through oracle cues would show if the bottleneck is the main cause; if the gap disappears, it supports the claim.

Figures

Figures reproduced from arXiv: 2603.06665 by Guanxing Chen, Jiayu Qian, Qiankun Li, Songpan Gao, Yu-An Huang, Yuan Wu, Zhi-An Huang, Zongxian Yang.

**Figure 1.** Figure 1: The three-stage Medical VLM CoT framework and targeted interventions. 2 Methodology Motivated by the medical perception bottleneck discussed in Section 1, this section links our empirical observations with a structural perspective on medical VLM inference. We present a three-stage decomposition of medical VQA reasoning to interpret how imperfect visual grounding may influence CoT generation, and introduce … view at source ↗

**Figure 2.** Figure 2: Main results across RQ1–RQ3. (a) CoT improves general benchmarks but degrades medical benchmarks. (b) CoT is more sensitive to progressive visual degradation than DirA. (c) Supplementing models with expert-level image descriptions alone effectively mitigates CoT degradation. (d,e) Counterfactual inputs reveal pseudo-robustness in DirA and stronger visual dependence in CoT. (f) Incorrect RoI and descripti… view at source ↗

**Figure 3.** Figure 3: Qualitative case study. Standard CoT exhibits misaligned attention patterns and incorrect conclusions, while grounded interventions provide additional spatial and semantic priors that yield more visually consistent reasoning trajectories. grounding alone further improves CoT (Figure 2c), suggesting that improved visual textualization can successfully stabilize downstream reasoning. Conversely, injecting i… view at source ↗

read the original abstract

Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoT hurts accuracy on medical VQA across models because early perceptual slips get amplified, but two simple inference-time grounding tricks often close or reverse the gap.

read the letter

This paper finds that chain-of-thought prompting reduces accuracy on medical visual question answering compared to direct answers, and it offers two straightforward fixes that often reverse the problem. The authors test this across general-purpose and medical VLMs on multiple benchmarks. They introduce perception anchoring with region-of-interest cues and description grounding with textual guidance. Both are training-free and applied at inference time. The results show consistent trends where these interventions boost performance and close or flip the CoT gap. What stands out is the counter-intuitive result in the medical domain. In general domains CoT helps, but here it hurts, which they link to subtle cues weakening visual grounding. The interventions target that directly and work in several settings. This is a practical observation that could guide how we build reliable medical VLMs. The main limitation is that the explanation for why CoT fails rests on patterns rather than isolated tests. The fixes alter prompt structure and content in ways that could affect results for reasons other than fixing perception errors, such as changing output length or shifting the input distribution. No detailed error tracing or matched controls are described to pin down the compounding uncertainty mechanism. Readers working on vision-language models for healthcare will find this relevant. It highlights the need for better visual grounding before adding reasoning layers. The work is empirical and reproducible in principle since code is mentioned, so it merits a full review to check the details and see if the trends hold under stricter conditions. I would send this to peer review. The core finding is solid enough to warrant referee input even if the causal claim needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript reports that chain-of-thought (CoT) prompting underperforms direct answering (DirA) on medical visual question answering tasks across general-purpose and medical-specific vision-language models. The authors attribute the gap to a 'medical perception bottleneck' in which subtle domain-specific visual cues weaken grounding and CoT compounds early perceptual errors. They introduce two training-free interventions—perception anchoring via region-of-interest cues and description grounding via high-quality textual descriptions—and show that these interventions raise accuracy, reduce CoT degradation, and in several settings reverse the CoT–DirA ordering. Experiments span multiple benchmarks and model families; code is released.

Significance. If the core empirical pattern holds, the work usefully documents a domain-specific limitation of text-centric reasoning techniques in medical imaging and supplies simple, inference-time fixes that improve reliability. The breadth of models and benchmarks, together with the open code, makes the findings actionable for clinical VLM development and provides a concrete baseline for future grounding research.

major comments (2)

[§4] §4 (Results): the reported CoT < DirA gap is not accompanied by prompt-length or output-token controls; because the interventions simultaneously alter visual focus, prompt length, and textual conditioning, the data do not isolate whether the original degradation arises from perceptual compounding or from generic prompt-complexity effects.
[§3.2] §3.2 (Interventions): no ablation matches total prompt length or isolates the contribution of ROI cues versus added textual descriptions; without such controls the claim that the interventions specifically repair a perception bottleneck remains correlational rather than causal.

minor comments (2)

[Abstract] Abstract: the term 'CoT–DirA inversion' is used without a brief parenthetical definition; a short clarification would aid readers outside the immediate subfield.
[Figures] Figure captions (throughout): axis labels and legend entries should explicitly state whether accuracy is reported as mean ± std or as single-run values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. The concerns about prompt-length controls and ablation design are valid and point to ways we can make the causal claims more rigorous. We will incorporate the suggested controls and ablations in the revised manuscript, which we believe will strengthen rather than undermine the core finding of a medical perception bottleneck.

read point-by-point responses

Referee: [§4] §4 (Results): the reported CoT < DirA gap is not accompanied by prompt-length or output-token controls; because the interventions simultaneously alter visual focus, prompt length, and textual conditioning, the data do not isolate whether the original degradation arises from perceptual compounding or from generic prompt-complexity effects.

Authors: We agree that prompt length and token count are potential confounds. In the revision we will add a new set of length-controlled experiments: for each model and benchmark we will create prompt variants that match the token length of the CoT condition (via neutral filler text or rephrasing) while preserving the direct-answer structure, and we will report both input and output token statistics across all conditions. These controls will allow us to quantify how much of the original CoT–DirA gap persists after length equalization and thereby better isolate perceptual compounding from generic complexity effects. revision: yes
Referee: [§3.2] §3.2 (Interventions): no ablation matches total prompt length or isolates the contribution of ROI cues versus added textual descriptions; without such controls the claim that the interventions specifically repair a perception bottleneck remains correlational rather than causal.

Authors: We accept that the current intervention results are correlational with respect to the individual factors. In the revised version we will add three new ablation arms per benchmark: (i) ROI cues paired with length-matched neutral text, (ii) high-quality textual descriptions without ROI cues, and (iii) length-matched prompts containing neither intervention. By comparing these conditions we will be able to separate the contribution of visual anchoring from that of added textual conditioning and from prompt length, thereby providing a more direct test of whether the interventions repair the hypothesized perception bottleneck. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

The paper reports benchmark results comparing CoT vs. direct answering on medical VQA tasks across models, attributes the gap to a perceptual bottleneck hypothesis, and evaluates two training-free interventions (ROI anchoring and description grounding). No equations, fitted parameters, or derivations appear; the central claims rest on observed accuracy deltas and intervention effects rather than any self-definitional reduction or self-citation chain. The work is self-contained against external benchmarks with no load-bearing steps that collapse to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical observations from existing benchmarks and a hypothesized explanatory mechanism; no new free parameters are introduced.

axioms (1)

domain assumption Existing medical VQA benchmarks are valid proxies for clinical visual reasoning tasks.
The paper evaluates on standard benchmarks without questioning their representativeness.

invented entities (1)

medical perception bottleneck no independent evidence
purpose: Explanatory concept for why CoT compounds visual errors in medicine
Introduced to account for the observed CoT degradation; no independent falsifiable test provided in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1240 out tokens · 41054 ms · 2026-05-15T18:29:33.761677+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 6 internal anchors

[1]

Amjith, S., Dusad, M., Muramalla, N., Shah, S.: Can large reasoning models improve accuracy on mathematical tasks using flawed thinking? arXiv preprint arXiv:2512.17079 (2025)

work page arXiv 2025
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., et al.: Qwen3-vl technical report (2025), https://arxiv. org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Chen, K., Rui, S., Jiang, Y., et al.: Think twice to see more: Iterative visual rea- soning in medical vlms (2025), https://arxiv.org/abs/2510.10052

work page arXiv 2025
[4]

In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

Du, Y., Wei, F., Zhang, Z., et al.: Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 14084–14093 (June 2022)

work page 2022
[5]

In: European Conference on Computer Vision

Fu, X., Hu, Y., Li, B., et al.: Blink: Multimodal large language models can see but not perceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)

work page 2024
[6]

Nature645(8081), 633–638 (2025)

Guo, D., Yang, D., Zhang, H., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

work page 2025
[7]

PathVQA: 30000+ Questions for Medical Visual Question Answering

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2003
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hu, Y., Li, T., Lu, Q., et al.: Omnimedvqa: A new large-scale comprehensive eval- uation benchmark for medical lvlm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22170–22183 (2024)

work page 2024
[9]

arXiv preprint arXiv:2510.08668 (2025)

Jiang, S., Wang, Y., Song, S., et al.: Hulu-med: A transparent generalist model towards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668 (2025)

work page arXiv 2025
[10]

arXiv preprint arXiv:2506.03922 (2025)

Kang, Z., Gong, J., Yan, J., et al.: Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models. arXiv preprint arXiv:2506.03922 (2025)

work page arXiv 2025
[11]

In: European conference on computer vision

Kembhavi, A., Salvato, M., Kolve, E., et al.: A diagram is worth a dozen images. In: European conference on computer vision. pp. 235–251. Springer (2016)

work page 2016
[12]

Kwon, W., Li, Z., Zhuang, S., et al.: Efficient memory management for large lan- guage model serving with pagedattention. p. 611–626. SOSP ’23, Association for Computing Machinery, New York, NY, USA (2023)

work page 2023
[13]

Scientific data 5(1), 180251 (2018)

Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 180251 (2018)

work page 2018
[14]

In: 2021 IEEE 18th inter- national symposium on biomedical imaging (ISBI)

Liu, B., Zhan, L.M., Xu, L., et al.: Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In: 2021 IEEE 18th inter- national symposium on biomedical imaging (ISBI). pp. 1650–1654. IEEE (2021) 10 Y. Wu et al

work page 2021
[15]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Liu, J., Wang, Y., Du, J., Zhou, J.T., Liu, Z.: Medcot: Medical chain of thought via hierarchical expert. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 17371–17389 (2024)

work page 2024
[16]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Lu, P., Bansal, H., Xia, T., et al.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

International Journal on Dig- ital Libraries23(3), 289–301 (2022)

Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., Bhattacharyya, P.: Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Dig- ital Libraries23(3), 289–301 (2022)

work page 2022
[18]

Team, K., Bai, T., Bai, Y., et al.: Kimi k2.5: Visual agentic intelligence (2026), https://arxiv.org/abs/2602.02276

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

Tong, P., Brown, E., Wu, P., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

work page 2024
[20]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Wang, Y., Li, Z., Zang, Y., et al.: Unified multimodal chain-of-thought reward model through reinforcement fine-tuning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

work page 2025
[21]

Wang,Y.,Liu,J.,Gao,S.,etal.: V2T-CoT:FromVisiontoTextChain-of-Thought forMedicalReasoningandDiagnosis.In:proceedingsofMedicalImageComputing and Computer Assisted Intervention – MICCAI 2025. vol. LNCS 15964. Springer Nature Switzerland (September 2025)

work page 2025
[22]

Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits rea- soninginlargelanguagemodels.Advancesinneuralinformationprocessingsystems 35, 24824–24837 (2022)

work page 2022
[23]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu,W.,Chan,H.P.,Li,L.,etal.:Lingshu:Ageneralistfoundationmodelforunified multimodalmedical understandingandreasoning. arXivpreprint arXiv:2506.07044 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

arXiv preprint arXiv:2506.13793 (2025)

Yang, Z., Qian, J., Peng, Z., et al.: Med-refl: Medical reasoning enhancement via self-corrected fine-grained reflection. arXiv preprint arXiv:2506.13793 (2025)

work page arXiv 2025
[25]

arXiv preprint arXiv:2601.02737 (2026)

Ye, Z., Niu, X., Wu, X., et al.: Unveiling and bridging the functional perception gap in mllms: Atomic visual alignment and hierarchical evaluation via pet-bench. arXiv preprint arXiv:2601.02737 (2026)

work page arXiv 2026
[26]

arXiv preprint arXiv:2305.10415 (2023)

Zhang, X., Wu, C., Zhao, Z., et al.: Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)

work page arXiv 2023
[27]

In: Thirty- seventh Conference on Neural Information Processing Systems (2023)

Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: DDCot: Duty-distinct chain- of-thought prompting for multimodal reasoning in language models. In: Thirty- seventh Conference on Neural Information Processing Systems (2023)

work page 2023
[28]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu,J.,Wang,W.,Chen,Z.,etal.:Internvl3:Exploringadvancedtrainingandtest- time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025