pith. machine review for the scientific record. sign in

arxiv: 2603.06665 · v2 · submitted 2026-03-02 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords medical visual question answeringchain-of-thought promptingvision-language modelsperception bottleneckvisual groundinginference-time interventions
0
0 comments X

The pith

In medical visual question answering, chain-of-thought prompting often reduces accuracy compared to direct answers because it amplifies early visual perception errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that chain-of-thought reasoning underperforms direct answering in medical visual question answering tasks for both general and specialized vision-language models. This occurs because subtle medical visual features create a perception bottleneck that weakens initial grounding, allowing chain-of-thought to build upon and compound those uncertainties. The authors introduce two simple interventions—region-of-interest anchoring and textual description grounding—that boost performance and often reverse the trend. Readers should care as this challenges the assumption that more reasoning always helps in complex domains like medicine, pointing instead to the need for stronger visual foundations.

Core claim

On medical visual question answering, CoT frequently underperforms direct answering across general-purpose and medical-specific models due to a medical perception bottleneck where subtle cues weaken visual grounding and CoT compounds early perceptual uncertainty. Training-free interventions via perception anchoring and description grounding improve accuracy and mitigate the degradation.

What carries the argument

Medical perception bottleneck that weakens visual grounding and causes chain-of-thought to compound perceptual errors.

Load-bearing premise

The observed performance gap is driven primarily by compounding perceptual uncertainty from the medical perception bottleneck rather than by variations in prompting, model size, or dataset artifacts.

What would settle it

Measuring CoT versus DirA performance after providing perfect initial visual perception through oracle cues would show if the bottleneck is the main cause; if the gap disappears, it supports the claim.

Figures

Figures reproduced from arXiv: 2603.06665 by Guanxing Chen, Jiayu Qian, Qiankun Li, Songpan Gao, Yu-An Huang, Yuan Wu, Zhi-An Huang, Zongxian Yang.

Figure 1
Figure 1. Figure 1: The three-stage Medical VLM CoT framework and targeted interventions. 2 Methodology Motivated by the medical perception bottleneck discussed in Section 1, this section links our empirical observations with a structural perspective on medical VLM inference. We present a three-stage decomposition of medical VQA reasoning to interpret how imperfect visual grounding may influence CoT generation, and introduce … view at source ↗
Figure 2
Figure 2. Figure 2: Main results across RQ1–RQ3. (a) CoT improves general benchmarks but de￾grades medical benchmarks. (b) CoT is more sensitive to progressive visual degradation than DirA. (c) Supplementing models with expert-level image descriptions alone effec￾tively mitigates CoT degradation. (d,e) Counterfactual inputs reveal pseudo-robustness in DirA and stronger visual dependence in CoT. (f) Incorrect RoI and descripti… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative case study. Standard CoT exhibits misaligned attention patterns and incorrect conclusions, while grounded interventions provide additional spatial and semantic priors that yield more visually consistent reasoning trajectories. grounding alone further improves CoT (Figure 2c), suggesting that improved vi￾sual textualization can successfully stabilize downstream reasoning. Conversely, injecting i… view at source ↗
read the original abstract

Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports that chain-of-thought (CoT) prompting underperforms direct answering (DirA) on medical visual question answering tasks across general-purpose and medical-specific vision-language models. The authors attribute the gap to a 'medical perception bottleneck' in which subtle domain-specific visual cues weaken grounding and CoT compounds early perceptual errors. They introduce two training-free interventions—perception anchoring via region-of-interest cues and description grounding via high-quality textual descriptions—and show that these interventions raise accuracy, reduce CoT degradation, and in several settings reverse the CoT–DirA ordering. Experiments span multiple benchmarks and model families; code is released.

Significance. If the core empirical pattern holds, the work usefully documents a domain-specific limitation of text-centric reasoning techniques in medical imaging and supplies simple, inference-time fixes that improve reliability. The breadth of models and benchmarks, together with the open code, makes the findings actionable for clinical VLM development and provides a concrete baseline for future grounding research.

major comments (2)
  1. [§4] §4 (Results): the reported CoT < DirA gap is not accompanied by prompt-length or output-token controls; because the interventions simultaneously alter visual focus, prompt length, and textual conditioning, the data do not isolate whether the original degradation arises from perceptual compounding or from generic prompt-complexity effects.
  2. [§3.2] §3.2 (Interventions): no ablation matches total prompt length or isolates the contribution of ROI cues versus added textual descriptions; without such controls the claim that the interventions specifically repair a perception bottleneck remains correlational rather than causal.
minor comments (2)
  1. [Abstract] Abstract: the term 'CoT–DirA inversion' is used without a brief parenthetical definition; a short clarification would aid readers outside the immediate subfield.
  2. [Figures] Figure captions (throughout): axis labels and legend entries should explicitly state whether accuracy is reported as mean ± std or as single-run values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. The concerns about prompt-length controls and ablation design are valid and point to ways we can make the causal claims more rigorous. We will incorporate the suggested controls and ablations in the revised manuscript, which we believe will strengthen rather than undermine the core finding of a medical perception bottleneck.

read point-by-point responses
  1. Referee: [§4] §4 (Results): the reported CoT < DirA gap is not accompanied by prompt-length or output-token controls; because the interventions simultaneously alter visual focus, prompt length, and textual conditioning, the data do not isolate whether the original degradation arises from perceptual compounding or from generic prompt-complexity effects.

    Authors: We agree that prompt length and token count are potential confounds. In the revision we will add a new set of length-controlled experiments: for each model and benchmark we will create prompt variants that match the token length of the CoT condition (via neutral filler text or rephrasing) while preserving the direct-answer structure, and we will report both input and output token statistics across all conditions. These controls will allow us to quantify how much of the original CoT–DirA gap persists after length equalization and thereby better isolate perceptual compounding from generic complexity effects. revision: yes

  2. Referee: [§3.2] §3.2 (Interventions): no ablation matches total prompt length or isolates the contribution of ROI cues versus added textual descriptions; without such controls the claim that the interventions specifically repair a perception bottleneck remains correlational rather than causal.

    Authors: We accept that the current intervention results are correlational with respect to the individual factors. In the revised version we will add three new ablation arms per benchmark: (i) ROI cues paired with length-matched neutral text, (ii) high-quality textual descriptions without ROI cues, and (iii) length-matched prompts containing neither intervention. By comparing these conditions we will be able to separate the contribution of visual anchoring from that of added textual conditioning and from prompt length, thereby providing a more direct test of whether the interventions repair the hypothesized perception bottleneck. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

The paper reports benchmark results comparing CoT vs. direct answering on medical VQA tasks across models, attributes the gap to a perceptual bottleneck hypothesis, and evaluates two training-free interventions (ROI anchoring and description grounding). No equations, fitted parameters, or derivations appear; the central claims rest on observed accuracy deltas and intervention effects rather than any self-definitional reduction or self-citation chain. The work is self-contained against external benchmarks with no load-bearing steps that collapse to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical observations from existing benchmarks and a hypothesized explanatory mechanism; no new free parameters are introduced.

axioms (1)
  • domain assumption Existing medical VQA benchmarks are valid proxies for clinical visual reasoning tasks.
    The paper evaluates on standard benchmarks without questioning their representativeness.
invented entities (1)
  • medical perception bottleneck no independent evidence
    purpose: Explanatory concept for why CoT compounds visual errors in medicine
    Introduced to account for the observed CoT degradation; no independent falsifiable test provided in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1240 out tokens · 41054 ms · 2026-05-15T18:29:33.761677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 6 internal anchors

  1. [1]

    Amjith, S., Dusad, M., Muramalla, N., Shah, S.: Can large reasoning models improve accuracy on mathematical tasks using flawed thinking? arXiv preprint arXiv:2512.17079 (2025)

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., et al.: Qwen3-vl technical report (2025), https://arxiv. org/abs/2511.21631

  3. [3]

    Chen, K., Rui, S., Jiang, Y., et al.: Think twice to see more: Iterative visual rea- soning in medical vlms (2025), https://arxiv.org/abs/2510.10052

  4. [4]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)

    Du, Y., Wei, F., Zhang, Z., et al.: Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 14084–14093 (June 2022)

  5. [5]

    In: European Conference on Computer Vision

    Fu, X., Hu, Y., Li, B., et al.: Blink: Multimodal large language models can see but not perceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)

  6. [6]

    Nature645(8081), 633–638 (2025)

    Guo, D., Yang, D., Zhang, H., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)

  7. [7]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

  8. [8]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hu, Y., Li, T., Lu, Q., et al.: Omnimedvqa: A new large-scale comprehensive eval- uation benchmark for medical lvlm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22170–22183 (2024)

  9. [9]

    arXiv preprint arXiv:2510.08668 (2025)

    Jiang, S., Wang, Y., Song, S., et al.: Hulu-med: A transparent generalist model towards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668 (2025)

  10. [10]

    arXiv preprint arXiv:2506.03922 (2025)

    Kang, Z., Gong, J., Yan, J., et al.: Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models. arXiv preprint arXiv:2506.03922 (2025)

  11. [11]

    In: European conference on computer vision

    Kembhavi, A., Salvato, M., Kolve, E., et al.: A diagram is worth a dozen images. In: European conference on computer vision. pp. 235–251. Springer (2016)

  12. [12]

    Kwon, W., Li, Z., Zhuang, S., et al.: Efficient memory management for large lan- guage model serving with pagedattention. p. 611–626. SOSP ’23, Association for Computing Machinery, New York, NY, USA (2023)

  13. [13]

    Scientific data 5(1), 180251 (2018)

    Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 180251 (2018)

  14. [14]

    In: 2021 IEEE 18th inter- national symposium on biomedical imaging (ISBI)

    Liu, B., Zhan, L.M., Xu, L., et al.: Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In: 2021 IEEE 18th inter- national symposium on biomedical imaging (ISBI). pp. 1650–1654. IEEE (2021) 10 Y. Wu et al

  15. [15]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Liu, J., Wang, Y., Du, J., Zhou, J.T., Liu, Z.: Medcot: Medical chain of thought via hierarchical expert. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 17371–17389 (2024)

  16. [16]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Lu, P., Bansal, H., Xia, T., et al.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)

  17. [17]

    International Journal on Dig- ital Libraries23(3), 289–301 (2022)

    Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., Bhattacharyya, P.: Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Dig- ital Libraries23(3), 289–301 (2022)

  18. [18]

    Team, K., Bai, T., Bai, Y., et al.: Kimi k2.5: Visual agentic intelligence (2026), https://arxiv.org/abs/2602.02276

  19. [19]

    Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

    Tong, P., Brown, E., Wu, P., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. Advances in Neural Information Processing Systems 37, 87310–87356 (2024)

  20. [20]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

    Wang, Y., Li, Z., Zang, Y., et al.: Unified multimodal chain-of-thought reward model through reinforcement fine-tuning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

  21. [21]

    Wang,Y.,Liu,J.,Gao,S.,etal.: V2T-CoT:FromVisiontoTextChain-of-Thought forMedicalReasoningandDiagnosis.In:proceedingsofMedicalImageComputing and Computer Assisted Intervention – MICCAI 2025. vol. LNCS 15964. Springer Nature Switzerland (September 2025)

  22. [22]

    Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits rea- soninginlargelanguagemodels.Advancesinneuralinformationprocessingsystems 35, 24824–24837 (2022)

  23. [23]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    Xu,W.,Chan,H.P.,Li,L.,etal.:Lingshu:Ageneralistfoundationmodelforunified multimodalmedical understandingandreasoning. arXivpreprint arXiv:2506.07044 (2025)

  24. [24]

    arXiv preprint arXiv:2506.13793 (2025)

    Yang, Z., Qian, J., Peng, Z., et al.: Med-refl: Medical reasoning enhancement via self-corrected fine-grained reflection. arXiv preprint arXiv:2506.13793 (2025)

  25. [25]

    arXiv preprint arXiv:2601.02737 (2026)

    Ye, Z., Niu, X., Wu, X., et al.: Unveiling and bridging the functional perception gap in mllms: Atomic visual alignment and hierarchical evaluation via pet-bench. arXiv preprint arXiv:2601.02737 (2026)

  26. [26]

    arXiv preprint arXiv:2305.10415 (2023)

    Zhang, X., Wu, C., Zhao, Z., et al.: Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)

  27. [27]

    In: Thirty- seventh Conference on Neural Information Processing Systems (2023)

    Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: DDCot: Duty-distinct chain- of-thought prompting for multimodal reasoning in language models. In: Thirty- seventh Conference on Neural Information Processing Systems (2023)

  28. [28]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu,J.,Wang,W.,Chen,Z.,etal.:Internvl3:Exploringadvancedtrainingandtest- time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)