Recognition: no theorem link
Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
Pith reviewed 2026-05-15 18:29 UTC · model grok-4.3
The pith
In medical visual question answering, chain-of-thought prompting often reduces accuracy compared to direct answers because it amplifies early visual perception errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On medical visual question answering, CoT frequently underperforms direct answering across general-purpose and medical-specific models due to a medical perception bottleneck where subtle cues weaken visual grounding and CoT compounds early perceptual uncertainty. Training-free interventions via perception anchoring and description grounding improve accuracy and mitigate the degradation.
What carries the argument
Medical perception bottleneck that weakens visual grounding and causes chain-of-thought to compound perceptual errors.
Load-bearing premise
The observed performance gap is driven primarily by compounding perceptual uncertainty from the medical perception bottleneck rather than by variations in prompting, model size, or dataset artifacts.
What would settle it
Measuring CoT versus DirA performance after providing perfect initial visual perception through oracle cues would show if the bottleneck is the main cause; if the gap disappears, it supports the claim.
Figures
read the original abstract
Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports that chain-of-thought (CoT) prompting underperforms direct answering (DirA) on medical visual question answering tasks across general-purpose and medical-specific vision-language models. The authors attribute the gap to a 'medical perception bottleneck' in which subtle domain-specific visual cues weaken grounding and CoT compounds early perceptual errors. They introduce two training-free interventions—perception anchoring via region-of-interest cues and description grounding via high-quality textual descriptions—and show that these interventions raise accuracy, reduce CoT degradation, and in several settings reverse the CoT–DirA ordering. Experiments span multiple benchmarks and model families; code is released.
Significance. If the core empirical pattern holds, the work usefully documents a domain-specific limitation of text-centric reasoning techniques in medical imaging and supplies simple, inference-time fixes that improve reliability. The breadth of models and benchmarks, together with the open code, makes the findings actionable for clinical VLM development and provides a concrete baseline for future grounding research.
major comments (2)
- [§4] §4 (Results): the reported CoT < DirA gap is not accompanied by prompt-length or output-token controls; because the interventions simultaneously alter visual focus, prompt length, and textual conditioning, the data do not isolate whether the original degradation arises from perceptual compounding or from generic prompt-complexity effects.
- [§3.2] §3.2 (Interventions): no ablation matches total prompt length or isolates the contribution of ROI cues versus added textual descriptions; without such controls the claim that the interventions specifically repair a perception bottleneck remains correlational rather than causal.
minor comments (2)
- [Abstract] Abstract: the term 'CoT–DirA inversion' is used without a brief parenthetical definition; a short clarification would aid readers outside the immediate subfield.
- [Figures] Figure captions (throughout): axis labels and legend entries should explicitly state whether accuracy is reported as mean ± std or as single-run values.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. The concerns about prompt-length controls and ablation design are valid and point to ways we can make the causal claims more rigorous. We will incorporate the suggested controls and ablations in the revised manuscript, which we believe will strengthen rather than undermine the core finding of a medical perception bottleneck.
read point-by-point responses
-
Referee: [§4] §4 (Results): the reported CoT < DirA gap is not accompanied by prompt-length or output-token controls; because the interventions simultaneously alter visual focus, prompt length, and textual conditioning, the data do not isolate whether the original degradation arises from perceptual compounding or from generic prompt-complexity effects.
Authors: We agree that prompt length and token count are potential confounds. In the revision we will add a new set of length-controlled experiments: for each model and benchmark we will create prompt variants that match the token length of the CoT condition (via neutral filler text or rephrasing) while preserving the direct-answer structure, and we will report both input and output token statistics across all conditions. These controls will allow us to quantify how much of the original CoT–DirA gap persists after length equalization and thereby better isolate perceptual compounding from generic complexity effects. revision: yes
-
Referee: [§3.2] §3.2 (Interventions): no ablation matches total prompt length or isolates the contribution of ROI cues versus added textual descriptions; without such controls the claim that the interventions specifically repair a perception bottleneck remains correlational rather than causal.
Authors: We accept that the current intervention results are correlational with respect to the individual factors. In the revised version we will add three new ablation arms per benchmark: (i) ROI cues paired with length-matched neutral text, (ii) high-quality textual descriptions without ROI cues, and (iii) length-matched prompts containing neither intervention. By comparing these conditions we will be able to separate the contribution of visual anchoring from that of added textual conditioning and from prompt length, thereby providing a more direct test of whether the interventions repair the hypothesized perception bottleneck. revision: yes
Circularity Check
No significant circularity; purely empirical evaluation
full rationale
The paper reports benchmark results comparing CoT vs. direct answering on medical VQA tasks across models, attributes the gap to a perceptual bottleneck hypothesis, and evaluates two training-free interventions (ROI anchoring and description grounding). No equations, fitted parameters, or derivations appear; the central claims rest on observed accuracy deltas and intervention effects rather than any self-definitional reduction or self-citation chain. The work is self-contained against external benchmarks with no load-bearing steps that collapse to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing medical VQA benchmarks are valid proxies for clinical visual reasoning tasks.
invented entities (1)
-
medical perception bottleneck
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Bai, S., Cai, Y., Chen, R., et al.: Qwen3-vl technical report (2025), https://arxiv. org/abs/2511.21631
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [3]
-
[4]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR)
Du, Y., Wei, F., Zhang, Z., et al.: Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR). pp. 14084–14093 (June 2022)
work page 2022
-
[5]
In: European Conference on Computer Vision
Fu, X., Hu, Y., Li, B., et al.: Blink: Multimodal large language models can see but not perceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)
work page 2024
-
[6]
Nature645(8081), 633–638 (2025)
Guo, D., Yang, D., Zhang, H., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025)
work page 2025
-
[7]
PathVQA: 30000+ Questions for Medical Visual Question Answering
He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[8]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Hu, Y., Li, T., Lu, Q., et al.: Omnimedvqa: A new large-scale comprehensive eval- uation benchmark for medical lvlm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22170–22183 (2024)
work page 2024
-
[9]
arXiv preprint arXiv:2510.08668 (2025)
Jiang, S., Wang, Y., Song, S., et al.: Hulu-med: A transparent generalist model towards holistic medical vision-language understanding. arXiv preprint arXiv:2510.08668 (2025)
-
[10]
arXiv preprint arXiv:2506.03922 (2025)
Kang, Z., Gong, J., Yan, J., et al.: Hssbench: Benchmarking humanities and social sciences ability for multimodal large language models. arXiv preprint arXiv:2506.03922 (2025)
-
[11]
In: European conference on computer vision
Kembhavi, A., Salvato, M., Kolve, E., et al.: A diagram is worth a dozen images. In: European conference on computer vision. pp. 235–251. Springer (2016)
work page 2016
-
[12]
Kwon, W., Li, Z., Zhuang, S., et al.: Efficient memory management for large lan- guage model serving with pagedattention. p. 611–626. SOSP ’23, Association for Computing Machinery, New York, NY, USA (2023)
work page 2023
-
[13]
Scientific data 5(1), 180251 (2018)
Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 180251 (2018)
work page 2018
-
[14]
In: 2021 IEEE 18th inter- national symposium on biomedical imaging (ISBI)
Liu, B., Zhan, L.M., Xu, L., et al.: Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In: 2021 IEEE 18th inter- national symposium on biomedical imaging (ISBI). pp. 1650–1654. IEEE (2021) 10 Y. Wu et al
work page 2021
-
[15]
In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Liu, J., Wang, Y., Du, J., Zhou, J.T., Liu, Z.: Medcot: Medical chain of thought via hierarchical expert. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 17371–17389 (2024)
work page 2024
-
[16]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Lu, P., Bansal, H., Xia, T., et al.: Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
International Journal on Dig- ital Libraries23(3), 289–301 (2022)
Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., Bhattacharyya, P.: Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Dig- ital Libraries23(3), 289–301 (2022)
work page 2022
-
[18]
Team, K., Bai, T., Bai, Y., et al.: Kimi k2.5: Visual agentic intelligence (2026), https://arxiv.org/abs/2602.02276
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Advances in Neural Information Processing Systems 37, 87310–87356 (2024)
Tong, P., Brown, E., Wu, P., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms. Advances in Neural Information Processing Systems 37, 87310–87356 (2024)
work page 2024
-
[20]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
Wang, Y., Li, Z., Zang, Y., et al.: Unified multimodal chain-of-thought reward model through reinforcement fine-tuning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)
work page 2025
-
[21]
Wang,Y.,Liu,J.,Gao,S.,etal.: V2T-CoT:FromVisiontoTextChain-of-Thought forMedicalReasoningandDiagnosis.In:proceedingsofMedicalImageComputing and Computer Assisted Intervention – MICCAI 2025. vol. LNCS 15964. Springer Nature Switzerland (September 2025)
work page 2025
-
[22]
Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits rea- soninginlargelanguagemodels.Advancesinneuralinformationprocessingsystems 35, 24824–24837 (2022)
work page 2022
-
[23]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Xu,W.,Chan,H.P.,Li,L.,etal.:Lingshu:Ageneralistfoundationmodelforunified multimodalmedical understandingandreasoning. arXivpreprint arXiv:2506.07044 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
arXiv preprint arXiv:2506.13793 (2025)
Yang, Z., Qian, J., Peng, Z., et al.: Med-refl: Medical reasoning enhancement via self-corrected fine-grained reflection. arXiv preprint arXiv:2506.13793 (2025)
-
[25]
arXiv preprint arXiv:2601.02737 (2026)
Ye, Z., Niu, X., Wu, X., et al.: Unveiling and bridging the functional perception gap in mllms: Atomic visual alignment and hierarchical evaluation via pet-bench. arXiv preprint arXiv:2601.02737 (2026)
-
[26]
arXiv preprint arXiv:2305.10415 (2023)
Zhang, X., Wu, C., Zhao, Z., et al.: Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023)
-
[27]
In: Thirty- seventh Conference on Neural Information Processing Systems (2023)
Zheng, G., Yang, B., Tang, J., Zhou, H.Y., Yang, S.: DDCot: Duty-distinct chain- of-thought prompting for multimodal reasoning in language models. In: Thirty- seventh Conference on Neural Information Processing Systems (2023)
work page 2023
-
[28]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu,J.,Wang,W.,Chen,Z.,etal.:Internvl3:Exploringadvancedtrainingandtest- time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.