pith. sign in

arxiv: 2606.17412 · v3 · pith:4R4F4N7Enew · submitted 2026-06-16 · 💻 cs.CV · cs.AI

Enhancing Pathological VLMs with Cross-scale Reasoning

Pith reviewed 2026-06-27 02:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords pathological imagesvision-language modelscross-scale reasoningmulti-magnification VQAreinforcement learningbenchmark curationScale-VQAScaleReasoner-R1
0
0 comments X

The pith

A reinforcement-learned VLM trained on curated cross-scale pathology questions reaches SOTA on both multi-magnification and single-scale benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pathological images require pathologists to combine evidence from low-magnification tissue architecture and high-magnification cellular details. Existing vision-language models lack explicit training on cross-scale reasoning tasks even when multi-scale images are present. The paper therefore builds a leakage-aware curation pipeline that blocks text-only shortcuts in multi-image VQA and uses it to create the Scale-VQA benchmark of 4,685 questions over 2,537 images. ScaleReasoner-R1 is then trained with reinforcement learning on this benchmark. The resulting model sets new performance records on the cross-scale task and also improves results on established single-scale pathology benchmarks.

Core claim

ScaleReasoner-R1, trained via reinforcement learning to optimize performance on cross-scale VQA tasks, achieves state-of-the-art performance on the cross-scale reasoning benchmark and generalizes to state-of-the-art performance on established single-scale benchmarks.

What carries the argument

The leakage-aware curation pipeline (adversarial text-only screening plus constraint-guided question design) that produces the Scale-VQA benchmark used to train ScaleReasoner-R1 with reinforcement learning.

If this is right

  • Limited cross-scale supervision can significantly improve pathological understanding in VLMs.
  • The model generalizes from cross-scale training to SOTA results on single-scale benchmarks.
  • Pathology interpretation can be formulated as multi-magnification reasoning.
  • Multi-image VQA benchmarks require explicit protection against text-only shortcuts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Cross-scale supervision may prove useful for VLMs in any domain whose data naturally spans multiple resolutions.
  • The same curation approach could be tested on other multi-image medical or scientific VQA tasks to reduce shortcut learning.
  • Reinforcement learning may be especially suited to teaching evidence integration across scales compared with standard supervised fine-tuning.

Load-bearing premise

The leakage-aware curation pipeline successfully removes text-only shortcuts so that performance truly reflects cross-scale visual reasoning.

What would settle it

If models trained on Scale-VQA retain high accuracy when the same questions are rephrased to restore magnification-dependent text cues, or if single-scale benchmark gains disappear when cross-scale questions are removed from training, the claim that gains stem from visual cross-scale reasoning would be falsified.

Figures

Figures reproduced from arXiv: 2606.17412 by Chi Phan, Dan Hu, Qiaochu Xue, Sudong Wang, Tianyi Zhang, Yueming Jin, Yufeng Wu, Zeyu Liu.

Figure 1
Figure 1. Figure 1: A comparison of (a) single-scale VQA, (b) naïve cross-scale VQA with text-only shortcut solutions, and (c) our leakage-aware cross-scale VQA. WSIs [9]. We curate 2,537 ROI captions at 10×, 40×, and 200×, and collect 937 cross-scale captions that synthesize visual patterns across magnifications. Second, we design a leakage-aware VQA curation pipeline (Fig. 2a) that com￾bines adversarial text-only screening … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Scale-VQA and ScaleReasoner-R1. (a) Leakage-aware curation pipeline. (b) Dataset overview. (c) GRPO-based RL training. (d) Cross-scale reasoning results. (e) Example cross-scale VQA. 2 Methods 2.1 Clinical Annotation for Cross-scale Reasoning We construct a clinically verified foundation for cross-scale reasoning by anno￾tating 177 publicly available TCGA WSIs [9]. Unlike prior benchmarks [21, … view at source ↗
read the original abstract

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language models (VLMs) include various scales, they often lack explicit cross-scale reasoning objectives. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on cross-scale VQA tasks. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. Code is available at https://github.com/iMVR-PL/ScaleReasoner-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a cross-scale reasoning paradigm for pathological VLMs. It constructs Scale-VQA, a benchmark of 4,685 multiple-choice VQA items grounded in 2,537 multi-magnification pathology images, via a leakage-aware curation pipeline (adversarial text-only screening plus constraint-guided question design) intended to eliminate magnification-dependent textual shortcuts. ScaleReasoner-R1 is then trained with reinforcement learning on this benchmark and is reported to achieve SOTA on Scale-VQA while generalizing to SOTA on established single-scale pathology benchmarks. The authors conclude that even limited cross-scale supervision improves pathological understanding; code is released.

Significance. If the curation pipeline demonstrably removes text-only shortcuts, the work would be significant: it supplies the first explicit cross-scale VQA benchmark and training objective for pathology, where multi-magnification integration is clinically essential. The RL optimization step and the reported generalization to single-scale tasks are concrete strengths. Public code release supports reproducibility.

major comments (1)
  1. [Benchmark construction / leakage-aware curation pipeline] The leakage-aware curation pipeline (described in the methods section on benchmark construction) is load-bearing for the central claim that Scale-VQA performance reflects cross-scale visual reasoning. The manuscript provides no quantitative validation—such as accuracy of text-only baselines on the final 4,685-question set—leaving open the possibility that residual shortcuts remain. This directly affects both the SOTA result on Scale-VQA and the generalization claim.
minor comments (2)
  1. The abstract states that 'even the limited cross-scale supervision can significantly improve pathological understanding' without quantifying the improvement or discussing limitations of the RL objective; a dedicated limitations paragraph would strengthen the discussion.
  2. Table or figure reporting the exact text-only screening rejection rates and final question statistics would make the curation pipeline more transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the leakage-aware curation pipeline, which is central to the validity of Scale-VQA. We address the major comment below.

read point-by-point responses
  1. Referee: [Benchmark construction / leakage-aware curation pipeline] The leakage-aware curation pipeline (described in the methods section on benchmark construction) is load-bearing for the central claim that Scale-VQA performance reflects cross-scale visual reasoning. The manuscript provides no quantitative validation—such as accuracy of text-only baselines on the final 4,685-question set—leaving open the possibility that residual shortcuts remain. This directly affects both the SOTA result on Scale-VQA and the generalization claim.

    Authors: We agree that the manuscript would be strengthened by explicit quantitative validation of the text-only screening. The adversarial text-only screening step was applied during curation to remove magnification-dependent textual shortcuts, but post-curation accuracy of text-only baselines on the final 4,685-question set was not reported. In the revised manuscript we will add these results (text-only model accuracy on the curated set, expected near random chance) in the benchmark construction section to confirm that residual shortcuts are minimal. This addition directly supports the cross-scale reasoning claim and the reported generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark curation and RL training with independent evaluation claims.

full rationale

The paper constructs Scale-VQA via an adversarial text-only screening pipeline and constraint-guided design, then trains ScaleReasoner-R1 with RL and reports empirical SOTA results on the new benchmark plus generalization to single-scale tasks. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text that would reduce any claim to a self-referential definition. The load-bearing assumption about shortcut removal is an empirical verification issue rather than a definitional or self-citation reduction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5811 in / 899 out tokens · 24711 ms · 2026-06-27T02:06:54.421118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    The Journal of pathol- ogy249(3), 286–294 (2019)

    Abels, E., Pantanowitz, L., Aeffner, F., Zarella, M.D., Van der Laak, J., Bui, M.M., Vemuri, V.N., Parwani, A.V., Gibbs, J., Agosto-Arroyo, E., et al.: Computational pathology definitions, best practices, and recommendations for regulatory guid- ance: a white paper from the digital pathology association. The Journal of pathol- ogy249(3), 286–294 (2019)

  2. [2]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and an- swer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4971–4980 (2018). https://doi.org/10.1109/CVPR.2018.00522

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  4. [4]

    Nature Computational Science (2025)

    Chen, K., Liu, M., Yan, F., et al.: Cost-effective instruction learning for pathology vision and language analysis. Nature Computational Science (2025). https://doi.org/10.1038/s43588-025-00818-5

  5. [5]

    In: European Conference on Com- puter Vision

    Chen, P., Zhu, C., Zheng, S., Li, H., Yang, L.: Wsi-vqa: Interpreting whole slide images by generative visual question answering. In: European Conference on Com- puter Vision. pp. 401–417. Springer (2025)

  6. [6]

    arXiv preprint arXiv:2410.11761 (2024) 10 C

    Chen, Y., Wang, G., Ji, Y., Li, Y., Ye, J., Li, T., , Ming, H., Yu, R., Qiao, Y., He, J.: Slidechat: A large vision-language assistant for whole-slide pathology image understanding. arXiv preprint arXiv:2410.11761 (2024) 10 C. Phan et al

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, Y., Wang, G., Ji, Y., Li, Y., Ye, J., Li, T., Hu, M., Yu, R., Qiao, Y., He, J.: Slidechat: A large vision-language assistant for whole-slide pathology image understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5134–5143 (Jun 2025)

  8. [8]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V., Levine, S., Ma, Y.: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025)

  9. [9]

    Nucleic acids research 44(8), e71–e71 (2016)

    Colaprico, A., Silva, T.C., Olsen, C., Garofano, L., Cava, C., Garolini, D., Sabedot, T.S., Malta, T.M., Pagnotta, S.M., Castiglioni, I., et al.: Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic acids research 44(8), e71–e71 (2016)

  10. [10]

    In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

    Goyal,Y.,Khot,T.,Summers-Stay,D.,Batra,D.,Parikh,D.:MakingthevinVQA matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017). https://doi.org/10.1109/CVPR.2017.670

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Hashimoto, N., Fukushima, D., Koga, R., Takagi, Y., Ko, K., Kohno, K., Nakaguro, M., Nakamura, S., Hontani, H., Takeuchi, I.: Multi-scale domain- adversarial multiple-instance cnn for cancer subtype classification with unanno- tated histopathological images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3852–38...

  13. [13]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

  14. [14]

    Advances in Neural Information Processing Systems36, 28541–28564 (2023)

    Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

  15. [15]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liang, Y., Lyu, X., Chen, W., Ding, M., Zhang, J., He, X., Wu, S., Xing, X., Yang, S., Wang, X., et al.: Wsi-llava: A multimodal large language model for whole slide image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22718–22727 (2025)

  16. [16]

    Liao, D., Chen, S., Xi, N., Xue, Q., Li, J., Hou, L., Liu, Z., Low, C.H., Wu, Y., Liu, Y., Jiang, Y., Li, D., Lyu, S.: Unpuzzle: A unified framework for pathology image analysis (2025), https://arxiv.org/abs/2503.03152

  17. [17]

    A Multimodal Generative AI Copilot for Human Pathology,

    Lu, M.Y., Chen, B., Williamson, D.F.K., Chen, R.J., Zhao, M., Chow, A.K., Ike- mura, K., Kim, A., Pouli, D., Patel, A., Soliman, A., Chen, C., Ding, T., Wang, J.J., Gerber, G., Liang, I., Le, L.P., Parwani, A.V., Weishaupt, L.L., Mahmood, F.: A multimodal generative ai copilot for human pathology. Nature634(8033), 466–473 (Oct 2024). https://doi.org/10.10...

  18. [18]

    arXiv e-prints pp

    Saygin Seyfioglu, M., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt- llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv e-prints pp. arXiv–2312 (2023)

  19. [19]

    In: Proceedings of the 58th annual meet- ing of the association for computational linguistics

    Shrestha, R., Kafle, K., Kanan, C.: A negative case analysis of visual grounding methods for VQA. In: Proceedings of the 58th annual meet- ing of the association for computational linguistics. pp. 8172–8181 (2020). https://doi.org/10.18653/v1/2020.acl-main.727

  20. [20]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) Enhancing Pathological VLMs with Cross-scale Reasoning 11

  21. [21]

    In: European Conference on Computer Vision

    Sun, Y., Wu, H., Zhu, C., Zheng, S., Chen, Q., Zhang, K., Zhang, Y., Wan, D., Lan, X., Zheng, M., et al.: Pathmmu: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. In: European Conference on Computer Vision. pp. 56–73. Springer (2024)

  22. [22]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalk- wyk,J.,Dai,A.M.,Hauth,A.,etal.:Gemini:afamilyofhighlycapablemultimodal models. arXiv preprint arXiv:2312.11805 (2023)

  23. [23]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

  24. [24]

    arXiv preprint arXiv:2507.17303 (2025)

    Xu, Z., Liu, Z., Hou, J., Ma, J., Jin, C., Wang, Y., Chen, Z., Zhang, Z., Huang, F., Guo, Z., et al.: A versatile pathology co-pilot via reasoning enhanced multimodal large language model. arXiv preprint arXiv:2507.17303 (2025)

  25. [25]

    arXiv preprint arXiv:2305.15075 (2023)

    Zhang, H., Chen, J., Jiang, F., Yu, F., Chen, Z., Li, J., Chen, G., Wu, X., Zhang, Z., Xiao, Q., Wan, X., Wang, B., Li, H.: Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075 (2023)

  26. [26]

    arXiv preprint arXiv:2505.11404 (2025)

    Zhang, W., Zhang, P., Guo, J., Cheng, T., Chen, J., Zhang, S., Zhang, Z., Yi, Y., Bu, H.: Patho-r1: A multimodal reinforcement learning-based pathology expert reasoner. arXiv preprint arXiv:2505.11404 (2025)

  27. [27]

    Nature Machine Intelligence1(5), 236–245 (2019)

    Zhang, Z., Chen, P., McGough, M., Xing, F., Wang, C., Bui, M., Xie, Y., Sapkota, M., Cui, L., Dhillon, J., et al.: Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nature Machine Intelligence1(5), 236–245 (2019)