Enhancing Pathological VLMs with Cross-scale Reasoning

Chi Phan; Dan Hu; Qiaochu Xue; Sudong Wang; Tianyi Zhang; Yueming Jin; Yufeng Wu; Zeyu Liu

arxiv: 2606.17412 · v3 · pith:4R4F4N7Enew · submitted 2026-06-16 · 💻 cs.CV · cs.AI

Enhancing Pathological VLMs with Cross-scale Reasoning

Chi Phan , Tianyi Zhang , Qiaochu Xue , Yufeng Wu , Dan Hu , Zeyu Liu , Sudong Wang , Yueming Jin This is my paper

Pith reviewed 2026-06-27 02:06 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords pathological imagesvision-language modelscross-scale reasoningmulti-magnification VQAreinforcement learningbenchmark curationScale-VQAScaleReasoner-R1

0 comments

The pith

A reinforcement-learned VLM trained on curated cross-scale pathology questions reaches SOTA on both multi-magnification and single-scale benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Pathological images require pathologists to combine evidence from low-magnification tissue architecture and high-magnification cellular details. Existing vision-language models lack explicit training on cross-scale reasoning tasks even when multi-scale images are present. The paper therefore builds a leakage-aware curation pipeline that blocks text-only shortcuts in multi-image VQA and uses it to create the Scale-VQA benchmark of 4,685 questions over 2,537 images. ScaleReasoner-R1 is then trained with reinforcement learning on this benchmark. The resulting model sets new performance records on the cross-scale task and also improves results on established single-scale pathology benchmarks.

Core claim

ScaleReasoner-R1, trained via reinforcement learning to optimize performance on cross-scale VQA tasks, achieves state-of-the-art performance on the cross-scale reasoning benchmark and generalizes to state-of-the-art performance on established single-scale benchmarks.

What carries the argument

The leakage-aware curation pipeline (adversarial text-only screening plus constraint-guided question design) that produces the Scale-VQA benchmark used to train ScaleReasoner-R1 with reinforcement learning.

If this is right

Limited cross-scale supervision can significantly improve pathological understanding in VLMs.
The model generalizes from cross-scale training to SOTA results on single-scale benchmarks.
Pathology interpretation can be formulated as multi-magnification reasoning.
Multi-image VQA benchmarks require explicit protection against text-only shortcuts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Cross-scale supervision may prove useful for VLMs in any domain whose data naturally spans multiple resolutions.
The same curation approach could be tested on other multi-image medical or scientific VQA tasks to reduce shortcut learning.
Reinforcement learning may be especially suited to teaching evidence integration across scales compared with standard supervised fine-tuning.

Load-bearing premise

The leakage-aware curation pipeline successfully removes text-only shortcuts so that performance truly reflects cross-scale visual reasoning.

What would settle it

If models trained on Scale-VQA retain high accuracy when the same questions are rephrased to restore magnification-dependent text cues, or if single-scale benchmark gains disappear when cross-scale questions are removed from training, the claim that gains stem from visual cross-scale reasoning would be falsified.

Figures

Figures reproduced from arXiv: 2606.17412 by Chi Phan, Dan Hu, Qiaochu Xue, Sudong Wang, Tianyi Zhang, Yueming Jin, Yufeng Wu, Zeyu Liu.

**Figure 1.** Figure 1: A comparison of (a) single-scale VQA, (b) naïve cross-scale VQA with text-only shortcut solutions, and (c) our leakage-aware cross-scale VQA. WSIs [9]. We curate 2,537 ROI captions at 10×, 40×, and 200×, and collect 937 cross-scale captions that synthesize visual patterns across magnifications. Second, we design a leakage-aware VQA curation pipeline (Fig. 2a) that combines adversarial text-only screening … view at source ↗

**Figure 2.** Figure 2: Overview of Scale-VQA and ScaleReasoner-R1. (a) Leakage-aware curation pipeline. (b) Dataset overview. (c) GRPO-based RL training. (d) Cross-scale reasoning results. (e) Example cross-scale VQA. 2 Methods 2.1 Clinical Annotation for Cross-scale Reasoning We construct a clinically verified foundation for cross-scale reasoning by annotating 177 publicly available TCGA WSIs [9]. Unlike prior benchmarks [21, … view at source ↗

read the original abstract

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language models (VLMs) include various scales, they often lack explicit cross-scale reasoning objectives. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on cross-scale VQA tasks. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. Code is available at https://github.com/iMVR-PL/ScaleReasoner-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a new multi-magnification VQA benchmark for pathology with shortcut-aware curation and trains an RL model on it, but the curation's success at blocking text-only exploits is not quantified in the abstract.

read the letter

The main thing here is that the authors introduce Scale-VQA, a benchmark of 4,685 multiple-choice questions drawn from 2,537 pathology images across magnification levels, and train ScaleReasoner-R1 with reinforcement learning to handle explicit cross-scale reasoning. They claim this gives SOTA on their benchmark and also improves results on existing single-scale pathology VQA sets.

The new element is the deliberate framing of pathology VQA as multi-magnification reasoning plus the leakage-aware pipeline that combines adversarial text-only screening with constraint-guided question writing. That pipeline directly targets the real risk that models will latch onto magnification-dependent text patterns instead of visual features. The RL objective is a straightforward way to push the model toward integrating low- and high-magnification evidence, which matches how pathologists actually work.

The construction effort looks honest. They acknowledge the shortcut problem up front and try to engineer around it, and the claim that even limited cross-scale supervision can lift overall performance is worth checking.

The soft spot is exactly the one the stress test flags. The abstract describes the screening step but gives no numbers on how well a text-only baseline performs on the final 4,685-question set. Without those figures it is hard to know whether the reported gains come from genuine cross-scale visual reasoning or from whatever leakage survived the filter. The generalization result to single-scale benchmarks would also need tighter controls on training data and baselines to be fully convincing.

This is for groups working on medical VLMs who care about realistic multi-scale evaluation. It deserves peer review because the problem is practical, the benchmark construction is thoughtful, and the RL approach is reproducible enough that referees can test the curation claims directly.

Referee Report

1 major / 2 minor

Summary. The paper introduces a cross-scale reasoning paradigm for pathological VLMs. It constructs Scale-VQA, a benchmark of 4,685 multiple-choice VQA items grounded in 2,537 multi-magnification pathology images, via a leakage-aware curation pipeline (adversarial text-only screening plus constraint-guided question design) intended to eliminate magnification-dependent textual shortcuts. ScaleReasoner-R1 is then trained with reinforcement learning on this benchmark and is reported to achieve SOTA on Scale-VQA while generalizing to SOTA on established single-scale pathology benchmarks. The authors conclude that even limited cross-scale supervision improves pathological understanding; code is released.

Significance. If the curation pipeline demonstrably removes text-only shortcuts, the work would be significant: it supplies the first explicit cross-scale VQA benchmark and training objective for pathology, where multi-magnification integration is clinically essential. The RL optimization step and the reported generalization to single-scale tasks are concrete strengths. Public code release supports reproducibility.

major comments (1)

[Benchmark construction / leakage-aware curation pipeline] The leakage-aware curation pipeline (described in the methods section on benchmark construction) is load-bearing for the central claim that Scale-VQA performance reflects cross-scale visual reasoning. The manuscript provides no quantitative validation—such as accuracy of text-only baselines on the final 4,685-question set—leaving open the possibility that residual shortcuts remain. This directly affects both the SOTA result on Scale-VQA and the generalization claim.

minor comments (2)

The abstract states that 'even the limited cross-scale supervision can significantly improve pathological understanding' without quantifying the improvement or discussing limitations of the RL objective; a dedicated limitations paragraph would strengthen the discussion.
Table or figure reporting the exact text-only screening rejection rates and final question statistics would make the curation pipeline more transparent.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the leakage-aware curation pipeline, which is central to the validity of Scale-VQA. We address the major comment below.

read point-by-point responses

Referee: [Benchmark construction / leakage-aware curation pipeline] The leakage-aware curation pipeline (described in the methods section on benchmark construction) is load-bearing for the central claim that Scale-VQA performance reflects cross-scale visual reasoning. The manuscript provides no quantitative validation—such as accuracy of text-only baselines on the final 4,685-question set—leaving open the possibility that residual shortcuts remain. This directly affects both the SOTA result on Scale-VQA and the generalization claim.

Authors: We agree that the manuscript would be strengthened by explicit quantitative validation of the text-only screening. The adversarial text-only screening step was applied during curation to remove magnification-dependent textual shortcuts, but post-curation accuracy of text-only baselines on the final 4,685-question set was not reported. In the revised manuscript we will add these results (text-only model accuracy on the curated set, expected near random chance) in the benchmark construction section to confirm that residual shortcuts are minimal. This addition directly supports the cross-scale reasoning claim and the reported generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark curation and RL training with independent evaluation claims.

full rationale

The paper constructs Scale-VQA via an adversarial text-only screening pipeline and constraint-guided design, then trains ScaleReasoner-R1 with RL and reports empirical SOTA results on the new benchmark plus generalization to single-scale tasks. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text that would reduce any claim to a self-referential definition. The load-bearing assumption about shortcut removal is an empirical verification issue rather than a definitional or self-citation reduction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5811 in / 899 out tokens · 24711 ms · 2026-06-27T02:06:54.421118+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 17 canonical work pages · 7 internal anchors

[1]

The Journal of pathol- ogy249(3), 286–294 (2019)

Abels, E., Pantanowitz, L., Aeffner, F., Zarella, M.D., Van der Laak, J., Bui, M.M., Vemuri, V.N., Parwani, A.V., Gibbs, J., Agosto-Arroyo, E., et al.: Computational pathology definitions, best practices, and recommendations for regulatory guid- ance: a white paper from the digital pathology association. The Journal of pathol- ogy249(3), 286–294 (2019)

2019
[2]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and an- swer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4971–4980 (2018). https://doi.org/10.1109/CVPR.2018.00522

work page doi:10.1109/cvpr.2018.00522 2018
[3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Nature Computational Science (2025)

Chen, K., Liu, M., Yan, F., et al.: Cost-effective instruction learning for pathology vision and language analysis. Nature Computational Science (2025). https://doi.org/10.1038/s43588-025-00818-5

work page doi:10.1038/s43588-025-00818-5 2025
[5]

In: European Conference on Com- puter Vision

Chen, P., Zhu, C., Zheng, S., Li, H., Yang, L.: Wsi-vqa: Interpreting whole slide images by generative visual question answering. In: European Conference on Com- puter Vision. pp. 401–417. Springer (2025)

2025
[6]

arXiv preprint arXiv:2410.11761 (2024) 10 C

Chen, Y., Wang, G., Ji, Y., Li, Y., Ye, J., Li, T., , Ming, H., Yu, R., Qiao, Y., He, J.: Slidechat: A large vision-language assistant for whole-slide pathology image understanding. arXiv preprint arXiv:2410.11761 (2024) 10 C. Phan et al

work page arXiv 2024
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, Y., Wang, G., Ji, Y., Li, Y., Ye, J., Li, T., Hu, M., Yu, R., Qiao, Y., He, J.: Slidechat: A large vision-language assistant for whole-slide pathology image understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5134–5143 (Jun 2025)

2025
[8]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V., Levine, S., Ma, Y.: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Nucleic acids research 44(8), e71–e71 (2016)

Colaprico, A., Silva, T.C., Olsen, C., Garofano, L., Cava, C., Garolini, D., Sabedot, T.S., Malta, T.M., Pagnotta, S.M., Castiglioni, I., et al.: Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic acids research 44(8), e71–e71 (2016)

2016
[10]

In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Goyal,Y.,Khot,T.,Summers-Stay,D.,Batra,D.,Parikh,D.:MakingthevinVQA matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017). https://doi.org/10.1109/CVPR.2017.670

work page doi:10.1109/cvpr.2017.670 2017
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hashimoto, N., Fukushima, D., Koga, R., Takagi, Y., Ko, K., Kohno, K., Nakaguro, M., Nakamura, S., Hontani, H., Takeuchi, I.: Multi-scale domain- adversarial multiple-instance cnn for cancer subtype classification with unanno- tated histopathological images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3852–38...

2020
[13]

PathVQA: 30000+ Questions for Medical Visual Question Answering

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2003
[14]

Advances in Neural Information Processing Systems36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

2023
[15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Liang, Y., Lyu, X., Chen, W., Ding, M., Zhang, J., He, X., Wu, S., Xing, X., Yang, S., Wang, X., et al.: Wsi-llava: A multimodal large language model for whole slide image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22718–22727 (2025)

2025
[16]

Liao, D., Chen, S., Xi, N., Xue, Q., Li, J., Hou, L., Liu, Z., Low, C.H., Wu, Y., Liu, Y., Jiang, Y., Li, D., Lyu, S.: Unpuzzle: A unified framework for pathology image analysis (2025), https://arxiv.org/abs/2503.03152

work page arXiv 2025
[17]

A Multimodal Generative AI Copilot for Human Pathology,

Lu, M.Y., Chen, B., Williamson, D.F.K., Chen, R.J., Zhao, M., Chow, A.K., Ike- mura, K., Kim, A., Pouli, D., Patel, A., Soliman, A., Chen, C., Ding, T., Wang, J.J., Gerber, G., Liang, I., Le, L.P., Parwani, A.V., Weishaupt, L.L., Mahmood, F.: A multimodal generative ai copilot for human pathology. Nature634(8033), 466–473 (Oct 2024). https://doi.org/10.10...

work page doi:10.1038/s41586-024-07618-3 2024
[18]

arXiv e-prints pp

Saygin Seyfioglu, M., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt- llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv e-prints pp. arXiv–2312 (2023)

2023
[19]

In: Proceedings of the 58th annual meet- ing of the association for computational linguistics

Shrestha, R., Kafle, K., Kanan, C.: A negative case analysis of visual grounding methods for VQA. In: Proceedings of the 58th annual meet- ing of the association for computational linguistics. pp. 8172–8181 (2020). https://doi.org/10.18653/v1/2020.acl-main.727

work page doi:10.18653/v1/2020.acl-main.727 2020
[20]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) Enhancing Pathological VLMs with Cross-scale Reasoning 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

In: European Conference on Computer Vision

Sun, Y., Wu, H., Zhu, C., Zheng, S., Chen, Q., Zhang, K., Zhang, Y., Wan, D., Lan, X., Zheng, M., et al.: Pathmmu: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. In: European Conference on Computer Vision. pp. 56–73. Springer (2024)

2024
[22]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalk- wyk,J.,Dai,A.M.,Hauth,A.,etal.:Gemini:afamilyofhighlycapablemultimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

arXiv preprint arXiv:2507.17303 (2025)

Xu, Z., Liu, Z., Hou, J., Ma, J., Jin, C., Wang, Y., Chen, Z., Zhang, Z., Huang, F., Guo, Z., et al.: A versatile pathology co-pilot via reasoning enhanced multimodal large language model. arXiv preprint arXiv:2507.17303 (2025)

work page arXiv 2025
[25]

arXiv preprint arXiv:2305.15075 (2023)

Zhang, H., Chen, J., Jiang, F., Yu, F., Chen, Z., Li, J., Chen, G., Wu, X., Zhang, Z., Xiao, Q., Wan, X., Wang, B., Li, H.: Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075 (2023)

work page arXiv 2023
[26]

arXiv preprint arXiv:2505.11404 (2025)

Zhang, W., Zhang, P., Guo, J., Cheng, T., Chen, J., Zhang, S., Zhang, Z., Yi, Y., Bu, H.: Patho-r1: A multimodal reinforcement learning-based pathology expert reasoner. arXiv preprint arXiv:2505.11404 (2025)

work page arXiv 2025
[27]

Nature Machine Intelligence1(5), 236–245 (2019)

Zhang, Z., Chen, P., McGough, M., Xing, F., Wang, C., Bui, M., Xie, Y., Sapkota, M., Cui, L., Dhillon, J., et al.: Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nature Machine Intelligence1(5), 236–245 (2019)

2019

[1] [1]

The Journal of pathol- ogy249(3), 286–294 (2019)

Abels, E., Pantanowitz, L., Aeffner, F., Zarella, M.D., Van der Laak, J., Bui, M.M., Vemuri, V.N., Parwani, A.V., Gibbs, J., Agosto-Arroyo, E., et al.: Computational pathology definitions, best practices, and recommendations for regulatory guid- ance: a white paper from the digital pathology association. The Journal of pathol- ogy249(3), 286–294 (2019)

2019

[2] [2]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and an- swer: Overcoming priors for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4971–4980 (2018). https://doi.org/10.1109/CVPR.2018.00522

work page doi:10.1109/cvpr.2018.00522 2018

[3] [3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Nature Computational Science (2025)

Chen, K., Liu, M., Yan, F., et al.: Cost-effective instruction learning for pathology vision and language analysis. Nature Computational Science (2025). https://doi.org/10.1038/s43588-025-00818-5

work page doi:10.1038/s43588-025-00818-5 2025

[5] [5]

In: European Conference on Com- puter Vision

Chen, P., Zhu, C., Zheng, S., Li, H., Yang, L.: Wsi-vqa: Interpreting whole slide images by generative visual question answering. In: European Conference on Com- puter Vision. pp. 401–417. Springer (2025)

2025

[6] [6]

arXiv preprint arXiv:2410.11761 (2024) 10 C

Chen, Y., Wang, G., Ji, Y., Li, Y., Ye, J., Li, T., , Ming, H., Yu, R., Qiao, Y., He, J.: Slidechat: A large vision-language assistant for whole-slide pathology image understanding. arXiv preprint arXiv:2410.11761 (2024) 10 C. Phan et al

work page arXiv 2024

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, Y., Wang, G., Ji, Y., Li, Y., Ye, J., Li, T., Hu, M., Yu, R., Qiao, Y., He, J.: Slidechat: A large vision-language assistant for whole-slide pathology image understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5134–5143 (Jun 2025)

2025

[8] [8]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V., Levine, S., Ma, Y.: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Nucleic acids research 44(8), e71–e71 (2016)

Colaprico, A., Silva, T.C., Olsen, C., Garofano, L., Cava, C., Garolini, D., Sabedot, T.S., Malta, T.M., Pagnotta, S.M., Castiglioni, I., et al.: Tcgabiolinks: an r/bioconductor package for integrative analysis of tcga data. Nucleic acids research 44(8), e71–e71 (2016)

2016

[10] [10]

In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017

Goyal,Y.,Khot,T.,Summers-Stay,D.,Batra,D.,Parikh,D.:MakingthevinVQA matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017). https://doi.org/10.1109/CVPR.2017.670

work page doi:10.1109/cvpr.2017.670 2017

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Hashimoto, N., Fukushima, D., Koga, R., Takagi, Y., Ko, K., Kohno, K., Nakaguro, M., Nakamura, S., Hontani, H., Takeuchi, I.: Multi-scale domain- adversarial multiple-instance cnn for cancer subtype classification with unanno- tated histopathological images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3852–38...

2020

[13] [13]

PathVQA: 30000+ Questions for Medical Visual Question Answering

He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2003

[14] [14]

Advances in Neural Information Processing Systems36, 28541–28564 (2023)

Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., Gao, J.: Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems36, 28541–28564 (2023)

2023

[15] [15]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Liang, Y., Lyu, X., Chen, W., Ding, M., Zhang, J., He, X., Wu, S., Xing, X., Yang, S., Wang, X., et al.: Wsi-llava: A multimodal large language model for whole slide image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22718–22727 (2025)

2025

[16] [16]

Liao, D., Chen, S., Xi, N., Xue, Q., Li, J., Hou, L., Liu, Z., Low, C.H., Wu, Y., Liu, Y., Jiang, Y., Li, D., Lyu, S.: Unpuzzle: A unified framework for pathology image analysis (2025), https://arxiv.org/abs/2503.03152

work page arXiv 2025

[17] [17]

A Multimodal Generative AI Copilot for Human Pathology,

Lu, M.Y., Chen, B., Williamson, D.F.K., Chen, R.J., Zhao, M., Chow, A.K., Ike- mura, K., Kim, A., Pouli, D., Patel, A., Soliman, A., Chen, C., Ding, T., Wang, J.J., Gerber, G., Liang, I., Le, L.P., Parwani, A.V., Weishaupt, L.L., Mahmood, F.: A multimodal generative ai copilot for human pathology. Nature634(8033), 466–473 (Oct 2024). https://doi.org/10.10...

work page doi:10.1038/s41586-024-07618-3 2024

[18] [18]

arXiv e-prints pp

Saygin Seyfioglu, M., Ikezogwo, W.O., Ghezloo, F., Krishna, R., Shapiro, L.: Quilt- llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. arXiv e-prints pp. arXiv–2312 (2023)

2023

[19] [19]

In: Proceedings of the 58th annual meet- ing of the association for computational linguistics

Shrestha, R., Kafle, K., Kanan, C.: A negative case analysis of visual grounding methods for VQA. In: Proceedings of the 58th annual meet- ing of the association for computational linguistics. pp. 8172–8181 (2020). https://doi.org/10.18653/v1/2020.acl-main.727

work page doi:10.18653/v1/2020.acl-main.727 2020

[20] [20]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025) Enhancing Pathological VLMs with Cross-scale Reasoning 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

In: European Conference on Computer Vision

Sun, Y., Wu, H., Zhu, C., Zheng, S., Chen, Q., Zhang, K., Zhang, Y., Wan, D., Lan, X., Zheng, M., et al.: Pathmmu: A massive multimodal expert-level benchmark for understanding and reasoning in pathology. In: European Conference on Computer Vision. pp. 56–73. Springer (2024)

2024

[22] [22]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalk- wyk,J.,Dai,A.M.,Hauth,A.,etal.:Gemini:afamilyofhighlycapablemultimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu, W., Chan, H.P., Li, L., Aljunied, M., Yuan, R., Wang, J., Xiao, C., Chen, G., Liu,C.,Li,Z.,etal.:Lingshu:Ageneralistfoundationmodelforunifiedmultimodal medical understanding and reasoning. arXiv preprint arXiv:2506.07044 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

arXiv preprint arXiv:2507.17303 (2025)

Xu, Z., Liu, Z., Hou, J., Ma, J., Jin, C., Wang, Y., Chen, Z., Zhang, Z., Huang, F., Guo, Z., et al.: A versatile pathology co-pilot via reasoning enhanced multimodal large language model. arXiv preprint arXiv:2507.17303 (2025)

work page arXiv 2025

[25] [25]

arXiv preprint arXiv:2305.15075 (2023)

Zhang, H., Chen, J., Jiang, F., Yu, F., Chen, Z., Li, J., Chen, G., Wu, X., Zhang, Z., Xiao, Q., Wan, X., Wang, B., Li, H.: Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075 (2023)

work page arXiv 2023

[26] [26]

arXiv preprint arXiv:2505.11404 (2025)

Zhang, W., Zhang, P., Guo, J., Cheng, T., Chen, J., Zhang, S., Zhang, Z., Yi, Y., Bu, H.: Patho-r1: A multimodal reinforcement learning-based pathology expert reasoner. arXiv preprint arXiv:2505.11404 (2025)

work page arXiv 2025

[27] [27]

Nature Machine Intelligence1(5), 236–245 (2019)

Zhang, Z., Chen, P., McGough, M., Xing, F., Wang, C., Bui, M., Xie, Y., Sapkota, M., Cui, L., Dhillon, J., et al.: Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nature Machine Intelligence1(5), 236–245 (2019)

2019