pith. sign in

arxiv: 2606.12407 · v1 · pith:F3F7AJHRnew · submitted 2026-06-10 · 💻 cs.CV

How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

Pith reviewed 2026-06-27 09:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords pathologywhole-slide imageslarge language modelsinput designperformance evaluationMultiPathQATCGAGTEx
0
0 comments X

The pith

Seemingly minor input choices raise general LLMs from 15% to 40% accuracy on pathology whole-slide tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that general-purpose LLMs have appeared weak on whole-slide pathology images mainly because of non-optimized ways of breaking those images into patches for processing. A systematic test of four design factors—inference mode, patch size, magnification, and patch count—identifies one balanced setup, large patches at lower magnification processed jointly, that lifts performance sharply. On the MultiPathQA benchmark this change moves GPT-5 from 15.1% to 39.5% on cancer-type classification and from 38.1% to 62.9% on organ classification. The same setup improves two other models and a held-out cohort without any extra tuning. If the finding holds, much of the reported gap between general LLMs and specialized pathology systems stems from input handling rather than model architecture.

Core claim

Prior studies have overstated the gap between specialized pathology models and general-purpose LLMs by using non-optimized input configurations; a single balanced configuration of large patches at lower magnification processed jointly raises GPT-5 from 15.1% to 39.5% on TCGA cancer-type classification and from 38.1% to 62.9% on GTEx organ classification, with per-task optimization yielding further gains and the configuration generalizing to other models and the held-out CPTAC cohort.

What carries the argument

Factorial analysis of the four input design factors (inference mode, patch size, magnification, patch count) that control how whole-slide images are divided and fed to an LLM.

If this is right

  • Comparisons between general LLMs and specialized pathology models have overstated the performance gap.
  • A single non-tuned configuration improves results across multiple models and datasets.
  • Per-task optimization of the four factors can produce still larger gains.
  • The same input choices transfer to fully held-out cohorts without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Input optimization of this kind may narrow apparent gaps between general and specialized models on other high-resolution image domains.
  • Benchmark papers should document and vary input configuration details when testing general models on large images.
  • Future work could test whether additional factors such as color normalization or patch overlap produce further gains.

Load-bearing premise

The four tested input design factors and the MultiPathQA benchmark plus held-out cohorts sufficiently represent the variables and tasks that determine LLM performance on real-world pathology WSIs.

What would settle it

A new experiment on additional pathology tasks or models in which the balanced large-patch low-magnification joint configuration produces no accuracy gain or a loss would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.12407 by Arjun K. Manrai, Kian R. Weihrauch, Thomas A. Buckley, William Lotter.

Figure 1
Figure 1. Figure 1: General-purpose LLMs are widely used as baselines for pathology-specific models, yet [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Variance decomposition of GPT-5 results on 100 WSI subset using ANOVA with two-factor [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance distribution on the 100-WSI subset, grouped by factor level. Each box [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: All-in-One vs. Majority Vote on the full 934-question MultiPathQA dataset. Both inference [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scaling curves under All-in-One inference across the full 934-question MultiPathQA [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Heatmaps showing accuracy as a function of patch size and magnification under All-in-One [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Variance decomposition over the full 934-question MultiPathQA dataset (All-in-One [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Total field of view, (Patch Size/Magnification) × Patch Count. B Additional statistical analyses Justification and implementation of two-way ANOVA. To quantify the relative contribution of each factor, we perform a two-way interaction ANOVA on the accuracy of each setting per benchmark. We estimate effect sizes with omega squared (ω 2 ): ω 2 = max 0, SSeffect − dfeffect · MSerror SStotal + MSerror  (1) W… view at source ↗
Figure 9
Figure 9. Figure 9: Scaling behavior as a function of total field of view across the full MultiPathQA dataset [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance relative to FOV-based prediction on the full MultiPathQA dataset. Boxen [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance across configurations. C.2 Mixed-magnification ablation Because benchmark performance varied across magnifications, we tested whether combining multiple spatial scales in a single input could provide a more robust representation. Using All-in-One inference, each whole-slide image was represented by 30 total patches at 896 px: 10 patches at 5×, 10 at 10×, and 10 at 20×, all jointly provided to … view at source ↗
Figure 12
Figure 12. Figure 12: Representative example patches comparing the prior baseline configuration (224 px [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Representative patches extracted from the same spatial coordinate under different magnifi [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
read the original abstract

General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that non-optimized input design choices (inference mode, patch size, magnification, patch count) in prior LLM evaluations on whole-slide pathology images have overstated the performance gap versus specialized models. Through a factorial analysis on the MultiPathQA benchmark, a single balanced configuration (large patches at lower magnification, joint processing) raises GPT-5 from 15.1% to 39.5% on TCGA cancer-type classification and 38.1% to 62.9% on GTEx organ classification; per-task optimization yields up to 43.9% and 71.6%. The configuration generalizes to other models and improves Gemini 3 Flash by 23.4 pp on the held-out CPTAC cohort without task-specific tuning.

Significance. If the results hold, the work shows that input configuration choices can close much of the reported gap between generalist LLMs and domain-specific pathology systems, with direct implications for benchmarking practices. Credit is due for the use of external held-out cohorts (CPTAC) and evaluation across multiple models without fitted parameters or self-referential derivations.

major comments (2)
  1. [Abstract and Results] Abstract and Results sections: the concrete percentage-point gains (15.1%→39.5%, 38.1%→62.9%, +23.4 pp) are reported without error bars, confidence intervals, or any statistical significance tests. This is load-bearing for the central claim that design choices dictate performance, as the improvements cannot be assessed for reliability versus experimental variance.
  2. [Methods] Methods (factorial analysis description): the systematic evaluation of the four factors does not report how variance across random seeds, multiple runs, or interaction effects was quantified, nor does it provide the full experimental protocol. This undermines confidence in the robustness of the recommended 'balanced configuration' and its generalization.
minor comments (1)
  1. [Introduction/Methods] Ensure the MultiPathQA benchmark, including task definitions and cohort details, is explicitly defined with references in the main text rather than assumed from the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical reporting and methodological transparency. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results sections: the concrete percentage-point gains (15.1%→39.5%, 38.1%→62.9%, +23.4 pp) are reported without error bars, confidence intervals, or any statistical significance tests. This is load-bearing for the central claim that design choices dictate performance, as the improvements cannot be assessed for reliability versus experimental variance.

    Authors: We agree that error bars and significance tests would strengthen the presentation. The reported figures come from single deterministic runs per configuration (fixed temperature=0 where supported) due to the high cost of proprietary LLM API calls on thousands of WSI patches. The factorial design itself demonstrates robustness via consistent directional improvements across four factors, two tasks, three models, and a held-out cohort. We will add a limitations paragraph explicitly noting the single-run nature and the magnitude of gains relative to typical LLM variance, and we will compute and report binomial confidence intervals on the classification accuracies in the revised Results. revision: partial

  2. Referee: [Methods] Methods (factorial analysis description): the systematic evaluation of the four factors does not report how variance across random seeds, multiple runs, or interaction effects was quantified, nor does it provide the full experimental protocol. This undermines confidence in the robustness of the recommended 'balanced configuration' and its generalization.

    Authors: We will expand the Methods section with the complete experimental protocol (exact patch extraction code, API parameters, prompt templates, and decision rules for each factor combination). Because the study used a full 2^4 factorial design on fixed data splits and deterministic inference settings, interaction effects were not separately modeled; main-effect trends are reported. We will add an explicit statement that random-seed variance was not quantified owing to cost and determinism, while noting that the same balanced configuration improved performance on two additional models and the fully held-out CPTAC cohort without any re-tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark study with direct measurements on held-out data

full rationale

The paper conducts a factorial experiment on four input design factors (inference mode, patch size, magnification, patch count) and reports measured accuracy changes on MultiPathQA (TCGA/GTEx) plus a fully held-out CPTAC cohort. No equations, no fitted parameters renamed as predictions, no self-citation load-bearing uniqueness claims, and no ansatz or renaming of known results. All reported gains (e.g., 15.1% to 39.5%) are direct empirical outcomes from tested configurations, externally verifiable on the stated benchmarks without reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the study relies on existing public benchmarks (TCGA, GTEx, CPTAC) and off-the-shelf models.

pith-pipeline@v0.9.1-grok · 5808 in / 1227 out tokens · 32888 ms · 2026-06-27T09:36:22.680010+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Mllm-hwsi: A multimodal large language model for hierarchical whole slide image understanding.arXiv preprint arXiv:2603.23067, 2026

    Basit Alawode, Arif Mahmood, Muaz Khalifa Al-Radi, Shahad Albastaki, Asim Khan, Muham- mad Bilal, Moshira Ali Abdalla, Mohammed Bennamoun, and Sajid Javed. Mllm-hwsi: A multimodal large language model for hierarchical whole slide image understanding.arXiv preprint arXiv:2603.23067, 2026

  2. [2]

    Navigating Gigapixel Pathology Images with Large Multimodal Models

    Thomas A Buckley, Kian R Weihrauch, Katherine Latham, Andrew Z Zhou, Padmini A Manrai, and Arjun K Manrai. Navigating gigapixel pathology images with large multimodal models. arXiv preprint arXiv:2511.19652, 2025

  3. [3]

    Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature medicine, 28(1):154–163, 2022

    Wouter Bulten, Kimmo Kartasalo, Po-Hsuan Cameron Chen, Peter Ström, Hans Pinckaers, Kunal Nagpal, Yuannan Cai, David F Steiner, Hester Van Boven, Robert Vink, et al. Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature medicine, 28(1):154–163, 2022

  4. [4]

    A novel approach to high-quality postmortem tissue procurement: the gtex project.Biopreservation and biobanking, 13(5):311–319, 2015

    Latarsha J Carithers, Kristin Ardlie, Mary Barcus, Philip A Branton, Angela Britton, Stephen A Buia, Carolyn C Compton, David S DeLuca, Joanne Peter-Demchok, Ellen T Gelfand, et al. A novel approach to high-quality postmortem tissue procurement: the gtex project.Biopreservation and biobanking, 13(5):311–319, 2015

  5. [5]

    Pathagent: Toward interpretable analysis of whole- slide pathology images via large language model-based agentic reasoning.arXiv preprint arXiv:2511.17052, 2025

    Jingyun Chen, Linghan Cai, Zhikang Wang, Yi Huang, Songhan Jiang, Shenjin Huang, Hongpeng Wang, and Yongbing Zhang. Pathagent: Toward interpretable analysis of whole- slide pathology images via large language model-based agentic reasoning.arXiv preprint arXiv:2511.17052, 2025

  6. [6]

    Scaling vision transformers to gigapixel images via hierarchical self-supervised learning

    Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16144–16155, 2022

  7. [7]

    Towards a general- purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

    Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general- purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

  8. [8]

    Slidechat: A large vision-language assistant for whole-slide pathology image understanding

    Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, Ming Hu, Rongshan Yu, Yu Qiao, and Junjun He. Slidechat: A large vision-language assistant for whole-slide pathology image understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5134–5143, 2025

  9. [9]

    A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

    Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

  10. [10]

    The cptac data portal: a resource for cancer proteomics research.Journal of proteome research, 14(6):2707–2713, 2015

    Nathan J Edwards, Mauricio Oberti, Ratna R Thangudu, Shuang Cai, Peter B McGarvey, Shine Jacob, Subha Madhavan, and Karen A Ketchum. The cptac data portal: a resource for cancer proteomics research.Journal of proteome research, 14(6):2707–2713, 2015

  11. [11]

    Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathol- ogy

    Fatemeh Ghezloo, Mehmet Saygin Seyfioglu, Rustin Soraki, Wisdom O Ikezogwo, Beibin Li, Tejoram Vivekanandan, Joann G Elmore, Ranjay Krishna, and Linda Shapiro. Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathol- ogy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 234...

  12. [12]

    Attention-based deep multiple instance learning

    Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. InInternational conference on machine learning, pages 2127–2136. PMLR, 2018

  13. [13]

    A co-evolving agentic ai system for medical imaging analysis.arXiv preprint arXiv:2509.20279, 2025

    Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu, Yihang Liu, Lilin Wang, Wenhui Lei, Sheng Wang, Yinuo Xu, et al. A co-evolving agentic ai system for medical imaging analysis.arXiv preprint arXiv:2509.20279, 2025. 10

  14. [14]

    Wsi-llava: A multimodal large language model for whole slide image

    Yuci Liang, Xinheng Lyu, Wenting Chen, Meidan Ding, Jipeng Zhang, Xiangjian He, Song Wu, Xiaohan Xing, Sen Yang, Xiyue Wang, et al. Wsi-llava: A multimodal large language model for whole slide image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22718–22727, 2025

  15. [15]

    Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

    Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

  16. [16]

    A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024

    Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guil- laume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024

  17. [17]

    A multimodal generative ai copilot for human pathology.Nature, 634(8033):466–473, 2024

    Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Melissa Zhao, Aaron K Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, et al. A multimodal generative ai copilot for human pathology.Nature, 634(8033):466–473, 2024

  18. [18]

    A new era of intelligence with gemini 3.Mountain View, CA: Google, 2025

    Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini 3.Mountain View, CA: Google, 2025

  19. [19]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  20. [20]

    MedGemma 1.5 Technical Report

    Andrew Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger, Fayaz Jamil, Madeleine Traverse, Alberto Tono, Bashir Sadjad, Lin Yang, Charles Lau, et al. Medgemma 1.5 technical report.arXiv preprint arXiv:2604.05081, 2026

  21. [21]

    Quilt-llava: Visual instruction tuning by extracting localized narratives from open- source histopathology videos

    Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open- source histopathology videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13183–13192, 2024

  22. [22]

    Prism: A multi-modal generative foundation model for slide-level histopathology.arXiv preprint arXiv:2405.10254, 2024

    George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D Kunz, Juan A Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, et al. Prism: A multi-modal generative foundation model for slide-level histopathology.arXiv preprint arXiv:2405.10254, 2024

  23. [23]

    Transmil: Transformer based correlated multiple instance learning for whole slide image classification

    Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136–2147, 2021

  24. [24]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  25. [25]

    Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology

    Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, and Lin Yang. Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10360–10371, 2025

  26. [26]

    Yuxuan Sun, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Bowen Ding, Tao Lin, and Lin Yang. Cpathagent: An agent-based foundation model for interpretable high-resolution pathology image analysis mimicking pathologists’ diagnostic logic.Advances in Neural Information Processing Systems, 38:101673–101731, 2025

  27. [27]

    Pathgen-1.6 m: 1.6 million pathology image-text pairs generation through multi-agent collaboration

    Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Jingxiong Li, Xuan Gong, Xinheng Lyu, Tao Lin, et al. Pathgen-1.6 m: 1.6 million pathology image-text pairs generation through multi-agent collaboration. InInternational Conference on Learning Representations, volume 2025, pages 94611–94653, 2025

  28. [28]

    A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine, 30(10):2924–2935, 2024

    Eugene V orontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine, 30(10):2924–2935, 2024. 11

  29. [29]

    Pathology-cot: Learning visual chain-of-thought agent from expert whole slide image diagnosis behavior.arXiv preprint arXiv:2510.04587, 2025

    Sheng Wang, Ruiming Wu, Charles Herndon, Yihang Liu, Shunsuke Koga, Jeanne Shen, and Zhi Huang. Pathology-cot: Learning visual chain-of-thought agent from expert whole slide image diagnosis behavior.arXiv preprint arXiv:2510.04587, 2025

  30. [30]

    A pathology foundation model for cancer diagnosis and prognosis prediction.Nature, 634(8035):970–978, 2024

    Xiyue Wang, Junhan Zhao, Eliana Marostica, Wei Yuan, Jietian Jin, Jiayu Zhang, Ruijiang Li, Hongping Tang, Kanran Wang, Yu Li, et al. A pathology foundation model for cancer diagnosis and prognosis prediction.Nature, 634(8035):970–978, 2024

  31. [31]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  32. [32]

    The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

    John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

  33. [33]

    Evidence- based diagnostic reasoning with multi-agent copilot for human pathology.arXiv preprint arXiv:2506.20964, 2025

    Luca L Weishaupt, Chengkuan Chen, Drew FK Williamson, Richard J Chen, Guillaume Jaume, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Ming Y Lu, et al. Evidence- based diagnostic reasoning with multi-agent copilot for human pathology.arXiv preprint arXiv:2506.20964, 2025

  34. [34]

    A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188, 2024

    Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188, 2024

  35. [35]

    Ac- celerating data processing and benchmarking of ai models for pathology.arXiv preprint arXiv:2502.06750, 2025

    Andrew Zhang, Guillaume Jaume, Anurag Vaidya, Tong Ding, and Faisal Mahmood. Ac- celerating data processing and benchmarking of ai models for pathology.arXiv preprint arXiv:2502.06750, 2025

  36. [36]

    isup_grade

    Wenchuan Zhang, Penghao Zhang, Jingru Guo, Tao Cheng, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, and Hong Bu. Patho-r1: A multimodal reinforcement learning-based pathol- ogy expert reasoner. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28418–28426, 2026. 12 Appendix A Implementation and evaluation details A.1 Patch...