How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

Arjun K. Manrai; Kian R. Weihrauch; Thomas A. Buckley; William Lotter

arxiv: 2606.12407 · v1 · pith:F3F7AJHRnew · submitted 2026-06-10 · 💻 cs.CV

How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

Kian R. Weihrauch , Thomas A. Buckley , William Lotter , Arjun K. Manrai This is my paper

Pith reviewed 2026-06-27 09:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords pathologywhole-slide imageslarge language modelsinput designperformance evaluationMultiPathQATCGAGTEx

0 comments

The pith

Seemingly minor input choices raise general LLMs from 15% to 40% accuracy on pathology whole-slide tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that general-purpose LLMs have appeared weak on whole-slide pathology images mainly because of non-optimized ways of breaking those images into patches for processing. A systematic test of four design factors—inference mode, patch size, magnification, and patch count—identifies one balanced setup, large patches at lower magnification processed jointly, that lifts performance sharply. On the MultiPathQA benchmark this change moves GPT-5 from 15.1% to 39.5% on cancer-type classification and from 38.1% to 62.9% on organ classification. The same setup improves two other models and a held-out cohort without any extra tuning. If the finding holds, much of the reported gap between general LLMs and specialized pathology systems stems from input handling rather than model architecture.

Core claim

Prior studies have overstated the gap between specialized pathology models and general-purpose LLMs by using non-optimized input configurations; a single balanced configuration of large patches at lower magnification processed jointly raises GPT-5 from 15.1% to 39.5% on TCGA cancer-type classification and from 38.1% to 62.9% on GTEx organ classification, with per-task optimization yielding further gains and the configuration generalizing to other models and the held-out CPTAC cohort.

What carries the argument

Factorial analysis of the four input design factors (inference mode, patch size, magnification, patch count) that control how whole-slide images are divided and fed to an LLM.

If this is right

Comparisons between general LLMs and specialized pathology models have overstated the performance gap.
A single non-tuned configuration improves results across multiple models and datasets.
Per-task optimization of the four factors can produce still larger gains.
The same input choices transfer to fully held-out cohorts without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Input optimization of this kind may narrow apparent gaps between general and specialized models on other high-resolution image domains.
Benchmark papers should document and vary input configuration details when testing general models on large images.
Future work could test whether additional factors such as color normalization or patch overlap produce further gains.

Load-bearing premise

The four tested input design factors and the MultiPathQA benchmark plus held-out cohorts sufficiently represent the variables and tasks that determine LLM performance on real-world pathology WSIs.

What would settle it

A new experiment on additional pathology tasks or models in which the balanced large-patch low-magnification joint configuration produces no accuracy gain or a loss would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.12407 by Arjun K. Manrai, Kian R. Weihrauch, Thomas A. Buckley, William Lotter.

**Figure 2.** Figure 2: Variance decomposition of GPT-5 results on 100 WSI subset using ANOVA with two-factor [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance distribution on the 100-WSI subset, grouped by factor level. Each box [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: All-in-One vs. Majority Vote on the full 934-question MultiPathQA dataset. Both inference [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling curves under All-in-One inference across the full 934-question MultiPathQA [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Heatmaps showing accuracy as a function of patch size and magnification under All-in-One [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Variance decomposition over the full 934-question MultiPathQA dataset (All-in-One [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Total field of view, (Patch Size/Magnification) × Patch Count. B Additional statistical analyses Justification and implementation of two-way ANOVA. To quantify the relative contribution of each factor, we perform a two-way interaction ANOVA on the accuracy of each setting per benchmark. We estimate effect sizes with omega squared (ω 2 ): ω 2 = max 0, SSeffect − dfeffect · MSerror SStotal + MSerror (1) W… view at source ↗

**Figure 9.** Figure 9: Scaling behavior as a function of total field of view across the full MultiPathQA dataset [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Performance relative to FOV-based prediction on the full MultiPathQA dataset. Boxen [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Performance across configurations. C.2 Mixed-magnification ablation Because benchmark performance varied across magnifications, we tested whether combining multiple spatial scales in a single input could provide a more robust representation. Using All-in-One inference, each whole-slide image was represented by 30 total patches at 896 px: 10 patches at 5×, 10 at 10×, and 10 at 20×, all jointly provided to … view at source ↗

**Figure 12.** Figure 12: Representative example patches comparing the prior baseline configuration (224 px [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Representative patches extracted from the same spatial coordinate under different magnifi [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

read the original abstract

General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Input tweaks close much of the LLM-specialist gap on pathology WSIs, but the work needs stats and broader validation.

read the letter

The main point is that non-optimized input choices have made general LLMs look worse than they are on whole-slide pathology tasks. Switching to larger patches at lower magnification and processing them together boosts GPT-5 from 15.1% to 39.5% on TCGA cancer classification and from 38.1% to 62.9% on GTEx organ classification. The same setup improves another model on held-out CPTAC data by over 20 points.

The paper does a factorial sweep over inference mode, patch size, magnification, and patch count. That systematic approach is the real contribution. It shows the gains transfer across models without per-task tuning.

The evidence is limited though. No error bars or statistical tests appear in the abstract, and the full methods aren't detailed here. The tasks are coarse classification; finer-grained pathology work might respond differently to magnification. Unexamined factors like stain handling could matter as much or more.

This is worth attention for anyone setting LLM baselines in medical imaging. Readers evaluating generalist models on WSIs will find the input optimization angle practical. It should go to peer review because the core empirical observation is clear and could shift how baselines are built, provided the authors add variance estimates and test more tasks.

Referee Report

2 major / 1 minor

Summary. The paper claims that non-optimized input design choices (inference mode, patch size, magnification, patch count) in prior LLM evaluations on whole-slide pathology images have overstated the performance gap versus specialized models. Through a factorial analysis on the MultiPathQA benchmark, a single balanced configuration (large patches at lower magnification, joint processing) raises GPT-5 from 15.1% to 39.5% on TCGA cancer-type classification and 38.1% to 62.9% on GTEx organ classification; per-task optimization yields up to 43.9% and 71.6%. The configuration generalizes to other models and improves Gemini 3 Flash by 23.4 pp on the held-out CPTAC cohort without task-specific tuning.

Significance. If the results hold, the work shows that input configuration choices can close much of the reported gap between generalist LLMs and domain-specific pathology systems, with direct implications for benchmarking practices. Credit is due for the use of external held-out cohorts (CPTAC) and evaluation across multiple models without fitted parameters or self-referential derivations.

major comments (2)

[Abstract and Results] Abstract and Results sections: the concrete percentage-point gains (15.1%→39.5%, 38.1%→62.9%, +23.4 pp) are reported without error bars, confidence intervals, or any statistical significance tests. This is load-bearing for the central claim that design choices dictate performance, as the improvements cannot be assessed for reliability versus experimental variance.
[Methods] Methods (factorial analysis description): the systematic evaluation of the four factors does not report how variance across random seeds, multiple runs, or interaction effects was quantified, nor does it provide the full experimental protocol. This undermines confidence in the robustness of the recommended 'balanced configuration' and its generalization.

minor comments (1)

[Introduction/Methods] Ensure the MultiPathQA benchmark, including task definitions and cohort details, is explicitly defined with references in the main text rather than assumed from the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on statistical reporting and methodological transparency. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results sections: the concrete percentage-point gains (15.1%→39.5%, 38.1%→62.9%, +23.4 pp) are reported without error bars, confidence intervals, or any statistical significance tests. This is load-bearing for the central claim that design choices dictate performance, as the improvements cannot be assessed for reliability versus experimental variance.

Authors: We agree that error bars and significance tests would strengthen the presentation. The reported figures come from single deterministic runs per configuration (fixed temperature=0 where supported) due to the high cost of proprietary LLM API calls on thousands of WSI patches. The factorial design itself demonstrates robustness via consistent directional improvements across four factors, two tasks, three models, and a held-out cohort. We will add a limitations paragraph explicitly noting the single-run nature and the magnitude of gains relative to typical LLM variance, and we will compute and report binomial confidence intervals on the classification accuracies in the revised Results. revision: partial
Referee: [Methods] Methods (factorial analysis description): the systematic evaluation of the four factors does not report how variance across random seeds, multiple runs, or interaction effects was quantified, nor does it provide the full experimental protocol. This undermines confidence in the robustness of the recommended 'balanced configuration' and its generalization.

Authors: We will expand the Methods section with the complete experimental protocol (exact patch extraction code, API parameters, prompt templates, and decision rules for each factor combination). Because the study used a full 2^4 factorial design on fixed data splits and deterministic inference settings, interaction effects were not separately modeled; main-effect trends are reported. We will add an explicit statement that random-seed variance was not quantified owing to cost and determinism, while noting that the same balanced configuration improved performance on two additional models and the fully held-out CPTAC cohort without any re-tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark study with direct measurements on held-out data

full rationale

The paper conducts a factorial experiment on four input design factors (inference mode, patch size, magnification, patch count) and reports measured accuracy changes on MultiPathQA (TCGA/GTEx) plus a fully held-out CPTAC cohort. No equations, no fitted parameters renamed as predictions, no self-citation load-bearing uniqueness claims, and no ansatz or renaming of known results. All reported gains (e.g., 15.1% to 39.5%) are direct empirical outcomes from tested configurations, externally verifiable on the stated benchmarks without reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the study relies on existing public benchmarks (TCGA, GTEx, CPTAC) and off-the-shelf models.

pith-pipeline@v0.9.1-grok · 5808 in / 1227 out tokens · 32888 ms · 2026-06-27T09:36:22.680010+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 11 canonical work pages · 4 internal anchors

[1]

Mllm-hwsi: A multimodal large language model for hierarchical whole slide image understanding.arXiv preprint arXiv:2603.23067, 2026

Basit Alawode, Arif Mahmood, Muaz Khalifa Al-Radi, Shahad Albastaki, Asim Khan, Muham- mad Bilal, Moshira Ali Abdalla, Mohammed Bennamoun, and Sajid Javed. Mllm-hwsi: A multimodal large language model for hierarchical whole slide image understanding.arXiv preprint arXiv:2603.23067, 2026

work page arXiv 2026
[2]

Navigating Gigapixel Pathology Images with Large Multimodal Models

Thomas A Buckley, Kian R Weihrauch, Katherine Latham, Andrew Z Zhou, Padmini A Manrai, and Arjun K Manrai. Navigating gigapixel pathology images with large multimodal models. arXiv preprint arXiv:2511.19652, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature medicine, 28(1):154–163, 2022

Wouter Bulten, Kimmo Kartasalo, Po-Hsuan Cameron Chen, Peter Ström, Hans Pinckaers, Kunal Nagpal, Yuannan Cai, David F Steiner, Hester Van Boven, Robert Vink, et al. Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature medicine, 28(1):154–163, 2022

2022
[4]

A novel approach to high-quality postmortem tissue procurement: the gtex project.Biopreservation and biobanking, 13(5):311–319, 2015

Latarsha J Carithers, Kristin Ardlie, Mary Barcus, Philip A Branton, Angela Britton, Stephen A Buia, Carolyn C Compton, David S DeLuca, Joanne Peter-Demchok, Ellen T Gelfand, et al. A novel approach to high-quality postmortem tissue procurement: the gtex project.Biopreservation and biobanking, 13(5):311–319, 2015

2015
[5]

Pathagent: Toward interpretable analysis of whole- slide pathology images via large language model-based agentic reasoning.arXiv preprint arXiv:2511.17052, 2025

Jingyun Chen, Linghan Cai, Zhikang Wang, Yi Huang, Songhan Jiang, Shenjin Huang, Hongpeng Wang, and Yongbing Zhang. Pathagent: Toward interpretable analysis of whole- slide pathology images via large language model-based agentic reasoning.arXiv preprint arXiv:2511.17052, 2025

work page arXiv 2025
[6]

Scaling vision transformers to gigapixel images via hierarchical self-supervised learning

Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16144–16155, 2022

2022
[7]

Towards a general- purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general- purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

2024
[8]

Slidechat: A large vision-language assistant for whole-slide pathology image understanding

Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, Ming Hu, Rongshan Yu, Yu Qiao, and Junjun He. Slidechat: A large vision-language assistant for whole-slide pathology image understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5134–5143, 2025

2025
[9]

A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

2025
[10]

The cptac data portal: a resource for cancer proteomics research.Journal of proteome research, 14(6):2707–2713, 2015

Nathan J Edwards, Mauricio Oberti, Ratna R Thangudu, Shuang Cai, Peter B McGarvey, Shine Jacob, Subha Madhavan, and Karen A Ketchum. The cptac data portal: a resource for cancer proteomics research.Journal of proteome research, 14(6):2707–2713, 2015

2015
[11]

Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathol- ogy

Fatemeh Ghezloo, Mehmet Saygin Seyfioglu, Rustin Soraki, Wisdom O Ikezogwo, Beibin Li, Tejoram Vivekanandan, Joann G Elmore, Ranjay Krishna, and Linda Shapiro. Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathol- ogy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 234...

2025
[12]

Attention-based deep multiple instance learning

Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. InInternational conference on machine learning, pages 2127–2136. PMLR, 2018

2018
[13]

A co-evolving agentic ai system for medical imaging analysis.arXiv preprint arXiv:2509.20279, 2025

Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu, Yihang Liu, Lilin Wang, Wenhui Lei, Sheng Wang, Yinuo Xu, et al. A co-evolving agentic ai system for medical imaging analysis.arXiv preprint arXiv:2509.20279, 2025. 10

work page arXiv 2025
[14]

Wsi-llava: A multimodal large language model for whole slide image

Yuci Liang, Xinheng Lyu, Wenting Chen, Meidan Ding, Jipeng Zhang, Xiangjian He, Song Wu, Xiaohan Xing, Sen Yang, Xiyue Wang, et al. Wsi-llava: A multimodal large language model for whole slide image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22718–22727, 2025

2025
[15]

Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

2021
[16]

A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024

Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guil- laume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024

2024
[17]

A multimodal generative ai copilot for human pathology.Nature, 634(8033):466–473, 2024

Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Melissa Zhao, Aaron K Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, et al. A multimodal generative ai copilot for human pathology.Nature, 634(8033):466–473, 2024

2024
[18]

A new era of intelligence with gemini 3.Mountain View, CA: Google, 2025

Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini 3.Mountain View, CA: Google, 2025

2025
[19]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[20]

MedGemma 1.5 Technical Report

Andrew Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger, Fayaz Jamil, Madeleine Traverse, Alberto Tono, Bashir Sadjad, Lin Yang, Charles Lau, et al. Medgemma 1.5 technical report.arXiv preprint arXiv:2604.05081, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Quilt-llava: Visual instruction tuning by extracting localized narratives from open- source histopathology videos

Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open- source histopathology videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13183–13192, 2024

2024
[22]

Prism: A multi-modal generative foundation model for slide-level histopathology.arXiv preprint arXiv:2405.10254, 2024

George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D Kunz, Juan A Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, et al. Prism: A multi-modal generative foundation model for slide-level histopathology.arXiv preprint arXiv:2405.10254, 2024

work page arXiv 2024
[23]

Transmil: Transformer based correlated multiple instance learning for whole slide image classification

Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136–2147, 2021

2021
[24]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology

Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, and Lin Yang. Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10360–10371, 2025

2025
[26]

Yuxuan Sun, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Bowen Ding, Tao Lin, and Lin Yang. Cpathagent: An agent-based foundation model for interpretable high-resolution pathology image analysis mimicking pathologists’ diagnostic logic.Advances in Neural Information Processing Systems, 38:101673–101731, 2025

2025
[27]

Pathgen-1.6 m: 1.6 million pathology image-text pairs generation through multi-agent collaboration

Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Jingxiong Li, Xuan Gong, Xinheng Lyu, Tao Lin, et al. Pathgen-1.6 m: 1.6 million pathology image-text pairs generation through multi-agent collaboration. InInternational Conference on Learning Representations, volume 2025, pages 94611–94653, 2025

2025
[28]

A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine, 30(10):2924–2935, 2024

Eugene V orontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine, 30(10):2924–2935, 2024. 11

2024
[29]

Pathology-cot: Learning visual chain-of-thought agent from expert whole slide image diagnosis behavior.arXiv preprint arXiv:2510.04587, 2025

Sheng Wang, Ruiming Wu, Charles Herndon, Yihang Liu, Shunsuke Koga, Jeanne Shen, and Zhi Huang. Pathology-cot: Learning visual chain-of-thought agent from expert whole slide image diagnosis behavior.arXiv preprint arXiv:2510.04587, 2025

work page arXiv 2025
[30]

A pathology foundation model for cancer diagnosis and prognosis prediction.Nature, 634(8035):970–978, 2024

Xiyue Wang, Junhan Zhao, Eliana Marostica, Wei Yuan, Jietian Jin, Jiayu Zhang, Ruijiang Li, Hongping Tang, Kanran Wang, Yu Li, et al. A pathology foundation model for cancer diagnosis and prognosis prediction.Nature, 634(8035):970–978, 2024

2024
[31]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

2013
[33]

Evidence- based diagnostic reasoning with multi-agent copilot for human pathology.arXiv preprint arXiv:2506.20964, 2025

Luca L Weishaupt, Chengkuan Chen, Drew FK Williamson, Richard J Chen, Guillaume Jaume, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Ming Y Lu, et al. Evidence- based diagnostic reasoning with multi-agent copilot for human pathology.arXiv preprint arXiv:2506.20964, 2025

work page arXiv 2025
[34]

A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188, 2024

Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188, 2024

2024
[35]

Ac- celerating data processing and benchmarking of ai models for pathology.arXiv preprint arXiv:2502.06750, 2025

Andrew Zhang, Guillaume Jaume, Anurag Vaidya, Tong Ding, and Faisal Mahmood. Ac- celerating data processing and benchmarking of ai models for pathology.arXiv preprint arXiv:2502.06750, 2025

work page arXiv 2025
[36]

isup_grade

Wenchuan Zhang, Penghao Zhang, Jingru Guo, Tao Cheng, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, and Hong Bu. Patho-r1: A multimodal reinforcement learning-based pathol- ogy expert reasoner. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28418–28426, 2026. 12 Appendix A Implementation and evaluation details A.1 Patch...

2026

[1] [1]

Mllm-hwsi: A multimodal large language model for hierarchical whole slide image understanding.arXiv preprint arXiv:2603.23067, 2026

Basit Alawode, Arif Mahmood, Muaz Khalifa Al-Radi, Shahad Albastaki, Asim Khan, Muham- mad Bilal, Moshira Ali Abdalla, Mohammed Bennamoun, and Sajid Javed. Mllm-hwsi: A multimodal large language model for hierarchical whole slide image understanding.arXiv preprint arXiv:2603.23067, 2026

work page arXiv 2026

[2] [2]

Navigating Gigapixel Pathology Images with Large Multimodal Models

Thomas A Buckley, Kian R Weihrauch, Katherine Latham, Andrew Z Zhou, Padmini A Manrai, and Arjun K Manrai. Navigating gigapixel pathology images with large multimodal models. arXiv preprint arXiv:2511.19652, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature medicine, 28(1):154–163, 2022

Wouter Bulten, Kimmo Kartasalo, Po-Hsuan Cameron Chen, Peter Ström, Hans Pinckaers, Kunal Nagpal, Yuannan Cai, David F Steiner, Hester Van Boven, Robert Vink, et al. Artificial intelligence for diagnosis and gleason grading of prostate cancer: the panda challenge.Nature medicine, 28(1):154–163, 2022

2022

[4] [4]

A novel approach to high-quality postmortem tissue procurement: the gtex project.Biopreservation and biobanking, 13(5):311–319, 2015

Latarsha J Carithers, Kristin Ardlie, Mary Barcus, Philip A Branton, Angela Britton, Stephen A Buia, Carolyn C Compton, David S DeLuca, Joanne Peter-Demchok, Ellen T Gelfand, et al. A novel approach to high-quality postmortem tissue procurement: the gtex project.Biopreservation and biobanking, 13(5):311–319, 2015

2015

[5] [5]

Pathagent: Toward interpretable analysis of whole- slide pathology images via large language model-based agentic reasoning.arXiv preprint arXiv:2511.17052, 2025

Jingyun Chen, Linghan Cai, Zhikang Wang, Yi Huang, Songhan Jiang, Shenjin Huang, Hongpeng Wang, and Yongbing Zhang. Pathagent: Toward interpretable analysis of whole- slide pathology images via large language model-based agentic reasoning.arXiv preprint arXiv:2511.17052, 2025

work page arXiv 2025

[6] [6]

Scaling vision transformers to gigapixel images via hierarchical self-supervised learning

Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16144–16155, 2022

2022

[7] [7]

Towards a general- purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Andrew H Song, Bowen Chen, Andrew Zhang, Daniel Shao, Muhammad Shaban, et al. Towards a general- purpose foundation model for computational pathology.Nature medicine, 30(3):850–862, 2024

2024

[8] [8]

Slidechat: A large vision-language assistant for whole-slide pathology image understanding

Ying Chen, Guoan Wang, Yuanfeng Ji, Yanjun Li, Jin Ye, Tianbin Li, Ming Hu, Rongshan Yu, Yu Qiao, and Junjun He. Slidechat: A large vision-language assistant for whole-slide pathology image understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5134–5143, 2025

2025

[9] [9]

A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

Tong Ding, Sophia J Wagner, Andrew H Song, Richard J Chen, Ming Y Lu, Andrew Zhang, Anurag J Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, et al. A multimodal whole-slide foundation model for pathology.Nature medicine, pages 1–13, 2025

2025

[10] [10]

The cptac data portal: a resource for cancer proteomics research.Journal of proteome research, 14(6):2707–2713, 2015

Nathan J Edwards, Mauricio Oberti, Ratna R Thangudu, Shuang Cai, Peter B McGarvey, Shine Jacob, Subha Madhavan, and Karen A Ketchum. The cptac data portal: a resource for cancer proteomics research.Journal of proteome research, 14(6):2707–2713, 2015

2015

[11] [11]

Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathol- ogy

Fatemeh Ghezloo, Mehmet Saygin Seyfioglu, Rustin Soraki, Wisdom O Ikezogwo, Beibin Li, Tejoram Vivekanandan, Joann G Elmore, Ranjay Krishna, and Linda Shapiro. Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathol- ogy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 234...

2025

[12] [12]

Attention-based deep multiple instance learning

Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. InInternational conference on machine learning, pages 2127–2136. PMLR, 2018

2018

[13] [13]

A co-evolving agentic ai system for medical imaging analysis.arXiv preprint arXiv:2509.20279, 2025

Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu, Yihang Liu, Lilin Wang, Wenhui Lei, Sheng Wang, Yinuo Xu, et al. A co-evolving agentic ai system for medical imaging analysis.arXiv preprint arXiv:2509.20279, 2025. 10

work page arXiv 2025

[14] [14]

Wsi-llava: A multimodal large language model for whole slide image

Yuci Liang, Xinheng Lyu, Wenting Chen, Meidan Ding, Jipeng Zhang, Xiangjian He, Song Wu, Xiaohan Xing, Sen Yang, Xiyue Wang, et al. Wsi-llava: A multimodal large language model for whole slide image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22718–22727, 2025

2025

[15] [15]

Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images.Nature biomedical engineering, 5(6):555–570, 2021

2021

[16] [16]

A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024

Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guil- laume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology.Nature medicine, 30(3):863–874, 2024

2024

[17] [17]

A multimodal generative ai copilot for human pathology.Nature, 634(8033):466–473, 2024

Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Melissa Zhao, Aaron K Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, et al. A multimodal generative ai copilot for human pathology.Nature, 634(8033):466–473, 2024

2024

[18] [18]

A new era of intelligence with gemini 3.Mountain View, CA: Google, 2025

Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini 3.Mountain View, CA: Google, 2025

2025

[19] [19]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026

[20] [20]

MedGemma 1.5 Technical Report

Andrew Sellergren, Chufan Gao, Fereshteh Mahvar, Timo Kohlberger, Fayaz Jamil, Madeleine Traverse, Alberto Tono, Bashir Sadjad, Lin Yang, Charles Lau, et al. Medgemma 1.5 technical report.arXiv preprint arXiv:2604.05081, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Quilt-llava: Visual instruction tuning by extracting localized narratives from open- source histopathology videos

Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open- source histopathology videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13183–13192, 2024

2024

[22] [22]

Prism: A multi-modal generative foundation model for slide-level histopathology.arXiv preprint arXiv:2405.10254, 2024

George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D Kunz, Juan A Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, et al. Prism: A multi-modal generative foundation model for slide-level histopathology.arXiv preprint arXiv:2405.10254, 2024

work page arXiv 2024

[23] [23]

Transmil: Transformer based correlated multiple instance learning for whole slide image classification

Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems, 34:2136–2147, 2021

2021

[24] [24]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology

Yuxuan Sun, Yixuan Si, Chenglu Zhu, Xuan Gong, Kai Zhang, Pingyi Chen, Ye Zhang, Zhongyi Shui, Tao Lin, and Lin Yang. Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10360–10371, 2025

2025

[26] [26]

Yuxuan Sun, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Bowen Ding, Tao Lin, and Lin Yang. Cpathagent: An agent-based foundation model for interpretable high-resolution pathology image analysis mimicking pathologists’ diagnostic logic.Advances in Neural Information Processing Systems, 38:101673–101731, 2025

2025

[27] [27]

Pathgen-1.6 m: 1.6 million pathology image-text pairs generation through multi-agent collaboration

Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Jingxiong Li, Xuan Gong, Xinheng Lyu, Tao Lin, et al. Pathgen-1.6 m: 1.6 million pathology image-text pairs generation through multi-agent collaboration. InInternational Conference on Learning Representations, volume 2025, pages 94611–94653, 2025

2025

[28] [28]

A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine, 30(10):2924–2935, 2024

Eugene V orontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection.Nature medicine, 30(10):2924–2935, 2024. 11

2024

[29] [29]

Pathology-cot: Learning visual chain-of-thought agent from expert whole slide image diagnosis behavior.arXiv preprint arXiv:2510.04587, 2025

Sheng Wang, Ruiming Wu, Charles Herndon, Yihang Liu, Shunsuke Koga, Jeanne Shen, and Zhi Huang. Pathology-cot: Learning visual chain-of-thought agent from expert whole slide image diagnosis behavior.arXiv preprint arXiv:2510.04587, 2025

work page arXiv 2025

[30] [30]

A pathology foundation model for cancer diagnosis and prognosis prediction.Nature, 634(8035):970–978, 2024

Xiyue Wang, Junhan Zhao, Eliana Marostica, Wei Yuan, Jietian Jin, Jiayu Zhang, Ruijiang Li, Hongping Tang, Kanran Wang, Yu Li, et al. A pathology foundation model for cancer diagnosis and prognosis prediction.Nature, 634(8035):970–978, 2024

2024

[31] [31]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas pan-cancer analysis project.Nature genetics, 45(10):1113–1120, 2013

2013

[33] [33]

Evidence- based diagnostic reasoning with multi-agent copilot for human pathology.arXiv preprint arXiv:2506.20964, 2025

Luca L Weishaupt, Chengkuan Chen, Drew FK Williamson, Richard J Chen, Guillaume Jaume, Tong Ding, Bowen Chen, Anurag Vaidya, Long Phi Le, Ming Y Lu, et al. Evidence- based diagnostic reasoning with multi-agent copilot for human pathology.arXiv preprint arXiv:2506.20964, 2025

work page arXiv 2025

[34] [34]

A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188, 2024

Hanwen Xu, Naoto Usuyama, Jaspreet Bagga, Sheng Zhang, Rajesh Rao, Tristan Naumann, Cliff Wong, Zelalem Gero, Javier González, Yu Gu, et al. A whole-slide foundation model for digital pathology from real-world data.Nature, 630(8015):181–188, 2024

2024

[35] [35]

Ac- celerating data processing and benchmarking of ai models for pathology.arXiv preprint arXiv:2502.06750, 2025

Andrew Zhang, Guillaume Jaume, Anurag Vaidya, Tong Ding, and Faisal Mahmood. Ac- celerating data processing and benchmarking of ai models for pathology.arXiv preprint arXiv:2502.06750, 2025

work page arXiv 2025

[36] [36]

isup_grade

Wenchuan Zhang, Penghao Zhang, Jingru Guo, Tao Cheng, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, and Hong Bu. Patho-r1: A multimodal reinforcement learning-based pathol- ogy expert reasoner. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28418–28426, 2026. 12 Appendix A Implementation and evaluation details A.1 Patch...

2026