Recognition: no theorem link
Enabling clinical use of foundation models for computational pathology
Pith reviewed 2026-05-15 19:13 UTC · model grok-4.3
The pith
Adding novel robustness losses while training downstream models on foundation model features reduces sensitivity to scanner and staining artifacts in computational pathology.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Introducing novel robustness losses during downstream model training reduces sensitivity to technical variability captured by foundation models for computational pathology. In a large-scale setup using 27,042 whole-slide images from 6,155 patients, the losses improve model robustness and classification accuracy by directing attention toward biologically relevant features, allowing clinically suitable models to be built without retraining the foundation models themselves.
What carries the argument
Novel robustness losses inserted into the downstream training objective that penalize dependence on technical variation while the model learns from frozen foundation-model features.
If this is right
- Downstream classifiers become less biased by differences in scanning equipment or staining batches across sites.
- Classification accuracy rises because training is steered toward disease-related signals rather than artifacts.
- New clinical models can be developed from existing foundation models without the compute cost of retraining them from scratch.
- Models trained this way are more likely to maintain performance when moved between hospitals with different equipment.
- The same training-time approach can be applied to any frozen foundation model to improve its suitability for real-world use.
Where Pith is reading between the lines
- Similar robustness losses could be tested on foundation models in radiology or other imaging fields that also face equipment-driven variation.
- Hospitals could apply this method locally to adapt a shared foundation model to their own scanner fleet without sharing raw patient data.
- Direct measurement of feature embeddings before and after the losses would strengthen evidence that technical directions are actually suppressed.
- Combining these losses with light fine-tuning of the foundation model might produce further gains in both robustness and accuracy.
Load-bearing premise
The robustness losses selectively suppress technical variation while leaving biologically relevant features intact, with this selectivity shown only indirectly through accuracy improvements.
What would settle it
A controlled test in which biological content is identical but technical factors (scanner or stain) are varied, showing that models trained with the robustness losses still change predictions with the technical factor, would falsify the claim of selective suppression.
read the original abstract
Foundation models for computational pathology are expected to facilitate the development of high-performing, generalisable deep learning systems. However, in addition to biologically relevant features, current foundation models also capture pre-analytic and scanner-specific variation that bias the predictions made by downstream task-specific models trained on these features. Here we show that introducing novel robustness losses during downstream model training reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 whole-slide images from 6,155 patients is used to train thousands of models from the features of eight well-known foundation models for computational pathology. In addition to a substantial improvement in robustness, our approach improves classification accuracy by focusing on biologically relevant features. It mitigates robustness limitations of foundation models for computational pathology without retraining the foundation models themselves, enabling development of models that are more suitable in real-world clinical use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that novel robustness losses introduced during downstream training on features from foundation models for computational pathology reduce sensitivity to technical variability (scanner and pre-analytic factors) while improving classification accuracy by focusing on biologically relevant features. This is shown via a large-scale setup training thousands of models on 27,042 WSIs from 6,155 patients across eight foundation models, without retraining the foundation models, enabling more clinically suitable systems.
Significance. If the central claim holds, the work is significant for computational pathology because it offers a practical mitigation strategy for a known robustness limitation of foundation models without the computational cost of retraining them. The comprehensive experimentation with patient-stratified splits and thousands of models supplies substantial empirical support, strengthening the case for real-world clinical translation.
major comments (2)
- Experimental Results section: The claim that the robustness losses selectively suppress technical covariates while preserving biological signals rests only on observed accuracy gains and reduced sensitivity on technical-shift test sets. No direct feature-level validation (e.g., mutual information between embeddings and technical metadata, feature attribution maps, or ablations isolating technical vs. biological dimensions) is reported, so equivalent gains could arise from generic regularization rather than the targeted mechanism.
- Methods section: The exact mathematical form of the novel robustness losses is not specified with equations, nor are the values or selection procedure for the free parameters (robustness loss weights) detailed. This information is load-bearing for reproducing the reported selectivity and verifying that the losses do not inadvertently suppress biological signal.
minor comments (2)
- Abstract: The number of foundation models (eight) and the specific downstream tasks should be stated explicitly to provide immediate context for the scale of the experiments.
- Figure captions and tables: Ensure all technical-shift test sets and patient-stratified split details are clearly labeled so readers can assess the robustness evaluation protocol without cross-referencing the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment below and will revise the manuscript accordingly to improve clarity and strengthen the evidence for our claims.
read point-by-point responses
-
Referee: Experimental Results section: The claim that the robustness losses selectively suppress technical covariates while preserving biological signals rests only on observed accuracy gains and reduced sensitivity on technical-shift test sets. No direct feature-level validation (e.g., mutual information between embeddings and technical metadata, feature attribution maps, or ablations isolating technical vs. biological dimensions) is reported, so equivalent gains could arise from generic regularization rather than the targeted mechanism.
Authors: We agree that direct feature-level analyses would provide stronger mechanistic evidence and help rule out generic regularization effects. Our current evidence relies on the large-scale experimental design with patient-stratified splits, technical-shift test sets, and consistent accuracy gains across eight foundation models. In the revised manuscript, we will add mutual information computations between the learned embeddings and technical metadata (scanner type, pre-analytic factors) as well as ablation studies comparing the robustness losses to standard regularization baselines. revision: yes
-
Referee: Methods section: The exact mathematical form of the novel robustness losses is not specified with equations, nor are the values or selection procedure for the free parameters (robustness loss weights) detailed. This information is load-bearing for reproducing the reported selectivity and verifying that the losses do not inadvertently suppress biological signal.
Authors: We acknowledge this omission limits reproducibility. The revised Methods section will include the full mathematical formulation of the robustness losses as equations, the specific hyperparameter values used for the loss weights, and the selection procedure (grid search with cross-validation on a patient-stratified validation subset). These additions will allow readers to verify that biological signal is preserved while technical variation is suppressed. revision: yes
Circularity Check
No circularity: empirical robustness gains measured directly on stratified splits
full rationale
The paper introduces robustness losses and validates them via large-scale training and evaluation on 27,042 WSIs from 6,155 patients using patient-stratified data splits across eight foundation models. All reported improvements in accuracy and reduced technical sensitivity are measured outcomes on held-out test sets rather than quantities derived from fitted parameters or self-citations. No equations, ansatzes, or uniqueness theorems appear in the derivation chain; the central claim is supported by direct empirical comparison without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- robustness loss weights
axioms (1)
- domain assumption Technical variations in whole-slide images are separable from biologically relevant features through auxiliary loss functions
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani et al. ‘On the opportunities and risks of foundation models’ . In: (2021). Preprint at https://doi.org/10.48550/arXiv.2108.07258
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07258 2021
-
[2]
‘Towards a general-purpose foundation model for computational pathology’
Richard J Chen et al. ‘Towards a general-purpose foundation model for computational pathology’ . In: Nature Medicine 30.3 (2024), pp. 850–862
work page 2024
-
[3]
‘A whole-slide foundation model for digital pathology from real-world data’
Hanwen Xu et al. ‘A whole-slide foundation model for digital pathology from real-world data’ . In: Nature 630.8015 (2024), pp. 181–188
work page 2024
-
[4]
‘Virchow2: Scaling self-supervised mixed magnification models in pathology’
Eric Zimmermann et al. ‘Virchow2: Scaling self-supervised mixed magnification models in pathology’ . In: arXiv preprint arXiv:2408.00738 (2024)
-
[5]
‘Phikon-v2, a large and public feature extractor for biomarker pre- diction’
Alexandre Filiot et al. ‘Phikon-v2, a large and public feature extractor for biomarker pre- diction’ . In: arXiv preprint arXiv:2409.09173 (2024)
-
[6]
‘Hibou: A family of founda- tional vision transformers for pathology’
Dmitry Nechaev, Alexey Pchelnikov and Ekaterina Ivanova. ‘Hibou: A family of founda- tional vision transformers for pathology’ . In: arXiv preprint arXiv:2406.05074 (2024)
-
[7]
‘Designing deep learning studies in cancer diagnostics’
Andreas Kleppe et al. ‘Designing deep learning studies in cancer diagnostics’ . In: Nature Reviews Cancer 21 (2021), pp. 199–211. url: https : / / api . semanticscholar . org / CorpusID:231769384
work page 2021
-
[8]
Jeroen Awm van der Laak, Geert J. S. Litjens and Francesco Ciompi. ‘Deep learning in histopathology: the path to the clinic’ . In: Nature Medicine 27 (2021), pp. 775–784. url: https://api.semanticscholar.org/CorpusID:234597294
work page 2021
-
[9]
‘Shortcut learning in deep neural networks’
Robert Geirhos et al. ‘Shortcut learning in deep neural networks’ . In: Nature Machine Intelligence 2.11 (Nov. 2020), pp. 665–673. issn: 2522-5839. doi: 10.1038/s42256-020- 00257-z. url: http://dx.doi.org/10.1038/s42256-020-00257-z
-
[10]
‘Automated Classification of Skin Lesions: From Pixels to Practice’
Akhila Narla et al. ‘Automated Classification of Skin Lesions: From Pixels to Practice’ . In: Journal of Investigative Dermatology 138.10 (2018), pp. 2108–2110. doi: 10.1016/j.jid. 2018.06.175
-
[11]
Julia K. Winkler et al. ‘Association Between Surgical Skin Markings in Dermoscopic Im- ages and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition’ . In: JAMA Dermatology 155.10 (Oct. 2019), pp. 1135–1141. issn: 2168-6068. doi: 10.1001/jamadermatol.2019.1735. eprint: https://jamanetwork.com/ journals/jamadermat...
-
[12]
John R Zech et al. ‘Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study’ . en. In: PLoS Med 15.11 (Nov. 2018), e1002683
work page 2018
-
[13]
Frederick M. Howard et al. ‘The impact of site-specific digital histology signatures on deep learning model accuracy and bias’ . In: Nature Communications 12.1 (2021), p. 4423. doi: 10.1038/s41467-021-24698-1
-
[14]
‘Biased data, biased AI: deep networks predict the acquisition site of TCGA images’
Taher Dehkharghanian et al. ‘Biased data, biased AI: deep networks predict the acquisition site of TCGA images’ . In: Diagnostic Pathology 18.1 (), p. 67. issn: 1746-1596. doi: 10. 1186/s13000-023-01355-3 . url: https://doi.org/10.1186/s13000-023-01355-3
-
[15]
‘Incidental Prompt Injections on Vision–Language Models in Real- Life Histopathology’
Jan Clusmann et al. ‘Incidental Prompt Injections on Vision–Language Models in Real- Life Histopathology’ . In: NEJM AI (2025). url: https://api.semanticscholar.org/ CorpusID:278840589
work page 2025
-
[16]
‘Investigation on potential bias factors in histopathology datasets’
Farnaz Kheiri et al. ‘Investigation on potential bias factors in histopathology datasets’ . In: Scientific Reports 15 (2025), p. 11349. doi: 10.1038/s41598-025-89210-x
-
[17]
David Tellez et al. ‘Quantifying the effects of data augmentation and stain color normal- ization in convolutional neural networks for computational pathology’ . In: Medical Image Analysis 58 (2019), p. 101544. issn: 1361-8415. doi: https : / / doi . org / 10 . 1016 / j . 31 media.2019.101544 . url: https://www.sciencedirect.com/science/article/pii/ S1361...
-
[18]
‘Impact of stain variation and color normalization for prognostic predictions in pathology’
Siyu Lin et al. ‘Impact of stain variation and color normalization for prognostic predictions in pathology’ . In: Scientific Reports 15.1 (2025), p. 2369. doi: 10 . 1038 / s41598 - 024 - 83267-w. url: https://doi.org/10.1038/s41598-024-83267-w
-
[19]
‘A Closer Look at Domain Shift for Deep Learning in Histopathology’
Karin Stacke et al. ‘A Closer Look at Domain Shift for Deep Learning in Histopathology’ . In: CoRR abs/1909.11575 (2019). arXiv: 1909 . 11575. url: http : / / arxiv . org / abs / 1909.11575
-
[20]
‘Mitosis domain generalization in histopathology images — The MIDOG challenge’
Marc Aubreville et al. ‘Mitosis domain generalization in histopathology images — The MIDOG challenge’ . In: Medical Image Analysis 84 (Feb. 2023), p. 102699. issn: 1361-8415. doi: 10.1016/j.media.2022.102699 . url: http://dx.doi.org/10.1016/j.media. 2022.102699
-
[21]
Scanner-Induced Domain Shifts Undermine the Robustness of Patho- logy Foundation Models
Erik Thiringer et al. Scanner-Induced Domain Shifts Undermine the Robustness of Patho- logy Foundation Models . 2026. arXiv: 2601.04163 [eess.IV] . url: https://arxiv.org/ abs/2601.04163
-
[22]
de Jong, Eric Marcus and Jonas Teuwen
Edwin D. de Jong, Eric Marcus and Jonas Teuwen. ‘Current Pathology Foundation Models are unrobust to Medical Center Differences’ . In: CoRR abs/2501.18055 (2025). doi: 10. 48550 / ARXIV . 2501 . 18055. arXiv: 2501 . 18055. url: https : / / doi . org / 10 . 48550 / arXiv.2501.18055
-
[23]
Comparing Computational Pathology Foundation Models using Representational Similarity Analysis
Vaibhav Mishra and William Lotter. Comparing Computational Pathology Foundation Models using Representational Similarity Analysis . 2025. arXiv: 2509 . 15482 [cs.CV] . url: https://arxiv.org/abs/2509.15482
-
[24]
Do Histopathological Foundation Models Eliminate Batch Effects? A Comparative Study
Jonah Kömen et al. Do Histopathological Foundation Models Eliminate Batch Effects? A Comparative Study . 2024. arXiv: 2411.05489 [cs.LG] . url: https://arxiv.org/abs/ 2411.05489
-
[25]
Gianluca Carloni et al. Pathology Foundation Models are Scanner Sensitive: Benchmark and Mitigation with Contrastive ScanGen Loss . 2025. arXiv: 2507 . 22092 [q-bio.QM] . url: https://arxiv.org/abs/2507.22092
-
[26]
Attention-based Deep Multiple Instance Learning
Maximilian Ilse, Jakub M Tomczak and Max Welling. ‘Attention-based Deep Multiple Instance Learning’ . In: arXiv preprint arXiv:1802.04712 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Timothy Iveson et al. ‘3 versus 6 months of adjuvant oxaliplatin-fluoropyrimidine com- bination therapy for colorectal cancer (SCOT): an international, randomised, phase 3, non-inferiority trial’ . In: The Lancet. Oncology 19 (2018), pp. 562–578. url: https://api. semanticscholar.org/CorpusID:4633365
work page 2018
-
[28]
Laurens Van der Maaten and Geoffrey Hinton. ‘Visualizing data using t-SNE. ’ In: Journal of machine learning research 9.11 (2008)
work page 2008
-
[30]
‘Measuring Domain Shift for Deep Learning in Histopathology’
Karin Stacke et al. ‘Measuring Domain Shift for Deep Learning in Histopathology’ . In: IEEE Journal of Biomedical and Health Informatics 25 (2020), pp. 325–336. url: https: //api.semanticscholar.org/CorpusID:224824021
work page 2020
-
[31]
‘Deep learning for prediction of colorectal cancer outcome: a dis- covery and validation study’
Ole-Johan Skrede et al. ‘Deep learning for prediction of colorectal cancer outcome: a dis- covery and validation study’ . In: The Lancet 395.10221 (2020), pp. 350–360
work page 2020
-
[32]
‘OpenSlide: A vendor-neutral software foundation for digital pathology’
Adam Goode et al. ‘OpenSlide: A vendor-neutral software foundation for digital pathology’ . In: Journal of pathology informatics 4 (2013)
work page 2013
-
[33]
J Bondi et al. ‘Expression and gene amplification of primary (A, B1, D1, D3, and E) and secondary (C and H) cyclins in colon adenocarcinomas and correlation with patient outcome’ . In: Journal of clinical pathology 58.5 (2005), pp. 509–514. 32
work page 2005
-
[34]
MA Merok et al. ‘Microsatellite instability has a positive prognostic impact on stage II colorectal cancer after complete resection: results from a large, consecutive Norwegian series’ . In: Annals of Oncology 24.5 (2013), pp. 1274–1282
work page 2013
-
[35]
‘Prognostic impact of genomic instability in colorectal cancer’
TS Hveem et al. ‘Prognostic impact of genomic instability in colorectal cancer’ . In: British journal of cancer 110.8 (2014), pp. 2159–2164
work page 2014
-
[36]
VC Petersen et al. ‘Identification of objective pathological prognostic determinants and models of prognosis in Dukes’ B colon cancer’ . In: Gut 51.1 (2002), pp. 65–69
work page 2002
-
[37]
‘Rofecoxib and cardiovascular adverse events in adjuvant treatment of colorectal cancer’
David J Kerr et al. ‘Rofecoxib and cardiovascular adverse events in adjuvant treatment of colorectal cancer’ . In: New England Journal of Medicine 357.4 (2007), pp. 360–369
work page 2007
-
[38]
Rachel S. Midgley et al. ‘Phase III randomized trial assessing rofecoxib in the adjuvant set- ting of colorectal cancer: final results of the VICTOR trial. ’ In: Journal of clinical oncology : official journal of the American Society of Clinical Oncology 28 30 (2010), pp. 4575–80. url: https://api.semanticscholar.org/CorpusID:13658722
work page 2010
-
[39]
Rachel S Kerr et al. ‘Adjuvant capecitabine plus bevacizumab versus capecitabine alone in patients with colorectal cancer (QUASAR 2): an open-label, randomised phase 3 trial’ . In: The Lancet Oncology 17.11 (2016), pp. 1543–1557
work page 2016
-
[40]
Masaaki Miyo et al. ‘DENEB: Development of new criteria for curability after local excision of pathological T1 colorectal cancer using liquid biopsy’ . In: Cancer Science 113 (2021), pp. 1531–1534. url: https://api.semanticscholar.org/CorpusID:244713475
work page 2021
-
[41]
Krijn J. C. Haasnoot et al. ‘Associations of non-pedunculated T1 colorectal adenocarcinoma outcome with consensus molecular subtypes, immunoscore, and microsatellite status: a mul- ticenter case-cohort study’ . In: Modern Pathology (2020), pp. 1–11. url: https://api. semanticscholar.org/CorpusID:219988302
work page 2020
-
[42]
Yara Backes et al. ‘Histologic Factors Associated With Need for Surgery in Patients With Pedunculated T1 Colorectal Carcinomas. ’ In:Gastroenterology154 6 (2018), pp. 1647–1659. url: https://api.semanticscholar.org/CorpusID:206248417
work page 2018
-
[43]
Bioptimus. H-optimus-1. 2025. url: https://huggingface.co/bioptimus/H-optimus-1
work page 2025
-
[44]
Charlie Saillard et al. H-optimus-0. 2024. url: https : / / github . com / bioptimus / releases/tree/main/models/h-optimus/v0
work page 2024
-
[45]
‘Towards a General-Purpose Foundation Model for Computational Pathology’
Richard J Chen et al. ‘Towards a General-Purpose Foundation Model for Computational Pathology’ . In:Nature Medicine (2024)
work page 2024
-
[46]
‘Elastix: a toolbox for intensity-based medical image registration’
Stefan Klein et al. ‘Elastix: a toolbox for intensity-based medical image registration’ . In: IEEE transactions on medical imaging 29.1 (2009), pp. 196–205
work page 2009
-
[47]
Ole-Johan Skrede et al. ‘Generalisation of automatic tumour segmentation in histopatho- logical whole-slide images across multiple cancer types. ’ In: NPJ precision oncology (2026). url: https://api.semanticscholar.org/CorpusID:285284278
work page 2026
-
[48]
‘Rectifier nonlinearities improve neural network acoustic models’
Andrew L Maas, Awni Y Hannun, Andrew Y Ng et al. ‘Rectifier nonlinearities improve neural network acoustic models’ . In: Proc. icml. Vol. 30. 1. Atlanta, GA. 2013, p. 3
work page 2013
-
[49]
‘Batch normalization: Accelerating deep network train- ing by reducing internal covariate shift’
Sergey Ioffe and Christian Szegedy. ‘Batch normalization: Accelerating deep network train- ing by reducing internal covariate shift’ . In: International conference on machine learning . PMLR. 2015, pp. 448–456
work page 2015
-
[50]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li and Oriol Vinyals. ‘Representation Learning with Contrast- ive Predictive Coding’ . In: arXiv preprint arXiv:1807.03748 (2018). 33 Figure captions Fig. 1 | Method overview a, Conventional use of multiple-instance learning and foundational models in histopathology for predicting attributes of a slide or a patient. The pipeline ...
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.