Recognition: unknown
DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset
Pith reviewed 2026-05-08 01:21 UTC · model grok-4.3
The pith
PathChat matches expert pathologists with no statistically significant difference in four of six tasks on a new open multicentric pathology benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that foundation models equipped for visual question answering in pathology can reach performance levels statistically indistinguishable from expert pathologists on a deliberately diverse, multicentric set of cases when evaluated through standardized tasks and scoring. PathChat+ achieves this equivalence in four of the six tasks, Gemini 2.5 Pro in two tasks, and GPT-5 in one task, under both sequential and independent answer generation protocols. The benchmark design includes explicit controls for case rarity, geographic variation, and subspecialty coverage to ground these comparisons.
What carries the argument
The DALPHIN benchmark: a fixed collection of 1236 images drawn from 300 real cases, used to score AI and human answers on six predefined diagnostic tasks with sequestered ground truth and indirect access.
If this is right
- AI copilots could be deployed as assistance tools for the subset of tasks where equivalence holds without expected loss in diagnostic accuracy.
- The public release of DALPHIN with sequestered ground truth creates a stable platform for tracking whether future models surpass or fall behind current expert baselines.
- Gaps remaining in two to five tasks per model identify concrete targets for targeted training or prompting improvements.
- The multicentric construction implies that observed performance levels are not limited to a single institution or region.
Where Pith is reading between the lines
- If the equivalence holds under broader clinical use, workload relief for pathologists becomes feasible on routine cases while preserving accuracy.
- The benchmark could be extended to test multi-turn diagnostic dialogues or integration with clinical context beyond static images.
- Differences in success rates between the pathology-specific and general-purpose models indicate that domain-specific fine-tuning still confers measurable advantages.
Load-bearing premise
The six tasks, selected images, and answer evaluation rules in DALPHIN adequately represent the full range of complexity, ambiguity, and real-world decisions pathologists face in daily practice across subspecialties and countries.
What would settle it
A replication study on a new collection of cases or an expanded task set in which any of the tested models shows a statistically significant performance deficit relative to the pathologist cohort would falsify the reported equivalence.
Figures
read the original abstract
Foundation models with visual question answering capabilities for digital pathology are emerging. Such unprecedented technology requires independent benchmarking to assess its potential in assisting pathologists in routine diagnostics. We created DALPHIN, the first multicentric open benchmark for pathology AI copilots, comprising 1236 images from 300 cases, spanning 130 rare to common diagnoses, 6 countries, and 14 subspecialties. The DALPHIN design and dataset are introduced alongside a human performance benchmark of 31 pathologists from 10 countries with varying expertise. We report results for two general-purpose (GPT-5, Gemini 2.5 Pro) and one pathology-specific copilot (PathChat+) for sequential and independent answer generation. We observed no statistically significant difference from expert-level performance in four of six tasks for PathChat, 2/6 tasks for Gemini, and 1/6 tasks for GPT. DALPHIN is publicly released with sequestered, indirectly accessible ground truth to foster robust and enduring benchmarking. Data, methods, and the evaluation platform are accessible through dalphin.grand-challenge.org.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DALPHIN, the first multicentric open benchmark for pathology AI copilots. It consists of 1236 images from 300 cases spanning 130 diagnoses across 6 countries and 14 subspecialties, accompanied by a human performance benchmark from 31 pathologists in 10 countries. Evaluations of GPT-5, Gemini 2.5 Pro, and PathChat+ on six tasks show no statistically significant difference from expert-level performance in four tasks for PathChat+, two for Gemini, and one for GPT. The benchmark, methods, and evaluation platform are publicly released with sequestered ground truth via dalphin.grand-challenge.org to support reproducible assessment.
Significance. If the benchmark design and statistical comparisons hold, this work delivers a valuable, open, and enduring resource for the field. It enables standardized evaluation of AI copilots against human experts on a diverse, multicentric dataset, which can accelerate development of reliable pathology AI tools and provide falsifiable performance baselines for future models.
minor comments (2)
- The abstract and results would benefit from a summary table listing per-task metrics, sample sizes, exact p-values, and any multiple-comparison corrections to make the 'no statistically significant difference' claims immediately verifiable.
- Clarify the precise model versions and prompting strategies used for GPT-5 and Gemini 2.5 Pro, as these details affect reproducibility of the sequential and independent answer-generation experiments.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the DALPHIN benchmark and for recommending minor revision. We are pleased that the work is recognized as delivering a valuable open resource for standardized evaluation of pathology AI copilots.
Circularity Check
No circularity: purely empirical benchmark study
full rationale
This is an empirical benchmarking paper that creates an open multicentric dataset (DALPHIN) with 1236 images from 300 cases across 130 diagnoses and 14 subspecialties, then directly compares AI copilots (PathChat+, Gemini, GPT) against performance of 31 human pathologists on six tasks. No derivations, equations, fitted parameters, or predictions by construction appear in the reported setup or results. Central claims rest on statistical comparisons to external human ground truth with sequestered evaluation, not on internal definitions or self-referential steps. The design is presented as transparent and reproducible via public release, with no load-bearing self-citations or ansatzes that reduce the findings to inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Statistical tests used to declare 'no statistically significant difference' are valid and appropriately powered for the sample sizes and metrics in each task.
- domain assumption The six tasks and 1236 images sufficiently represent the diagnostic challenges pathologists face in clinical practice.
Reference graph
Works this paper leans on
-
[1]
InProceedings of the IEEE International Conference on Computer Vision (ICCV)(2015)
Antol, S.et al.VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision (ICCV)(2015)
2015
-
[2]
Chen, C.et al.Evidence-based diagnostic reasoning with multi-agent copilot for human pathology (2025). 2506.20964
-
[3]
Vorontsov, E.et al.PRISM2: Unlocking multi-modal general pathology AI with clinical dialogue (2025). 2506.13063
-
[4]
Xu, Z.et al.A versatile pathology co-pilot via reasoning enhanced multimodal large language model (2025). 2507.17303
-
[5]
B.et al.VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019
Abacha, A. B.et al.VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019. InCLEF 2019 Working Notes(2019)
2019
-
[6]
Bereuter, J.-P.et al.Benchmarking vision capabilities of large language models in surgical examination questions. J. Surg. Educ.82, 103442, 10.1016/j.jsurg.2025.103442 (2025)
-
[7]
The future landscape of large language models in medicine,
Clusmann, J.et al.The future landscape of large language models in medicine.Commun. Medicine3, 10.1038/s43856-023-00370-1 (2023)
-
[8]
Medicine4, 10.1038/s43856-024-00709-2 (2024)
Zhang, X.et al.Development of a large-scale medical visual question-answering dataset.Commun. Medicine4, 10.1038/s43856-024-00709-2 (2024)
-
[9]
PathVQA: 30000+ Questions for Medical Visual Question Answering
He, X., Zhang, Y., Mou, L., Xing, E. & Xie, P. PathVQA: 30000+ questions for medical visual question answering (2020). 2003.10286
work page internal anchor Pith review arXiv 2020
-
[10]
Y.et al.A multimodal generative AI copilot for human pathology.Nature634, 466–473, 10.1038/ s41586-024-07618-3 (2024)
Lu, M. Y.et al.A multimodal generative AI copilot for human pathology.Nature634, 466–473, 10.1038/ s41586-024-07618-3 (2024)
2024
-
[11]
Stegeman, M.et al.Designing UNICORN: a unified benchmark for imaging in computational pathology, radiology, and natural language (2026). 2603.02790
-
[12]
van Rijthoven, M.et al.Tumor-infiltrating lymphocytes in breast cancer through artificial intelligence: biomarker analysis from the results of the TIGER challenge.medRxiv10.1101/2025.02.28.25323078 (2025)
2025
-
[13]
Medicine28, 154–163, 10.1038/s41591-021-01620-2 (2022)
Bulten, W.et al.Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge.Nat. Medicine28, 154–163, 10.1038/s41591-021-01620-2 (2022). 14.Meakin, J.et al.Grand-Challenge.org (v2025.08.1), 10.5281/zenodo.16780413 (2025)
-
[14]
Gemini 2.5 Pro
Google DeepMind. Gemini 2.5 Pro. https://developers.generativeai.google/ (2025). Accessed September 2025. 16.OpenAI. GPT-5. https://openai.com/gpt-5 (2025). Accessed September 2025
2025
-
[15]
Modernisering opleidingsplan 2 (MOP2) pathologie
Nederlandse Vereniging voor Pathologie. Modernisering opleidingsplan 2 (MOP2) pathologie. https://pathologie. nl/opleidingseisen/ (2024)
2024
-
[16]
Gatta, G.et al.Rare cancers are not so rare: The rare cancer burden in europe.Eur. J. Cancer47, 2493–2511, 10.1016/j.ejca.2011.08.008 (2011)
-
[17]
https://doi.org/10.1038/s41598-017-17204-5, https://www.nature.com/articles/s41598-017-17204-5
Bankhead, P.et al.QuPath: Open source software for digital pathology image analysis.Sci. Reports7, 16878, 10.1038/s41598-017-17204-5 (2017)
-
[18]
HALO image analysis platform
Indica Labs. HALO image analysis platform. https://www.indicalab.com/halo (2025). Albuquerque, NM, USA. 16/39
2025
-
[19]
Investig.105, 103609, 10.1016/j.labinv.2024.103609 (2025)
Lems, C.et al.Towards a multicentric open digital pathology assistant benchmark: Initial results from the DALPHIN study.Lab. Investig.105, 103609, 10.1016/j.labinv.2024.103609 (2025)
-
[20]
InMedical Imaging with Deep Learning(2026)
Moonemans, S.et al.Democratizing pathology co-pilots: An open pipeline and dataset for whole-slide vision-language modelling. InMedical Imaging with Deep Learning(2026)
2026
-
[21]
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning Representations (ICLR)(2020)
2020
-
[22]
Biobert: A pre-trained biomedical language representation model for biomedical text mining
Lee, J.et al.BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36, 1234–1240, 10.1093/bioinformatics/btz682 (2020). Acknowledgements The authors thank Milda Poceviči¯ ut˙ e for her contributions to the pilot study underlying this work and for her support in obtaining data from VPC. The authors also t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.