arxiv: 2605.03544 · v1 · submitted 2026-05-05 · 💻 cs.CV · cs.AI

Recognition: unknown

DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset

Carlijn Lems , Sander Moonemans , Nat\'alie Klub\'i\v{c}kov\'a , Biagio Brattoli , Taebum Lee , Seokhwi Kim , Veronica Vilaplana , Laura Pons

show 48 more authors

Sapir Hochman Mauricio Eduardo Su\'arez-Franck Pedro Luis Fernandez Julius Drachneris Donatas Petroska Renaldas Augulis Arvydas Laurinavicius Domingos Oliveira Diana Montezuma Anouk B. Bouwmeester Dominique van Midden Anne-Marie Vos Shoko Vos Jolique van Ipenburg Maschenka Balkenhol Koen Winkler Iris Nagtegaal Konnie Hebeda Uta Flucke Katrien Gr\"unberg Josef Skopal Brinder S. Chohan Jordi Temprana-Salvador Enrico Munari Luca Cima Giulia Querzoli Yosamin Gonzalez Belisario Jaeike W. Faber Geert J.L.H. van Leenders Jan H. von der Th\"usen Lodewijk A.A. Brosens Ronald R. de Krijger Pieter Wesseling Sandrine Florquin Mateusz Maniewski Adam Kowalewski Robert Barna Dina Tiniakos Joan Lop Gros Rogier Donders Jake S.F. Maurits Ming Yang Lu Chengkuan Chen Faisal Mahmood Jeroen van der Laak Nadieh Khalili Fr\'ed\'erique Meeuwsen Francesco Ciompi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords digital pathologyAI copilotsbenchmarkingmulticentric datasetvisual question answeringpathologist performancePathChatfoundation models

0 comments

The pith

PathChat matches expert pathologists with no statistically significant difference in four of six tasks on a new open multicentric pathology benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates DALPHIN, the first open multicentric benchmark for pathology AI copilots, with 1236 images from 300 cases covering 130 diagnoses across six countries and 14 subspecialties. It pairs this dataset with a human performance baseline collected from 31 pathologists of varying expertise in ten countries. On six tasks involving image interpretation and diagnosis, the pathology-specific model PathChat shows no statistically significant gap from expert performance in four tasks, while the general models Gemini and GPT match experts in two and one task respectively. The benchmark is released publicly with sequestered ground truth to support repeated, independent testing of future AI systems.

Core claim

The authors establish that foundation models equipped for visual question answering in pathology can reach performance levels statistically indistinguishable from expert pathologists on a deliberately diverse, multicentric set of cases when evaluated through standardized tasks and scoring. PathChat+ achieves this equivalence in four of the six tasks, Gemini 2.5 Pro in two tasks, and GPT-5 in one task, under both sequential and independent answer generation protocols. The benchmark design includes explicit controls for case rarity, geographic variation, and subspecialty coverage to ground these comparisons.

What carries the argument

The DALPHIN benchmark: a fixed collection of 1236 images drawn from 300 real cases, used to score AI and human answers on six predefined diagnostic tasks with sequestered ground truth and indirect access.

If this is right

AI copilots could be deployed as assistance tools for the subset of tasks where equivalence holds without expected loss in diagnostic accuracy.
The public release of DALPHIN with sequestered ground truth creates a stable platform for tracking whether future models surpass or fall behind current expert baselines.
Gaps remaining in two to five tasks per model identify concrete targets for targeted training or prompting improvements.
The multicentric construction implies that observed performance levels are not limited to a single institution or region.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the equivalence holds under broader clinical use, workload relief for pathologists becomes feasible on routine cases while preserving accuracy.
The benchmark could be extended to test multi-turn diagnostic dialogues or integration with clinical context beyond static images.
Differences in success rates between the pathology-specific and general-purpose models indicate that domain-specific fine-tuning still confers measurable advantages.

Load-bearing premise

The six tasks, selected images, and answer evaluation rules in DALPHIN adequately represent the full range of complexity, ambiguity, and real-world decisions pathologists face in daily practice across subspecialties and countries.

What would settle it

A replication study on a new collection of cases or an expanded task set in which any of the tested models shows a statistically significant performance deficit relative to the pathologist cohort would falsify the reported equivalence.

Figures

Figures reproduced from arXiv: 2605.03544 by Adam Kowalewski, Anne-Marie Vos, Anouk B. Bouwmeester, Arvydas Laurinavicius, Biagio Brattoli, Brinder S. Chohan, Carlijn Lems, Chengkuan Chen, Diana Montezuma, Dina Tiniakos, Domingos Oliveira, Dominique van Midden, Donatas Petroska, Enrico Munari, Faisal Mahmood, Francesco Ciompi, Fr\'ed\'erique Meeuwsen, Geert J.L.H. van Leenders, Giulia Querzoli, Iris Nagtegaal, Jaeike W. Faber, Jake S.F. Maurits, Jan H. von der Th\"usen, Jeroen van der Laak, Joan Lop Gros, Jolique van Ipenburg, Jordi Temprana-Salvador, Josef Skopal, Julius Drachneris, Katrien Gr\"unberg, Koen Winkler, Konnie Hebeda, Laura Pons, Lodewijk A.A. Brosens, Luca Cima, Maschenka Balkenhol, Mateusz Maniewski, Mauricio Eduardo Su\'arez-Franck, Ming Yang Lu, Nadieh Khalili, Nat\'alie Klub\'i\v{c}kov\'a, Pedro Luis Fernandez, Pieter Wesseling, Renaldas Augulis, Robert Barna, Rogier Donders, Ronald R. de Krijger, Sander Moonemans, Sandrine Florquin, Sapir Hochman, Seokhwi Kim, Shoko Vos, Taebum Lee, Uta Flucke, Veronica Vilaplana, Yosamin Gonzalez Belisario.

**Figure 1.** Figure 1: Overview of this study. (a) Illustrative example demonstrating the format of a DALPHIN benchmark case. Each case includes a low-resolution overview of one or more histopathology whole-slide images and one or more regions of interest (ROIs) selected by the contributing pathologist. In addition, each case includes four standard questions (when applicable), a case-specific multiple-choice question, and option… view at source ↗

**Figure 2.** Figure 2: Evaluation of VLMs on initial case-orienting tasks and comparison with subspecialty expert and non-expert (resident) pathologists. Error bars for VLMs and shaded regions for pathologists indicate 95% confidence intervals. (a) Organ-recognition performance of VLMs, experts, and non-experts on a free-response organ recognition task (Qtissue) in DALPHINfull and DALPHINreader. (b) Organ-recognition performance… view at source ↗

**Figure 3.** Figure 3: Evaluation of VLMs on a free-response diagnosis task (Qdiagnosis) and comparison with subspecialty expert and non-expert (resident) pathologists. (a) BioBERT similarity scores of VLMs, experts, and non-experts in DALPHINfull and DALPHINreader. White points with error bars indicate mean scores with 95% confidence intervals. (b) Traditional and semantic NLP metric scores for VLMs, experts, and non-experts, s… view at source ↗

**Figure 4.** Figure 4: Evaluation of VLMs on case-specific multiple-choice (Qmc) and free-response questions (Qopen) and comparison with subspecialty expert and non-expert (resident) pathologists. Unless otherwise noted, error bars for VLMs and shaded regions for pathologists indicate 95% confidence intervals. (a) Accuracy of VLMs, experts, and non-experts on Qmc questions in DALPHINfull and DALPHINreader. (b) Accuracy of VLMs f… view at source ↗

read the original abstract

Foundation models with visual question answering capabilities for digital pathology are emerging. Such unprecedented technology requires independent benchmarking to assess its potential in assisting pathologists in routine diagnostics. We created DALPHIN, the first multicentric open benchmark for pathology AI copilots, comprising 1236 images from 300 cases, spanning 130 rare to common diagnoses, 6 countries, and 14 subspecialties. The DALPHIN design and dataset are introduced alongside a human performance benchmark of 31 pathologists from 10 countries with varying expertise. We report results for two general-purpose (GPT-5, Gemini 2.5 Pro) and one pathology-specific copilot (PathChat+) for sequential and independent answer generation. We observed no statistically significant difference from expert-level performance in four of six tasks for PathChat, 2/6 tasks for Gemini, and 1/6 tasks for GPT. DALPHIN is publicly released with sequestered, indirectly accessible ground truth to foster robust and enduring benchmarking. Data, methods, and the evaluation platform are accessible through dalphin.grand-challenge.org.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DALPHIN introduces a useful open multicentric benchmark for pathology AI copilots with human baselines, though the no-difference claims depend on task and scoring details.

read the letter

The main point is that DALPHIN is a new open multicentric benchmark for pathology AI copilots, backed by human expert performance data from 31 pathologists, where some models show no significant difference from experts on several tasks. The paper constructs a dataset of 1236 images from 300 cases that spans 130 diagnoses, 14 subspecialties, and 6 countries. They include a human benchmark with pathologists of varying expertise from 10 countries. Results are reported for GPT-5, Gemini 2.5 Pro, and the pathology-specific PathChat+ using both sequential and independent answer modes. The platform is released publicly with sequestered ground truth to enable continued use without data leakage. This is a step forward because prior evaluations were often limited to single centers or lacked open access. The diversity and the direct comparison to humans give it more weight for assessing real utility in diagnostics. The soft spots lie in the details of the evaluation. The six tasks need to be checked for how well they represent the ambiguity and decision-making in actual pathology practice. Without full methods on image selection, scoring rubrics, and the exact statistical tests including sample sizes and power, the no-difference findings could be influenced by how the questions were framed or how lenient the grading was. The abstract alone leaves some uncertainty on robustness. This paper is for AI developers in medical imaging and pathologists interested in AI tools. It offers a reference dataset and baselines that could help standardize testing. It deserves serious peer review because the benchmark design is novel and the human data is a valuable addition, even if the model performance claims might benefit from closer examination during review. I would recommend accepting it for peer review.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces DALPHIN, the first multicentric open benchmark for pathology AI copilots. It consists of 1236 images from 300 cases spanning 130 diagnoses across 6 countries and 14 subspecialties, accompanied by a human performance benchmark from 31 pathologists in 10 countries. Evaluations of GPT-5, Gemini 2.5 Pro, and PathChat+ on six tasks show no statistically significant difference from expert-level performance in four tasks for PathChat+, two for Gemini, and one for GPT. The benchmark, methods, and evaluation platform are publicly released with sequestered ground truth via dalphin.grand-challenge.org to support reproducible assessment.

Significance. If the benchmark design and statistical comparisons hold, this work delivers a valuable, open, and enduring resource for the field. It enables standardized evaluation of AI copilots against human experts on a diverse, multicentric dataset, which can accelerate development of reliable pathology AI tools and provide falsifiable performance baselines for future models.

minor comments (2)

The abstract and results would benefit from a summary table listing per-task metrics, sample sizes, exact p-values, and any multiple-comparison corrections to make the 'no statistically significant difference' claims immediately verifiable.
Clarify the precise model versions and prompting strategies used for GPT-5 and Gemini 2.5 Pro, as these details affect reproducibility of the sequential and independent answer-generation experiments.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the DALPHIN benchmark and for recommending minor revision. We are pleased that the work is recognized as delivering a valuable open resource for standardized evaluation of pathology AI copilots.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark study

full rationale

This is an empirical benchmarking paper that creates an open multicentric dataset (DALPHIN) with 1236 images from 300 cases across 130 diagnoses and 14 subspecialties, then directly compares AI copilots (PathChat+, Gemini, GPT) against performance of 31 human pathologists on six tasks. No derivations, equations, fitted parameters, or predictions by construction appear in the reported setup or results. Central claims rest on statistical comparisons to external human ground truth with sequestered evaluation, not on internal definitions or self-referential steps. The design is presented as transparent and reproducible via public release, with no load-bearing self-citations or ansatzes that reduce the findings to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions about statistical significance testing and the representativeness of the chosen cases and tasks for real pathology practice; no free parameters are fitted to produce the main claims, and no new entities are postulated.

axioms (2)

domain assumption Statistical tests used to declare 'no statistically significant difference' are valid and appropriately powered for the sample sizes and metrics in each task.
The central performance claims depend on these tests being correctly applied without undisclosed multiple-comparison corrections or other issues.
domain assumption The six tasks and 1236 images sufficiently represent the diagnostic challenges pathologists face in clinical practice.
This is required for the human-AI comparison to generalize beyond the benchmark itself.

pith-pipeline@v0.9.0 · 5812 in / 1567 out tokens · 66827 ms · 2026-05-08T01:21:17.231139+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 13 canonical work pages · 1 internal anchor

[1]

InProceedings of the IEEE International Conference on Computer Vision (ICCV)(2015)

Antol, S.et al.VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision (ICCV)(2015)

2015
[2]

2506.20964

Chen, C.et al.Evidence-based diagnostic reasoning with multi-agent copilot for human pathology (2025). 2506.20964

work page arXiv 2025
[3]

2506.13063

Vorontsov, E.et al.PRISM2: Unlocking multi-modal general pathology AI with clinical dialogue (2025). 2506.13063

work page arXiv 2025
[4]

2507.17303

Xu, Z.et al.A versatile pathology co-pilot via reasoning enhanced multimodal large language model (2025). 2507.17303

work page arXiv 2025
[5]

B.et al.VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019

Abacha, A. B.et al.VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019. InCLEF 2019 Working Notes(2019)

2019
[6]

Bereuter, J.-P.et al.Benchmarking vision capabilities of large language models in surgical examination questions. J. Surg. Educ.82, 103442, 10.1016/j.jsurg.2025.103442 (2025)

work page doi:10.1016/j.jsurg.2025.103442 2025
[7]

The future landscape of large language models in medicine,

Clusmann, J.et al.The future landscape of large language models in medicine.Commun. Medicine3, 10.1038/s43856-023-00370-1 (2023)

work page doi:10.1038/s43856-023-00370-1 2023
[8]

Medicine4, 10.1038/s43856-024-00709-2 (2024)

Zhang, X.et al.Development of a large-scale medical visual question-answering dataset.Commun. Medicine4, 10.1038/s43856-024-00709-2 (2024)

work page doi:10.1038/s43856-024-00709-2 2024
[9]

PathVQA: 30000+ Questions for Medical Visual Question Answering

He, X., Zhang, Y., Mou, L., Xing, E. & Xie, P. PathVQA: 30000+ questions for medical visual question answering (2020). 2003.10286

work page internal anchor Pith review arXiv 2020
[10]

Y.et al.A multimodal generative AI copilot for human pathology.Nature634, 466–473, 10.1038/ s41586-024-07618-3 (2024)

Lu, M. Y.et al.A multimodal generative AI copilot for human pathology.Nature634, 466–473, 10.1038/ s41586-024-07618-3 (2024)

2024
[11]

2603.02790

Stegeman, M.et al.Designing UNICORN: a unified benchmark for imaging in computational pathology, radiology, and natural language (2026). 2603.02790

work page arXiv 2026
[12]

van Rijthoven, M.et al.Tumor-infiltrating lymphocytes in breast cancer through artificial intelligence: biomarker analysis from the results of the TIGER challenge.medRxiv10.1101/2025.02.28.25323078 (2025)

2025
[13]

Medicine28, 154–163, 10.1038/s41591-021-01620-2 (2022)

Bulten, W.et al.Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge.Nat. Medicine28, 154–163, 10.1038/s41591-021-01620-2 (2022). 14.Meakin, J.et al.Grand-Challenge.org (v2025.08.1), 10.5281/zenodo.16780413 (2025)

work page doi:10.1038/s41591-021-01620-2 2022
[14]

Gemini 2.5 Pro

Google DeepMind. Gemini 2.5 Pro. https://developers.generativeai.google/ (2025). Accessed September 2025. 16.OpenAI. GPT-5. https://openai.com/gpt-5 (2025). Accessed September 2025

2025
[15]

Modernisering opleidingsplan 2 (MOP2) pathologie

Nederlandse Vereniging voor Pathologie. Modernisering opleidingsplan 2 (MOP2) pathologie. https://pathologie. nl/opleidingseisen/ (2024)

2024
[16]

Gatta, G.et al.Rare cancers are not so rare: The rare cancer burden in europe.Eur. J. Cancer47, 2493–2511, 10.1016/j.ejca.2011.08.008 (2011)

work page doi:10.1016/j.ejca.2011.08.008 2011
[17]

https://doi.org/10.1038/s41598-017-17204-5, https://www.nature.com/articles/s41598-017-17204-5

Bankhead, P.et al.QuPath: Open source software for digital pathology image analysis.Sci. Reports7, 16878, 10.1038/s41598-017-17204-5 (2017)

work page doi:10.1038/s41598-017-17204-5 2017
[18]

HALO image analysis platform

Indica Labs. HALO image analysis platform. https://www.indicalab.com/halo (2025). Albuquerque, NM, USA. 16/39

2025
[19]

Investig.105, 103609, 10.1016/j.labinv.2024.103609 (2025)

Lems, C.et al.Towards a multicentric open digital pathology assistant benchmark: Initial results from the DALPHIN study.Lab. Investig.105, 103609, 10.1016/j.labinv.2024.103609 (2025)

work page doi:10.1016/j.labinv.2024.103609 2024
[20]

InMedical Imaging with Deep Learning(2026)

Moonemans, S.et al.Democratizing pathology co-pilots: An open pipeline and dataset for whole-slide vision-language modelling. InMedical Imaging with Deep Learning(2026)

2026
[21]

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning Representations (ICLR)(2020)

2020
[22]

Biobert: A pre-trained biomedical language representation model for biomedical text mining

Lee, J.et al.BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36, 1234–1240, 10.1093/bioinformatics/btz682 (2020). Acknowledgements The authors thank Milda Poceviči¯ ut˙ e for her contributions to the pilot study underlying this work and for her support in obtaining data from VPC. The authors also t...

work page doi:10.1093/bioinformatics/btz682 2020