pith. machine review for the scientific record. sign in

arxiv: 2605.03544 · v1 · submitted 2026-05-05 · 💻 cs.CV · cs.AI

Recognition: unknown

DALPHIN: Benchmarking Digital Pathology AI Copilots Against Pathologists on an Open Multicentric Dataset

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords digital pathologyAI copilotsbenchmarkingmulticentric datasetvisual question answeringpathologist performancePathChatfoundation models
0
0 comments X

The pith

PathChat matches expert pathologists with no statistically significant difference in four of six tasks on a new open multicentric pathology benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates DALPHIN, the first open multicentric benchmark for pathology AI copilots, with 1236 images from 300 cases covering 130 diagnoses across six countries and 14 subspecialties. It pairs this dataset with a human performance baseline collected from 31 pathologists of varying expertise in ten countries. On six tasks involving image interpretation and diagnosis, the pathology-specific model PathChat shows no statistically significant gap from expert performance in four tasks, while the general models Gemini and GPT match experts in two and one task respectively. The benchmark is released publicly with sequestered ground truth to support repeated, independent testing of future AI systems.

Core claim

The authors establish that foundation models equipped for visual question answering in pathology can reach performance levels statistically indistinguishable from expert pathologists on a deliberately diverse, multicentric set of cases when evaluated through standardized tasks and scoring. PathChat+ achieves this equivalence in four of the six tasks, Gemini 2.5 Pro in two tasks, and GPT-5 in one task, under both sequential and independent answer generation protocols. The benchmark design includes explicit controls for case rarity, geographic variation, and subspecialty coverage to ground these comparisons.

What carries the argument

The DALPHIN benchmark: a fixed collection of 1236 images drawn from 300 real cases, used to score AI and human answers on six predefined diagnostic tasks with sequestered ground truth and indirect access.

If this is right

  • AI copilots could be deployed as assistance tools for the subset of tasks where equivalence holds without expected loss in diagnostic accuracy.
  • The public release of DALPHIN with sequestered ground truth creates a stable platform for tracking whether future models surpass or fall behind current expert baselines.
  • Gaps remaining in two to five tasks per model identify concrete targets for targeted training or prompting improvements.
  • The multicentric construction implies that observed performance levels are not limited to a single institution or region.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the equivalence holds under broader clinical use, workload relief for pathologists becomes feasible on routine cases while preserving accuracy.
  • The benchmark could be extended to test multi-turn diagnostic dialogues or integration with clinical context beyond static images.
  • Differences in success rates between the pathology-specific and general-purpose models indicate that domain-specific fine-tuning still confers measurable advantages.

Load-bearing premise

The six tasks, selected images, and answer evaluation rules in DALPHIN adequately represent the full range of complexity, ambiguity, and real-world decisions pathologists face in daily practice across subspecialties and countries.

What would settle it

A replication study on a new collection of cases or an expanded task set in which any of the tested models shows a statistically significant performance deficit relative to the pathologist cohort would falsify the reported equivalence.

Figures

Figures reproduced from arXiv: 2605.03544 by Adam Kowalewski, Anne-Marie Vos, Anouk B. Bouwmeester, Arvydas Laurinavicius, Biagio Brattoli, Brinder S. Chohan, Carlijn Lems, Chengkuan Chen, Diana Montezuma, Dina Tiniakos, Domingos Oliveira, Dominique van Midden, Donatas Petroska, Enrico Munari, Faisal Mahmood, Francesco Ciompi, Fr\'ed\'erique Meeuwsen, Geert J.L.H. van Leenders, Giulia Querzoli, Iris Nagtegaal, Jaeike W. Faber, Jake S.F. Maurits, Jan H. von der Th\"usen, Jeroen van der Laak, Joan Lop Gros, Jolique van Ipenburg, Jordi Temprana-Salvador, Josef Skopal, Julius Drachneris, Katrien Gr\"unberg, Koen Winkler, Konnie Hebeda, Laura Pons, Lodewijk A.A. Brosens, Luca Cima, Maschenka Balkenhol, Mateusz Maniewski, Mauricio Eduardo Su\'arez-Franck, Ming Yang Lu, Nadieh Khalili, Nat\'alie Klub\'i\v{c}kov\'a, Pedro Luis Fernandez, Pieter Wesseling, Renaldas Augulis, Robert Barna, Rogier Donders, Ronald R. de Krijger, Sander Moonemans, Sandrine Florquin, Sapir Hochman, Seokhwi Kim, Shoko Vos, Taebum Lee, Uta Flucke, Veronica Vilaplana, Yosamin Gonzalez Belisario.

Figure 1
Figure 1. Figure 1: Overview of this study. (a) Illustrative example demonstrating the format of a DALPHIN benchmark case. Each case includes a low-resolution overview of one or more histopathology whole-slide images and one or more regions of interest (ROIs) selected by the contributing pathologist. In addition, each case includes four standard questions (when applicable), a case-specific multiple-choice question, and option… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation of VLMs on initial case-orienting tasks and comparison with subspecialty expert and non-expert (resident) pathologists. Error bars for VLMs and shaded regions for pathologists indicate 95% confidence intervals. (a) Organ-recognition performance of VLMs, experts, and non-experts on a free-response organ recognition task (Qtissue) in DALPHINfull and DALPHINreader. (b) Organ-recognition performance… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation of VLMs on a free-response diagnosis task (Qdiagnosis) and comparison with subspecialty expert and non-expert (resident) pathologists. (a) BioBERT similarity scores of VLMs, experts, and non-experts in DALPHINfull and DALPHINreader. White points with error bars indicate mean scores with 95% confidence intervals. (b) Traditional and semantic NLP metric scores for VLMs, experts, and non-experts, s… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of VLMs on case-specific multiple-choice (Qmc) and free-response questions (Qopen) and comparison with subspecialty expert and non-expert (resident) pathologists. Unless otherwise noted, error bars for VLMs and shaded regions for pathologists indicate 95% confidence intervals. (a) Accuracy of VLMs, experts, and non-experts on Qmc questions in DALPHINfull and DALPHINreader. (b) Accuracy of VLMs f… view at source ↗
read the original abstract

Foundation models with visual question answering capabilities for digital pathology are emerging. Such unprecedented technology requires independent benchmarking to assess its potential in assisting pathologists in routine diagnostics. We created DALPHIN, the first multicentric open benchmark for pathology AI copilots, comprising 1236 images from 300 cases, spanning 130 rare to common diagnoses, 6 countries, and 14 subspecialties. The DALPHIN design and dataset are introduced alongside a human performance benchmark of 31 pathologists from 10 countries with varying expertise. We report results for two general-purpose (GPT-5, Gemini 2.5 Pro) and one pathology-specific copilot (PathChat+) for sequential and independent answer generation. We observed no statistically significant difference from expert-level performance in four of six tasks for PathChat, 2/6 tasks for Gemini, and 1/6 tasks for GPT. DALPHIN is publicly released with sequestered, indirectly accessible ground truth to foster robust and enduring benchmarking. Data, methods, and the evaluation platform are accessible through dalphin.grand-challenge.org.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces DALPHIN, the first multicentric open benchmark for pathology AI copilots. It consists of 1236 images from 300 cases spanning 130 diagnoses across 6 countries and 14 subspecialties, accompanied by a human performance benchmark from 31 pathologists in 10 countries. Evaluations of GPT-5, Gemini 2.5 Pro, and PathChat+ on six tasks show no statistically significant difference from expert-level performance in four tasks for PathChat+, two for Gemini, and one for GPT. The benchmark, methods, and evaluation platform are publicly released with sequestered ground truth via dalphin.grand-challenge.org to support reproducible assessment.

Significance. If the benchmark design and statistical comparisons hold, this work delivers a valuable, open, and enduring resource for the field. It enables standardized evaluation of AI copilots against human experts on a diverse, multicentric dataset, which can accelerate development of reliable pathology AI tools and provide falsifiable performance baselines for future models.

minor comments (2)
  1. The abstract and results would benefit from a summary table listing per-task metrics, sample sizes, exact p-values, and any multiple-comparison corrections to make the 'no statistically significant difference' claims immediately verifiable.
  2. Clarify the precise model versions and prompting strategies used for GPT-5 and Gemini 2.5 Pro, as these details affect reproducibility of the sequential and independent answer-generation experiments.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the DALPHIN benchmark and for recommending minor revision. We are pleased that the work is recognized as delivering a valuable open resource for standardized evaluation of pathology AI copilots.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark study

full rationale

This is an empirical benchmarking paper that creates an open multicentric dataset (DALPHIN) with 1236 images from 300 cases across 130 diagnoses and 14 subspecialties, then directly compares AI copilots (PathChat+, Gemini, GPT) against performance of 31 human pathologists on six tasks. No derivations, equations, fitted parameters, or predictions by construction appear in the reported setup or results. Central claims rest on statistical comparisons to external human ground truth with sequestered evaluation, not on internal definitions or self-referential steps. The design is presented as transparent and reproducible via public release, with no load-bearing self-citations or ansatzes that reduce the findings to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions about statistical significance testing and the representativeness of the chosen cases and tasks for real pathology practice; no free parameters are fitted to produce the main claims, and no new entities are postulated.

axioms (2)
  • domain assumption Statistical tests used to declare 'no statistically significant difference' are valid and appropriately powered for the sample sizes and metrics in each task.
    The central performance claims depend on these tests being correctly applied without undisclosed multiple-comparison corrections or other issues.
  • domain assumption The six tasks and 1236 images sufficiently represent the diagnostic challenges pathologists face in clinical practice.
    This is required for the human-AI comparison to generalize beyond the benchmark itself.

pith-pipeline@v0.9.0 · 5812 in / 1567 out tokens · 66827 ms · 2026-05-08T01:21:17.231139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    InProceedings of the IEEE International Conference on Computer Vision (ICCV)(2015)

    Antol, S.et al.VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision (ICCV)(2015)

  2. [2]

    2506.20964

    Chen, C.et al.Evidence-based diagnostic reasoning with multi-agent copilot for human pathology (2025). 2506.20964

  3. [3]

    2506.13063

    Vorontsov, E.et al.PRISM2: Unlocking multi-modal general pathology AI with clinical dialogue (2025). 2506.13063

  4. [4]

    2507.17303

    Xu, Z.et al.A versatile pathology co-pilot via reasoning enhanced multimodal large language model (2025). 2507.17303

  5. [5]

    B.et al.VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019

    Abacha, A. B.et al.VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019. InCLEF 2019 Working Notes(2019)

  6. [6]

    Bereuter, J.-P.et al.Benchmarking vision capabilities of large language models in surgical examination questions. J. Surg. Educ.82, 103442, 10.1016/j.jsurg.2025.103442 (2025)

  7. [7]

    The future landscape of large language models in medicine,

    Clusmann, J.et al.The future landscape of large language models in medicine.Commun. Medicine3, 10.1038/s43856-023-00370-1 (2023)

  8. [8]

    Medicine4, 10.1038/s43856-024-00709-2 (2024)

    Zhang, X.et al.Development of a large-scale medical visual question-answering dataset.Commun. Medicine4, 10.1038/s43856-024-00709-2 (2024)

  9. [9]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    He, X., Zhang, Y., Mou, L., Xing, E. & Xie, P. PathVQA: 30000+ questions for medical visual question answering (2020). 2003.10286

  10. [10]

    Y.et al.A multimodal generative AI copilot for human pathology.Nature634, 466–473, 10.1038/ s41586-024-07618-3 (2024)

    Lu, M. Y.et al.A multimodal generative AI copilot for human pathology.Nature634, 466–473, 10.1038/ s41586-024-07618-3 (2024)

  11. [11]

    2603.02790

    Stegeman, M.et al.Designing UNICORN: a unified benchmark for imaging in computational pathology, radiology, and natural language (2026). 2603.02790

  12. [12]

    van Rijthoven, M.et al.Tumor-infiltrating lymphocytes in breast cancer through artificial intelligence: biomarker analysis from the results of the TIGER challenge.medRxiv10.1101/2025.02.28.25323078 (2025)

  13. [13]

    Medicine28, 154–163, 10.1038/s41591-021-01620-2 (2022)

    Bulten, W.et al.Artificial intelligence for diagnosis and Gleason grading of prostate cancer: the PANDA challenge.Nat. Medicine28, 154–163, 10.1038/s41591-021-01620-2 (2022). 14.Meakin, J.et al.Grand-Challenge.org (v2025.08.1), 10.5281/zenodo.16780413 (2025)

  14. [14]

    Gemini 2.5 Pro

    Google DeepMind. Gemini 2.5 Pro. https://developers.generativeai.google/ (2025). Accessed September 2025. 16.OpenAI. GPT-5. https://openai.com/gpt-5 (2025). Accessed September 2025

  15. [15]

    Modernisering opleidingsplan 2 (MOP2) pathologie

    Nederlandse Vereniging voor Pathologie. Modernisering opleidingsplan 2 (MOP2) pathologie. https://pathologie. nl/opleidingseisen/ (2024)

  16. [16]

    Gatta, G.et al.Rare cancers are not so rare: The rare cancer burden in europe.Eur. J. Cancer47, 2493–2511, 10.1016/j.ejca.2011.08.008 (2011)

  17. [17]

    https://doi.org/10.1038/s41598-017-17204-5, https://www.nature.com/articles/s41598-017-17204-5

    Bankhead, P.et al.QuPath: Open source software for digital pathology image analysis.Sci. Reports7, 16878, 10.1038/s41598-017-17204-5 (2017)

  18. [18]

    HALO image analysis platform

    Indica Labs. HALO image analysis platform. https://www.indicalab.com/halo (2025). Albuquerque, NM, USA. 16/39

  19. [19]

    Investig.105, 103609, 10.1016/j.labinv.2024.103609 (2025)

    Lems, C.et al.Towards a multicentric open digital pathology assistant benchmark: Initial results from the DALPHIN study.Lab. Investig.105, 103609, 10.1016/j.labinv.2024.103609 (2025)

  20. [20]

    InMedical Imaging with Deep Learning(2026)

    Moonemans, S.et al.Democratizing pathology co-pilots: An open pipeline and dataset for whole-slide vision-language modelling. InMedical Imaging with Deep Learning(2026)

  21. [21]

    Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating text generation with BERT. InInternational Conference on Learning Representations (ICLR)(2020)

  22. [22]

    Biobert: A pre-trained biomedical language representation model for biomedical text mining

    Lee, J.et al.BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36, 1234–1240, 10.1093/bioinformatics/btz682 (2020). Acknowledgements The authors thank Milda Poceviči¯ ut˙ e for her contributions to the pilot study underlying this work and for her support in obtaining data from VPC. The authors also t...