arxiv: 2604.27582 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Assessing Pancreatic Ductal Adenocarcinoma Vascular Invasion: the PDACVI Benchmark

M. Riera-Mar\'in , O. K. Sikha , J. Rodr\'iguez-Comas , M. S. May , T. Kirscher , X. Coubez , P. Meyer , S. Faisan

show 18 more authors

Z. Pan X. Zhou X. Liang C. H\'emon V. Boussot J.-L. Dillenseger J.-C. Nunes K.-C. Kahl C. L\"uth J. Traub P.-H. Conze M. M. Duh A. Aubanell R. de Figueiredo Cardoso S. Egger-Hackenschmidt J. Garc\'ia-L\'opez M. A. Gonz\'alez-Ballester A. Galdran

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords pancreatic cancervascular invasionimage segmentationuncertainty quantificationinter-rater variabilitymedical imaging benchmarkprobabilistic models

0 comments

The pith

Methods that model expert disagreement produce more reliable vascular invasion maps for pancreatic cancer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a benchmark using multiple expert annotations to evaluate AI performance in determining vascular invasion for pancreatic cancer surgery. It shows that high average overlap with consensus labels does not guarantee reliability at the ambiguous tumor-vessel interfaces that matter most for surgical planning. Methods that account for disagreement among experts produce probabilistic outputs that remain accurate and robust even when expert consensus is low. This distinction matters because vascular invasion status directly affects whether a patient can undergo potentially curative resection.

Core claim

A densely annotated dataset with five independent expert annotations per case, paired with an evaluation framework that combines volumetric metrics, probabilistic calibration, and interface-specific analysis, demonstrates that uncertainty-aware methods outperform binary segmentation approaches by maintaining performance in low-consensus ambiguous regions.

What carries the argument

The multi-annotation dataset that captures inter-rater variability at tumor-vessel boundaries, used within a multi-metric framework to separate global accuracy from localized clinical utility.

Load-bearing premise

The five independent expert annotations faithfully capture diagnostic ambiguity at tumor-vessel interfaces and the multi-metric framework reflects surgical decision utility.

What would settle it

A validation study in which uncertainty-modeling methods fail to demonstrate superior calibration or robustness when compared to binary methods against actual surgical findings or additional expert consensus.

Figures

Figures reproduced from arXiv: 2604.27582 by A. Aubanell, A. Galdran, C. H\'emon, C. L\"uth, J.-C. Nunes, J. Garc\'ia-L\'opez, J.-L. Dillenseger, J. Rodr\'iguez-Comas, J. Traub, K.-C. Kahl, M. A. Gonz\'alez-Ballester, M. M. Duh, M. Riera-Mar\'in, M. S. May, O. K. Sikha, P.-H. Conze, P. Meyer, R. de Figueiredo Cardoso, S. Egger-Hackenschmidt, S. Faisan, T. Kirscher, V. Boussot, X. Coubez, X. Liang, X. Zhou, Z. Pan.

**Figure 1.** Figure 1: Examples of High Diagnostic Divergence in PDAC Segmentation. The five columns display the independent annotations from the five human experts. Tumor annotations are delineated in red, adjacent vascular structures in green, and areas of vascular invasion (tumor-vessel contact) are highlighted in white. (Top row) Substantial disagreement on the infiltrative borders and the specific extent of the tumor-vessel… view at source ↗

**Figure 2.** Figure 2: Interrater agreement (Mean ± SD) per rater pair. The matrix shows average agreement across our dataset between individual experts and also the STAPLE consensus, pointing to the 1-year junior resident (Rater 5) as the primary outlier. 3 view at source ↗

**Figure 3.** Figure 3: Ranking stability via bootstrap analysis. Bubble charts illustrating the rank frequency (1st to 6th) achieved by each participating team across 500 bootstrap iterations for all nine metrics. The size of each round marker corresponds to the percentage of iterations in which a team achieved a given rank. that fine-grained vascular assessment remains strongly limited by case difficulty and inter-rater ambigui… view at source ↗

**Figure 4.** Figure 4: Representative predictions across increasing (top-to-bottom) levels of diagnostic ambiguity. First column: CT slice with expert annotations. Second-to-last columns: predictions of different participants, highlighting different confidence behaviors at the tumor-vessel interface. In low-ambiguity cases, most methods localize the tumor consistently, whereas in ambiguous and highcomplexity cases binary-target… view at source ↗

**Figure 5.** Figure 5: Performance metric inter-correlation analysis. Overlap metrics were strongly correlated with each other, but only weakly associated with vessel-specific vascular invasion errors, supporting the use of a multi-metric benchmark. OrdSTAPLE - Universitat Pompeu Fabra, Sycai Medical OrdSTAPLE followed a two-stage curriculum. A first-stage nnU-Net pretrained on PANORAMA Batch 4 was adapted to predict five ordina… view at source ↗

**Figure 6.** Figure 6: For global metrics, several pairwise differences remained significant, particularly in calibration and overlap. In contrast, most pairwise comparisons for vessel-specific vascular invasion metrics did not reach significance, indicating that case-wise variance in these structures is strongly driven by a subset of highly ambiguous studies shared across methods. 6.5.3. Case-wise performance signatures view at source ↗

**Figure 6.** Figure 6: Pairwise Wilcoxon signed-rank test matrices across benchmark metrics. Heatmaps display p-values for pairwise method comparisons over global segmentation, calibration, and vessel-specific vascular invasion metrics. Several differences remain significant for overlap and calibration, whereas most vascular comparisons do not reach significance, indicating that case-wise variance in vascular assessment is stro… view at source ↗

**Figure 7.** Figure 7: Case-wise performance signatures across the test cohort. Standardized metric profiles reveal both method-specific tradeoffs and recurrent hard studies that degrade performance across multiple architectures. 15 view at source ↗

**Figure 8.** Figure 8: Failure analysis and cross-algorithm consistency in high-complexity cases. Methods exhibited distinct failure profiles, but also shared a subset of recurrent hard studies that degraded performance across architectures. 16 view at source ↗

read the original abstract

Surgical resection remains the only potentially curative treatment for pancreatic ductal adenocarcinoma (PDAC), and eligibility depends on accurate assessment of vascular invasion (VI), i.e., tumor extension into adjacent critical vessels. Despite its importance for preoperative staging and surgical planning, computational VI assessment remains underexplored. Two major challenges are the lack of public datasets and the diagnostic ambiguity at the tumor-vessel interface, which leads to substantial inter-rater variability even among expert radiologists. To address these limitations, we introduce the CURVAS-PDACVI Dataset and Challenge, an open benchmark for uncertainty-aware AI in PDAC staging based on a densely annotated dataset with five independent expert annotations per scan. We also propose a multi-metric evaluation framework that extends beyond spatial overlap to include probabilistic calibration and VI assessment. Evaluation of six state-of-the-art methods shows that strong global volumetric overlap does not necessarily translate into reliable performance at clinically critical tumor-vessel interfaces. In particular, methods optimized for binary segmentation perform competitively on average overlap metrics, but often degrade in high-complexity cases with low expert consensus, either collapsing in volume or overextending at uncertain boundaries. In contrast, methods that model inter-rater disagreement produce better calibrated probabilistic maps and show greater robustness in these ambiguous cases. The benchmark highlights the limitations of volumetric accuracy as a proxy for localized surgical utility, motivating uncertainty-aware probabilistic models for preoperative decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases a new multi-rater PDAC vascular invasion dataset and shows uncertainty methods hold up better on ambiguous boundaries, but the clinical utility claim lacks any anchor to surgical outcomes.

read the letter

This paper's core offering is the CURVAS-PDACVI dataset: CT scans with five independent expert annotations each, framed as an open benchmark and challenge for vascular invasion assessment in pancreatic cancer. It also reports that methods modeling inter-rater disagreement produce better-calibrated probability maps and hold up more reliably at low-consensus tumor-vessel interfaces than standard binary segmentation approaches. That observation is grounded in the multi-metric evaluation they run across six methods, which adds calibration and localized interface scores to the usual overlap numbers. The dataset release itself is the clearest advance; prior PDAC work has not combined this density of annotations with an explicit focus on uncertainty at the critical boundaries. The empirical comparison is straightforward and reproducible on the held-out cases they describe. The main soft spot is the missing link to actual clinical decisions. All performance numbers are measured against the expert labels or their aggregates, with no external check against operative findings, resectability rates, or surgeon thresholds. The claim that these calibrated maps would improve preoperative staging therefore rests on the assumption that matching annotation variability equals better recovery of true invasion status. That assumption is plausible but untested here, so the robustness advantage could be partly an artifact of fitting the label distribution. The paper is aimed at medical imaging groups working on oncology segmentation and uncertainty quantification. Anyone building or benchmarking tools for PDAC staging would get immediate value from the data and the baseline numbers. It deserves a serious referee because the dataset is new, the evaluation framework is explicit, and the central empirical pattern is clear enough to discuss. I would send it for review and ask the authors to clarify how far the calibration gains translate beyond annotation agreement.

Referee Report

2 major / 3 minor

Summary. The paper introduces the CURVAS-PDACVI Dataset and Challenge, an open benchmark consisting of CT scans for pancreatic ductal adenocarcinoma (PDAC) vascular invasion (VI) assessment, each with five independent expert annotations to capture inter-rater variability at tumor-vessel interfaces. It proposes a multi-metric evaluation framework extending beyond volumetric overlap to include probabilistic calibration and localized interface-specific metrics. Evaluation of six state-of-the-art methods demonstrates that binary segmentation approaches achieve competitive global overlap but degrade in high-ambiguity cases, whereas disagreement-modeling methods yield better-calibrated probabilistic maps and greater robustness at uncertain boundaries. The work concludes that standard overlap metrics are poor proxies for localized surgical utility and motivates uncertainty-aware models for preoperative staging.

Significance. If the empirical comparisons hold, this provides a valuable public resource addressing the scarcity of densely annotated datasets for PDAC VI, a clinically critical task determining surgical resectability. The multi-annotation design and extended metrics (calibration plus interface assessment) are strengths that enable more nuanced benchmarking than typical Dice-focused evaluations. The findings offer concrete evidence favoring probabilistic modeling in ambiguous medical segmentation scenarios, with potential to guide development of AI tools that better reflect diagnostic uncertainty.

major comments (2)

Abstract and Results: The central claim that disagreement-modeling methods produce better calibrated maps and greater robustness at tumor-vessel interfaces depends on the multi-metric framework serving as a valid proxy for surgical decision utility. However, all evaluations are performed solely against the five expert annotations (majority vote or soft labels) with no external anchor such as actual surgical findings, resectability outcomes, or surgeon decision thresholds; this leaves open whether superior calibration reflects true VI status or merely annotation distribution matching.
Methods and Results: The selection and definition of 'high-complexity cases with low expert consensus' is not accompanied by explicit quantitative criteria, exclusion rules, or the number of such cases; without these details, the reported degradation of binary methods versus robustness of uncertainty-aware methods cannot be fully assessed for reproducibility or generalizability.

minor comments (3)

Abstract: Inclusion of at least one or two concrete quantitative results (e.g., specific calibration error values or interface metric differences across the six methods) would strengthen the summarized comparative findings.
The paper should clarify the exact implementation of the probabilistic calibration metric and interface-specific scores, including any hyperparameters or thresholds used.
Ensure the dataset release includes clear documentation on annotation protocol, imaging parameters, and patient inclusion criteria to maximize utility for the community.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and detailed comments. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: Abstract and Results: The central claim that disagreement-modeling methods produce better calibrated maps and greater robustness at tumor-vessel interfaces depends on the multi-metric framework serving as a valid proxy for surgical decision utility. However, all evaluations are performed solely against the five expert annotations (majority vote or soft labels) with no external anchor such as actual surgical findings, resectability outcomes, or surgeon decision thresholds; this leaves open whether superior calibration reflects true VI status or merely annotation distribution matching.

Authors: We agree that direct validation against surgical outcomes or resectability data would provide stronger evidence of clinical utility. The current benchmark relies on multi-expert annotations to quantify inter-rater variability at ambiguous tumor-vessel interfaces, which is a core clinical challenge. The extended metrics (calibration and interface-specific) are designed to evaluate how well models reflect this uncertainty rather than assuming a single ground truth. We will revise the abstract, results, and discussion sections to explicitly frame these as annotation-based proxies and to note the absence of outcome-level validation as a limitation, while highlighting it as an important direction for future work. No new external data can be added at this stage. revision: partial
Referee: Methods and Results: The selection and definition of 'high-complexity cases with low expert consensus' is not accompanied by explicit quantitative criteria, exclusion rules, or the number of such cases; without these details, the reported degradation of binary methods versus robustness of uncertainty-aware methods cannot be fully assessed for reproducibility or generalizability.

Authors: We appreciate this observation and agree that explicit criteria are required. In the revised manuscript, we will expand the Methods section to define high-complexity cases using precise quantitative measures of inter-annotator disagreement (such as average pairwise overlap or voxel-wise entropy thresholds), specify any exclusion rules applied during case selection, and report the exact number of such cases within the dataset. These details will also be referenced in the Results when discussing performance differences, enabling full reproducibility and assessment of generalizability. revision: yes

standing simulated objections not resolved

The absence of external validation against actual surgical findings, resectability outcomes, or surgeon decision thresholds, as this data is not part of the current benchmark dataset.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent evaluation against held-out annotations

full rationale

This is an empirical benchmark paper introducing a dataset (CURVAS-PDACVI) with five expert annotations per case and a multi-metric framework (overlap, probabilistic calibration, interface-specific scores). All performance claims are measured directly against these external annotations as ground truth, with no equations, fitted parameters, derivations, or self-referential reductions. Methods are compared on held-out data; superior calibration or robustness in ambiguous cases is reported as an observed empirical outcome rather than derived by construction from the inputs. No self-citation chains, ansatzes, or uniqueness theorems are invoked to justify the central results. The evaluation framework is defined independently of the tested methods and does not reduce to renaming or fitting the same quantities it claims to predict.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an empirical benchmark paper; the central claims rest on the domain assumption that multi-rater annotations represent clinically meaningful uncertainty and that interface-specific metrics predict surgical utility. No free parameters, mathematical axioms, or new physical entities are introduced.

axioms (2)

domain assumption Five independent expert annotations per scan capture the diagnostic ambiguity at tumor-vessel interfaces
Invoked in the dataset construction and in the claim that uncertainty-aware methods are more robust in low-consensus cases.
domain assumption The multi-metric framework (volumetric overlap plus probabilistic calibration plus interface assessment) reflects surgical decision-making utility
Underpins the conclusion that standard overlap metrics are insufficient proxies for clinical value.

pith-pipeline@v0.9.0 · 5704 in / 1512 out tokens · 65405 ms · 2026-05-07T09:55:26.526364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 26 canonical work pages · 1 internal anchor

[1]

H. Sun, H. Ma, G. Hong, H. Sun, J. Wang, Survival improvement in patients with pancreatic cancer by decade: A period analysis of the SEER database, 1981–2010, Scientific Reports 4 (2014) 6747.doi: 10.1038/srep06747. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC5381379/

work page doi:10.1038/srep06747 1981
[2]

Br ¨ugel, E

M. Br ¨ugel, E. J. Rummeny, M. Dobritz, Vascular invasion in pancre- atic cancer, Abdominal Imaging 29 (2) (2004) 239–245.doi:10.1007/ s00261-003-0102-2. URLhttps://doi.org/10.1007/s00261-003-0102-2

work page doi:10.1007/s00261-003-0102-2 2004
[3]

L. Yao, Z. Zhang, E. Keles, C. Yazici, T. Tirkes, U. Bagci, A review of deep learning and radiomics approaches for pancreatic cancer diagno- sis from medical imaging, Current Opinion in Gastroenterology 39 (5) (2023) 436–447.doi:10.1097/MOG.0000000000000966

work page doi:10.1097/mog.0000000000000966 2023
[4]

Alidina, A

Z. Alidina, A. A. M. Hussain, I. Banani, M. M. Khan, T. M. Pawlik, Ra- diomics for early detection of pancreatic cancer: a systematic review and meta-analysis, Journal of Gastrointestinal Surgery 30 (5) (2026) 102374. doi:10.1016/j.gassur.2026.102374. URLhttps://www.sciencedirect.com/science/article/pii/ S1091255X26000557

work page doi:10.1016/j.gassur.2026.102374 2026
[5]

Y . Xia, Q. Yu, W. Shen, Y . Zhou, E. K. Fishman, A. L. Yuille, Detecting Pancreatic Ductal Adenocarcinoma in Multi-phase CT Scans via Align- ment Ensemble, in: A. L. Martel, P. Abolmaesumi, D. Stoyanov, D. Ma- teus, M. A. Zuluaga, S. K. Zhou, D. Racoceanu, L. Joskowicz (Eds.), Medical Image Computing and Computer Assisted Intervention – MIC- CAI 2020, Sp...

work page doi:10.1007/978-3-030-59716-0_28 2020
[6]

J. Zhao, J. Wang, Y . Gu, X. Huang, L. Wang, Diagnostic methods for pan- creatic cancer and their clinical applications (Review), Oncology Letters 30 (1) (2025) 370.doi:10.3892/ol.2025.15116. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC12150009/

work page doi:10.3892/ol.2025.15116 2025
[7]

Isensee, P

F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, K. H. Maier-Hein, nnU- Net: a self-configuring method for deep learning-based biomedical image segmentation, Nature Methods 18 (2) (2021) 203–211.doi:10.1038/ s41592-020-01008-z

2021
[8]

J. I. Bereska, B. V . Janssen, C. Y . Nio, M. P. M. Kop, G. Kazemier, O. R. Busch, F. Struik, H. A. Marquering, J. Stoker, M. G. Besselink, I. M. Verpalen, for the Pancreatobiliary and Hepatic Artificial Intelli- gence Research (PHAIR) consortium, Artificial intelligence for assess- ment of vascular involvement and tumor resectability on CT in patients wi...

work page doi:10.1186/s41747-023-00419-9 2024
[9]

Jungo, R

A. Jungo, R. Meier, E. Ermis, M. Blatti-Moreno, E. Herrmann, R. Wiest, M. Reyes, On the Effect of Inter-observer Variability for a Reliable Esti- mation of Uncertainty of Medical Image Segmentation, in: A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-L ´opez, G. Fichtinger (Eds.), Medical Image Computing and Computer Assisted Intervention – MIC- ...

work page doi:10.1007/978-3-030-00928-1_77 2018
[10]

Joskowicz, D

L. Joskowicz, D. Cohen, N. Caplan, J. Sosna, Inter-observer variability of manual contour delineation of structures in CT, European Radiology 29 (3) (2019) 1391–1399.doi:10.1007/s00330-018-5695-5

work page doi:10.1007/s00330-018-5695-5 2019
[11]

N. C. Buchs, M. Chilcott, P.-A. Poletti, L. H. Buhler, P. Morel, Vascular invasion in pancreatic cancer: Imaging modalities, preoperative diagno- sis and surgical management, World Journal of Gastroenterology : WJG 16 (7) (2010) 818–831.doi:10.3748/wjg.v16.i7.818. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC2825328/

work page doi:10.3748/wjg.v16.i7.818 2010
[12]

R. K. G. Do, A. Kambadakone, Radiomics for CT Assessment of Vascular Contact in Pancreatic Adenocarcinoma, Radiology 301 (3) (2021) 623–624, publisher: Radiological Society of North America. doi:10.1148/radiol.2021211635. URLhttps://pubs.rsna.org/doi/full/10.1148/radiol. 2021211635

work page doi:10.1148/radiol.2021211635 2021
[13]

Warfield, K

S. Warfield, K. Zou, W. Wells, Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmen- tation, IEEE Transactions on Medical Imaging 23 (7) (2004) 903–921. doi:10.1109/TMI.2004.828354. URLhttps://ieeexplore.ieee.org/document/1309714

work page doi:10.1109/tmi.2004.828354 2004
[14]

M. H. Jensen, D. R. Jørgensen, R. Jalaboi, M. E. Hansen, M. A. Olsen, Improving Uncertainty Estimation in Convolutional Neural Networks Us- ing Inter-rater Agreement, in: D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, A. Khan (Eds.), Medical Image Com- puting and Computer Assisted Intervention – MICCAI 2019, Springer Internatio...

2019
[15]

C. H. Sudre, B. G. Anson, S. Ingala, C. D. Lane, D. Jimenez, L. Haider, T. Varsavsky, R. Tanno, L. Smith, S. Ourselin, R. H. J ¨ager, M. J. Car- doso, Let’s Agree to Disagree: Learning Highly Debatable Multirater La- belling, in: D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, A. Khan (Eds.), Medical Image Computing and Computer...

work page doi:10.1007/978-3-030-32251-9_73 2019
[16]

Zhang, Y

J. Zhang, Y . Zheng, Y . Shi, A Soft Label Method for Medical Image Seg- mentation with Multirater Annotations, Computational Intelligence and Neuroscience 2023 (2023) 1883597.doi:10.1155/2023/1883597

work page doi:10.1155/2023/1883597 2023
[17]

Islam, B

M. Islam, B. Glocker, Spatially Varying Label Smoothing: Capturing Un- certainty from Expert Annotations, in: A. Feragen, S. Sommer, J. Schn- abel, M. Nielsen (Eds.), Information Processing in Medical Imaging, Springer International Publishing, Cham, 2021, pp. 677–688.doi: 10.1007/978-3-030-78191-0_52

work page doi:10.1007/978-3-030-78191-0_52 2021
[18]

Riera-Mar ´ın, J

M. Riera-Mar ´ın, J. G. L ´opez, J. Rodr ´ıguez-Comas, M. A. G. Ballester, A. Galdran, Multi-Rater Calibration Error Estimation, in: C. H. Sudre, M. I. Hoque, R. Mehta, C. Ouyang, C. Qin, M. Rakic, W. M. Wells (Eds.), Uncertainty for Safe Utilization of Machine Learning in Med- ical Imaging, Springer Nature Switzerland, Cham, 2026, pp. 147–157. doi:10.100...

work page doi:10.1007/978-3-032-06593-3_14 2026
[19]

Tuijn, F

S. Tuijn, F. Janssens, P. Robben, H. van den Bergh, Reducing interrater variability and improving health care: a meta-analytical review, Journal of Evaluation in Clinical Practice 18 (4) (2012) 887–895.doi:10.1111/ j.1365-2753.2011.01705.x

work page arXiv 2012
[20]

H. K. Yang, M.-S. Park, M. Choi, J. Shin, S. S. Lee, W. K. Jeong, S. H. Hwang, S. H. Choi, Systematic review and meta-analysis of diag- nostic performance of CT imaging for assessing resectability of pancre- atic ductal adenocarcinoma after neoadjuvant therapy: importance of CT criteria, Abdominal Radiology (New York) 46 (11) (2021) 5201–5217. doi:10.1007...

work page doi:10.1007/s00261-021-03198-2 2021
[21]

Y . N. Shen, X. L. Bai, G. G. Li, T. B. Liang, Review of radiological classifications of pancreatic cancer with peripancreatic vessel invasion: are new grading criteria required?, Cancer Imaging 17 (2017) 14.doi: 10.1186/s40644-017-0115-7. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC5420088/

work page doi:10.1186/s40644-017-0115-7 2017
[22]

Alves, M

N. Alves, M. Schuurmans, D. Rutkowski, D. Yakar, I. Haldorsen, M. Liedenbaum, A. Molven, P. Vendittelli, G. Litjens, J. Hermans, H. Huisman, The PANORAMA Study Protocol: Pancreatic Cancer Di- agnosis - Radiologists Meet AI, Tech. rep., Zenodo (Jan. 2024).doi: 10.5281/zenodo.10599559. URLhttps://zenodo.org/records/10599559

work page doi:10.5281/zenodo.10599559 2024
[23]

CURVAS-PDACVI : A pancreatic ductal adenocarcinoma imaging dataset, November 2025

M. Riera-Mar ´ın, S. O K, M. M. Duh, A. Aubanell, R. de Figueiredo Car- doso, S. Egger-Hackenschmidt, M. S. May, S. Bernaus Tom ´e, J. Rodr´ıguez-Comas, M. ´A. Gonz´alez Ballester, J. Garcia L´opez, Curvas- pdacvi dataset (Nov. 2025).doi:10.5281/zenodo.17552201. URLhttps://doi.org/10.5281/zenodo.17552201

work page doi:10.5281/zenodo.17552201 2025
[24]

TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

T. Kirscher, A. Ertl, K. Maier-Hein, X. Coubez, P. Meyer, S. Faisan, Twin- Track: Post-hoc Multi-Rater Calibration for Medical Image Segmenta- tion, arXiv:2604.15950 [cs] (Apr. 2026).doi:10.48550/arXiv.2604. 10 15950. URLhttp://arxiv.org/abs/2604.15950

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604 2026
[25]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable pre- dictive uncertainty estimation using deep ensembles, in: Proceedings of the 31st International Conference on Neural Information Processing Sys- tems, NIPS’17, Curran Associates Inc., Red Hook, NY , USA, 2017, pp. 6405–6416

2017
[26]

M. P. Naeini, G. F. Cooper, M. Hauskrecht, Obtaining well calibrated probabilities using bayesian binning, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, AAAI Press, Austin, Texas, 2015, pp. 2901–2907

2015
[27]

R. E. Barlow, D. J. Bartholomew, J. M. Bremner, H. D. Brunk, Statistical inference under order restric- tions, Statistica Neerlandica 27 (4) (1973) 189–189, eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467- 9574.1973.tb00228.x.doi:10.1111/j.1467-9574.1973.tb00228. x. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1111/j. 1467-9574.1973.tb00228.x

work page doi:10.1111/j.1467- 1973
[28]

Vbench: Comprehensive benchmark suite for video generative models

G. Franchi, O. Laurent, M. Legu ´ery, A. Bursuc, A. Pilzer, A. Yao, Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 12194–12204, iSSN: 2575- 7075.doi:10.1109/CVPR52733.2024.01159. URLhttps://ieeexplore.ieee.org/document/10656702

work page doi:10.1109/cvpr52733.2024.01159 2024
[29]

H ´emon, B

C. H ´emon, B. Texier, C. Lafond, J.-C. Nunes, A. Barateau, Towards trustworthy AI in radiotherapy: a comprehensive review of uncertainty- aware techniques, Physics in Medicine and Biology 71 (1) (Dec. 2025). doi:10.1088/1361-6560/ae2a9f

work page doi:10.1088/1361-6560/ae2a9f 2025
[30]

URLhttps://github.com/SYCAI-Technologies/ curvas-challenge/tree/main/CURVAS-PDACVI_2025

curvas-challenge/CURV AS-PDACVI 2025 at main · SYCAI- Technologies/curvas-challenge (2025). URLhttps://github.com/SYCAI-Technologies/ curvas-challenge/tree/main/CURVAS-PDACVI_2025

2025
[31]

Location & V olume

Appendix 6.1. Dataset Supplementary Details The original PANORAMA CT scans used in this work are available athttps://zenodo.org/records/11034178, and the CURV AS-PDACVI benchmark release, including the multi-rater annotations and evaluation resources, is available at https://zenodo.org/records/15401568. In the released repository, each case follows a stan...

work page arXiv 2025