pith. machine review for the scientific record. sign in

arxiv: 2604.27582 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Assessing Pancreatic Ductal Adenocarcinoma Vascular Invasion: the PDACVI Benchmark

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords pancreatic cancervascular invasionimage segmentationuncertainty quantificationinter-rater variabilitymedical imaging benchmarkprobabilistic models
0
0 comments X

The pith

Methods that model expert disagreement produce more reliable vascular invasion maps for pancreatic cancer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a benchmark using multiple expert annotations to evaluate AI performance in determining vascular invasion for pancreatic cancer surgery. It shows that high average overlap with consensus labels does not guarantee reliability at the ambiguous tumor-vessel interfaces that matter most for surgical planning. Methods that account for disagreement among experts produce probabilistic outputs that remain accurate and robust even when expert consensus is low. This distinction matters because vascular invasion status directly affects whether a patient can undergo potentially curative resection.

Core claim

A densely annotated dataset with five independent expert annotations per case, paired with an evaluation framework that combines volumetric metrics, probabilistic calibration, and interface-specific analysis, demonstrates that uncertainty-aware methods outperform binary segmentation approaches by maintaining performance in low-consensus ambiguous regions.

What carries the argument

The multi-annotation dataset that captures inter-rater variability at tumor-vessel boundaries, used within a multi-metric framework to separate global accuracy from localized clinical utility.

Load-bearing premise

The five independent expert annotations faithfully capture diagnostic ambiguity at tumor-vessel interfaces and the multi-metric framework reflects surgical decision utility.

What would settle it

A validation study in which uncertainty-modeling methods fail to demonstrate superior calibration or robustness when compared to binary methods against actual surgical findings or additional expert consensus.

Figures

Figures reproduced from arXiv: 2604.27582 by A. Aubanell, A. Galdran, C. H\'emon, C. L\"uth, J.-C. Nunes, J. Garc\'ia-L\'opez, J.-L. Dillenseger, J. Rodr\'iguez-Comas, J. Traub, K.-C. Kahl, M. A. Gonz\'alez-Ballester, M. M. Duh, M. Riera-Mar\'in, M. S. May, O. K. Sikha, P.-H. Conze, P. Meyer, R. de Figueiredo Cardoso, S. Egger-Hackenschmidt, S. Faisan, T. Kirscher, V. Boussot, X. Coubez, X. Liang, X. Zhou, Z. Pan.

Figure 1
Figure 1. Figure 1: Examples of High Diagnostic Divergence in PDAC Segmentation. The five columns display the independent annotations from the five human experts. Tumor annotations are delineated in red, adjacent vascular structures in green, and areas of vascular invasion (tumor-vessel contact) are highlighted in white. (Top row) Substantial disagreement on the infiltrative borders and the specific extent of the tumor-vessel… view at source ↗
Figure 2
Figure 2. Figure 2: Interrater agreement (Mean ± SD) per rater pair. The matrix shows average agreement across our dataset between in￾dividual experts and also the STAPLE consensus, pointing to the 1-year junior resident (Rater 5) as the primary outlier. 3 view at source ↗
Figure 3
Figure 3. Figure 3: Ranking stability via bootstrap analysis. Bubble charts illustrating the rank frequency (1st to 6th) achieved by each participating team across 500 bootstrap iterations for all nine metrics. The size of each round marker corresponds to the percentage of iterations in which a team achieved a given rank. that fine-grained vascular assessment remains strongly limited by case difficulty and inter-rater ambigui… view at source ↗
Figure 4
Figure 4. Figure 4: Representative predictions across increasing (top-to-bottom) levels of diagnostic ambiguity. First column: CT slice with expert annotations. Second-to-last columns: predictions of different participants, highlighting different confidence behaviors at the tumor-vessel interface. In low-ambiguity cases, most methods localize the tumor consistently, whereas in ambiguous and high￾complexity cases binary-target… view at source ↗
Figure 5
Figure 5. Figure 5: Performance metric inter-correlation analysis. Overlap metrics were strongly correlated with each other, but only weakly associated with vessel-specific vascular invasion errors, supporting the use of a multi-metric benchmark. OrdSTAPLE - Universitat Pompeu Fabra, Sycai Medical OrdSTAPLE followed a two-stage curriculum. A first-stage nnU-Net pretrained on PANORAMA Batch 4 was adapted to predict five ordina… view at source ↗
Figure 6
Figure 6. Figure 6: For global metrics, several pairwise differences re￾mained significant, particularly in calibration and overlap. In contrast, most pairwise comparisons for vessel-specific vascu￾lar invasion metrics did not reach significance, indicating that case-wise variance in these structures is strongly driven by a subset of highly ambiguous studies shared across methods. 6.5.3. Case-wise performance signatures view at source ↗
Figure 6
Figure 6. Figure 6: Pairwise Wilcoxon signed-rank test matrices across benchmark metrics. Heatmaps display p-values for pairwise method comparisons over global segmentation, calibration, and vessel-specific vascular invasion metrics. Several differences remain signif￾icant for overlap and calibration, whereas most vascular comparisons do not reach significance, indicating that case-wise variance in vascular assessment is stro… view at source ↗
Figure 7
Figure 7. Figure 7: Case-wise performance signatures across the test cohort. Standardized metric profiles reveal both method-specific trade￾offs and recurrent hard studies that degrade performance across multiple architectures. 15 view at source ↗
Figure 8
Figure 8. Figure 8: Failure analysis and cross-algorithm consistency in high-complexity cases. Methods exhibited distinct failure profiles, but also shared a subset of recurrent hard studies that degraded performance across architectures. 16 view at source ↗
read the original abstract

Surgical resection remains the only potentially curative treatment for pancreatic ductal adenocarcinoma (PDAC), and eligibility depends on accurate assessment of vascular invasion (VI), i.e., tumor extension into adjacent critical vessels. Despite its importance for preoperative staging and surgical planning, computational VI assessment remains underexplored. Two major challenges are the lack of public datasets and the diagnostic ambiguity at the tumor-vessel interface, which leads to substantial inter-rater variability even among expert radiologists. To address these limitations, we introduce the CURVAS-PDACVI Dataset and Challenge, an open benchmark for uncertainty-aware AI in PDAC staging based on a densely annotated dataset with five independent expert annotations per scan. We also propose a multi-metric evaluation framework that extends beyond spatial overlap to include probabilistic calibration and VI assessment. Evaluation of six state-of-the-art methods shows that strong global volumetric overlap does not necessarily translate into reliable performance at clinically critical tumor-vessel interfaces. In particular, methods optimized for binary segmentation perform competitively on average overlap metrics, but often degrade in high-complexity cases with low expert consensus, either collapsing in volume or overextending at uncertain boundaries. In contrast, methods that model inter-rater disagreement produce better calibrated probabilistic maps and show greater robustness in these ambiguous cases. The benchmark highlights the limitations of volumetric accuracy as a proxy for localized surgical utility, motivating uncertainty-aware probabilistic models for preoperative decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces the CURVAS-PDACVI Dataset and Challenge, an open benchmark consisting of CT scans for pancreatic ductal adenocarcinoma (PDAC) vascular invasion (VI) assessment, each with five independent expert annotations to capture inter-rater variability at tumor-vessel interfaces. It proposes a multi-metric evaluation framework extending beyond volumetric overlap to include probabilistic calibration and localized interface-specific metrics. Evaluation of six state-of-the-art methods demonstrates that binary segmentation approaches achieve competitive global overlap but degrade in high-ambiguity cases, whereas disagreement-modeling methods yield better-calibrated probabilistic maps and greater robustness at uncertain boundaries. The work concludes that standard overlap metrics are poor proxies for localized surgical utility and motivates uncertainty-aware models for preoperative staging.

Significance. If the empirical comparisons hold, this provides a valuable public resource addressing the scarcity of densely annotated datasets for PDAC VI, a clinically critical task determining surgical resectability. The multi-annotation design and extended metrics (calibration plus interface assessment) are strengths that enable more nuanced benchmarking than typical Dice-focused evaluations. The findings offer concrete evidence favoring probabilistic modeling in ambiguous medical segmentation scenarios, with potential to guide development of AI tools that better reflect diagnostic uncertainty.

major comments (2)
  1. Abstract and Results: The central claim that disagreement-modeling methods produce better calibrated maps and greater robustness at tumor-vessel interfaces depends on the multi-metric framework serving as a valid proxy for surgical decision utility. However, all evaluations are performed solely against the five expert annotations (majority vote or soft labels) with no external anchor such as actual surgical findings, resectability outcomes, or surgeon decision thresholds; this leaves open whether superior calibration reflects true VI status or merely annotation distribution matching.
  2. Methods and Results: The selection and definition of 'high-complexity cases with low expert consensus' is not accompanied by explicit quantitative criteria, exclusion rules, or the number of such cases; without these details, the reported degradation of binary methods versus robustness of uncertainty-aware methods cannot be fully assessed for reproducibility or generalizability.
minor comments (3)
  1. Abstract: Inclusion of at least one or two concrete quantitative results (e.g., specific calibration error values or interface metric differences across the six methods) would strengthen the summarized comparative findings.
  2. The paper should clarify the exact implementation of the probabilistic calibration metric and interface-specific scores, including any hyperparameters or thresholds used.
  3. Ensure the dataset release includes clear documentation on annotation protocol, imaging parameters, and patient inclusion criteria to maximize utility for the community.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and detailed comments. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: Abstract and Results: The central claim that disagreement-modeling methods produce better calibrated maps and greater robustness at tumor-vessel interfaces depends on the multi-metric framework serving as a valid proxy for surgical decision utility. However, all evaluations are performed solely against the five expert annotations (majority vote or soft labels) with no external anchor such as actual surgical findings, resectability outcomes, or surgeon decision thresholds; this leaves open whether superior calibration reflects true VI status or merely annotation distribution matching.

    Authors: We agree that direct validation against surgical outcomes or resectability data would provide stronger evidence of clinical utility. The current benchmark relies on multi-expert annotations to quantify inter-rater variability at ambiguous tumor-vessel interfaces, which is a core clinical challenge. The extended metrics (calibration and interface-specific) are designed to evaluate how well models reflect this uncertainty rather than assuming a single ground truth. We will revise the abstract, results, and discussion sections to explicitly frame these as annotation-based proxies and to note the absence of outcome-level validation as a limitation, while highlighting it as an important direction for future work. No new external data can be added at this stage. revision: partial

  2. Referee: Methods and Results: The selection and definition of 'high-complexity cases with low expert consensus' is not accompanied by explicit quantitative criteria, exclusion rules, or the number of such cases; without these details, the reported degradation of binary methods versus robustness of uncertainty-aware methods cannot be fully assessed for reproducibility or generalizability.

    Authors: We appreciate this observation and agree that explicit criteria are required. In the revised manuscript, we will expand the Methods section to define high-complexity cases using precise quantitative measures of inter-annotator disagreement (such as average pairwise overlap or voxel-wise entropy thresholds), specify any exclusion rules applied during case selection, and report the exact number of such cases within the dataset. These details will also be referenced in the Results when discussing performance differences, enabling full reproducibility and assessment of generalizability. revision: yes

standing simulated objections not resolved
  • The absence of external validation against actual surgical findings, resectability outcomes, or surgeon decision thresholds, as this data is not part of the current benchmark dataset.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent evaluation against held-out annotations

full rationale

This is an empirical benchmark paper introducing a dataset (CURVAS-PDACVI) with five expert annotations per case and a multi-metric framework (overlap, probabilistic calibration, interface-specific scores). All performance claims are measured directly against these external annotations as ground truth, with no equations, fitted parameters, derivations, or self-referential reductions. Methods are compared on held-out data; superior calibration or robustness in ambiguous cases is reported as an observed empirical outcome rather than derived by construction from the inputs. No self-citation chains, ansatzes, or uniqueness theorems are invoked to justify the central results. The evaluation framework is defined independently of the tested methods and does not reduce to renaming or fitting the same quantities it claims to predict.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an empirical benchmark paper; the central claims rest on the domain assumption that multi-rater annotations represent clinically meaningful uncertainty and that interface-specific metrics predict surgical utility. No free parameters, mathematical axioms, or new physical entities are introduced.

axioms (2)
  • domain assumption Five independent expert annotations per scan capture the diagnostic ambiguity at tumor-vessel interfaces
    Invoked in the dataset construction and in the claim that uncertainty-aware methods are more robust in low-consensus cases.
  • domain assumption The multi-metric framework (volumetric overlap plus probabilistic calibration plus interface assessment) reflects surgical decision-making utility
    Underpins the conclusion that standard overlap metrics are insufficient proxies for clinical value.

pith-pipeline@v0.9.0 · 5704 in / 1512 out tokens · 65405 ms · 2026-05-07T09:55:26.526364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    H. Sun, H. Ma, G. Hong, H. Sun, J. Wang, Survival improvement in patients with pancreatic cancer by decade: A period analysis of the SEER database, 1981–2010, Scientific Reports 4 (2014) 6747.doi: 10.1038/srep06747. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC5381379/

  2. [2]

    Br ¨ugel, E

    M. Br ¨ugel, E. J. Rummeny, M. Dobritz, Vascular invasion in pancre- atic cancer, Abdominal Imaging 29 (2) (2004) 239–245.doi:10.1007/ s00261-003-0102-2. URLhttps://doi.org/10.1007/s00261-003-0102-2

  3. [3]

    L. Yao, Z. Zhang, E. Keles, C. Yazici, T. Tirkes, U. Bagci, A review of deep learning and radiomics approaches for pancreatic cancer diagno- sis from medical imaging, Current Opinion in Gastroenterology 39 (5) (2023) 436–447.doi:10.1097/MOG.0000000000000966

  4. [4]

    Alidina, A

    Z. Alidina, A. A. M. Hussain, I. Banani, M. M. Khan, T. M. Pawlik, Ra- diomics for early detection of pancreatic cancer: a systematic review and meta-analysis, Journal of Gastrointestinal Surgery 30 (5) (2026) 102374. doi:10.1016/j.gassur.2026.102374. URLhttps://www.sciencedirect.com/science/article/pii/ S1091255X26000557

  5. [5]

    Y . Xia, Q. Yu, W. Shen, Y . Zhou, E. K. Fishman, A. L. Yuille, Detecting Pancreatic Ductal Adenocarcinoma in Multi-phase CT Scans via Align- ment Ensemble, in: A. L. Martel, P. Abolmaesumi, D. Stoyanov, D. Ma- teus, M. A. Zuluaga, S. K. Zhou, D. Racoceanu, L. Joskowicz (Eds.), Medical Image Computing and Computer Assisted Intervention – MIC- CAI 2020, Sp...

  6. [6]

    J. Zhao, J. Wang, Y . Gu, X. Huang, L. Wang, Diagnostic methods for pan- creatic cancer and their clinical applications (Review), Oncology Letters 30 (1) (2025) 370.doi:10.3892/ol.2025.15116. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC12150009/

  7. [7]

    Isensee, P

    F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, K. H. Maier-Hein, nnU- Net: a self-configuring method for deep learning-based biomedical image segmentation, Nature Methods 18 (2) (2021) 203–211.doi:10.1038/ s41592-020-01008-z

  8. [8]

    J. I. Bereska, B. V . Janssen, C. Y . Nio, M. P. M. Kop, G. Kazemier, O. R. Busch, F. Struik, H. A. Marquering, J. Stoker, M. G. Besselink, I. M. Verpalen, for the Pancreatobiliary and Hepatic Artificial Intelli- gence Research (PHAIR) consortium, Artificial intelligence for assess- ment of vascular involvement and tumor resectability on CT in patients wi...

  9. [9]

    Jungo, R

    A. Jungo, R. Meier, E. Ermis, M. Blatti-Moreno, E. Herrmann, R. Wiest, M. Reyes, On the Effect of Inter-observer Variability for a Reliable Esti- mation of Uncertainty of Medical Image Segmentation, in: A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-L ´opez, G. Fichtinger (Eds.), Medical Image Computing and Computer Assisted Intervention – MIC- ...

  10. [10]

    Joskowicz, D

    L. Joskowicz, D. Cohen, N. Caplan, J. Sosna, Inter-observer variability of manual contour delineation of structures in CT, European Radiology 29 (3) (2019) 1391–1399.doi:10.1007/s00330-018-5695-5

  11. [11]

    N. C. Buchs, M. Chilcott, P.-A. Poletti, L. H. Buhler, P. Morel, Vascular invasion in pancreatic cancer: Imaging modalities, preoperative diagno- sis and surgical management, World Journal of Gastroenterology : WJG 16 (7) (2010) 818–831.doi:10.3748/wjg.v16.i7.818. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC2825328/

  12. [12]

    R. K. G. Do, A. Kambadakone, Radiomics for CT Assessment of Vascular Contact in Pancreatic Adenocarcinoma, Radiology 301 (3) (2021) 623–624, publisher: Radiological Society of North America. doi:10.1148/radiol.2021211635. URLhttps://pubs.rsna.org/doi/full/10.1148/radiol. 2021211635

  13. [13]

    Warfield, K

    S. Warfield, K. Zou, W. Wells, Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmen- tation, IEEE Transactions on Medical Imaging 23 (7) (2004) 903–921. doi:10.1109/TMI.2004.828354. URLhttps://ieeexplore.ieee.org/document/1309714

  14. [14]

    M. H. Jensen, D. R. Jørgensen, R. Jalaboi, M. E. Hansen, M. A. Olsen, Improving Uncertainty Estimation in Convolutional Neural Networks Us- ing Inter-rater Agreement, in: D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, A. Khan (Eds.), Medical Image Com- puting and Computer Assisted Intervention – MICCAI 2019, Springer Internatio...

  15. [15]

    C. H. Sudre, B. G. Anson, S. Ingala, C. D. Lane, D. Jimenez, L. Haider, T. Varsavsky, R. Tanno, L. Smith, S. Ourselin, R. H. J ¨ager, M. J. Car- doso, Let’s Agree to Disagree: Learning Highly Debatable Multirater La- belling, in: D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, A. Khan (Eds.), Medical Image Computing and Computer...

  16. [16]

    Zhang, Y

    J. Zhang, Y . Zheng, Y . Shi, A Soft Label Method for Medical Image Seg- mentation with Multirater Annotations, Computational Intelligence and Neuroscience 2023 (2023) 1883597.doi:10.1155/2023/1883597

  17. [17]

    Islam, B

    M. Islam, B. Glocker, Spatially Varying Label Smoothing: Capturing Un- certainty from Expert Annotations, in: A. Feragen, S. Sommer, J. Schn- abel, M. Nielsen (Eds.), Information Processing in Medical Imaging, Springer International Publishing, Cham, 2021, pp. 677–688.doi: 10.1007/978-3-030-78191-0_52

  18. [18]

    Riera-Mar ´ın, J

    M. Riera-Mar ´ın, J. G. L ´opez, J. Rodr ´ıguez-Comas, M. A. G. Ballester, A. Galdran, Multi-Rater Calibration Error Estimation, in: C. H. Sudre, M. I. Hoque, R. Mehta, C. Ouyang, C. Qin, M. Rakic, W. M. Wells (Eds.), Uncertainty for Safe Utilization of Machine Learning in Med- ical Imaging, Springer Nature Switzerland, Cham, 2026, pp. 147–157. doi:10.100...

  19. [19]

    Tuijn, F

    S. Tuijn, F. Janssens, P. Robben, H. van den Bergh, Reducing interrater variability and improving health care: a meta-analytical review, Journal of Evaluation in Clinical Practice 18 (4) (2012) 887–895.doi:10.1111/ j.1365-2753.2011.01705.x

  20. [20]

    H. K. Yang, M.-S. Park, M. Choi, J. Shin, S. S. Lee, W. K. Jeong, S. H. Hwang, S. H. Choi, Systematic review and meta-analysis of diag- nostic performance of CT imaging for assessing resectability of pancre- atic ductal adenocarcinoma after neoadjuvant therapy: importance of CT criteria, Abdominal Radiology (New York) 46 (11) (2021) 5201–5217. doi:10.1007...

  21. [21]

    Y . N. Shen, X. L. Bai, G. G. Li, T. B. Liang, Review of radiological classifications of pancreatic cancer with peripancreatic vessel invasion: are new grading criteria required?, Cancer Imaging 17 (2017) 14.doi: 10.1186/s40644-017-0115-7. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC5420088/

  22. [22]

    Alves, M

    N. Alves, M. Schuurmans, D. Rutkowski, D. Yakar, I. Haldorsen, M. Liedenbaum, A. Molven, P. Vendittelli, G. Litjens, J. Hermans, H. Huisman, The PANORAMA Study Protocol: Pancreatic Cancer Di- agnosis - Radiologists Meet AI, Tech. rep., Zenodo (Jan. 2024).doi: 10.5281/zenodo.10599559. URLhttps://zenodo.org/records/10599559

  23. [23]

    CURVAS-PDACVI : A pancreatic ductal adenocarcinoma imaging dataset, November 2025

    M. Riera-Mar ´ın, S. O K, M. M. Duh, A. Aubanell, R. de Figueiredo Car- doso, S. Egger-Hackenschmidt, M. S. May, S. Bernaus Tom ´e, J. Rodr´ıguez-Comas, M. ´A. Gonz´alez Ballester, J. Garcia L´opez, Curvas- pdacvi dataset (Nov. 2025).doi:10.5281/zenodo.17552201. URLhttps://doi.org/10.5281/zenodo.17552201

  24. [24]

    TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

    T. Kirscher, A. Ertl, K. Maier-Hein, X. Coubez, P. Meyer, S. Faisan, Twin- Track: Post-hoc Multi-Rater Calibration for Medical Image Segmenta- tion, arXiv:2604.15950 [cs] (Apr. 2026).doi:10.48550/arXiv.2604. 10 15950. URLhttp://arxiv.org/abs/2604.15950

  25. [25]

    Lakshminarayanan, A

    B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable pre- dictive uncertainty estimation using deep ensembles, in: Proceedings of the 31st International Conference on Neural Information Processing Sys- tems, NIPS’17, Curran Associates Inc., Red Hook, NY , USA, 2017, pp. 6405–6416

  26. [26]

    M. P. Naeini, G. F. Cooper, M. Hauskrecht, Obtaining well calibrated probabilities using bayesian binning, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, AAAI Press, Austin, Texas, 2015, pp. 2901–2907

  27. [27]

    R. E. Barlow, D. J. Bartholomew, J. M. Bremner, H. D. Brunk, Statistical inference under order restric- tions, Statistica Neerlandica 27 (4) (1973) 189–189, eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467- 9574.1973.tb00228.x.doi:10.1111/j.1467-9574.1973.tb00228. x. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1111/j. 1467-9574.1973.tb00228.x

  28. [28]

    Vbench: Comprehensive benchmark suite for video generative models

    G. Franchi, O. Laurent, M. Legu ´ery, A. Bursuc, A. Pilzer, A. Yao, Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models, in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 12194–12204, iSSN: 2575- 7075.doi:10.1109/CVPR52733.2024.01159. URLhttps://ieeexplore.ieee.org/document/10656702

  29. [29]

    H ´emon, B

    C. H ´emon, B. Texier, C. Lafond, J.-C. Nunes, A. Barateau, Towards trustworthy AI in radiotherapy: a comprehensive review of uncertainty- aware techniques, Physics in Medicine and Biology 71 (1) (Dec. 2025). doi:10.1088/1361-6560/ae2a9f

  30. [30]

    URLhttps://github.com/SYCAI-Technologies/ curvas-challenge/tree/main/CURVAS-PDACVI_2025

    curvas-challenge/CURV AS-PDACVI 2025 at main · SYCAI- Technologies/curvas-challenge (2025). URLhttps://github.com/SYCAI-Technologies/ curvas-challenge/tree/main/CURVAS-PDACVI_2025

  31. [31]

    Location & V olume

    Appendix 6.1. Dataset Supplementary Details The original PANORAMA CT scans used in this work are available athttps://zenodo.org/records/11034178, and the CURV AS-PDACVI benchmark release, including the multi-rater annotations and evaluation resources, is available at https://zenodo.org/records/15401568. In the released repository, each case follows a stan...