pith. sign in

arxiv: 2511.03769 · v2 · submitted 2025-11-05 · 🧬 q-bio.OT

Current validation practice undermines surgical AI development

Annika Reinke , Ziying O. Li , Minu D. Tizabi , Pascaline Andr\'e , Marcel Knopp , Mika M. Rother , Ines P. Machado , Maria S. Altieri
show 90 more authors
This is my paper

Pith reviewed 2026-05-18 01:38 UTC · model grok-4.3

classification 🧬 q-bio.OT
keywords surgical AIvalidation pitfallsvideo analysisDelphi consensustemporal dynamicshierarchical dataintraoperative videosclinical translation
0
0 comments X

The pith

Existing validation practices for AI in surgical videos neglect temporal and hierarchical structures, producing misleading results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that inadequate validation is a key reason why AI tools for analyzing surgical videos have not been widely adopted in clinics. It shows that current methods often overlook how videos unfold over time and are organized in layers like procedures, phases, and frames. A consensus process with many experts identified common pitfalls in handling data, choosing metrics, and reporting results. Reviews and tests reveal these issues are common and can change which algorithms seem best. The work offers better practices to make evaluations more reliable and clinically useful.

Core claim

Existing validation practices often neglect the temporal and hierarchical structure of intraoperative videos, producing misleading, unstable, or clinically irrelevant results. Through a multi-stage Delphi process with 92 international experts, a comprehensive catalog of validation pitfalls was created, spanning data issues, metric selection, and aggregation and reporting. A systematic review and experiments on real datasets demonstrate that these pitfalls are widespread and can substantially affect algorithm performance assessments and rankings.

What carries the argument

The catalog of validation pitfalls derived from the Delphi consensus process, categorized into data, metric selection and configuration, and aggregation and reporting.

If this is right

  • Adopting the cataloged best practices will improve the stability and clinical relevance of validation results for surgical AI algorithms.
  • Accounting for temporal dynamics and hierarchical data structures will prevent understating uncertainty and obscuring failure modes.
  • More rigorous validation will support better benchmarking, reporting, regulatory review, and clinical translation of surgical AI.
  • Future studies should avoid clinically uninformative aggregation and account for frame dependencies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pitfalls catalog is adopted widely, it could accelerate the development of more trustworthy surgical AI by standardizing evaluation methods.
  • Similar validation issues may exist in other domains using video data, such as medical imaging or autonomous systems, suggesting broader applicability of the framework.
  • Testing the best practices on new datasets could reveal additional pitfalls not captured in the initial consensus.
  • Integrating these guidelines into AI development pipelines might reduce the gap between reported performance and real-world clinical utility.

Load-bearing premise

The multi-stage Delphi process with 92 international experts produced a comprehensive and unbiased catalog of all major validation pitfalls without significant selection or consensus bias.

What would settle it

Re-evaluating a set of published surgical AI papers using metrics that properly account for temporal stability and hierarchical structure, and finding that algorithm rankings and reported performance remain unchanged, would challenge the claim that current practices produce misleading results.

Figures

Figures reproduced from arXiv: 2511.03769 by Abdourahmane Ndong, Adrito Das, Alexander Hann, Alexander Seitel, Amine Yamlahi, Amin Madani, Aneeq Zia, Annika Reinke, Anthony Jarc, Arnaud Huaulm\'e, Axel Krieger, Beat P. Mueller, Charles Heitz, Chinedu Nwoye, Danail Stoyanov, Daniel A. Hashimoto, Daniel Donoho, Deepak Alapatt, Dominik Rivoir, Duygu Sarikaya, Elvis C.S. Chen, Evangelia Christodoulou, Felix Nickel, Fiona R. Kolbinger, Fons van der Sommen, Gabor Fichtinger, Gernot Kronreif, Gyusung I. Lee, Hani J. Marcus, Hanna Hoffmann, Hirenkumar Nakawala, Hongliang Ren, Ines Gockel, Ines P. Machado, Jan G\"odeke, Jeffrey H. Siewerdsen, Jennifer Eckhoff, Jo\"el L. Lavanchy, Juliane Neumann, Justin W. Collins, Kyle Lam, Lalith Sharan, Lars M\"undermann, Lena Maier-Hein, Leo Joskowicz, Luc Joyeux, Makoto Hashizume, Marcel Knopp, Marco Nolden, Maria S. Altieri, Martin Wagner, Matthias Seibold, Matthias Unberath, Max Kirchner, Micha Pfeiffer, Mika M. Rother, Minu D. Tizabi, Namkee Oh, Nassir Navab, Nicola Rieke, Nicolas Padoy, Oliver Burgert, Olivier Colliot, Ozanan R. Meireles, Pablo Garc\'ia Kilroy, Pascaline Andr\'e, Patrick Godau, Paul F. J\"ager, Peng Liu, Philipp Fuernstahl, Pierre Jannin, Pietro Mascagni, Qi Dou, Raphael Sznitman, Rebecca Hisey, Reuben Docea, Robert Lim, Rohit Jena, Russell Taylor, Samuel Schmidgall, Sandy Engelhardt, Sebastian Bodenstedt, Shaohua K. Zhou, Shlomi Laufer, Silvia Seidlitz, Sophia Bano, Stamatia Giannarou, Stefanie Speidel, Stephen Gilbert, Tamas Haidegger, Teodor P. Grantcharov, Thomas Pausch, Thuy N. Tran, Tim R\"adsch, Tobias Czempiel, Vinkle Srivastav, Yueming Jin, Ziying O. Li.

Figure 1
Figure 1. Figure 1: Examples of validation pitfalls in surgical video analysis related to data, metric selection and configuration, and metric aggregation and reporting. (a) Unreliable or inconsistent annotation: Inconsistent object identifiers (IDs) in the reference can mask annotation errors when using frame-based metrics such as mean Average Precision (mAP), which ignore object IDs and may falsely suggest perfect performan… view at source ↗
Figure 2
Figure 2. Figure 2: Pitfalls related to validation of surgical AI may have severe consequences and real-world risks. (a) Overview of pitfalls collected in a multi-stage Delphi process involving over 90 experts. Pitfalls were classified into pitfalls related to data [P1], metric selection and configuration [P2], and metric aggregation and reporting [P3]. (b) Connections between pitfalls and potential consequences. A colored bo… view at source ↗
Figure 3
Figure 3. Figure 3: Validation and reporting flaws are widespread in common practice. Selected key insights from a [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Common practice leads to large underestimation of confidence intervals. Experimental evidence for two representative tasks ((a) binary instrument segmentation (Robust Medical Instrument Segmentation (RobustMIS) challenge [101]) and (b) action triplet recognition [91]). Confidence intervals (CIs) were computed either per naïve bootstrap, assuming all samples are independent (orange), or with a hierarchical … view at source ↗
Figure 5
Figure 5. Figure 5: Lack of stratification of performance values hides performance drops for relevant, potentially confounding image properties. The bar plot shows the difference in median instance Dice similarity score (DSC) for the task of surgical instrument instance segmentation between stratified and unstratified validation across algorithms (A1-A7) as well as their median performance (black bar). Hierarchical 95% confid… view at source ↗
Figure 6
Figure 6. Figure 6: Different validation strategies lead to varying algorithm rankings. (a) Different aggregation strategies such as over all frames (frame-wise aggregation), over videos (video-wise aggregation), or over phases (phase-wise aggregation) produce different rankings. Kendall’s tau is shown in comparison to the default rankings (frame-wise). Similarly to the original challenge, we used the 5% percentile as aggrega… view at source ↗
read the original abstract

Surgical data science (SDS) is rapidly advancing, yet clinical adoption of artificial intelligence (AI) in surgery remains limited, with inadequate validation emerging as an important contributing factor. In fact, existing validation practices often neglect the temporal and hierarchical structure of intraoperative videos, producing misleading, unstable, or clinically irrelevant results. In a pioneering, consensus-driven effort, we introduce a comprehensive catalog of validation pitfalls in AI-based surgical video analysis that was derived from a multi-stage Delphi process with 92 international experts. The collected pitfalls span three categories: (1) data (e.g., incomplete annotation, spurious correlations), (2) metric selection and configuration (e.g., neglect of temporal stability, mismatch with clinical needs), and (3) aggregation and reporting (e.g., clinically uninformative aggregation, failure to account for frame dependencies in hierarchical data structures). A systematic review of surgical AI papers reveals that these pitfalls are widespread in current practice, with the majority of studies failing to account for temporal dynamics or hierarchical data structure, or relying on clinically uninformative metrics. Experiments on real surgical video datasets provide empirical evidence that ignoring temporal and hierarchical data structures can substantially understate uncertainty, obscure critical failure modes, and even alter algorithm rankings. To address these shortcomings, we provide a catalogue of best practices compiled in a multi-stage Delphi process. Together, this work provides an evidence-based framework to inform more rigorous validation of surgical video analysis algorithms and to guide future efforts in benchmarking, reporting, regulatory review, and clinical translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that current validation practices for AI in surgical video analysis frequently neglect the temporal and hierarchical structure of intraoperative videos, yielding misleading, unstable, or clinically irrelevant results. It supports this via a multi-stage Delphi process with 92 international experts that produced a catalog of pitfalls across data, metric selection/configuration, and aggregation/reporting; a systematic review demonstrating high prevalence of these pitfalls in published surgical AI papers; experiments on real datasets showing that standard practices can understate uncertainty, obscure failure modes, and change algorithm rankings; and a companion catalog of best practices.

Significance. If the central claims hold, the work could meaningfully advance validation standards in surgical data science by providing an evidence-based framework that informs benchmarking, reporting, regulatory review, and clinical translation. The Delphi consensus, systematic review of prevalence, and empirical demonstrations of altered uncertainty/rankings are concrete strengths that could help shift community practice.

major comments (2)
  1. [Experiments] Experiments section: the claim that standard validation produces 'misleading' or 'clinically irrelevant' results rests on showing different uncertainty estimates and algorithm rankings when temporal/hierarchical structure is ignored, but lacks an independent external anchor (e.g., correlation with actual surgical outcomes, complication rates, or blinded expert clinical ratings). Without this, it remains possible that the 'standard' results are stable and appropriate while the adjusted ones introduce new biases; this is load-bearing for the central claim that neglect produces misleading results.
  2. [Methods / Delphi process] Delphi process description (likely §3 or Methods): the manuscript should explicitly report expert selection criteria, response rates, exact survey items, and any measures taken to mitigate selection or consensus bias. The abstract and summary statements leave these details underspecified, which weakens confidence that the catalog is comprehensive and unbiased.
minor comments (2)
  1. [Abstract] Abstract: add one sentence summarizing the exact number of papers in the systematic review and the main inclusion criteria.
  2. [Throughout] Notation and terminology: ensure consistent use of 'temporal stability' vs. 'temporal dynamics' and 'hierarchical data structure' across sections to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us identify areas for clarification and improvement. We address each major comment below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the claim that standard validation produces 'misleading' or 'clinically irrelevant' results rests on showing different uncertainty estimates and algorithm rankings when temporal/hierarchical structure is ignored, but lacks an independent external anchor (e.g., correlation with actual surgical outcomes, complication rates, or blinded expert clinical ratings). Without this, it remains possible that the 'standard' results are stable and appropriate while the adjusted ones introduce new biases; this is load-bearing for the central claim that neglect produces misleading results.

    Authors: We thank the referee for this important observation. Our experiments demonstrate that standard validation (ignoring temporal/hierarchical structure) produces substantially different uncertainty estimates, obscures failure modes visible under structure-aware methods, and alters algorithm rankings. These inconsistencies indicate that standard practices can yield unstable conclusions about model performance, which we view as evidence of potential misleadingness in the absence of a known clinical ground truth. We acknowledge that direct correlation with surgical outcomes or expert ratings would provide the strongest validation of clinical relevance; such data are not available in the public datasets used. We will revise the manuscript to explicitly discuss this limitation, adjust phrasing from 'clinically irrelevant' to 'potentially misleading due to instability and ranking changes,' and add a paragraph on the value of future studies incorporating clinical outcome anchors. revision: partial

  2. Referee: [Methods / Delphi process] Delphi process description (likely §3 or Methods): the manuscript should explicitly report expert selection criteria, response rates, exact survey items, and any measures taken to mitigate selection or consensus bias. The abstract and summary statements leave these details underspecified, which weakens confidence that the catalog is comprehensive and unbiased.

    Authors: We agree that transparent reporting of the Delphi methodology is essential. The full Methods section describes the multi-stage process with 92 experts, but we will expand it with a dedicated subsection detailing: expert selection criteria (minimum 5 years experience in surgical data science or AI, recruited via international societies and research networks), per-stage response rates, the exact survey items and question wording used in each round, and bias mitigation steps (anonymous voting, independent facilitation, and predefined consensus thresholds). We will also add the full survey instrument as supplementary material to allow readers to assess comprehensiveness and potential bias. revision: yes

Circularity Check

0 steps flagged

No circularity: claims grounded in external Delphi consensus and review

full rationale

The paper derives its catalog of pitfalls and best practices from a multi-stage Delphi process with 92 international experts and a systematic review of the literature, then supports prevalence claims with that review and empirical sensitivity experiments on external surgical video datasets. No mathematical derivations, equations, fitted parameters, or predictions are described that reduce by construction to the authors' own inputs or prior self-citations. The central argument that current practices produce misleading results rests on external expert input and observed differences in uncertainty estimates rather than any self-referential loop or load-bearing internal citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work depends on the reliability of expert consensus methods and the representativeness of the reviewed literature and chosen datasets; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption A multi-stage Delphi process with international experts yields an unbiased and comprehensive catalog of validation pitfalls.
    Invoked to derive the three-category catalog of pitfalls in data, metrics, and aggregation.

pith-pipeline@v0.9.0 · 6282 in / 1166 out tokens · 35409 ms · 2026-05-18T01:38:20.090835+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages

  1. [1]

    Surgical tool classification in laparoscopic videos using convolutional neural network.Current Directions in Biomedical Engineering, 4(1):407–410, 2018

    Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, and Knut Möller. Surgical tool classification in laparoscopic videos using convolutional neural network.Current Directions in Biomedical Engineering, 4(1):407–410, 2018

  2. [2]

    Wavemamba: Spatial-spectral wavelet mamba for hyperspectral image classification.IEEE Geoscience and Remote Sensing Letters, 2024

    Muhammad Ahmad, Muhammad Usama, Manuel Mazzara, and Salvatore Distefano. Wavemamba: Spatial-spectral wavelet mamba for hyperspectral image classification.IEEE Geoscience and Remote Sensing Letters, 2024

  3. [3]

    The sages critical view of safety challenge: A global benchmark for ai-assisted surgical quality assessment.arXiv preprint arXiv:2509.17100, 2025

    Deepak Alapatt, Jennifer Eckhoff, Zhiliang Lyu, Yutong Ban, Jean-Paul Mazellier, Sarah Choksi, Kunyi Yang, Quanzheng Li, Filippo Filicori, Xiang Li, et al. The sages critical view of safety challenge: A global benchmark for ai-assisted surgical quality assessment.arXiv preprint arXiv:2509.17100, 2025

  4. [4]

    Aggregating long-term context for learning laparoscopic and robot-assisted surgical workflows, 2021

    Yutong Ban, Guy Rosman, Thomas Ward, Daniel Hashimoto, Taisei Kondo, Hidekazu Iwaki, Ozanan Meireles, and Daniela Rus. Aggregating long-term context for learning laparoscopic and robot-assisted surgical workflows, 2021

  5. [5]

    Bias in radiology artificial intelligence: causes, evaluation and mitigation, 2024

    Imon Banerjee. Bias in radiology artificial intelligence: causes, evaluation and mitigation, 2024

  6. [6]

    Placental vessel segmentation and registration in fetoscopy: literature review and miccai fetreg2021 challenge findings.Medical Image Analysis, 92: 103066, 2024

    Sophia Bano, Alessandro Casella, Francisco Vasconcelos, Abdul Qayyum, Abdesslam Benzinou, Moona Mazher, Fabrice Meriaudeau, Chiara Lena, Ilaria Anita Cintorrino, Gaia Romana De Paolis, et al. Placental vessel segmentation and registration in fetoscopy: literature review and miccai fetreg2021 challenge findings.Medical Image Analysis, 92: 103066, 2024

  7. [7]

    Impact of data on generalization of ai for surgical intelligence applications.Scientific reports, 10(1):22208, 2020

    Omri Bar, Daniel Neimark, Maya Zohar, Gregory D Hager, Ross Girshick, Gerald M Fried, Tamir Wolf, and Dotan Asselmann. Impact of data on generalization of ai for surgical intelligence applications.Scientific reports, 10(1):22208, 2020

  8. [8]

    Maximilian Berlet, Thomas Vogel, Daniel Ostler, Tobias Czempiel, M Kähler, Stephan Brunner, Hubertus Feussner, Dirk Wilhelm, and Michael Kranzfelder. Surgical reporting for laparoscopic cholecystectomy based on phase annotation by a convolutional neural network (cnn) and the phenomenon of phase flickering: a proof of concept.International journal of compu...

  9. [9]

    A probabilistic approach to surgical tasks and skill metrics.IEEE Transactions on Biomedical Engineering, 69(7):2212–2219, 2021

    Max Berniker, Kiran D Bhattacharyya, Kristen C Brown, and Anthony Jarc. A probabilistic approach to surgical tasks and skill metrics.IEEE Transactions on Biomedical Engineering, 69(7):2212–2219, 2021

  10. [10]

    Improving temporal stability and accuracy for endoscopic video tissue classification using recurrent neural networks.Sensors, 20(15):4133, 2020

    Tim Boers, Joost van der Putten, Maarten Struyvenberg, Kiki Fockens, Jelmer Jukema, Erik Schoon, Fons van der Sommen, Jacques Bergman, and Peter de With. Improving temporal stability and accuracy for endoscopic video tissue classification using recurrent neural networks.Sensors, 20(15):4133, 2020

  11. [11]

    Frame-by-frame analysis of a commercially available artificial intelligence polyp detection system in full-length colonoscopies.Digestion, 103(5):378–385, 2022

    Markus Brand, Joel Troya, Adrian Krenzer, Costanza De Maria, Niklas Mehlhase, Sebastian Götze, Benjamin Walter, Alexander Meining, and Alexander Hann. Frame-by-frame analysis of a commercially available artificial intelligence polyp detection system in full-length colonoscopies.Digestion, 103(5):378–385, 2022

  12. [12]

    Delphi process: a methodology used for the elicitation of opinions of experts

    Bernice B Brown. Delphi process: a methodology used for the elicitation of opinions of experts. Technical report, 1968

  13. [13]

    Finding spurious correlations with function-semantic contrast analysis, 2023

    Kirill Bykov, Laura Kopf, and Marina M-C Höhne. Finding spurious correlations with function-semantic contrast analysis, 2023

  14. [14]

    Artificial intelligence for surgical scene understanding: A systematic review and reporting quality meta-analysis.medRxiv, pages 2025–07, 2025

    Matthias Carstens, Shubha Vasisht, Zheyuan Zhang, Iulia Barbur, Annika Reinke, Lena Maier-Hein, Daniel A Hashimoto, and Fiona R Kolbinger. Artificial intelligence for surgical scene understanding: A systematic review and reporting quality meta-analysis.medRxiv, pages 2025–07, 2025

  15. [15]

    Surgt challenge: Benchmark of soft-tissue trackers for robotic surgery.Medical image analysis, 91:102985, 2024

    João Cartucho, Alistair Weld, Samyakh Tukra, Haozheng Xu, Hiroki Matsuzaki, Taiyo Ishikawa, Minjun Kwon, Yong Eun Jang, Kwang-Ju Kim, Gwang Lee, et al. Surgt challenge: Benchmark of soft-tissue trackers for robotic surgery.Medical image analysis, 91:102985, 2024

  16. [16]

    Causality matters in medical imaging.Nature Communications, 11(1): 3673, 2020

    Daniel C Castro, Ian Walker, and Ben Glocker. Causality matters in medical imaging.Nature Communications, 11(1): 3673, 2020

  17. [17]

    Automatic tissue segmentation of hyperspectral images in liver and head neck surgeries using machine learning.Artificial Intelligence Surgery, 1(1):22–37, 2021

    Fernando Cervantes-Sanchez, Marianne Maktabi, Hannes Köhler, Robert Sucher, Nada Rayes, Juan Gabriel Avina- Cervantes, Ivan Cruz-Aceves, and Claire Chalopin. Automatic tissue segmentation of hyperspectral images in liver and head neck surgeries using machine learning.Artificial Intelligence Surgery, 1(1):22–37, 2021

  18. [18]

    Why deep surgical models fail?: Revisiting surgical action triplet recognition through the lens of robustness, 2023

    Yanqi Cheng, Lihao Liu, Shujun Wang, Yueming Jin, Carola-Bibiane Schönlieb, and Angelica I Aviles-Rivero. Why deep surgical models fail?: Revisiting surgical action triplet recognition through the lens of robustness, 2023

  19. [19]

    Same data, opposite results?: A call to improve surgical database research.JAMA surgery, 156(3):219–220, 2021

    Christopher P Childers and Melinda Maggard-Gibbons. Same data, opposite results?: A call to improve surgical database research.JAMA surgery, 156(3):219–220, 2021

  20. [20]

    Confidence intervals uncovered: Are we ready for real-world medical imaging ai?, 2024

    Evangelia Christodoulou, Annika Reinke, Rola Houhou, Piotr Kalinowski, Selen Erkan, Carole H Sudre, Ninon Burgos, Sofiène Boutaj, Sophie Loizillon, Maëlys Solal, et al. Confidence intervals uncovered: Are we ready for real-world medical imaging ai?, 2024

  21. [21]

    False promises in medical imaging ai? assessing validity of outperformance claims, 2025

    Evangelia Christodoulou, Annika Reinke, Pascaline Andrè, Patrick Godau, Piotr Kalinowski, Rola Houhou, Selen Erkan, Carole H Sudre, Ninon Burgos, Sofiène Boutaj, et al. False promises in medical imaging ai? assessing validity of outperformance claims, 2025. 52 Reinke et al

  22. [22]

    Gary S Collins, Paula Dhiman, Constanza L Andaur Navarro, Jie Ma, Lotty Hooft, Johannes B Reitsma, Patricia Logullo, Andrew L Beam, Lily Peng, Ben Van Calster, et al. Protocol for development of a reporting guideline (tripod-ai) and risk of bias tool (probast-ai) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ ...

  23. [23]

    Opera: Attention-regularized transformers for surgical phase recognition, 2021

    Tobias Czempiel, Magdalini Paschali, Daniel Ostler, Seong Tae Kim, Benjamin Busam, and Nassir Navab. Opera: Attention-regularized transformers for surgical phase recognition, 2021

  24. [24]

    PhD thesis, Technische Universität München, 2023

    Tobias M Czempiel.Symphony of Time: Temporal Deep Learning for Surgical Activity Recognition. PhD thesis, Technische Universität München, 2023

  25. [25]

    Detecting spurious correlations via robust visual concepts in real and ai-generated image classification.arXiv preprint arXiv:2311.01655, 2023

    Preetam Prabhu Srikar Dammu and Chirag Shah. Detecting spurious correlations via robust visual concepts in real and ai-generated image classification.arXiv preprint arXiv:2311.01655, 2023

  26. [26]

    Deep learning in surgical workflow analysis: a review of phase and step recognition.IEEE Journal of Biomedical and Health Informatics, 27(11):5405–5417, 2023

    Kubilay Can Demir, Hannah Schieber, Tobias Weise, Daniel Roth, Matthias May, Andreas Maier, and Seung Hee Yang. Deep learning in surgical workflow analysis: a review of phase and step recognition.IEEE Journal of Biomedical and Health Informatics, 27(11):5405–5417, 2023

  27. [27]

    Automatic data-driven real-time segmentation and recognition of surgical workflow.International journal of computer assisted radiology and surgery, 11(6):1081–1089, 2016

    Olga Dergachyova, David Bouget, Arnaud Huaulmé, Xavier Morandi, and Pierre Jannin. Automatic data-driven real-time segmentation and recognition of surgical workflow.International journal of computer assisted radiology and surgery, 11(6):1081–1089, 2016

  28. [28]

    Carts: Causality-driven robot tool segmentation from vision and kinematics data, 2022

    Hao Ding, Jintan Zhang, Peter Kazanzides, Jie Ying Wu, and Mathias Unberath. Carts: Causality-driven robot tool segmentation from vision and kinematics data, 2022

  29. [29]

    Chapman and Hall/CRC, 1994

    Bradley Efron and Robert J Tibshirani.An introduction to the bootstrap. Chapman and Hall/CRC, 1994

  30. [30]

    Improving surgical training phantoms by hyperrealism: deep unpaired image-to-image translation from real surgeries, 2018

    Sandy Engelhardt, Raffaele De Simone, Peter M Full, Matthias Karck, and Ivo Wolf. Improving surgical training phantoms by hyperrealism: deep unpaired image-to-image translation from real surgeries, 2018

  31. [31]

    Video-based surgical skill assessment using 3d convolutional neural networks.International journal of computer assisted radiology and surgery, 14(7):1217–1225, 2019

    Isabel Funke, Sören Torge Mees, Jürgen Weitz, and Stefanie Speidel. Video-based surgical skill assessment using 3d convolutional neural networks.International journal of computer assisted radiology and surgery, 14(7):1217–1225, 2019

  32. [32]

    Funke, D

    Isabel Funke, Dominik Rivoir, and Stefanie Speidel. Metrics matter in surgical phase recognition.arXiv preprint arXiv:2305.13961, 2023

  33. [33]

    Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer, 2021

    Xiaojie Gao, Yueming Jin, Yonghao Long, Qi Dou, and Pheng-Ann Heng. Trans-svnet: Accurate phase recognition from surgical videos via hybrid embedding aggregation transformer, 2021

  34. [34]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  35. [35]

    arXiv:2312.06295 (2023)

    Negin Ghamsarian, Yosuf El-Shabrawi, Sahar Nasirihaghighi, Doris Putzgruber-Adamitsch, Martin Zinkernagel, Se- bastian Wolf, Klaus Schoeffmann, and Raphael Sznitman. Cataract-1k: cataract surgery dataset for scene segmentation, phase recognition, and irregularity detection.arXiv preprint arXiv:2312.06295, 2023

  36. [36]

    Jäger, and Lena Maier-Hein

    Patrick Godau, Piotr Kalinowski, Evangelia Christodoulou, Annika Reinke, Minu Tizabi, Luciana Ferrer, Paul F. Jäger, and Lena Maier-Hein. Deployment of image analysis algorithms under prevalence shifts. pages 389–399, 2023

  37. [37]

    Act-net: Anchor- context action detection in surgery videos, 2023

    Luoying Hao, Yan Hu, Wenjun Lin, Qun Wang, Heng Li, Huazhu Fu, Jinming Duan, and Jiang Liu. Act-net: Anchor- context action detection in surgery videos, 2023

  38. [38]

    Computer vision analysis of intraoperative video: automated recognition of operative steps in laparoscopic sleeve gastrectomy.Annals of surgery, 270(3):414–421, 2019

    Daniel A Hashimoto, Guy Rosman, Elan R Witkowski, Caitlin Stafford, Allison J Navarette-Welton, David W Rattner, Keith D Lillemoe, Daniela L Rus, and Ozanan R Meireles. Computer vision analysis of intraoperative video: automated recognition of operative steps in laparoscopic sleeve gastrectomy.Annals of surgery, 270(3):414–421, 2019

  39. [39]

    A foundation for evaluating the surgical artificial intelligence literature.European Journal of Surgical Oncology, 50(12):108014, 2024

    Daniel A Hashimoto, Sai Koushik Sambasastry, Vivek Singh, Sruthi Kurada, Maria Altieri, Takuto Yoshida, Amin Madani, and Matjaz Jogan. A foundation for evaluating the surgical artificial intelligence literature.European Journal of Surgical Oncology, 50(12):108014, 2024

  40. [40]

    An empirical study on activity recognition in long surgical videos, 2022

    Zhuohong He, Ali Mottaghi, Aidean Sharghi, Muhammad Abdullah Jamal, and Omid Mohareri. An empirical study on activity recognition in long surgical videos, 2022

  41. [41]

    Next-generation surgical navigation: Marker-less multi-view 6dof pose estimation of surgical instruments.Medical Image Analysis, page 103613, 2025

    Jonas Hein, Nicola Cavalcanti, Daniel Suter, Lukas Zingg, Fabio Carrillo, Lilian Calvet, Mazda Farshad, Nassir Navab, Marc Pollefeys, and Philipp Fürnstahl. Next-generation surgical navigation: Marker-less multi-view 6dof pose estimation of surgical instruments.Medical Image Analysis, page 103613, 2025

  42. [42]

    Cholecseg8k: a semantic segmen- tation dataset for laparoscopic cholecystectomy based on cholec80

    W-Y Hong, C-L Kao, Y-H Kuo, J-R Wang, W-L Chang, and C-S Shih. Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80.arXiv preprint arXiv:2012.12453, 2020

  43. [43]

    Peg transfer workflow recognition challenge report: Do multimodal data improve recognition?Computer Methods and Programs in Biomedicine, 236:107561, 2023

    Arnaud Huaulmé, Kanako Harada, Quang-Minh Nguyen, Bogyu Park, Seungbum Hong, Min-Kook Choi, Michael Peven, Yunshuang Li, Yonghao Long, Qi Dou, et al. Peg transfer workflow recognition challenge report: Do multimodal data improve recognition?Computer Methods and Programs in Biomedicine, 236:107561, 2023

  44. [44]

    Global versus local kinematic skills assessment on robotic-assisted hysterectomies.IEEE Transactions on Medical Robotics and Bionics, 2024

    Arnaud Huaulmé, Krystel Nyangoh Timoh, Victor Jan, Sonia Guerin, and Pierre Jannin. Global versus local kinematic skills assessment on robotic-assisted hysterectomies.IEEE Transactions on Medical Robotics and Bionics, 2024

  45. [45]

    Experts vs super-experts: differences in automated performance metrics and clinical outcomes for robot-assisted radical prostatectomy.BJU international, 123(5):861–868, 2019

    Andrew J Hung, Paul J Oh, Jian Chen, Saum Ghodoussipour, Christianne Lane, Anthony Jarc, and Inderbir S Gill. Experts vs super-experts: differences in automated performance metrics and clinical outcomes for robot-assisted radical prostatectomy.BJU international, 123(5):861–868, 2019. Current validation practice undermines surgical AI development 53

  46. [46]

    Risk factors for bad splits during sagittal split ramus osteotomy: a retrospective study of 964 cases.British Journal of Oral and Maxillofacial Surgery, 59(6):678–682, 2021

    N Jiang, M Wang, R Bi, G Wu, S Zhu, and Y Liu. Risk factors for bad splits during sagittal split ramus osteotomy: a retrospective study of 964 cases.British Journal of Oral and Maxillofacial Surgery, 59(6):678–682, 2021

  47. [47]

    Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network.IEEE transactions on medical imaging, 37(5): 1114–1126, 2017

    Yueming Jin, Qi Dou, Hao Chen, Lequan Yu, Jing Qin, Chi-Wing Fu, and Pheng-Ann Heng. Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network.IEEE transactions on medical imaging, 37(5): 1114–1126, 2017

  48. [48]

    Temporal memory relation network for workflow recognition from surgical video.IEEE Transactions on Medical Imaging, 40(7):1911–1923, 2021

    Yueming Jin, Yonghao Long, Cheng Chen, Zixu Zhao, Qi Dou, and Pheng-Ann Heng. Temporal memory relation network for workflow recognition from surgical video.IEEE Transactions on Medical Imaging, 40(7):1911–1923, 2021

  49. [49]

    Quality over quantity? the role of data quality and uncertainty for ai in surgery.Global Surgical Education-Journal of the Association for Surgical Education, 3(1):79, 2024

    Matjaž Jogan, Sruthi Kurada, Shubha Vasisht, Vivek Singh, and Daniel A Hashimoto. Quality over quantity? the role of data quality and uncertainty for ai in surgery.Global Surgical Education-Journal of the Association for Surgical Education, 3(1):79, 2024

  50. [50]

    Inter-observer variability of manual contour delineation of structures in ct.European radiology, 29(3):1391–1399, 2019

    Leo Joskowicz, D Cohen, N Caplan, and Jacob Sosna. Inter-observer variability of manual contour delineation of structures in ct.European radiology, 29(3):1391–1399, 2019

  51. [51]

    State-of-the-art of situation recognition systems for intraoperative procedures.Medical & Biological Engineering & Computing, 60(4):921–939, 2022

    Denise Junger, Sina Mailin Frommer, and Oliver Burgert. State-of-the-art of situation recognition systems for intraoperative procedures.Medical & Biological Engineering & Computing, 60(4):921–939, 2022

  52. [52]

    Denuka Kankanamge, Chandana Wijeweera, Zehurn Ong, T Preda, Terry Carney, Mike Wilson, and Veronica Preda. Artificial intelligence based assessment of minimally invasive surgical skills using standardised objective metrics–a narrative review.The American Journal of Surgery, 241:116074, 2025

  53. [53]

    Federated cycling (fedcy): Semi-supervised federated learning of surgical phases.IEEE transactions on medical imaging, 42(7):1920–1931, 2022

    Hasan Kassem, Deepak Alapatt, Pietro Mascagni, Alexandros Karargyris, and Nicolas Padoy. Federated cycling (fedcy): Semi-supervised federated learning of surgical phases.IEEE transactions on medical imaging, 42(7):1920–1931, 2022

  54. [54]

    A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

    Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

  55. [55]

    Artificial intelligence assisted operative anatomy recognition in endoscopic pituitary surgery.NPJ Digital Medicine, 7(1):314, 2024

    Danyal Z Khan, Alexandra Valetopoulou, Adrito Das, John G Hanrahan, Simon C Williams, Sophia Bano, Anouk Borg, Neil L Dorward, Santiago Barbarisi, Lucy Culshaw, et al. Artificial intelligence assisted operative anatomy recognition in endoscopic pituitary surgery.NPJ Digital Medicine, 7(1):314, 2024

  56. [56]

    Physical imaging parameter variation drives domain shift.Scientific Reports, 12(1):21302, 2022

    Oz Kilim, Alex Olar, Tamás Joó, Tamás Palicz, Péter Pollner, and István Csabai. Physical imaging parameter variation drives domain shift.Scientific Reports, 12(1):21302, 2022

  57. [57]

    Pelphix: Surgical phase recognition from x-ray images in percutaneous pelvic fixation, 2023

    Benjamin D Killeen, Han Zhang, Jan Mangulabnan, Mehran Armand, Russell H Taylor, Greg Osgood, and Mathias Unberath. Pelphix: Surgical phase recognition from x-ray images in percutaneous pelvic fixation, 2023

  58. [58]

    Surgical phase recognition: From public datasets to real-world data.Applied Sciences, 12(17):8746, 2022

    Kadir Kirtac, Nizamettin Aydin, Joël L Lavanchy, Guido Beldi, Marco Smit, Michael S Woods, and Florian Aspart. Surgical phase recognition: From public datasets to real-world data.Applied Sciences, 12(17):8746, 2022

  59. [59]

    A vision transformer for decoding surgeon activity from surgical videos.Nature biomedical engineering, 7(6):780–796, 2023

    Dani Kiyasseh, Runzhuo Ma, Taseen F Haque, Brian J Miles, Christian Wagner, Daniel A Donoho, Animashree Anandkumar, and Andrew J Hung. A vision transformer for decoding surgeon activity from surgical videos.Nature biomedical engineering, 7(6):780–796, 2023

  60. [60]

    Susceptibility to image resolution in face recognition and trainings strategies.arXiv preprint arXiv:2107.03769, 2021

    Martin Knoche, Stefan Hörmann, and Gerhard Rigoll. Susceptibility to image resolution in face recognition and trainings strategies.arXiv preprint arXiv:2107.03769, 2021

  61. [61]

    Adherence to the checklist for artificial intelligence in medical imaging (claim): an umbrella review with a comprehensive two-level analysis.Diagn Interv Radiol, 2025

    Burak Koçak, Fadime Köse, Ali Keleş, Abdurrezzak Şendur, İsmail Meşe, and Mehmet Karagülle. Adherence to the checklist for artificial intelligence in medical imaging (claim): an umbrella review with a comprehensive two-level analysis.Diagn Interv Radiol, 2025

  62. [62]

    Distribution shift detection for the postmarket surveillance of medical ai algorithms: a retrospective simulation study.NPJ Digital Medicine, 7(1):120, 2024

    Lisa M Koch, Christian F Baumgartner, and Philipp Berens. Distribution shift detection for the postmarket surveillance of medical ai algorithms: a retrospective simulation study.NPJ Digital Medicine, 7(1):120, 2024

  63. [63]

    Florian Kofler, Ivan Ezhov, Fabian Isensee, Fabian Balsiger, Christoph Berger, Maximilian Koerner, Beatrice Demiray, Julia Rackerseder, Johannes Paetzold, Hongwei Li, et al. Are we using appropriate segmentation metrics? identifying correlates of human expert perception for cnn training beyond rolling the dice coefficient.Machine Learning for Biomedical I...

  64. [64]

    Anatomy segmentation in laparoscopic surgery: comparison of machine learning and human expertise–an experimental study.International Journal of Surgery, 109 (10):2962–2974, 2023

    Fiona R Kolbinger, Franziska M Rinner, Alexander C Jenke, Matthias Carstens, Stefanie Krell, Stefan Leger, Marius Distler, Jürgen Weitz, Stefanie Speidel, and Sebastian Bodenstedt. Anatomy segmentation in laparoscopic surgery: comparison of machine learning and human expertise–an experimental study.International Journal of Surgery, 109 (10):2962–2974, 2023

  65. [65]

    Fiona R Kolbinger, Sebastian Bodenstedt, Matthias Carstens, Stefan Leger, Stefanie Krell, Franziska M Rinner, Thomas P Nielen, Johanna Kirchberg, Johannes Fritzmann, Jürgen Weitz, et al. Artificial intelligence for context-aware surgical guidance in complex robot-assisted oncological procedures: An exploratory feasibility study.European Journal of Surgica...

  66. [66]

    Strategies to improve real-world applicability of laparoscopic anatomy segmentation models, 2024

    Fiona R Kolbinger, Jiangpeng He, Jinge Ma, and Fengqing Zhu. Strategies to improve real-world applicability of laparoscopic anatomy segmentation models, 2024

  67. [67]

    Appendix300: A multi-institutional laparoscopic appendectomy video dataset for computational modeling tasks.medRxiv, pages 2025–09, 2025

    Fiona R Kolbinger, Max Kirchner, Kevin Pfeiffer, Sebastian Bodenstedt, Alexander C Jenke, Julia Barthel, Matthias Carstens, Karolin Dehlke, Sophia Dietz, Sotirios Emmanouilidis, et al. Appendix300: A multi-institutional laparoscopic appendectomy video dataset for computational modeling tasks.medRxiv, pages 2025–09, 2025. 54 Reinke et al

  68. [68]

    Xiaowen Kong, Yueming Jin, Qi Dou, Ziyi Wang, Zerui Wang, Bo Lu, Erbao Dong, Yun-Hui Liu, and Dong Sun. Accurate instance segmentation of surgical instruments in robotic surgery: model refinement and cross-dataset evaluation.International journal of computer assisted radiology and surgery, 16(9):1607–1614, 2021

  69. [69]

    Robust consistent video depth estimation, 2021

    Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation, 2021

  70. [70]

    Surgical phase and instrument recognition: how to identify appropriate dataset splits.International Journal of Computer Assisted Radiology and Surgery, 19(4):699–711, 2024

    Georgii Kostiuchik, Lalith Sharan, Benedikt Mayer, Ivo Wolf, Bernhard Preim, and Sandy Engelhardt. Surgical phase and instrument recognition: how to identify appropriate dataset splits.International Journal of Computer Assisted Radiology and Surgery, 19(4):699–711, 2024

  71. [71]

    Joël L Lavanchy, Sanat Ramesh, Diego Dall’Alba, Cristians Gonzalez, Paolo Fiorini, Beat P Müller-Stich, Philipp C Nett, Jacques Marescaux, Didier Mutter, and Nicolas Padoy. Challenges in multi-centric generalization: phase and step recognition in roux-en-y gastric bypass surgery.International journal of computer assisted radiology and surgery, 19(11):2249...

  72. [72]

    Artificial intelligence in surgery: evolution, trends, and future directions

    Huiyang Li, Zhuoqi Han, Haixiao Wu, Elmar R Musaev, Yile Lin, Shu Li, Alexander D Makatsariya, Vladimir P Chekhonin, Wenjuan Ma, and Chao Zhang. Artificial intelligence in surgery: evolution, trends, and future directions. International Journal of Surgery, 111(2):2101–2111, 2025

  73. [73]

    Skit: a fast key information video transformer for online surgical phase recognition, 2023

    Yang Liu, Jiayu Huo, Jingjing Peng, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, and Sebastien Ourselin. Skit: a fast key information video transformer for online surgical phase recognition, 2023

  74. [74]

    Artificial intelligence–enabled decision support in surgery: state-of-the-art and future directions.Annals of Surgery, 278(1):51–58, 2023

    Tyler J Loftus, Maria S Altieri, Jeremy A Balch, Kenneth L Abbott, Jeff Choi, Jayson S Marwaha, Daniel A Hashimoto, Gabriel A Brat, Yannis Raftopoulos, Heather L Evans, et al. Artificial intelligence–enabled decision support in surgery: state-of-the-art and future directions.Annals of Surgery, 278(1):51–58, 2023

  75. [75]

    Impact of quality, type and volume of data used by deep learning models in the analysis of medical images.Informatics in Medicine Unlocked, 29:100911, 2022

    Andreea Roxana Luca, Tudor Florin Ursuleanu, Liliana Gheorghe, Roxana Grigorovici, Stefan Iancu, Maria Hlusneac, and Alexandru Grigorovici. Impact of quality, type and volume of data used by deep learning models in the analysis of medical images.Informatics in Medicine Unlocked, 29:100911, 2022

  76. [76]

    Hota: A higher order metric for evaluating multi-object tracking.International journal of computer vision, 129(2): 548–578, 2021

    Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taixé, and Bastian Leibe. Hota: A higher order metric for evaluating multi-object tracking.International journal of computer vision, 129(2): 548–578, 2021

  77. [77]

    Artificial intelligence for intra- operative guidance: using semantic segmentation to identify surgical anatomy during laparoscopic cholecystectomy

    Amin Madani, Babak Namazi, Maria S Altieri, Daniel A Hashimoto, Angela Maria Rivera, Philip H Pucher, Allison Navarrete-Welton, Ganesh Sankaranarayanan, L Michael Brunt, Allan Okrainec, et al. Artificial intelligence for intra- operative guidance: using semantic segmentation to identify surgical anatomy during laparoscopic cholecystectomy. Annals of surge...

  78. [78]

    Detecting spurious correlations with sanity tests for artificial intelligence guided radiology systems.Frontiers in digital health, 3:671015, 2021

    Usman Mahmood, Robik Shrestha, David DB Bates, Lorenzo Mannelli, Giuseppe Corrias, Yusuf Emre Erdi, and Christopher Kanan. Detecting spurious correlations with sanity tests for artificial intelligence guided radiology systems.Frontiers in digital health, 3:671015, 2021

  79. [79]

    Surgical data science for next-generation interventions.Nature Biomedical Engineering, 1(9):691–696, 2017

    Lena Maier-Hein, Swaroop S Vedula, Stefanie Speidel, Nassir Navab, Ron Kikinis, Adrian Park, Matthias Eisen- mann, Hubertus Feussner, Germain Forestier, Stamatia Giannarou, et al. Surgical data science for next-generation interventions.Nature Biomedical Engineering, 1(9):691–696, 2017

  80. [80]

    Heidelberg colorectal data set for surgical data science in the sensor operating room.Scientific data, 8(1):101, 2021

    Lena Maier-Hein, Martin Wagner, Tobias Ross, Annika Reinke, Sebastian Bodenstedt, Peter M Full, Hellena Hempe, Diana Mindroc-Filimon, Patrick Scholz, Thuy Nuong Tran, et al. Heidelberg colorectal data set for surgical data science in the sensor operating room.Scientific data, 8(1):101, 2021

Showing first 80 references.