pith. sign in

arxiv: 2605.22200 · v1 · pith:5EMD2U3Knew · submitted 2026-05-21 · 💻 cs.CV · cs.AI· cs.LG

OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025

Pith reviewed 2026-05-22 07:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords open surgerysurgical skill assessmentvideo analysissuturingOSATSspatiotemporal modelsinstrument trackingMICCAI challenge
0
0 comments X

The pith

General-purpose spatiotemporal video models achieve the strongest performance in assessing open suturing skills from video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports results from a two-year MICCAI challenge that tests machine learning methods on videos of a dry-lab open suturing task recorded with a fixed camera, along with instrument trajectories. It shows that general-purpose spatiotemporal video models deliver the best results on skill classification into four levels and on predicting the full set of OSATS scores, while other approaches can reach similar performance when carefully designed. Readers should care because such automated tools could standardize surgical training and improve outcomes, yet the work also identifies concrete limits in fine-grained scoring and in tracking hands or tools amid occlusions.

Core claim

The central claim is that general-purpose spatiotemporal video models consistently achieved the strongest performance across the challenge tasks of four-class skill level classification and eight-category OSATS prediction, although conceptually diverse approaches reached competitive levels when well-executed; predicting fine-grained OSATS scores remains challenging but improves with more training data, while keypoint tracking is hindered by frequent occlusions and out-of-frame instances.

What carries the argument

General-purpose spatiotemporal video models operating on dry-lab suturing videos supplemented by instrument trajectories.

If this is right

  • Skill level can be classified into four categories with high reliability using video models alone.
  • Fine-grained prediction of the eight OSATS categories benefits substantially from larger amounts of training data.
  • Keypoint tracking of hands and tools is currently limited by occlusions and out-of-frame motion, restricting motion-based skill analysis.
  • Conceptually different methods can reach competitive accuracy when implemented and tuned effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results point toward hybrid systems that combine video features with trajectory data to strengthen future assessment tools.
  • Expanding the dataset to include varied camera angles or real surgical footage could address current tracking failures.
  • These benchmarks suggest that video-based assessment pipelines might be integrated into simulator platforms to give trainees immediate feedback.

Load-bearing premise

Dry-lab videos captured by a static GoPro camera and paired with instrument trajectories serve as a representative proxy for real clinical open surgery skill that generalizes beyond the training task.

What would settle it

Models trained only on this dry-lab dataset would show a large drop in accuracy when tested on videos recorded during actual open operations in the operating room.

Figures

Figures reproduced from arXiv: 2605.22200 by Andr\'e Ferreira, Atsushi Kouno, Behrus Hinrichs-Puladi, Bining Long, Claas de Boer, Daehong Kang, Danail Stoyanov, Evangelos Mazomenos, Frank H\"olzle, Gabriella d'Albenzio, Guilherme Barbosa, Hanna Hoffmann, Hao Chen, Hiroki Matsuzaki, Hyoun-Joong Kong, Idan Smoller, Isabel Funke, Jan Egger, Jieun Park, Jo\~ao Carvalho, Kensaku Mori, Kousuke Hirasawa, Kunpeng Xie, Kyu Eun Lee, Lennart Johannes Gruber, Leonardo Barroso, Mariana Ribeiro, Marina Music, Marlon Neuhaus, Masahiro Oda, Max Kirchner, Nooshin Maghsoodi, Nuno Gomes, Ori Meiraz, Rafael Piexoto, Rainer R\"ohrig, Rebecca Hisey, Roi Papo, Saeid Rezaei, Satoshi Kasai, Satoshi Kondo, Sebastian Bodenstedt, Seoi Jeong, Setareh Bady, Seungjae Hong, Shlomi Laufer, Shunsuke Kikuchi, Soyeon Shin, Stefanie Speidel, Takayuki Kitasaka, Tiago Jesus, Victor Alves, Xinkai Zhao, Yang Shu, Yihui Wang, Youngbin Kong, Yuichiro Hayashi.

Figure 1
Figure 1. Figure 1: Overview of the 2024 and 2025 MICCAI EndoVis Open Suturing Skills Subchallenges. methods, data (baseline improvements for Task 2), and performance. 2. MICCAI 2024 Challenge Overview 2.1. Challenge Design Organization The challenge was hosted at MICCAI 2024 in Mar￾rakesh, Morocco as a subchallenge under the EndoVis chal￾lenge1 . It was jointly organized by the Dresden University of Technology (TUD), the Nat… view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of GRS scores (Task 1) for the training and test sets for the 2024 and 2025 challenges. The 2024 data includes the additional expert samples. were easier to rate consistently. These results suggest gener￾ally good agreement among raters, though certain aspects of surgical skill assessment may be more subjective and prone to variability. The full results of the IRA analysis can be found in the… view at source ↗
Figure 3
Figure 3. Figure 3: Task 1: GRS Metric performance (top row) and rank analysis (bottom row) on the challenge dataset for F1 and Expected Cost (EC) after bootstrapping with 10,000 repetitions. Error bars denote the standard deviation for the performance graphs. Rank plot bubble size corresponds to rank frequency, solid lines are 95% confidence intervals, and crosses denote the median rank of that team. Team ranks are sorted by… view at source ↗
Figure 4
Figure 4. Figure 4: Task 2: OSATS Metric performance (top row) and rank analysis (bottom row) for F1 and Expected Cost (EC) after bootstrapping with 10,000 repetitions. Error bars denote the standard deviation for the performance graphs. Rank plot bubble size corresponds to rank frequency, solid lines are 95% confidence intervals, and crosses denote the median rank of that team. Team ranks are sorted by mode. Kinetics-400 [7]… view at source ↗
Figure 5
Figure 5. Figure 5: Task 1: GRS Bootstrapping (top row) and rank analysis (bottom row) for F1 and Expected Cost (EC) after bootstrapping with 10,000 repetitions. Error bars denote the standard deviation for the performance graphs. Rank plot bubble size corresponds to rank frequency, solid lines are 95% confidence intervals, and crosses denote the median rank of that team. Team ranks are sorted by mode. Team Scalpel’s mid-tabl… view at source ↗
Figure 6
Figure 6. Figure 6: Task 2: OSATS Bootstrapping (top row) and rank analysis (bottom row) for F1 and Expected Cost (EC) after bootstrapping with 10,000 repetitions. Error bars denote the standard deviation for the performance graphs. Rank plot bubble size corresponds to rank frequency, solid lines are 95% confidence intervals, and crosses denote the median rank of that team. Team ranks are sorted by mode. process, much as task… view at source ↗
Figure 7
Figure 7. Figure 7: Task 3: Tracking Bootstrapping (left) and rank analysis (right) for HOTA after bootstrapping with 10,000 repetitions. Error bars denote the standard deviation for the performance graphs. Rank plot bubble size corresponds to rank frequency, solid lines are 95% confidence intervals, and crosses denote the median rank of that team. Team ranks are sorted by mode [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Task 1 confusion matrices of teams SK, Perk, and Baseline from the 2024 Challenge (top) compared with the additional expert training data confusion matrices (bottom). single video could substantially change the macro F1 score. Confusion matrices for the three leading methods (baseline, SK, and Perk) reveal what specifically changed in their predictions (see [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distributions of the individual OSATS categories of the train set [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distributions of the individual OSATS categories of the test set [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distributions of the individual OSATS categories of the train set [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distributions of the individual OSATS categories of the test set. E.2. 2025 Challenge - Task 1 Confusion matrices for the 2025 Challenge Task 1 are seen in [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Task 1 confusion matrices of all teams for the 2024 Challenge [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Task 1 confusion matrices of all teams for the 2025 Challenge. : Preprint submitted to Elsevier Page 27 of 31 [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
read the original abstract

Achieving high levels of surgical skill through effective training is essential for optimal patient outcomes. Automated, data-driven skill assessment holds significant potential to improve surgical training. While machine learning-based methods are increasingly popular for assessing skills in minimally invasive surgery, their application to open surgery remains limited. We present the results of a dedicated MICCAI challenge designed to benchmark and advance vision-based skill assessment in open surgery. The challenge dataset comprises videos of an open suturing training task recorded with a static GoPro camera in a dry-lab setting, with instrument trajectories available in addition to the primary video modality. The OSS Challenge was hosted over two consecutive years, comprising two and three independent tasks, respectively: (1) classifying skill level into four classes, (2) predicting the full Objective Structured Assessment of Technical Skills across eight categories, and (3) tracking hands and surgical tools. Participants submitted diverse solutions including deep learning-based video models, tracking-driven methods, and hybrid approaches. General-purpose spatiotemporal video models consistently achieved the strongest performance, though conceptually diverse approaches reached competitive levels when well-executed. Predicting fine-grained OSATS scores remains challenging but benefits substantially from increased training data. Keypoint tracking proves difficult given frequent occlusions and out-of-frame instances, limiting current applicability for motion-based skill analysis. This work benchmarks innovative and diverse solutions for surgical skill assessment, highlighting both the promise and current limitations of video-based evaluation in open surgery and identifying critical directions for advancing automated skill assessment toward clinical impact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript reports results from the OSS Challenge 2024-2025, a MICCAI competition for vision-based surgical skill assessment in open suturing. The dataset consists of dry-lab videos recorded with a static GoPro camera, augmented by instrument trajectories. Two years of the challenge covered three tasks: four-class skill level classification, prediction of full OSATS scores across eight categories, and hand/tool keypoint tracking. Diverse participant submissions included spatiotemporal video models, tracking-driven methods, and hybrids. The central observation is that general-purpose spatiotemporal video models achieved the strongest aggregate performance, although well-executed conceptually diverse approaches remained competitive; fine-grained OSATS prediction improves with more data while tracking is limited by occlusions.

Significance. If the leaderboard outcomes hold, the work supplies a useful public benchmark for an under-studied domain (open-surgery skill assessment) that has lagged behind laparoscopic applications. By releasing a fixed dataset with multiple modalities and tasks, the challenge enables direct comparison of methods and surfaces concrete limitations (occlusions, data hunger for OSATS) that future research can target. The finding that off-the-shelf spatiotemporal models already lead provides a clear, actionable starting point for the community.

major comments (2)
  1. [Abstract / Evaluation protocol] Abstract and evaluation-protocol section: the claim that spatiotemporal models 'consistently achieved the strongest performance' rests on aggregate rankings, yet the manuscript does not report per-task statistical significance tests or confidence intervals on the performance gaps; without these, it is difficult to judge whether the observed differences are robust or could be explained by variance in participant submissions.
  2. [Dataset and evaluation protocol] Dataset and split description: full details on the train/validation/test partitioning ratios, number of videos per skill class, and exact definitions of the OSATS and tracking metrics (including handling of out-of-frame or occluded keypoints) are referenced only at high level; these omissions affect reproducibility of the reported rankings and the claim that increased training data substantially benefits OSATS prediction.
minor comments (3)
  1. [Abstract] Abstract: the sentence 'Predicting fine-grained OSATS scores remains challenging but benefits substantially from increased training data' would be stronger if it cited the specific data-volume ablation or participant results that support the 'substantially' qualifier.
  2. [Results] The manuscript should include a short table summarizing the number of participating teams, submissions per task, and top-three scores with metric names to give readers an immediate quantitative overview.
  3. [Discussion] Consider adding a brief discussion of how the static GoPro viewpoint and dry-lab setting may affect generalization to real operating-room conditions, even if only to reiterate the limitations already flagged.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and positive recommendation. We address the two major comments point-by-point below. Both points identify areas where additional detail will improve the manuscript, and we have incorporated the requested information in the revision.

read point-by-point responses
  1. Referee: [Abstract / Evaluation protocol] Abstract and evaluation-protocol section: the claim that spatiotemporal models 'consistently achieved the strongest performance' rests on aggregate rankings, yet the manuscript does not report per-task statistical significance tests or confidence intervals on the performance gaps; without these, it is difficult to judge whether the observed differences are robust or could be explained by variance in participant submissions.

    Authors: We agree that statistical support strengthens the claim. The reported rankings reflect performance on a fixed, held-out test set across all submissions. In the revised manuscript we will add per-task bootstrap confidence intervals (1000 resamples) for the top three methods and non-parametric paired tests (Wilcoxon signed-rank) between the leading spatiotemporal models and the next-best approaches. These additions will quantify whether the observed gaps are statistically distinguishable from submission variance. revision: yes

  2. Referee: [Dataset and evaluation protocol] Dataset and split description: full details on the train/validation/test partitioning ratios, number of videos per skill class, and exact definitions of the OSATS and tracking metrics (including handling of out-of-frame or occluded keypoints) are referenced only at high level; these omissions affect reproducibility of the reported rankings and the claim that increased training data substantially benefits OSATS prediction.

    Authors: We accept this criticism. The revised Dataset section will explicitly state the train/validation/test ratios (approximately 55/15/30 across both challenge years), the exact number of videos per skill class, and the precise metric formulations. For OSATS we will detail the 1–5 Likert scale per category and the averaging procedure; for tracking we will specify that occluded or out-of-frame keypoints are excluded from the error computation via the provided visibility flags. These clarifications will also make transparent the data-scaling experiments that support the OSATS improvement claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in challenge report

full rationale

The manuscript is a MICCAI challenge summary that reports empirical outcomes from independent participant submissions evaluated on a fixed, publicly released dry-lab dataset. Central observations (e.g., strongest performance by general-purpose spatiotemporal video models on skill classification and OSATS tasks) are direct leaderboard results rather than outputs of any internal derivation, fitted parameter, or self-referential equation. No load-bearing steps reduce to the paper's own inputs by construction; the text contains no equations, no self-citation chains invoked as uniqueness theorems, and no renaming of known results as novel derivations. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the established OSATS framework and standard computer vision evaluation practices without introducing new free parameters, axioms beyond domain conventions, or invented entities.

axioms (1)
  • domain assumption OSATS categories provide a valid and reliable measure of technical surgical skill.
    The challenge defines task 2 as prediction of full OSATS scores across eight categories, treating these scores as ground truth.

pith-pipeline@v0.9.0 · 6065 in / 1276 out tokens · 35698 ms · 2026-05-22T07:55:07.463749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 2 internal anchors

  1. [1]

    Ahmed, K., Miskovic, D., Darzi, A., Athanasiou, T., Hanna, G.B.,

  2. [2]

    The American Journal of Surgery 202, 469– 480.e6

    Observational tools for assessment of procedural skills: a systematic review. The American Journal of Surgery 202, 469– 480.e6. doi:10.1016/J.AMJSURG.2010.10.020. :Preprint submitted to Elsevier Page 28 of 31

  3. [3]

    Keep your eye on the best: Contrastive regression transformer for skill assessmentinroboticsurgery

    Anastasiou, D., Jin, Y., Stoyanov, D., Mazomenos, E., 2023. Keep your eye on the best: Contrastive regression transformer for skill assessmentinroboticsurgery. IEEERoboticsandAutomationLetters 8, 1755–1762. doi:10.1109/LRA.2023.3242466

  4. [4]

    Deep neural network architecture for automated soft surgical skills evaluation using ob- jective structured assessment of technical skills criteria

    Benmansour, M., Malti, A., Jannin, P., 2023. Deep neural network architecture for automated soft surgical skills evaluation using ob- jective structured assessment of technical skills criteria. Interna- tionalJournalofComputerAssistedRadiologyandSurgery18,929–

  5. [5]

    1007/s11548-022-02827-5

    URL:https://doi.org/10.1007/s11548-022-02827-5, doi:10. 1007/s11548-022-02827-5

  6. [6]

    Is space-time attention all you need for video understanding?, in: Meila, M., Zhang, T

    Bertasius, G., Wang, H., Torresani, L., 2021. Is space-time attention all you need for video understanding?, in: Meila, M., Zhang, T. (Eds.), Proceedings of the 38th International Conference on Machine Learning, PMLR. pp. 813–824. URL:https://proceedings.mlr. press/v139/bertasius21a.html

  7. [7]

    Surgical skill and complication rates after bariatric surgery

    Birkmeyer, J.D., Finks, J.F., O’Reilly, A., Oerline, M., et al., 2013. Surgical skill and complication rates after bariatric surgery. New England Journal of Medicine 369, 1434–1442. URL:https:// www.nejm.org/doi/10.1056/NEJMsa1300625,doi:10.1056/NEJMSA1300625/ SUPPL_FILE/NEJMSA1300625_DISCLOSURES.PDF

  8. [8]

    URL:https://www.sciencedirect

    Byvshev,P.,Mettes,P.,Xiao,Y.,2022.Are3dconvolutionalnetworks inherently biased towards appearance? Computer Vision and Im- age Understanding 220, 103437. URL:https://www.sciencedirect. com/science/article/pii/S1077314222000534,doi:https://doi.org/10. 1016/j.cviu.2022.103437

  9. [9]

    Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

    Carreira, J., Zisserman, A., Com, Z., Deepmind, 2017. Quo vadis, actionrecognition?anewmodelandthekineticsdataset. Proceedings -30thIEEEConferenceonComputerVisionandPatternRecognition, CVPR 2017 2017-January, 4724–4733. URL:https://arxiv.org/ abs/1705.07750v3, doi:10.1109/CVPR.2017.502

  10. [10]

    Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Associa- tion for Computing Machinery, New York, NY, USA. pp. 785–

  11. [11]

    URL:https://doi.org/10.1145/2939672.2939785, doi:10.1145/ 2939672.2939785

  12. [12]

    Cspnext:Anewefficienttokenhybridbackbone

    Chen, X., Yang, C., Mo, J., Sun, Y., Karmouni, H., Jiang, Y., Zheng, Z.,2024. Cspnext:Anewefficienttokenhybridbackbone. Eng.Appl. Artif. Intell. 132. URL:https://doi.org/10.1016/j.engappai.2024. 107886, doi:10.1016/j.engappai.2024.107886

  13. [13]

    A ConvNet for the 2020s

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R., 2022. Masked-attention mask transformer for universal image segmenta- tion,in:2022IEEE/CVFConferenceonComputerVisionandPattern Recognition (CVPR), pp. 1280–1289. doi:10.1109/CVPR52688.2022. 00135

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Cherti,M.,Beaumont,R.,Wightman,R.,Wortsman,M.,Ilharco,G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J., 2023. Repro- duciblescalinglawsforcontrastivelanguage-imagelearning,in:2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2829. doi:10.1109/CVPR52729.2023.00276

  15. [15]

    In: Moschitti, A., Pang, B., Daelemans, W

    Cho,K.,vanMerriënboer,B.,Gulcehre,C.,Bahdanau,D.,Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation, in: Moschitti, A., Pang, B., Daelemans, W. (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro- cessing(EMNLP),AssociationforComputa...

  16. [16]

    Czempiel, T., Paschali, M., Keicher, M., Simson, W., Feussner, H., Kim, S.T., Navab, N., 2020. Tecno: Surgical phase recognition with multi-stage temporal convolutional networks, in: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical Image Comput- ing and Computer Assisted Inter...

  17. [17]

    Low-fidelity bench models for basic surgical skills trainingduringundergraduatemedicaleducation

    Denadai, R., Saad-Hossne, R., Todelo, A.P., Kirylko, L., Souto, L.R.M., 2014. Low-fidelity bench models for basic surgical skills trainingduringundergraduatemedicaleducation. RevistadoColégio Brasileiro de Cirurgiões 41, 137–145

  18. [18]

    An Introduction to the Bootstrap

    Efron, B., Tibshirani, R.J., 1994. An Introduction to the Bootstrap. 1st ed., Chapman and Hall/CRC. URL:https://doi.org/10.1201/ 9780429246593, doi:10.1201/9780429246593

  19. [19]

    The impact of simulation-based training in medical education: A review

    Elendu, C., Amaechi, D.C., Okatta, A.U., Amaechi, E.C., Elendu, T.C., Ezeh, C.P., Elendu, I.D., 2024. The impact of simulation-based training in medical education: A review. Medicine 103, e38813. doi:10.1097/MD.0000000000038813

  20. [20]

    Two-framemotionestimationbasedonpolyno- mial expansion, in: Bigun, J., Gustavsson, T

    Farnebäck,G.,2003. Two-framemotionestimationbasedonpolyno- mial expansion, in: Bigun, J., Gustavsson, T. (Eds.), Image Analysis, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 363–370

  21. [21]

    Fathabadi, F.R., Grantner, J.L., Shebrain, S.A., Abdel-Qader, I.,

  22. [22]

    Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics , 1248–1253doi:10.1109/SMC52423.2021.9658766

    Surgical skill assessment system using fuzzy logic in a multi- class detection of laparoscopic box-trainer instruments. Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics , 1248–1253doi:10.1109/SMC52423.2021.9658766

  23. [23]

    X3d: Expanding architectures for efficient video recognition doi:10.48550/arXiv.2004.04730

    Feichtenhofer, C., 2020. X3d: Expanding architectures for efficient video recognition doi:10.48550/arXiv.2004.04730

  24. [24]

    A benchmark for video-based laparoscopic skill analysis and assessment

    Funke, I., Bodenstedt, S., von Bechtolsheim, F., Oehme, F., Mar- uschke, M., Herrlich, S., Weitz, J., Distler, M., Mees, S.T., Spei- del, S., 2026. A benchmark for video-based laparoscopic skill analysis and assessment. URL:https://arxiv.org/abs/2602.09927, arXiv:2602.09927

  25. [25]

    Funke,I.,Bodenstedt,S.,Oehme,F.,vonBechtolsheim,F.,Weitz,J., Speidel, S., 2019a. Using 3d convolutional neural networks to learn spatiotemporal features for automatic surgical gesture recognition in video,in:Shen,D.,Liu,T.,Peters,T.M.,Staib,L.H.,Essert,C.,Zhou, S., Yap, P.T., Khan, A. (Eds.), Medical Image Computing and Com- puter Assisted Intervention – ...

  26. [26]

    Video-based surgical skill assessment using 3d convolutional neural networks

    Funke, I., Mees, S.T., Weitz, J., Speidel, S., 2019b. Video-based surgical skill assessment using 3d convolutional neural networks. International Journal of Computer Assisted Radiology and Surgery 14, 1217–1225. URL:https://link.springer.com/article/10.1007/ s11548-019-01995-1, doi:10.1007/S11548-019-01995-1/FIGURES/4

  27. [27]

    Goh, A.C., Goldfarb, D.W., Sander, J.C., Miles, B.J., Dunkin, B.J.,

  28. [28]

    URL:https://www.auajournals.org/doi/ 10.1016/j.juro.2011.09.032, doi:10.1016/J.JURO.2011.09.032

    Global evaluative assessment of robotic skills: Validation of a clinicalassessmenttooltomeasureroboticsurgicalskills.TheJournal of Urology 187, 247–252. URL:https://www.auajournals.org/doi/ 10.1016/j.juro.2011.09.032, doi:10.1016/J.JURO.2011.09.032

  29. [29]

    Video- based fully automatic assessment of open surgery suturing skills

    Goldbraikh,A.,D’Angelo,A.L.,Pugh,C.M.,Laufer,S.,2022. Video- based fully automatic assessment of open surgery suturing skills. International Journal of Computer Assisted Radiology and Surgery 17, 437–448. URL:https://link.springer.com/article/10.1007/ s11548-022-02559-6, doi:10.1007/S11548-022-02559-6/FIGURES/5

  30. [30]

    Automated skills assessment in open surgery: A scoping review

    Hamza, H., Shabir, D., Aboumarzouk, O., Al-Ansari, A., Shaban, K., Navkar, N.V., 2025. Automated skills assessment in open surgery: A scoping review. Engineering Applications of Artifi- cial Intelligence 153, 110893. URL:https://www.sciencedirect. com/science/article/pii/S0952197625008930,doi:https://doi.org/10. 1016/j.engappai.2025.110893

  31. [31]

    Maskr-cnn

    He,K.,Gkioxari,G.,Dollár,P.,Girshick,R.,2020. Maskr-cnn. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 386–

  32. [32]

    URL:https://api.semanticscholar.org/CorpusID:264031695

  33. [33]

    Deep residual learning for image recognition,

    He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. Proceedings of the IEEE Computer Soci- ety Conference on Computer Vision and Pattern Recognition 2016- December, 770–778. doi:10.1109/CVPR.2016.90

  34. [34]

    Aixsuture: vision-based assessment of open suturing skills

    Hoffmann,H.,Funke,I.,Peters,P.,Venkatesh,D.K.,Egger,J.,Rivoir, D., Röhrig, R., Hölzle, F., Bodenstedt, S., Willemer, M.C., Speidel, S., Puladi, B., 2024. Aixsuture: vision-based assessment of open suturing skills. International Journal of Computer Assisted Radiol- ogy and Surgery 19, 1045–1052. URL:https://doi.org/10.1007/ s11548-024-03093-3, doi:10.1007/...

  35. [35]

    Rtmpose: Real-time multi-person pose estimation based on mmpose,

    Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., Chen, K., 2023. Rtmpose: Real-time multi-person pose estimation based on mmpose. ArXiv abs/2303.07399. URL:https://api. semanticscholar.org/CorpusID:257504954. :Preprint submitted to Elsevier Page 29 of 31

  36. [36]

    Ke,G.,Meng,Q.,Finley,T.,Wang,T.,Chen,W.,Ma,W.,Ye,Q.,Liu, T.Y., 2017. Lightgbm: a highly efficient gradient boosting decision tree, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA. p. 3149–3157

  37. [37]

    A vision transformer for decoding surgeon activity from surgical videos

    Kiyasseh, D., Ma, R., Haque, T.F., Miles, B.J., Wagner, C., Donoho, D.A., Anandkumar, A., Hung, A.J., 2023. A vision transformer for decoding surgeon activity from surgical videos. Nature Biomedical Engineering 7, 780–796. URL:https://www.nature.com/articles/ s41551-023-01010-8, doi:10.1038/s41551-023-01010-8

  38. [38]

    Machine learn- ing for technical skill assessment in surgery: a systematic re- view

    Lam, K., Chen, J., Wang, Z., Iqbal, F.M., Darzi, A., Lo, B., Purkayastha, S., Kinross, J.M., 2022. Machine learn- ing for technical skill assessment in surgery: a systematic re- view. URL:https://www.nature.com/articles/s41746-022-00566-0. pdf, doi:10.1038/s41746-022-00566-0

  39. [39]

    Automation of surgical skill assessment using a three-stage machine learning algorithm

    Lavanchy,J.L.,Zindel,J.,Kirtac,K.,Twick,I.,Hosgor,E.,Candinas, D., Beldi, G., 2021. Automation of surgical skill assessment using a three-stage machine learning algorithm. Scientific Reports 11, 1–9. URL:https://www.nature.com/articles/s41598-021-84295-6, doi:10. 1038/s41598-021-84295-6

  40. [40]

    Automatic assessment of per- formanceintheflstrainerusingcomputervision

    Lazar, A., Sroka, G., Laufer, S., 2023. Automatic assessment of per- formanceintheflstrainerusingcomputervision. SurgicalEndoscopy 37, 6476–6482. URL:https://doi.org/10.1007/s00464-023-10132-8, doi:10.1007/s00464-023-10132-8

  41. [41]

    Automated methods of technical skill as- sessment in surgery: A systematic review

    Levin, M., McKechnie, T., Khalid, S., Grantcharov, T.P., Gold- enberg, M., 2019. Automated methods of technical skill as- sessment in surgery: A systematic review. Journal of Surgi- cal Education 76, 1629–1639. URL:https://www.sciencedirect. com/science/article/pii/S1931720419301643,doi:https://doi.org/10. 1016/j.jsurg.2019.06.011

  42. [42]

    Hrnext: High-resolution context network for crowd pose estimation

    Li, Q., Zhang, Z., Zhang, F., Xiao, F., 2023. Hrnext: High-resolution context network for crowd pose estimation. IEEE Transactions on Multimedia 25, 1521–1528. doi:10.1109/TMM.2023.3248144

  43. [43]

    Mvitv2: Improved multiscale vision transformers for classification and detection

    Li, Y., Wu, C., Fan, H., Mangalam, K., Xiong, B., Malik, J., Feicht- enhofer, C., 2021. Mvitv2: Improved multiscale vision transformers for classification and detection. 2022 IEEE/CVF Conference on ComputerVisionandPatternRecognition(CVPR),4794–4804URL: https://api.semanticscholar.org/CorpusID:244799268

  44. [44]

    Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.,

  45. [45]

    11976–11986

    A convnet for the 2020s, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11976–11986

  46. [46]

    Hota:Ahigherordermetricforevaluatingmulti- object tracking

    Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L.,Leibe,B.,2020. Hota:Ahigherordermetricforevaluatingmulti- object tracking. International Journal of Computer Vision , 1–31

  47. [47]

    Rtmdet: An empirical study of designing real-time object detectors.arXiv preprint arXiv:2212.07784,

    Lyu, C., Zhang, W., Huang, H., Zhou, Y., Wang, Y., Liu, Y., Zhang, S., Chen, K., 2022. Rtmdet: An empirical study of designing real- time object detectors. ArXiv abs/2212.07784. URL:https://api. semanticscholar.org/CorpusID:254685870

  48. [48]

    Maier-Hein, L., Eisenmann, M., Reinke, A., Onogur, S., Stankovic, M., Scholz, P., Arbel, T., Bogunovic, H., Bradley, A.P., Carass, A., Feldmann, C., Frangi, A.F., Full, P.M., van Ginneken, B., Hanbury, A., Honauer, K., Kozubek, M., Landman, B.A., März, K., Maier, O., Maier-Hein, K., Menze, B.H., Müller, H., Neher, P.F., Niessen, W., Rajpoot, N., Sharp, G....

  49. [49]

    Nature Communications 9, 5217

    Why rankings of biomedical image analysis competitions should be interpreted with care. Nature Communications 9, 5217. URL:https://www.nature.com/articles/s41467-018-07619-7, doi:10. 1038/s41467-018-07619-7. publisher: Nature Publishing Group

  50. [50]

    Metrics reloaded: Pitfalls and recommendationsfor imageanalysisvalidationURL:https://arxiv

    Maier-Hein, L., Reinke, A., Godau, P., Tizabi, M.D., Büttner, F., Christodoulou, E., Glocker, B., Isensee, F., Kleesiek, J., Kozubek, M., Reyes, M., Riegler, M.A., Wiesenfarth, M., Kavur, E., Sudre, C.H., Baumgartner, M., Eisenmann, M., Heckmann-Nötzel, D., Räd- sch, A.T., Acion, L., Antonelli, M., Arbel, T., Bakas, S., Benis, A., Blaschko, M., Cardoso, M...

  51. [51]

    Bias:Transparentreportingofbiomedicalimageanalysis challenges

    Maier-Hein, L., Reinke, A., Kozubek, M., Martel, A.L., Arbel, T., Eisenmann, M., Hanbury, A., Jannin, P., Müller, H., Onogur, S., Saez-Rodriguez,J.,vanGinneken,B.,Kopp-Schneider,A.,Landman, B.A.,2020. Bias:Transparentreportingofbiomedicalimageanalysis challenges. Medical Image Analysis 66, 101796. URL:https: //www.sciencedirect.com/science/article/pii/S13...

  52. [52]

    Objective structured assessment of technicalskill(osats)forsurgicalresidents

    Martin, J., Regehr, G., Reznick, R., Macrae, H., Murnaghan, J., Hutchison, C., Brown, M., 1997. Objective structured assessment of technicalskill(osats)forsurgicalresidents. Britishjournalofsurgery 84, 273–278

  53. [53]

    Forming inferences about some intraclass correlation coefficients

    Mcgraw, K., Wong, S., 1996. Forming inferences about some intraclass correlation coefficients. Psychological Methods 1, 30–46. doi:10.1037/1082-989X.1.1.30

  54. [54]

    Ranking surgical skills using an attention-enhanced siamese network with piecewise aggre- gated kinematic data

    Oğul, B.B., Gilgien, M., Özdemir, S., 2022. Ranking surgical skills using an attention-enhanced siamese network with piecewise aggre- gated kinematic data. International Journal of Computer Assisted Radiology and Surgery 17, 1039–1048. URL:https://doi.org/10. 1007/s11548-022-02581-8, doi:10.1007/s11548-022-02581-8

  55. [55]

    Papo, R., Gershov, S., Friedman, T., Or, I., Bolotin, G., Laufer, S.,

  56. [56]

    Rohan:Robusthanddetectioninoperationroomdoi:10.48550/ arXiv.2501.08115

  57. [57]

    Pedrett, R., Mascagni, P., Beldi, G., Padoy, N., Lavanchy, J.L.,

  58. [58]

    Surgical Endoscopy URL:https://link.springer.com/10.1007/s00464-023-10335-z, doi:10.1007/S00464-023-10335-Z

    Technical skill assessment in minimally invasive surgery using artificial intelligence: a systematic review. Surgical Endoscopy URL:https://link.springer.com/10.1007/s00464-023-10335-z, doi:10.1007/S00464-023-10335-Z

  59. [59]

    Spatial entropy as an inductive bias for vision transformers.MachineLearning113,6945–6975.URL:https://doi

    Peruzzo, E., Sangineto, E., Liu, Y., Nadai, M.D., Bi, W., Lepri, B., Sebe, N., 2024. Spatial entropy as an inductive bias for vision transformers.MachineLearning113,6945–6975.URL:https://doi. org/10.1007/s10994-024-06570-7, doi:10.1007/s10994-024-06570-7

  60. [60]

    Peters, P., Lemos, M., Bönsch, A., Ooms, M., Ulbrich, M., Rashad, A., Krause, F., Lipprandt, M., Kuhlen, T.W., Röhrig, R., Hölzle, F., Puladi, B., 2023a. Dataset from: Effect of head-mounted displays on students’ acquisition of surgical suturing techniques compared to an e-learning and tutor-led course: A randomized controlled trial URL: https://zenodo.or...

  61. [61]

    Effect of head-mounted displays on students’ acquisition of surgical suturing techniques compared to an e-learning and tutor-led course: A ran- domized controlled trial

    Peters, P., Lemos, M., Bönsch, A., Ooms, M., Ulbrich, M., Rashad, A., Krause, F., Lipprandt, M., Kuhlen, T.W., Röhrig, R., Hölzle, F., Puladi, B., med med dent Behrus Puladi, 2023b. Effect of head-mounted displays on students’ acquisition of surgical suturing techniques compared to an e-learning and tutor-led course: A ran- domized controlled trial. Inter...

  62. [62]

    Icc4irr: A shinyapplicationtoestimateinterraterreliabilityusingintraclasscor- relation coefficients

    Psychogyiopoulos, A., Koopman, L., Ten Hove, D., 2025. Icc4irr: A shinyapplicationtoestimateinterraterreliabilityusingintraclasscor- relation coefficients. URL:https://tasospsy.shinyapps.io/icc4irr_ app/

  63. [63]

    SAM 2: Segment anything in images and videos, in: The Thirteenth International Conference on Learning Representations

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollar, P., Feichtenhofer, C., 2025. SAM 2: Segment anything in images and videos, in: The Thirteenth International Conference on Learning Representations. URL:https://openrevie...

  64. [64]

    You Only Look Once: Unified, Real-Time Object Detection

    Redmon, J., Divvala, S., Girshick, R., Farhadi, A., 2015. You only look once: Unified, real-time object detection URL:http://arxiv. :Preprint submitted to Elsevier Page 30 of 31 org/abs/1506.02640

  65. [65]

    Brown, K., 2025

    Rezaei, S., N. Brown, K., 2025. Generative reward machine for re- inforcementlearningforphysicalinternetdistributioncentre,in:Ma- chine Learning, Optimization, and Data Science: 10th International Conference, LOD 2024, Castiglione Della Pescaia, Italy, Septem- ber 22–25, 2024, Revised Selected Papers, Part I, Springer-Verlag, Berlin, Heidelberg. p. 317–33...

  66. [66]

    Benchmarking and error diagnosis in multi-instance pose estimation

    Ronchi, M.R., Perona, P., 2017. Benchmarking and error diagnosis in multi-instance pose estimation. 2017 IEEE International Con- ference on Computer Vision (ICCV) , 369–378URL:https://api. semanticscholar.org/CorpusID:863539

  67. [67]

    ImageNet Large Scale Visual Recognition Challenge,

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei- Fei, L., 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 211–252. doi:10.1007/s11263-015-0816-y

  68. [68]

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crowson, K., Schmidt, L., Kacz- marczyk, R., Jitsev, J., 2022. Laion-5b: an open large-scale dataset for training next generation image-text models, in: Proceedings of the 36th International Confer...

  69. [69]

    Virtual reality training improves operating room performance: results of a random- ized,double-blindedstudy

    Seymour, N.E., Gallagher, A.G., Roman, S.A., O’Brien, M.K., Bansal, V.K., Andersen, D.K., Satava, R.M., 2002. Virtual reality training improves operating room performance: results of a random- ized,double-blindedstudy. AnnalsofSurgery236,458–463. doi:10. 1097/00000658-200210000-00008

  70. [70]

    Intraclass correlations: uses in assessingraterreliability

    Shrout, P.E., Fleiss, J.L., 1979. Intraclass correlations: uses in assessingraterreliability. Psychologicalbulletin862,420–8. doi:10. 1037//0033-2909.86.2.420

  71. [71]

    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.,

  72. [72]

    2016 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR) , 2818–2826URL:https://api.semanticscholar.org/ CorpusID:206593880

    Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR) , 2818–2826URL:https://api.semanticscholar.org/ CorpusID:206593880

  73. [73]

    Updated guide- lines on selecting an intraclass correlation coefficient for interrater reliability, with applications to incomplete observational designs

    Ten Hove, D., Jorgensen, T., van der Ark, A., 2024. Updated guide- lines on selecting an intraclass correlation coefficient for interrater reliability, with applications to incomplete observational designs. Psychological Methods 29, 967–979. doi:10.1037/met0000516

  74. [74]

    Convnext: A contemporary architecture for convolutional neural networks for imageclassification

    Todi, A., Narula, N., Sharma, M., Gupta, U., 2023. Convnext: A contemporary architecture for convolutional neural networks for imageclassification. 20233rdInternationalConferenceonInnovative Sustainable Computational Technologies (CISCT) , 1–6URL:https: //api.semanticscholar.org/CorpusID:266486570

  75. [75]

    Tong, Z., Song, Y., Wang, J., Wang, L., 2022. Videomae: masked autoencodersaredata-efficientlearnersforself-supervisedvideopre- training, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA

  76. [76]

    A closer look at spatiotemporal convolutions for action recognition, in: CVPR

    Tran,D.,Wang,H.,Torresani,L.,Ray,J.,LeCun,Y.,Paluri,M.,2018. A closer look at spatiotemporal convolutions for action recognition, in: CVPR

  77. [77]

    Attention is all you need, in: Guyon, I., Luxburg, U.V., Bengio, S., Wal- lach, H., Fergus, R., Vishwanathan, S., Garnett, R

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I., 2017. Attention is all you need, in: Guyon, I., Luxburg, U.V., Bengio, S., Wal- lach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Ad- vancesinNeuralInformationProcessingSystems,CurranAssociates, Inc. URL:https://proceedings.neurips.cc/paper...

  78. [78]

    Temporal segment networks: Towards good practices for deepactionrecognition,in:Europeanconferenceoncomputervision, Springer

    Wang,L.,Xiong,Y.,Wang,Z.,Qiao,Y.,Lin,D.,Tang,X.,VanGool, L., 2016. Temporal segment networks: Towards good practices for deepactionrecognition,in:Europeanconferenceoncomputervision, Springer. pp. 20–36

  79. [79]

    Yang, S., Luo, L., Wang, Q., Chen, H., 2024. Surgformer: Surgical TransformerwithHierarchicalTemporalAttentionforSurgicalPhase Recognition , in: proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, Springer Nature Switzerland

  80. [80]

    Swin3d: A pretrained transformer backbone for 3d indoor scene understanding

    Yang, Y.Q., Guo, Y.X., Xiong, J., Liu, Y., Pan, H., Wang, P.S., Tong, X., Guo, B., 2023. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. ArXiv abs/2304.06906. URL: https://api.semanticscholar.org/CorpusID:258170015

Showing first 80 references.