Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

Charlie Budd; Meng Wei; Miaojing Shi; Oluwatosin Alabi; Tom Vercauteren

arxiv: 2511.00643 · v1 · submitted 2025-11-01 · 💻 cs.CV

Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

Oluwatosin Alabi , Meng Wei , Charlie Budd , Tom Vercauteren , Miaojing Shi This is my paper

Pith reviewed 2026-05-18 01:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords surgical action tripletsinstrument instance segmentationtriplet segmentationtarget-aware fusionCholecTriplet-SegMask2Formersurgical scene understandingaction grounding

0 comments

The pith

A new task and dataset ground surgical action triplets to specific instrument instances and anatomical targets using segmentation masks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines triplet segmentation as the task of producing spatially grounded outputs that link each instrument instance to its action verb and anatomical target in surgical video frames. It releases CholecTriplet-Seg, a dataset of more than 30,000 frames with instance masks tied to verb and target labels, to enable strongly supervised evaluation. TargetFusionNet extends Mask2Former by adding a target-aware fusion step that merges weak anatomy priors with instrument instance queries to improve target prediction. Across recognition, detection, and segmentation metrics the model outperforms baselines that use only frame-level labels or class activation maps. The results indicate that detailed instance supervision paired with approximate target knowledge yields more accurate and robust surgical interaction understanding.

Core claim

Triplet segmentation is presented as a unified task that outputs spatially grounded <instrument, verb, target> predictions. The CholecTriplet-Seg dataset supplies the first large-scale benchmark linking instrument instance masks to action and target annotations. TargetFusionNet extends Mask2Former with a target-aware fusion mechanism that integrates weak anatomy priors and instrument instance queries to produce more accurate anatomical target predictions than prior frame-level or activation-map approaches.

What carries the argument

TargetFusionNet, an extension of Mask2Former that adds a target-aware fusion mechanism to combine weak anatomy priors with instrument instance queries for improved anatomical target prediction.

If this is right

Triplet segmentation supplies a single framework that replaces separate recognition and weak localization steps with instance-level grounding.
Strong instance supervision plus weak target priors produces measurable gains in recognition, detection, and segmentation metrics.
The benchmark dataset enables direct comparison of future methods on spatially precise surgical action outputs.
The resulting predictions support more interpretable surgical scene understanding than frame-level triplet classifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion pattern could be tested on non-surgical video domains that require grounding actions to objects and their targets.
Real-time deployment on robotic platforms might use the instance-level outputs to flag unsafe instrument-tissue contacts during procedures.
Collecting fully supervised target masks on a subset of the data could quantify how much additional accuracy the weak priors currently limit.
The dataset structure naturally supports multi-task training that jointly optimizes segmentation and action recognition.

Load-bearing premise

The target-aware fusion mechanism can reliably combine weak anatomy priors with instrument instance queries to produce accurate target predictions without introducing new errors into the segmentation pipeline.

What would settle it

Running the model on a held-out set of surgical frames where the supplied anatomy priors are deliberately noisy or removed and checking whether target prediction accuracy drops below the level of a non-fusion baseline.

read the original abstract

Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main value is releasing the first large dataset for instance-level surgical triplet segmentation, with a Mask2Former extension that reports metric gains but leaves the fusion robustness open to question.

read the letter

Hi colleague, the one thing to take away is that this work releases a new dataset linking instrument instance masks to action triplets and shows that adding target-aware fusion to Mask2Former produces better numbers on recognition, detection, and segmentation tasks. They define triplet segmentation as the joint task of producing spatially grounded outputs and introduce CholecTriplet-Seg with over 30,000 frames from cholecystectomy videos that supply instance masks plus verb and anatomical target labels. This moves past earlier frame-level triplet recognition and the imprecise class activation maps used for spatial grounding before. TargetFusionNet extends Mask2Former by fusing weak anatomy priors with the instance queries, and the abstract states consistent improvements over baselines. Releasing the dataset and setting up the benchmark is the clearest positive step; it gives the community a concrete resource for strongly supervised evaluation in surgical scene understanding. The architecture choice is reasonable for the problem, since strong instance supervision plus weak target signals is a practical way to tackle target prediction. The soft spot sits in the fusion mechanism. Weak priors can easily misalign with instance queries under occlusion or tissue similarity, and without explicit uncertainty handling or detailed ablations shown in the abstract, it is not yet clear whether errors get introduced rather than resolved. The overall metric gains are reported, but the central claim would be stronger with statistical tests and robustness checks on that component. This paper is for people working in medical computer vision and surgical robotics who need instance-level interaction data. Readers building benchmarks or extending segmentation models to action triplets will find the data release and evaluation setup useful. It deserves peer review because the dataset contribution is tangible and the method is a direct response to a documented gap, even if the fusion details need referee scrutiny.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CholecTriplet-Seg, a dataset of over 30,000 annotated frames that links instrument instance masks to action verbs and anatomical targets, establishing the first benchmark for instance-level surgical action triplet segmentation. It proposes TargetFusionNet, an extension of Mask2Former that adds a target-aware fusion module to integrate weak anatomy priors with instrument instance queries, producing spatially grounded <instrument, verb, target> outputs. Evaluations across recognition, detection, and segmentation metrics show consistent gains over baselines, attributing the improvement to strong instance supervision combined with weak target priors.

Significance. If the reported gains hold under rigorous validation, the work supplies a valuable public benchmark and a practical architecture for moving surgical action understanding from frame-level labels to instance-grounded triplets. This directly supports more interpretable models for computer-assisted surgery and could seed follow-on research on uncertainty-aware fusion and multi-instrument interaction modeling.

major comments (3)

[§3.2] §3.2, TargetFusionNet description: the target-aware fusion is presented as reliably combining weak anatomy priors with instance queries, yet the text provides no explicit uncertainty modeling, misalignment correction, or error-propagation analysis; this is load-bearing for the central claim that the module enhances accuracy without introducing new errors under occlusion or tissue similarity.
[§4.2] §4.2 and Table 2: the reported metric improvements (e.g., triplet segmentation mAP) are stated as consistent, but the manuscript does not report statistical significance tests, confidence intervals, or per-scene variance; without these, the cross-baseline superiority cannot be assessed as robust.
[§2.1] §2.1, Dataset construction: the protocol for assigning anatomical targets to specific instrument instances when multiple instruments interact with the same tissue region is described only at a high level; this detail is necessary to reproduce the ground-truth labels that underpin all quantitative claims.

minor comments (2)

[Figure 3] Figure 3 caption and §3.3: the notation for the fused query features (Q_f) is introduced without an explicit equation; adding the mathematical definition would improve clarity.
[§4.1] §4.1: the baseline implementations are referenced to prior work but the exact hyper-parameter settings and training schedules used for re-implementation are not listed; a supplementary table would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. We address each major comment in detail below, proposing specific revisions to enhance the clarity, rigor, and reproducibility of our work.

read point-by-point responses

Referee: [§3.2] §3.2, TargetFusionNet description: the target-aware fusion is presented as reliably combining weak anatomy priors with instance queries, yet the text provides no explicit uncertainty modeling, misalignment correction, or error-propagation analysis; this is load-bearing for the central claim that the module enhances accuracy without introducing new errors under occlusion or tissue similarity.

Authors: We thank the referee for this important point. While the target-aware fusion leverages cross-attention between instrument queries and anatomy prior features to achieve spatial grounding, we did not include explicit uncertainty estimates or misalignment correction in the original submission. To address this, we will revise Section 3.2 to include a discussion of the fusion mechanism's robustness properties and add an analysis of error propagation under challenging conditions such as occlusion. We will also report results from additional experiments simulating tissue similarity scenarios. revision: yes
Referee: [§4.2] §4.2 and Table 2: the reported metric improvements (e.g., triplet segmentation mAP) are stated as consistent, but the manuscript does not report statistical significance tests, confidence intervals, or per-scene variance; without these, the cross-baseline superiority cannot be assessed as robust.

Authors: We agree that providing statistical validation is essential for claiming consistent improvements. In the revised version, we will augment Table 2 and the corresponding text in §4.2 with statistical significance tests (e.g., paired t-tests with p-values), 95% confidence intervals obtained through bootstrapping over multiple runs, and an analysis of per-scene performance variance to better demonstrate the robustness of TargetFusionNet across different surgical procedures. revision: yes
Referee: [§2.1] §2.1, Dataset construction: the protocol for assigning anatomical targets to specific instrument instances when multiple instruments interact with the same tissue region is described only at a high level; this detail is necessary to reproduce the ground-truth labels that underpin all quantitative claims.

Authors: We appreciate the referee's emphasis on reproducibility. The current description in §2.1 outlines the high-level annotation process involving surgical experts. To provide the necessary detail, we will expand this section with a more precise protocol: when multiple instruments interact with the same region, the target is assigned to the instrument with the most direct and sustained contact, as determined by visual inspection of tissue deformation and instrument trajectory. We will include illustrative examples and release the complete annotation guidelines alongside the dataset. revision: yes

Circularity Check

0 steps flagged

No circularity: new dataset and empirical architecture extension are self-contained

full rationale

The paper introduces CholecTriplet-Seg, a new annotated dataset for triplet segmentation, and TargetFusionNet, an extension of Mask2Former with a target-aware fusion module. All performance claims rest on direct empirical evaluation across recognition, detection, and segmentation metrics against baselines. No equations, fitted parameters, or derivations are presented that reduce outputs to inputs by construction. No self-citations are invoked to establish uniqueness theorems or load-bearing premises. The central results derive from the new instance-level annotations and the proposed fusion mechanism rather than from any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard computer vision assumptions about the utility of Mask2Former for instance segmentation and the value of combining strong instrument supervision with weak target priors; no new free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Mask2Former provides a suitable base architecture for extending to target-aware fusion in surgical scenes.
The paper states that TargetFusionNet extends Mask2Former with the fusion mechanism.

pith-pipeline@v0.9.0 · 5811 in / 1266 out tokens · 57302 ms · 2026-05-18T01:14:01.945753+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
cs.CV 2026-05 conditional novelty 7.0

SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annot...

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

In: Medical Image Computing and Computer Assisted Intervention– MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, pp

Nwoye, C.I., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N.: Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In: Medical Image Computing and Computer Assisted Intervention– MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, pp. 364–374 (2020...

work page 2020
[2]

Medical Image Analysis78, 102433 (2022)

Nwoye, C.I., Yu, T., Gonzalez, C., Seeliger, B., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N.: Rendezvous: Attention mechanisms for the recogni- tion of surgical action triplets in endoscopic videos. Medical Image Analysis78, 102433 (2022)

work page 2022
[3]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

Yamlahi, A., Tran, T.N., Godau, P., Schellenberg, M., Michael, D., Smidt, F.- H., N¨ olke, J.-H., Adler, T.J., Tizabi, M.D., Nwoye, C.I.,et al.: Self-distillation for surgical action recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 637–646 (2023). Springer

work page 2023
[4]

Scientific Data12(1) (2025) https://doi.org/10.1038/s41597-025-05163-w

Alabi, O., Toe, K.K.Z., Zhou, Z., Budd, C., Raison, N., Shi, M., Vercauteren, T.: Cholecinstanceseg: A tool instance segmentation dataset for laparoscopic surgery. Scientific Data12(1) (2025) https://doi.org/10.1038/s41597-025-05163-w

work page doi:10.1038/s41597-025-05163-w 2025
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)

work page 2022
[6]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

Chen, Y., He, S., Jin, Y., Qin, J.: Surgical activity triplet recognition via triplet disentanglement. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 451–461 (2023). Springer

work page 2023
[7]

International Journal of Computer Assisted Radiology and Surgery, 1–7 (2023)

Sharma, S., Nwoye, C.I., Mutter, D., Padoy, N.: Rendezvous in time: an attention- based temporal fusion approach for surgical triplet recognition. International Journal of Computer Assisted Radiology and Surgery, 1–7 (2023)

work page 2023
[8]

Medical Image Analysis89, 102888 (2023)

Nwoye, C.I., Yu, T., Sharma, S., Murali, A., Alapatt, D., Vardazaryan, A., Yuan, K., Hajek, J., Reiter, W., Yamlahi, A.,et al.: Cholectriplet2022: Show me a tool and tell me the triplet—an endoscopic vision challenge for surgical action triplet detection. Medical Image Analysis89, 102888 (2023)

work page 2023
[9]

IEEE Transactions on Medical Imaging (2025) 11

Pei, J., Zhang, J., Qin, G., Wang, K., Jin, Y., Heng, P.-A.: Instrument-tissue- guided surgical action triplet detection via textual-temporal trail exploration. IEEE Transactions on Medical Imaging (2025) 11

work page 2025
[10]

arXiv preprint arXiv:2508.21096 (2025)

Han, Z., Budd, C., Zhang, G., Tian, H., Bergeles, C., Vercauteren, T.: ROBUST- MIPS: A combined skeletal pose and instance segmentation dataset for laparo- scopic surgical instruments. arXiv preprint arXiv:2508.21096 (2025)

work page arXiv 2025
[11]

Artificial Intelligence Review58(1), 1 (2024)

Ahmed, F.A., Yousef, M., Ahmed, M.A., Ali, H.O., Mahboob, A., Ali, H., Shah, Z., Aboumarzouk, O., Al Ansari, A., Balakrishnan, S.: Deep learning for sur- gical instrument recognition and segmentation in robotic-assisted surgeries: a systematic review. Artificial Intelligence Review58(1), 1 (2024)

work page 2024
[12]

arXiv preprint arXiv:2507.16559 (2025)

Rueckert, T., Rauber, D., Maerkl, R., Klausmann, L., Yildiran, S.R., Gutbrod, M., Nunes, D.W., Moreno, A.F., Luengo, I., Stoyanov, D., et al.: Compara- tive validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the phakir 2024 challenge. arXiv preprint arXiv:2507.16559 (2025)

work page arXiv 2024
[13]

arXiv (2024) 2401.11174 [cs.CV]

Ayobi, N., Rodr´ ıguez, S., P´ erez, A., Hern´ andez, I., Aparicio, N., Dessevres, E., Pe˜ na, S., Santander, J., Caicedo, J.I., Fern´ andez, N., Arbel´ aez, P.: Pixel-wise recognition for holistic surgical scene understanding. arXiv (2024) 2401.11174 [cs.CV]

work page arXiv 2024
[14]

International Journal of Computer Assisted Radiology and Surgery17(12), 2173–2181 (2022)

Owen, D., Grammatikopoulou, M., Luengo, I., Stoyanov, D.: Automated identifi- cation of critical structures in laparoscopic cholecystectomy. International Journal of Computer Assisted Radiology and Surgery17(12), 2173–2181 (2022)

work page 2022
[15]

International Journal of Computer Assisted Radiology and Surgery19(6), 1085–1091 (2024)

Bati´ c, D., Holm, F., ¨Ozsoy, E., Czempiel, T., Navab, N.: Endovit: pretrain- ing vision transformers on a large collection of endoscopic images. International Journal of Computer Assisted Radiology and Surgery19(6), 1085–1091 (2024)

work page 2024
[16]

Cholecseg8k: a semantic segmen- tation dataset for laparoscopic cholecystectomy based on cholec80

Hong, W.-Y., Kao, C.-L., Kuo, Y.-H., Wang, J.-R., Chang, W.-L., Shih, C.-S.: Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. arXiv preprint arXiv:2012.12453 (2020)

work page arXiv 2012
[17]

Data splits and metrics for method benchmarking on surgical action triplet datasets.arXiv preprint arXiv:2204.05235, 2022

Nwoye, C.I., Padoy, N.: Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet Datasets (2023). https://arxiv.org/abs/2204.05235

work page arXiv 2023
[18]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906
[19]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

work page 2016
[20]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 12

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

In: Medical Image Computing and Computer Assisted Intervention– MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, pp

Nwoye, C.I., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N.: Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In: Medical Image Computing and Computer Assisted Intervention– MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, pp. 364–374 (2020...

work page 2020

[2] [2]

Medical Image Analysis78, 102433 (2022)

Nwoye, C.I., Yu, T., Gonzalez, C., Seeliger, B., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N.: Rendezvous: Attention mechanisms for the recogni- tion of surgical action triplets in endoscopic videos. Medical Image Analysis78, 102433 (2022)

work page 2022

[3] [3]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

Yamlahi, A., Tran, T.N., Godau, P., Schellenberg, M., Michael, D., Smidt, F.- H., N¨ olke, J.-H., Adler, T.J., Tizabi, M.D., Nwoye, C.I.,et al.: Self-distillation for surgical action recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 637–646 (2023). Springer

work page 2023

[4] [4]

Scientific Data12(1) (2025) https://doi.org/10.1038/s41597-025-05163-w

Alabi, O., Toe, K.K.Z., Zhou, Z., Budd, C., Raison, N., Shi, M., Vercauteren, T.: Cholecinstanceseg: A tool instance segmentation dataset for laparoscopic surgery. Scientific Data12(1) (2025) https://doi.org/10.1038/s41597-025-05163-w

work page doi:10.1038/s41597-025-05163-w 2025

[5] [5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)

work page 2022

[6] [6]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp

Chen, Y., He, S., Jin, Y., Qin, J.: Surgical activity triplet recognition via triplet disentanglement. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 451–461 (2023). Springer

work page 2023

[7] [7]

International Journal of Computer Assisted Radiology and Surgery, 1–7 (2023)

Sharma, S., Nwoye, C.I., Mutter, D., Padoy, N.: Rendezvous in time: an attention- based temporal fusion approach for surgical triplet recognition. International Journal of Computer Assisted Radiology and Surgery, 1–7 (2023)

work page 2023

[8] [8]

Medical Image Analysis89, 102888 (2023)

Nwoye, C.I., Yu, T., Sharma, S., Murali, A., Alapatt, D., Vardazaryan, A., Yuan, K., Hajek, J., Reiter, W., Yamlahi, A.,et al.: Cholectriplet2022: Show me a tool and tell me the triplet—an endoscopic vision challenge for surgical action triplet detection. Medical Image Analysis89, 102888 (2023)

work page 2023

[9] [9]

IEEE Transactions on Medical Imaging (2025) 11

Pei, J., Zhang, J., Qin, G., Wang, K., Jin, Y., Heng, P.-A.: Instrument-tissue- guided surgical action triplet detection via textual-temporal trail exploration. IEEE Transactions on Medical Imaging (2025) 11

work page 2025

[10] [10]

arXiv preprint arXiv:2508.21096 (2025)

Han, Z., Budd, C., Zhang, G., Tian, H., Bergeles, C., Vercauteren, T.: ROBUST- MIPS: A combined skeletal pose and instance segmentation dataset for laparo- scopic surgical instruments. arXiv preprint arXiv:2508.21096 (2025)

work page arXiv 2025

[11] [11]

Artificial Intelligence Review58(1), 1 (2024)

Ahmed, F.A., Yousef, M., Ahmed, M.A., Ali, H.O., Mahboob, A., Ali, H., Shah, Z., Aboumarzouk, O., Al Ansari, A., Balakrishnan, S.: Deep learning for sur- gical instrument recognition and segmentation in robotic-assisted surgeries: a systematic review. Artificial Intelligence Review58(1), 1 (2024)

work page 2024

[12] [12]

arXiv preprint arXiv:2507.16559 (2025)

Rueckert, T., Rauber, D., Maerkl, R., Klausmann, L., Yildiran, S.R., Gutbrod, M., Nunes, D.W., Moreno, A.F., Luengo, I., Stoyanov, D., et al.: Compara- tive validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the phakir 2024 challenge. arXiv preprint arXiv:2507.16559 (2025)

work page arXiv 2024

[13] [13]

arXiv (2024) 2401.11174 [cs.CV]

Ayobi, N., Rodr´ ıguez, S., P´ erez, A., Hern´ andez, I., Aparicio, N., Dessevres, E., Pe˜ na, S., Santander, J., Caicedo, J.I., Fern´ andez, N., Arbel´ aez, P.: Pixel-wise recognition for holistic surgical scene understanding. arXiv (2024) 2401.11174 [cs.CV]

work page arXiv 2024

[14] [14]

International Journal of Computer Assisted Radiology and Surgery17(12), 2173–2181 (2022)

Owen, D., Grammatikopoulou, M., Luengo, I., Stoyanov, D.: Automated identifi- cation of critical structures in laparoscopic cholecystectomy. International Journal of Computer Assisted Radiology and Surgery17(12), 2173–2181 (2022)

work page 2022

[15] [15]

International Journal of Computer Assisted Radiology and Surgery19(6), 1085–1091 (2024)

Bati´ c, D., Holm, F., ¨Ozsoy, E., Czempiel, T., Navab, N.: Endovit: pretrain- ing vision transformers on a large collection of endoscopic images. International Journal of Computer Assisted Radiology and Surgery19(6), 1085–1091 (2024)

work page 2024

[16] [16]

Cholecseg8k: a semantic segmen- tation dataset for laparoscopic cholecystectomy based on cholec80

Hong, W.-Y., Kao, C.-L., Kuo, Y.-H., Wang, J.-R., Chang, W.-L., Shih, C.-S.: Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. arXiv preprint arXiv:2012.12453 (2020)

work page arXiv 2012

[17] [17]

Data splits and metrics for method benchmarking on surgical action triplet datasets.arXiv preprint arXiv:2204.05235, 2022

Nwoye, C.I., Padoy, N.: Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet Datasets (2023). https://arxiv.org/abs/2204.05235

work page arXiv 2023

[18] [18]

MMDetection: Open MMLab Detection Toolbox and Benchmark

Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1906

[19] [19]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

work page 2016

[20] [20]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 12

work page internal anchor Pith review Pith/arXiv arXiv 2017