Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach
Pith reviewed 2026-05-18 01:14 UTC · model grok-4.3
The pith
A new task and dataset ground surgical action triplets to specific instrument instances and anatomical targets using segmentation masks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Triplet segmentation is presented as a unified task that outputs spatially grounded <instrument, verb, target> predictions. The CholecTriplet-Seg dataset supplies the first large-scale benchmark linking instrument instance masks to action and target annotations. TargetFusionNet extends Mask2Former with a target-aware fusion mechanism that integrates weak anatomy priors and instrument instance queries to produce more accurate anatomical target predictions than prior frame-level or activation-map approaches.
What carries the argument
TargetFusionNet, an extension of Mask2Former that adds a target-aware fusion mechanism to combine weak anatomy priors with instrument instance queries for improved anatomical target prediction.
If this is right
- Triplet segmentation supplies a single framework that replaces separate recognition and weak localization steps with instance-level grounding.
- Strong instance supervision plus weak target priors produces measurable gains in recognition, detection, and segmentation metrics.
- The benchmark dataset enables direct comparison of future methods on spatially precise surgical action outputs.
- The resulting predictions support more interpretable surgical scene understanding than frame-level triplet classifiers.
Where Pith is reading between the lines
- The same fusion pattern could be tested on non-surgical video domains that require grounding actions to objects and their targets.
- Real-time deployment on robotic platforms might use the instance-level outputs to flag unsafe instrument-tissue contacts during procedures.
- Collecting fully supervised target masks on a subset of the data could quantify how much additional accuracy the weak priors currently limit.
- The dataset structure naturally supports multi-task training that jointly optimizes segmentation and action recognition.
Load-bearing premise
The target-aware fusion mechanism can reliably combine weak anatomy priors with instrument instance queries to produce accurate target predictions without introducing new errors into the segmentation pipeline.
What would settle it
Running the model on a held-out set of surgical frames where the supplied anatomy priors are deliberately noisy or removed and checking whether target prediction accuracy drops below the level of a non-fusion baseline.
read the original abstract
Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CholecTriplet-Seg, a dataset of over 30,000 annotated frames that links instrument instance masks to action verbs and anatomical targets, establishing the first benchmark for instance-level surgical action triplet segmentation. It proposes TargetFusionNet, an extension of Mask2Former that adds a target-aware fusion module to integrate weak anatomy priors with instrument instance queries, producing spatially grounded <instrument, verb, target> outputs. Evaluations across recognition, detection, and segmentation metrics show consistent gains over baselines, attributing the improvement to strong instance supervision combined with weak target priors.
Significance. If the reported gains hold under rigorous validation, the work supplies a valuable public benchmark and a practical architecture for moving surgical action understanding from frame-level labels to instance-grounded triplets. This directly supports more interpretable models for computer-assisted surgery and could seed follow-on research on uncertainty-aware fusion and multi-instrument interaction modeling.
major comments (3)
- [§3.2] §3.2, TargetFusionNet description: the target-aware fusion is presented as reliably combining weak anatomy priors with instance queries, yet the text provides no explicit uncertainty modeling, misalignment correction, or error-propagation analysis; this is load-bearing for the central claim that the module enhances accuracy without introducing new errors under occlusion or tissue similarity.
- [§4.2] §4.2 and Table 2: the reported metric improvements (e.g., triplet segmentation mAP) are stated as consistent, but the manuscript does not report statistical significance tests, confidence intervals, or per-scene variance; without these, the cross-baseline superiority cannot be assessed as robust.
- [§2.1] §2.1, Dataset construction: the protocol for assigning anatomical targets to specific instrument instances when multiple instruments interact with the same tissue region is described only at a high level; this detail is necessary to reproduce the ground-truth labels that underpin all quantitative claims.
minor comments (2)
- [Figure 3] Figure 3 caption and §3.3: the notation for the fused query features (Q_f) is introduced without an explicit equation; adding the mathematical definition would improve clarity.
- [§4.1] §4.1: the baseline implementations are referenced to prior work but the exact hyper-parameter settings and training schedules used for re-implementation are not listed; a supplementary table would aid reproducibility.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough review and valuable suggestions. We address each major comment in detail below, proposing specific revisions to enhance the clarity, rigor, and reproducibility of our work.
read point-by-point responses
-
Referee: [§3.2] §3.2, TargetFusionNet description: the target-aware fusion is presented as reliably combining weak anatomy priors with instance queries, yet the text provides no explicit uncertainty modeling, misalignment correction, or error-propagation analysis; this is load-bearing for the central claim that the module enhances accuracy without introducing new errors under occlusion or tissue similarity.
Authors: We thank the referee for this important point. While the target-aware fusion leverages cross-attention between instrument queries and anatomy prior features to achieve spatial grounding, we did not include explicit uncertainty estimates or misalignment correction in the original submission. To address this, we will revise Section 3.2 to include a discussion of the fusion mechanism's robustness properties and add an analysis of error propagation under challenging conditions such as occlusion. We will also report results from additional experiments simulating tissue similarity scenarios. revision: yes
-
Referee: [§4.2] §4.2 and Table 2: the reported metric improvements (e.g., triplet segmentation mAP) are stated as consistent, but the manuscript does not report statistical significance tests, confidence intervals, or per-scene variance; without these, the cross-baseline superiority cannot be assessed as robust.
Authors: We agree that providing statistical validation is essential for claiming consistent improvements. In the revised version, we will augment Table 2 and the corresponding text in §4.2 with statistical significance tests (e.g., paired t-tests with p-values), 95% confidence intervals obtained through bootstrapping over multiple runs, and an analysis of per-scene performance variance to better demonstrate the robustness of TargetFusionNet across different surgical procedures. revision: yes
-
Referee: [§2.1] §2.1, Dataset construction: the protocol for assigning anatomical targets to specific instrument instances when multiple instruments interact with the same tissue region is described only at a high level; this detail is necessary to reproduce the ground-truth labels that underpin all quantitative claims.
Authors: We appreciate the referee's emphasis on reproducibility. The current description in §2.1 outlines the high-level annotation process involving surgical experts. To provide the necessary detail, we will expand this section with a more precise protocol: when multiple instruments interact with the same region, the target is assigned to the instrument with the most direct and sustained contact, as determined by visual inspection of tissue deformation and instrument trajectory. We will include illustrative examples and release the complete annotation guidelines alongside the dataset. revision: yes
Circularity Check
No circularity: new dataset and empirical architecture extension are self-contained
full rationale
The paper introduces CholecTriplet-Seg, a new annotated dataset for triplet segmentation, and TargetFusionNet, an extension of Mask2Former with a target-aware fusion module. All performance claims rest on direct empirical evaluation across recognition, detection, and segmentation metrics against baselines. No equations, fitted parameters, or derivations are presented that reduce outputs to inputs by construction. No self-citations are invoked to establish uniqueness theorems or load-bearing premises. The central results derive from the new instance-level annotations and the proposed fusion mechanism rather than from any self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mask2Former provides a suitable base architecture for extending to target-aware fusion in surgical scenes.
Forward citations
Cited by 1 Pith paper
-
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annot...
Reference graph
Works this paper leans on
-
[1]
Nwoye, C.I., Gonzalez, C., Yu, T., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N.: Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In: Medical Image Computing and Computer Assisted Intervention– MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, pp. 364–374 (2020...
work page 2020
-
[2]
Medical Image Analysis78, 102433 (2022)
Nwoye, C.I., Yu, T., Gonzalez, C., Seeliger, B., Mascagni, P., Mutter, D., Marescaux, J., Padoy, N.: Rendezvous: Attention mechanisms for the recogni- tion of surgical action triplets in endoscopic videos. Medical Image Analysis78, 102433 (2022)
work page 2022
-
[3]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp
Yamlahi, A., Tran, T.N., Godau, P., Schellenberg, M., Michael, D., Smidt, F.- H., N¨ olke, J.-H., Adler, T.J., Tizabi, M.D., Nwoye, C.I.,et al.: Self-distillation for surgical action recognition. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 637–646 (2023). Springer
work page 2023
-
[4]
Scientific Data12(1) (2025) https://doi.org/10.1038/s41597-025-05163-w
Alabi, O., Toe, K.K.Z., Zhou, Z., Budd, C., Raison, N., Shi, M., Vercauteren, T.: Cholecinstanceseg: A tool instance segmentation dataset for laparoscopic surgery. Scientific Data12(1) (2025) https://doi.org/10.1038/s41597-025-05163-w
-
[5]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1290–1299 (2022)
work page 2022
-
[6]
In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp
Chen, Y., He, S., Jin, Y., Qin, J.: Surgical activity triplet recognition via triplet disentanglement. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 451–461 (2023). Springer
work page 2023
-
[7]
International Journal of Computer Assisted Radiology and Surgery, 1–7 (2023)
Sharma, S., Nwoye, C.I., Mutter, D., Padoy, N.: Rendezvous in time: an attention- based temporal fusion approach for surgical triplet recognition. International Journal of Computer Assisted Radiology and Surgery, 1–7 (2023)
work page 2023
-
[8]
Medical Image Analysis89, 102888 (2023)
Nwoye, C.I., Yu, T., Sharma, S., Murali, A., Alapatt, D., Vardazaryan, A., Yuan, K., Hajek, J., Reiter, W., Yamlahi, A.,et al.: Cholectriplet2022: Show me a tool and tell me the triplet—an endoscopic vision challenge for surgical action triplet detection. Medical Image Analysis89, 102888 (2023)
work page 2023
-
[9]
IEEE Transactions on Medical Imaging (2025) 11
Pei, J., Zhang, J., Qin, G., Wang, K., Jin, Y., Heng, P.-A.: Instrument-tissue- guided surgical action triplet detection via textual-temporal trail exploration. IEEE Transactions on Medical Imaging (2025) 11
work page 2025
-
[10]
arXiv preprint arXiv:2508.21096 (2025)
Han, Z., Budd, C., Zhang, G., Tian, H., Bergeles, C., Vercauteren, T.: ROBUST- MIPS: A combined skeletal pose and instance segmentation dataset for laparo- scopic surgical instruments. arXiv preprint arXiv:2508.21096 (2025)
-
[11]
Artificial Intelligence Review58(1), 1 (2024)
Ahmed, F.A., Yousef, M., Ahmed, M.A., Ali, H.O., Mahboob, A., Ali, H., Shah, Z., Aboumarzouk, O., Al Ansari, A., Balakrishnan, S.: Deep learning for sur- gical instrument recognition and segmentation in robotic-assisted surgeries: a systematic review. Artificial Intelligence Review58(1), 1 (2024)
work page 2024
-
[12]
arXiv preprint arXiv:2507.16559 (2025)
Rueckert, T., Rauber, D., Maerkl, R., Klausmann, L., Yildiran, S.R., Gutbrod, M., Nunes, D.W., Moreno, A.F., Luengo, I., Stoyanov, D., et al.: Compara- tive validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the phakir 2024 challenge. arXiv preprint arXiv:2507.16559 (2025)
-
[13]
arXiv (2024) 2401.11174 [cs.CV]
Ayobi, N., Rodr´ ıguez, S., P´ erez, A., Hern´ andez, I., Aparicio, N., Dessevres, E., Pe˜ na, S., Santander, J., Caicedo, J.I., Fern´ andez, N., Arbel´ aez, P.: Pixel-wise recognition for holistic surgical scene understanding. arXiv (2024) 2401.11174 [cs.CV]
-
[14]
International Journal of Computer Assisted Radiology and Surgery17(12), 2173–2181 (2022)
Owen, D., Grammatikopoulou, M., Luengo, I., Stoyanov, D.: Automated identifi- cation of critical structures in laparoscopic cholecystectomy. International Journal of Computer Assisted Radiology and Surgery17(12), 2173–2181 (2022)
work page 2022
-
[15]
International Journal of Computer Assisted Radiology and Surgery19(6), 1085–1091 (2024)
Bati´ c, D., Holm, F., ¨Ozsoy, E., Czempiel, T., Navab, N.: Endovit: pretrain- ing vision transformers on a large collection of endoscopic images. International Journal of Computer Assisted Radiology and Surgery19(6), 1085–1091 (2024)
work page 2024
-
[16]
Cholecseg8k: a semantic segmen- tation dataset for laparoscopic cholecystectomy based on cholec80
Hong, W.-Y., Kao, C.-L., Kuo, Y.-H., Wang, J.-R., Chang, W.-L., Shih, C.-S.: Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. arXiv preprint arXiv:2012.12453 (2020)
-
[17]
Nwoye, C.I., Padoy, N.: Data Splits and Metrics for Method Benchmarking on Surgical Action Triplet Datasets (2023). https://arxiv.org/abs/2204.05235
-
[18]
MMDetection: Open MMLab Detection Toolbox and Benchmark
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[19]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
work page 2016
-
[20]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 12
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.